Services for compression and transfer of aind-data to the cloud

Project description

aind-data-transfer

Code Style

Tools for transferring large data to and between cloud storage providers.

Installation

To upload data to aws s3, you may need to install and configure awscli. To upload data to gcp, you may need to install and configure gsutil.

Generic upload

You may need to first install pyminizip from conda if getting errors on Windows: conda install -c mzh pyminizip

From PyPI: pip install aind-data-transfer
From source: pip install -e .

Imaging

Run pip install -e .[imaging]
Run ./post_install.sh

Ephys

From PyPI: pip install aind-data-transfer[ephys]
From source pip install -e .[ephys]

Full

Run pip install -e .[full]
Run ./post_install.sh

Development

Run pip install -e .[dev]
Run ./post_install.sh

MPI

To run scripts on a cluster, you need to install dask-mpi. This requires compiling mpi4py with the MPI implementation used by your cluster (Open MPI, MPICH, etc). The following example is for the Allen Institute HPC, but should be applicable to other HPC systems.

SSH into your cluster login node

ssh user.name@hpc-login

On the Allen cluster, the MPI modules are only available on compute nodes, so SSH into a compute node (n256 chosen arbitrarily).

ssh user.name@n256

Now load the MPI module and compiler. It is important that you use the latest MPI version and compiler, or else dask-mpi may not function properly.

module load gcc/10.1.0-centos7 mpi/mpich-3.2-x86_64

Install mpi4py

python -m pip install --no-cache-dir mpi4py

Now install dask-mpi

python -m pip install dask_mpi --upgrade

Usage

Running one or more upload jobs

The jobs can be defined inside a csv file. The first row of the csv file needs the following headers. Some are required for the job to run, and others are optional.

Required

s3_bucket: S3 Bucket name
experiment_type: One of [confocal, diSPIM, ecephys, exaSPIM, FIP, fMOST, HSFP, mesoSPIM, MPOPHYS, MRI, Other, SmartSPIM, single-plane-ophys] (pulled from the Modality.abbreviation field)
modality: One of [CONFOCAL, DISPIM, ECEPHYS, EPHYS, EXASPIM, FIP, FMOST, HSFP, ICEPHYS, MESOSPIM, MPOPHYS, MRI, SMARTSPIM, SPIM, SPOPHYS]
subject_id: ID of the subject
acq_date: Format can be either yyyy-MM-dd or MM/dd/yyyy
acq_time: Format can be either HH:mm:ss or HH-mm-ss

One or more modalities need to be set. The csv headers can look like:

modality0: [CONFOCAL, DISPIM, ECEPHYS, EPHYS, EXASPIM, FIP, FMOST, HSFP, ICEPHYS, MESOSPIM, MPOPHYS, MRI, SMARTSPIM, SPIM, SPOPHYS]
modality0.source: path to modality0 raw data folder
modality0.compress_raw_data (Optional): Override default compression behavior. True if ECEPHYS, False otherwise.
modality0.skip_staging (Optional): If modality0.compress_raw_data is False and this is True, upload directly to s3. Default is False.
modality0.extra_configs (Optional): path to config file to override compression defaults
modality1 (Optional): [CONFOCAL, DISPIM, ECEPHYS, EPHYS, EXASPIM, FIP, FMOST, HSFP, ICEPHYS, MESOSPIM, MPOPHYS, MRI, SMARTSPIM, SPIM, SPOPHYS]
modality1.source (Optional): path to modality0 raw data folder
modality1.compress_raw_data (Optional): Override default compression behavior. True if ECEPHYS, False otherwise.
modality1.skip_staging (Optional): If modality1.compress_raw_data is False and this is True, upload directly to s3. Default is False.
modality1.extra_configs (Optional): path to config file to override compression defaults
...

Somewhat Optional. Set the aws_param_store_name, but can define custom endpoints if desired

aws_param_store_name: Path to aws_param_store_name to retrieve common endpoints

If aws_param_store_name not set...

codeocean_domain: Domain of Code Ocean platform
codeocean_trigger_capsule_id: Launch a Code Ocean pipeline
codeocean_trigger_capsule_version: Optional if Code Ocean pipeline is versioned
metadata_service_domain: Domain name of the metadata service
aind_data_transfer_repo_location: The link to this project
video_encryption_password: Password with which to encrypt video files
codeocean_api_token: Code Ocean token used to run a capsule

Optional

temp_directory: The job will use your OS's file system to create a temp directory as default. You can override the location by setting this parameter.
behavior_dir: Location where behavior data associated with the raw data is stored.
metadata_dir: Location where metadata associated with the raw data is stored.
log_level: Default log level is warning. Can be set here.

Optional Flags

metadata_dir_force: Default is false. If true, the metadata in the metadata folder will be regarded as the source of truth vs. the metadata pulled from aind_metadata_service
dry_run: Default is false. If set to true, it will perform a dry-run of the upload portion and not actually upload anything.
force_cloud_sync: Use with caution. If set to true, it will sync the local raw data to the cloud even if the cloud folder already exists.
compress_raw_data: Override all compress_raw_data defaults and set them to True.
skip_staging: For each modality, copy uncompressed data directly to s3.

After creating the csv file, you can run through the jobs with

python -m aind_data_transfer.jobs.s3_upload_job --jobs-csv-file "path_to_jobs_list"

Any Optional Flags attached will persist and override those set in the csv file. For example,

python -m aind_data_transfer.jobs.s3_upload_job --jobs-csv-file "path_to_jobs_list" --dry-run --compress-raw-data

will compress the raw data source and run a dry run for all jobs defined in the csv file.

An example csv file might look like:

data-source, s3-bucket, subject-id, modality, experiment_type, acq-date, acq-time, aws_param_store_name
dir/data_set_1, some_bucket, 123454, ECEPHYS, ecephys, 2020-10-10, 14-10-10, /aind/data/transfer/endpoints
dir/data_set_2, some_bucket2, 123456, OPHYS, Other, 2020-10-11, 13-10-10, /aind/data/transfer/endpoints

Defining a custom processing capsule to run in code ocean

Read the previous section on defining a csv file. Retrieve the capsule id from the code ocean platform. You can add an extra parameter to define a custom processing capsule that gets executed aftet the data is uploaded:

codeocean_process_capsule_id, data-source, s3-bucket, subject-id, modality, experiment_type, acq-date, acq-time, aws_param_store_name
xyz-123-456, dir/data_set_1, some_bucket, 123454, ECEPHYS, ecephys, 2020-10-10, 14-10-10, /aind/data/transfer/endpoints
xyz-123-456, dir/data_set_2, some_bucket2, 123456, OPHYS, Other, 2020-10-11, 13-10-10, /aind/data/transfer/endpoints

Contributing

Linters and testing

There are several libraries used to run linters, check documentation, and run tests.

Please test your changes using the coverage library, which will run the tests and log a coverage report:

coverage run -m unittest discover && coverage report

Use interrogate to check that modules, methods, etc. have been documented thoroughly:

interrogate .

Use flake8 to check that code is up to standards (no unused imports, etc.):

flake8 .

Use black to automatically format the code into PEP standards:

black .

Use isort to automatically sort import statements:

isort .

Pull requests

For internal members, please create a branch. For external members, please fork the repo and open a pull request from the fork. We'll primarily use Angular style for commit messages. Roughly, they should follow the pattern:

<type>(<scope>): <short summary>

where scope (optional) describes the packages affected by the code changes and type (mandatory) is one of:

build: Changes that affect the build system or external dependencies (example scopes: pyproject.toml, setup.py)
ci: Changes to our CI configuration files and scripts (examples: .github/workflows/ci.yml)
docs: Documentation only changes
feat: A new feature
fix: A bug fix
perf: A code change that improves performance
refactor: A code change that neither fixes a bug nor adds a feature
test: Adding missing tests or correcting existing tests

Project details

Release history Release notifications | RSS feed

0.35.3

Jun 8, 2024

0.35.2

Jun 3, 2024

0.35.1

May 14, 2024

0.35.0

May 6, 2024

0.34.1

May 1, 2024

0.34.0

Apr 30, 2024

0.33.1

Apr 9, 2024

0.33.0

Apr 9, 2024

0.32.13

Mar 29, 2024

0.32.12

Mar 12, 2024

0.32.11

Mar 12, 2024

0.32.10

Mar 12, 2024

0.32.9

Mar 12, 2024

0.32.8

Feb 27, 2024

0.32.7

Feb 20, 2024

0.32.6

Feb 8, 2024

0.32.5

Feb 7, 2024

0.32.4

Jan 23, 2024

0.32.3

Jan 23, 2024

0.32.2

Dec 20, 2023

0.32.1

Dec 11, 2023

0.32.0

Dec 9, 2023

0.31.2

Nov 13, 2023

0.31.1

Nov 8, 2023

0.31.0

Nov 8, 2023

0.30.2

Oct 31, 2023

0.30.1

Oct 17, 2023

0.30.0

Oct 14, 2023

0.29.4

Oct 10, 2023

0.29.3

Oct 10, 2023

0.29.2

Oct 7, 2023

0.29.1

Oct 5, 2023

0.29.0

Oct 3, 2023

0.28.0

Sep 14, 2023

0.27.0

Sep 12, 2023

0.26.2

Sep 5, 2023

0.26.1

Sep 2, 2023

0.26.0

Sep 1, 2023

0.25.4

Aug 29, 2023

This version

0.25.3

Aug 25, 2023

0.25.2

Aug 18, 2023

0.25.1

Aug 16, 2023

0.25.0

Aug 8, 2023

0.24.0

Aug 7, 2023

0.23.3

Aug 7, 2023

0.23.2

Aug 4, 2023

0.23.1

Aug 3, 2023

0.23.0

Aug 2, 2023

0.22.1

Jul 20, 2023

0.22.0

Jul 19, 2023

0.21.5

Jul 12, 2023

0.21.4

Jul 11, 2023

0.21.3

Jul 3, 2023

0.21.2

Jun 26, 2023

0.21.1

Jun 15, 2023

0.21.0

Jun 2, 2023

0.20.2

May 31, 2023

0.20.1

May 27, 2023

0.20.0

May 26, 2023

0.19.0

May 24, 2023

0.18.0

May 22, 2023

0.17.2

May 19, 2023

0.16.3

May 19, 2023

0.16.2

May 10, 2023

0.16.1

May 9, 2023

0.16.0

May 9, 2023

0.15.0

Apr 29, 2023

0.14.1

Apr 27, 2023

0.14.0

Apr 27, 2023

0.13.4

Apr 25, 2023

0.13.3

Apr 20, 2023

0.13.2

Apr 12, 2023

0.13.1

Apr 7, 2023

0.13.0

Apr 5, 2023

0.12.3

Mar 30, 2023

0.12.2

Mar 30, 2023

0.12.1

Mar 20, 2023

0.12.0

Mar 20, 2023

0.11.0

Mar 18, 2023

0.10.1

Mar 16, 2023

0.10.0

Mar 15, 2023

0.9.3

Mar 15, 2023

0.9.2

Mar 14, 2023

0.9.1

Mar 9, 2023

0.9.0

Mar 7, 2023

0.8.2

Mar 3, 2023

0.8.1

Mar 1, 2023

0.8.0

Feb 28, 2023

0.7.6

Feb 24, 2023

0.7.5

Feb 21, 2023

0.7.4

Feb 10, 2023

0.7.3

Feb 10, 2023

0.7.2

Feb 7, 2023

0.7.1

Feb 4, 2023

0.7.0

Feb 1, 2023

0.6.1

Jan 30, 2023

0.6.0

Jan 30, 2023

0.5.0

Jan 14, 2023

0.4.1

Jan 13, 2023

0.4.0

Jan 12, 2023

0.3.1

Jan 12, 2023

0.3.0

Jan 11, 2023

0.2.9

Jan 9, 2023

0.2.8

Jan 9, 2023

0.2.7

Jan 6, 2023

0.2.6

Jan 5, 2023

0.2.5

Jan 4, 2023

0.2.4

Jan 4, 2023

0.2.3

Jan 4, 2023

0.2.2

Jan 3, 2023

0.2.1

Dec 23, 2022

0.2.0

Dec 20, 2022

0.1.25

Dec 20, 2022

0.1.24

Dec 16, 2022

0.1.23

Dec 15, 2022

0.1.22

Dec 15, 2022

0.1.21

Dec 14, 2022

0.1.20

Dec 14, 2022

0.1.19

Dec 13, 2022

0.1.18

Dec 12, 2022

0.1.17

Dec 12, 2022

0.1.16

Dec 5, 2022

0.1.15

Dec 2, 2022

0.1.14

Nov 19, 2022

0.1.13

Nov 17, 2022

0.1.12

Nov 14, 2022

0.1.11

Nov 12, 2022

0.1.10

Nov 11, 2022

0.1.9

Nov 4, 2022

0.1.8

Nov 2, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aind-data-transfer-0.25.3.tar.gz (755.5 kB view details)

Uploaded Aug 25, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aind_data_transfer-0.25.3-py3-none-any.whl (84.7 kB view details)

Uploaded Aug 25, 2023 Python 3

File details

Details for the file aind-data-transfer-0.25.3.tar.gz.

File metadata

Download URL: aind-data-transfer-0.25.3.tar.gz
Upload date: Aug 25, 2023
Size: 755.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.17

File hashes

Hashes for aind-data-transfer-0.25.3.tar.gz
Algorithm	Hash digest
SHA256	`100589d40a47976249e6038d86638b1fc089a67bb524d317536bb60a9331ad79`
MD5	`276e91a024030c591e79b776d6358512`
BLAKE2b-256	`4ba5a1139f3c98cb147100c28fadd57dc678afd4ed18d3648a1c9ff09b3403b8`

See more details on using hashes here.

File details

Details for the file aind_data_transfer-0.25.3-py3-none-any.whl.

File metadata

Download URL: aind_data_transfer-0.25.3-py3-none-any.whl
Upload date: Aug 25, 2023
Size: 84.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.17

File hashes

Hashes for aind_data_transfer-0.25.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f8e66978044335b3bbd5df5f479b670971dcb89bed41faaa9729c7b2bf32c9fc`
MD5	`e22190fd50e32cf89448b6edd38d4e14`
BLAKE2b-256	`b2a6e6889c5eb8ef9ab96fcdd87c34c4bd219d54a730b617405a22d496c79808`

See more details on using hashes here.

aind-data-transfer 0.25.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

aind-data-transfer

Installation

Generic upload

Imaging

Ephys

Full

Development

MPI

Usage

Running one or more upload jobs

Defining a custom processing capsule to run in code ocean

Contributing

Linters and testing

Pull requests

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes