AACR Project GENIE ETL

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Project description

genie banner

AACR Project GENIE

Introduction

This repository documents code used to gather, QC, standardize, and analyze data uploaded by institutes participating in AACR's Project GENIE (Genomics, Evidence, Neoplasia, Information, Exchange).

Dependencies

This package contains both R, Python and cli tools. These are tools or packages you will need, to be able to reproduce these results:

Python >=3.8 or <3.10
- pip install -r requirements.txt
bedtools
R 4.2.2
- renv::install()
- Follow instructions here to install synapser
Java > 8
- For mac users, it seems to work better to run brew install java
wget
- For mac users, have to run brew install wget

File Validator

One of the features of the aacrgenie package is that is provides a local validation tool that GENIE data contributors and install and use to validate their files locally prior to uploading to Synapse.

pip install aacrgenie
genie -v

This will install all the necessary components for you to run the validator locally on all of your files, including the Synapse client. Please view the help to see how to run to validator.

genie validate -h
genie validate data_clinical_supp_SAGE.txt SAGE

Contributing

Please view contributing guide to learn how to contribute to the GENIE package.

Sage Bionetworks Only

Developing locally

These are instructions on how you would develop and test the pipeline locally.

Make sure you have read through the GENIE Onboarding Docs and have access to all of the required repositories, resources and synapse projects for Main GENIE.
Be sure you are invited to the Synapse GENIE Admin team.
Make sure you are a Synapse certified user: Certified User - Synapse User Account Types
Clone this repo and install the package locally.
```
pip install -e .
pip install -r requirements.txt
pip install -r requirements-dev.txt
```
If you are having trouble with the above, try installing via pipenv
1. Specify a python version that is supported by this repo: pipenv --python <python_version>
2. pipenv install from requirements file
3. Activate your pipenv: pipenv shell
Configure the Synapse client to authenticate to Synapse.
1. Create a Synapse Personal Access token (PAT).
2. Add a ~/.synapseConfig file
```
[authentication]
authtoken = <PAT here>
```
3. OR set an environmental variable
```
export SYNAPSE_AUTH_TOKEN=<PAT here>
```
4. Confirm you can log in your terminal.
```
synapse login
```
Run the different pipelines on the test project. The --project_id syn7208886 points to the test project.
1. Validate all the files excluding vcf files:
```
python bin/input_to_database.py main --project_id syn7208886 --onlyValidate
```
2. Validate all the files:
```
python bin/input_to_database.py mutation --project_id syn7208886 --onlyValidate --genie_annotation_pkg ../annotation-tools
```
3. Process all the files aside from the mutation (maf, vcf) files. The mutation processing was split because it takes at least 2 days to process all the production mutation data. Ideally, there is a parameter to exclude or include file types to process/validate, but that is not implemented.
```
python bin/input_to_database.py main --project_id syn7208886 --deleteOld
```
4. Process the mutation data. Be sure to clone this repo: https://github.com/Sage-Bionetworks/annotation-tools and git checkout the version of the repo pinned to the Dockerfile. This repo houses the code that re-annotates the mutation data with genome nexus. The --createNewMafDatabase will create a new mutation tables in the test project. This flag is necessary for production data for two main reasons:
  - During processing of mutation data, the data is appended to the data, so without creating an empty table, there will be duplicated data uploaded.
  - By design, Synapse Tables were meant to be appended to. When a Synapse Tables is updated, it takes time to index the table and return results. This can cause problems for the pipeline when trying to query the mutation table. It is actually faster to create an entire new table than updating or deleting all rows and appending new rows when dealing with millions of rows.
  - If you run this more than once on the same day, you'll run into an issue with overwriting the narrow maf table as it already exists. Be sure to rename the current narrow maf database under Tables in the test synapse project and try again.
```
python bin/input_to_database.py mutation --project_id syn7208886 --deleteOld --genie_annotation_pkg ../annotation-tools --createNewMafDatabase
```
5. Create a consortium release. Be sure to add the --test parameter. Be sure to clone the cbioportal repo: https://github.com/cBioPortal/cbioportal and git checkout the version of the repo pinned to the Dockerfile
```
python bin/database_to_staging.py Jan-2017 ../cbioportal TEST --test
```
6. Create a public release. Be sure to add the --test parameter. Be sure to clone the cbioportal repo: https://github.com/cBioPortal/cbioportal and git checkout the version of the repo pinned to the Dockerfile
```
python bin/consortium_to_public.py Jan-2017 ../cbioportal TEST --test
```

Production

The production pipeline is run on Nextflow Tower and the Nextflow workflow is captured in nf-genie. It is wise to create an ec2 via the Sage Bionetworks service catalog to work with the production data, because there is limited PHI in GENIE.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

17.1.0

Feb 19, 2026

17.0.0

Oct 22, 2025

16.6.0

Aug 13, 2025

16.5.0

Feb 20, 2025

16.4.0

Jul 23, 2024

16.3.0

Mar 20, 2024

This version

16.2.0

Feb 14, 2024

16.1.0

Nov 10, 2023

16.0.0

Sep 12, 2023

15.4.0

Jul 13, 2023

15.3.0

May 3, 2023

15.2.0

Mar 21, 2023

15.1.0

Feb 24, 2023

15.0.0

Jan 7, 2023

14.4.0

Dec 20, 2022

14.3.1

Nov 14, 2022

14.3.0

Nov 3, 2022

14.2.0

Nov 2, 2022

14.1.2

Sep 14, 2022

14.1.1

Aug 2, 2022

14.1.0

Jul 17, 2022

14.0.1

Jul 17, 2022

14.0.0

Jul 4, 2022

13.3.0

May 4, 2022

13.2.0

Apr 4, 2022

13.1.1

Mar 14, 2022

13.1.0

Mar 10, 2022

13.0.0

Mar 3, 2022

12.7.0

Jan 19, 2022

12.6.0

Jan 9, 2022

12.5.0

Apr 7, 2021

12.4.0

Mar 5, 2021

12.3.0

Feb 10, 2021

12.2.0

Jan 12, 2021

12.1.0

Nov 26, 2020

12.0.0

Nov 5, 2020

11.1.0

Sep 26, 2020

11.0.0

Aug 17, 2020

10.0.0

Jun 16, 2020

9.0.1

May 19, 2020

9.0.0

May 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aacrgenie-16.2.0.tar.gz (175.3 kB view details)

Uploaded Feb 14, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aacrgenie-16.2.0-py3-none-any.whl (172.1 kB view details)

Uploaded Feb 14, 2024 Python 3

File details

Details for the file aacrgenie-16.2.0.tar.gz.

File metadata

Download URL: aacrgenie-16.2.0.tar.gz
Upload date: Feb 14, 2024
Size: 175.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for aacrgenie-16.2.0.tar.gz
Algorithm	Hash digest
SHA256	`282756064be19ac3940fa7d25243a2c1e30c455752d614a056a37b9312e34bd3`
MD5	`5dd443e5c207c7a68175182e8da31ad4`
BLAKE2b-256	`5fae54e2a8cf1161e2d1b3bc84c93ea0b4d14df49e6e50dfcaf6565f3eaa80f8`

See more details on using hashes here.

File details

Details for the file aacrgenie-16.2.0-py3-none-any.whl.

File metadata

Download URL: aacrgenie-16.2.0-py3-none-any.whl
Upload date: Feb 14, 2024
Size: 172.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for aacrgenie-16.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`30751b222a45dc585e74ab8de212bb81a6244f6b2cf93cca9718462319644887`
MD5	`527ba8354fb8623c43a6d7970cc1b213`
BLAKE2b-256	`ad38706f5158f94fb0df4c1c3044e2ac57feb67742ee819e68ea0182122c1e30`

See more details on using hashes here.

aacrgenie 16.2.0

Navigation

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AACR Project GENIE

Introduction

Dependencies

File Validator

Contributing

Sage Bionetworks Only

Developing locally

Production

Project details

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes