satmut-utils

Tools for variant simulation and variant calling in paired end, targeted sequencing saturation mutagenesis experiments

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

satmut_utils is a Python package for simulation and variant calling of saturation mutagenesis data. The two main subcommands are:

'sim'
'call'

satmut_utils commands are designed to simulate and call variants in paired-end, targeted sequencing reads. Alignments to a mature mRNA reference (contiguous, spliced coding sequence with possible untranslated regions) are expected. Genome-wide and transcriptome-wide variant calling is not supported.

Installation

Currently, only Unix/Linux and MacOSX operating systems are supported. To get started, follow these steps:

If conda is not installed, install miniconda for managing environments. See this link for installation on your particular architecture.
Clone the satmut_utils repository:

git clone https://github.com/CenikLab/satmut_utils.git
SATMUT_ROOT="$PWD/satmut_utils"

Execute the provided shell script to generate the satmut_utils environment, install the package, and optionally download curated reference files, which are required if using Ensembl identifiers (see Reference files). Finally, activate the satmut_utils environment.

REF_DIR="$HOME/satmut_utils_refs"
$SATMUT_ROOT/install_satmut_utils.sh -h
$SATMUT_ROOT/install_satmut_utils.sh -t -g -r "$REF_DIR" "$SATMUT_ROOT"
conda activate satmut_utils

You are now ready to call the command-line executable satmut_utils

satmut_utils is the primary command, with subcommands 'sim' and 'call'.

Code examples

See the satmut_utils manual for more detailed usage information.

Parameter help:

satmut_utils -h
satmut_utils sim -h
satmut_utils call -h

Common arguments to both 'sim' and 'call' subcommands should be provided first, then the subcommand, and then the subcommand-specific arguments.

It is recommended that a new output directory is created for each job. Default is to output to the current directory. OUTPUT_DIR="/tmp/satmut_utils_test"

Additionally, the user can set a directory for temporary files by setting an environment variable $SCRATCH. If this variable is not set, temporary files are written to /tmp. export SCRATCH="$(mktemp -d -t satmut_temp)"

Run 'sim'

Run 'sim' on in silico alignments to generate SNPs, MNPs, insertions, and deletions. Structural variants and gene fusions are not currently supported.

TEST_DIR="$SATMUT_ROOT/src/tests/test_data"
OUTPUT_DIR="/tmp/satmut_utils_test"
satmut_utils -i ENST00000398165.7 -x $REF_DIR -o $OUTPUT_DIR -p $TEST_DIR/CBS_sim_primers.bed sim -f -a $TEST_DIR/CBS_sim.bam -v $TEST_DIR/CBS_sim.vcf

The 'sim' workflow outputs paired FASTQs, a realigned BAM file, and a truth VCF containing simulated variants and their expected counts and frequencies.

Run 'call'

Support exists for calling SNPs/SNVs, MNPs/MNVs, insertions, and deletions. Multi-codon changes and complex InDels are not supported (e.g. delete a codon, insert a nt).

Run 'call' on the simulated data by specifying an Ensembl transcript/gene ID and the directory containing curated reference files.

TEST_DIR="$SATMUT_ROOT/src/tests/test_data"
OUTPUT_DIR="/tmp/satmut_utils_test"
satmut_utils -i ENST00000398165.7 -x $REF_DIR -o $OUTPUT_DIR -p $TEST_DIR/CBS_sim_primers.bed call -1 $TEST_DIR/CBS_sim.R1.fq.gz -2 $TEST_DIR/CBS_sim.R2.fq.gz -v -m 1

Here, we call variants on a simulated dataset with no adapter sequences, so we pass -v. However, in most cases the user should provide 5' and 3' adapters for trimming.

If the Ensembl ID is not in the curated set of primary transcripts, or if you want to align to a custom reference, several reference files most be provided. Reference files.

satmut_utils -r $TEST_DIR/CBS.fa -o $OUTPUT_DIR -p $TEST_DIR/CBS_sim_primers.bed call -1 $TEST_DIR/CBS_sim.R1.fq.gz -2 $TEST_DIR/CBS_sim.R2.fq.gz -v -m 1 -g $TEST_DIR/CBS.gff -k $REF_DIR/GRCh38.fa.gz

The 'call' workflow produces a VCF of candidate variant calls as well as a bedgraph file reporting fragment coverage across the transcript reference. The output VCF and its corresponding tab-delimited summary.txt file contain records for each mismatched base in a MNP. See the satmut_utils manual or the corresponding VCF header for column/field descriptions.

Reference files

For convenience, a curated transcriptome of primary human transcripts from APPRIS is provided, allowing the user to pass an Ensembl gene or transcript ID to satmut_utils. However, if the requested Ensembl ID is not found in this set, custom reference files must be passed, which include:

A. Transcript reference (FASTA)

B. Transcript annotations (GFF)

C. GFF reference (FASTA)

Common transcript annotations in GFF format (file B) map coordinates in the genome. For this case, file A should specify a transcript FASTA and file C should specify the genome FASTA.

In typical saturation mutagenesis datasets, an intron-less coding sequence is expressed from a vector. In this case, set file A and C to the same composite (vector + coding sequence) reference FASTA, then make a custom GFF annotation (file B) with a coding sequence exon. See the user manual for more details on creating custom reference files.

Tests

To run unit tests, execute the following from the satmut_utils repository:

nose2 -q

Accessory scripts

Two command-line interfaces are provided to enable pre-processing of reads prior to satmut_utils 'sim':

satmut_trim
satmut_align

satmut_trim is a wrapper around cutadapt, and satmut_align a wrapper around bowtie2. satmut_align should be used to generate the BAM file accepted by 'sim'. If reads have been aligned with some other method, there is no guarantee 'sim' will complete without error, as alignment tags output by bowtie2 are required for 'sim' (MD, NM).

Additionally, a number of helpful scripts are available on the development repository of satmut_utils in the src/scripts directory. For example, if subdomains of a gene were targeted, MAVE-HGVS annotations can be updated with run_mave_hgvs_annot.py to report positions relative to target regions (as required by MAVEdb).

Citation

If you use satmut_utils, please cite the following paper:

Hoskins I, Sun S, Cote A, Roth FP, Cenik C. satmut_utils: a simulation and variant calling package for multiplexed assays of variant effect. Genome Biol. BioMed Central; 2023 Apr 20;24(1):1–2

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.1.2

Dec 25, 2023

1.1.1

Sep 15, 2023

1.1.0

Sep 11, 2023

1.0.4

Jun 28, 2023

1.0.3

Jun 23, 2023

1.0.2

Dec 9, 2022

1.0.1

Nov 21, 2022

1.0.0

Apr 15, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

satmut_utils-1.1.2.tar.gz (12.0 MB view hashes)

Uploaded Dec 25, 2023 Source

Built Distribution

satmut_utils-1.1.2-py3-none-any.whl (12.0 MB view hashes)

Uploaded Dec 25, 2023 Python 3

Hashes for satmut_utils-1.1.2.tar.gz

Hashes for satmut_utils-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`61506856cf0cc8af569858e30a9633701fd997b386d83400ececdb2370be3fb2`
MD5	`95a7e6ee40d923914c84c6af4f12d185`
BLAKE2b-256	`c60e9bcfb78e2802e533905bf2231f7290affcd7b8e7bb978564d0a6515e85f5`

Hashes for satmut_utils-1.1.2-py3-none-any.whl

Hashes for satmut_utils-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`82a1fbf9da9d3397734cb2f6d16cb9e0346d73cc88319e846f119a95d39cf371`
MD5	`e1d64da19115829d84c016bfc5639a1a`
BLAKE2b-256	`cc1c7848af782381f404512051f4b3de12c05ededbb774467ae9c3fb9bce8b43`