Python wrapper around kallisto | bustools for scRNA-seq analysis
Project description
kb-python
kb-python is a python package for processing single-cell RNA-sequencing. It wraps the kallisto | bustools single-cell RNA-seq command line tools in order to unify multiple processing workflows.
kb-python was first developed by Kyung Hoi (Joseph) Min and A. Sina Booeshaghi while in Lior Pachter's lab at Caltech. If you use kb-python in a publication please cite*:
Melsted, P., Booeshaghi, A.S., et al.
Modular, efficient and constant-memory single-cell RNA-seq preprocessing.
Nat Biotechnol 39, 813–818 (2021).
https://doi.org/10.1038/s41587-021-00870-2
Installation
The latest release can be installed with
pip install kb-python
The development version can be installed with
pip install git+https://github.com/pachterlab/kb_python
There are no prerequisite packages to install. The kallisto and bustools binaries are included with the package.
Usage
kb consists of five subcommands
$ kb
usage: kb [-h] [--list] <CMD> ...
positional arguments:
<CMD>
info Display package and citation information
compile Compile `kallisto` and `bustools` binaries from source
ref Build a kallisto index and transcript-to-gene mapping
count Generate count matrices from a set of single-cell FASTQ files
extract Extract reads that were pseudoaligned to specific genes/transcripts (or extract all reads that were / were not pseudoaligned)
kb ref: generate a pseudoalignment index
The kb ref command takes in a species annotation file (GTF) and associated genome (FASTA) and builds a species-specific index for pseudoalignment of reads. This must be run before kb count. Internally, kb ref extracts the coding regions from the GTF and builds a transcriptome FASTA that is then indexed with kallisto index.
kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME> <GENOME_ANNOTATION>
<GENOME>refers to a genome file (FASTA).<GENOME_ANNOTATION>refers to a genome annotation file (GTF)- Note: The latest genome annotation and genome file for every species on ensembl can be found with the
ggetcommand-line tool.
Prebuilt indices are available at https://github.com/pachterlab/kallisto-transcriptome-indices
Examples
# Index the transcriptome from genome FASTA (genome.fa.gz) and GTF (annotation.gtf.gz)
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa genome.fa.gz annotation.gtf.gz
# An example for downloading a prebuilt reference for mouse
$ kb ref -d mouse -i index.idx -g t2g.txt
kb count: pseudoalign and count reads
The kb count command takes in the pseudoalignment index (built with kb ref) and sequencing reads generated by a sequencing machine to generate a count matrix. Internally, kb count runs numerous kallisto and bustools commands comprising a single-cell workflow for the specified technology that generated the sequencing reads.
kb count -i index.idx -g t2g.txt -o out/ -x <TECHNOLOGY> <FASTQ FILE[s]>
<TECHNOLOGY>refers to the assay that generated the sequencing reads.- For a list of supported assays run
kb --list
- For a list of supported assays run
<FASTQ FILE[s]>refers to the a list of FASTQ files generated- Different assays will have a different number of FASTQ files
- Different assays will place the different features in different FASTQ files
- For example, sequencing a 10xv3 library on a NextSeq Illumina sequencer usually results in two FASTQ files.
- The
R1.fastq.gzfile (colloquially called "read 1") contains a 16 basepair cell barcode and a 12 basepair unique molecular identifier (UMI). - The
R2.fastq.gzfile (colloquially called "read 2") contains the cDNA associated with the cell barcode-UMI pair in read 1.
Examples
# Quantify 10xv3 reads read1.fastq.gz and read2.fastq.gz
$ kb count -i index.idx -g t2g.txt -o out/ -x 10xv3 read1.fastq.gz read2.fastq.gz
kb info: display package and citation information
The kb info command prints out package information including the version of kb-python, kallisto, and bustools along with their installation location.
$ kb info
kb_python 0.29.5 ...
kallisto: 0.51.1 ...
bustools: 0.45.1 ...
...
kb compile: compile kallisto and bustools binaries from source
The kb compile command grabs the latest kallisto and bustools source and compiles the binaries. Note: this is not required to run kb-python.
Use cases
kb-python facilitates fast and uniform pre-processing of single-cell sequencing data to answer relevant research questions.
$ pip install kb-python gget ffq
# Goal: quantify publicly available scRNAseq data
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)
$ kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ')
# -> count matrix in out/ folder
# Goal: quantify 10xv2 feature barcode data, feature_barcodes.txt is a tab-delimited file
# containing barcode_sequence<tab>barcode_name
$ kb ref -i index.idx -g f2g.txt -f1 features.fa --workflow kite feature_barcodes.txt
$ kb count -i index.idx -g f2b.txt -x 10xv2 -o out/ --workflow kite --h5ad R1.fastq.gz R2.fastq.gz
# -> count matrix in out/ folder
Submitted by @sbooeshaghi.
Do you have a cool use case for kb-python? Submit a PR (including the goal, code snippet, and your username) so that we can feature it here.
Tutorials
For a list of tutorials that use kb-python please see https://www.kallistobus.tools/.
Documentation
Developer documentation is hosted on Read the Docs.
Contributing
Thank you for wanting to improve kb-python! If you have believe you've found a bug, please submit an issue.
If you have a new feature you'd like to add to kb-python please create a pull request. Pull requests should contain a message detailing the exact changes made, the reasons for the change, and tests that check for the correctness of those changes.
Cite
If you use kb-python in a publication, please cite the following papers:
kb-python & kallisto and/or bustools
@article{sullivan2023kallisto,
title={kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq},
author={Sullivan, Delaney K and Min, Kyung Hoi and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Luebbert, Laura and Holley, Guillaume and Moses, Lambda and Gustafsson, Johan and Bray, Nicolas L and Pimentel, Harold and Booeshaghi, A Sina and others},
journal={bioRxiv},
pages={2023--11},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
bustools
@article{melsted2021modular,
title={\href{https://doi.org/10.1038/s41587-021-00870-2}{Modular, efficient and constant-memory single-cell RNA-seq preprocessing}},
author={Melsted, P{\'a}ll and Booeshaghi, A. Sina and Liu, Lauren and Gao, Fan and Lu, Lambda and Min, Kyung Hoi Joseph and da Veiga Beltrame, Eduardo and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Gehring, Jase and Pachter, Lior},
author+an={1=first;2=first,highlight},
journal={Nature biotechnology},
year={2021},
month={4},
day={1},
doi={https://doi.org/10.1038/s41587-021-00870-2}
}
kallisto
@article{bray2016near,
title={Near-optimal probabilistic RNA-seq quantification},
author={Bray, Nicolas L and Pimentel, Harold and Melsted, P{\'a}ll and Pachter, Lior},
journal={Nature biotechnology},
volume={34},
number={5},
pages={525--527},
year={2016},
publisher={Nature Publishing Group}
}
kITE
@article{booeshaghi2024quantifying,
title={Quantifying orthogonal barcodes for sequence census assays},
author={Booeshaghi, A Sina and Min, Kyung Hoi and Gehring, Jase and Pachter, Lior},
journal={Bioinformatics Advances},
volume={4},
number={1},
pages={vbad181},
year={2024},
publisher={Oxford University Press}
}
BUS format
@article{melsted2019barcode,
title={The barcode, UMI, set format and BUStools},
author={Melsted, P{\'a}ll and Ntranos, Vasilis and Pachter, Lior},
journal={Bioinformatics},
volume={35},
number={21},
pages={4472--4473},
year={2019},
publisher={Oxford University Press}
}
kb-python was inspired by Sten Linnarsson’s loompy fromfq command (http://linnarssonlab.org/loompy/kallisto/index.html)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kb_python-0.29.5.tar.gz.
File metadata
- Download URL: kb_python-0.29.5.tar.gz
- Upload date:
- Size: 36.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c410c1b2e3706ae619345fe111eb6f06f776be0f5f72c73e46e08b9d9f0534fb
|
|
| MD5 |
e8a5fade2fb23387be0e92481662388f
|
|
| BLAKE2b-256 |
15019c1cf968114c08fdb8969a8812583576e514c4dea78eea7caf5527408172
|
File details
Details for the file kb_python-0.29.5-py3-none-any.whl.
File metadata
- Download URL: kb_python-0.29.5-py3-none-any.whl
- Upload date:
- Size: 36.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24141916807c8dd41484e5e9d7710df92e1386b77cf5dafa10f06d106b88b8a9
|
|
| MD5 |
acc42cfa125520c61a538bde30e113af
|
|
| BLAKE2b-256 |
bedc549f2de517272abda9f4169d70bf2c1268a4f0ce159901a0d8c1c9e6d10b
|