bioinfokit

Bioinformatics data analysis and visualization toolkit

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

The bioinfokit toolkit aimed to provide various easy-to-use functionalities to analyze,
visualize, and interpret the biological data generated from genome-scale omics experiments.

How to install:

bioinfokit requires

Python 3
NumPy
scikit-learn
seaborn
pandas
matplotlib
SciPy

git clone https://github.com/reneshbedre/bioinfokit.git
cd bioinfokit
python setup.py install

Volcano plot

bioinfokit.visuz.volcano(table, lfc, pv, lfc_thr, pv_thr, color, valpha, geneid, genenames, gfont)

Parameters	Description
`table`	Comma separated (csv) gene expression table having atleast gene IDs, log fold change, P-values or adjusted P-values columns
`lfc`	Name of a column having log fold change values [string][default:logFC]
`pv`	Name of a column having P-values or adjusted P-values [string][default:p_values]
`lfc_thr`	Log fold change cutoff for up and downregulated genes [float][default:1.0]
`pv_thr`	P-values or adjusted P-values cutoff for up and downregulated genes [float][default:0.05]
`color`	Tuple of two colors [tuple][default: ("green", "red")]
`valpha`	Transparency of points on volcano plot [float (between 0 and 1)][default: 1.0]
`geneid`	Name of a column having gene Ids. This is necessary for plotting gene label on the points [string][default: None]
`genenames`	Tuple of gene Ids to label the points. The gene Ids must be present in the geneid column. If this option set to "deg" it will label all genes defined by lfc_thr and pv_thr [string, tuple, dict][default: None]
`gfont`	Font size for genenames [float][default: 10.0]

Returns:

Volcano plot image in same directory (volcano.png)

Working example

MA plot

bioinfokit.visuz.ma(table, lfc, ct_count, st_count, pv_thr)

Parameters	Description
`table`	Comma separated (csv) gene expression table having atleast gene IDs, log fold change, and counts (control and treatment) columns
`lfc`	Name of a column having log fold change values [default:logFC]
`ct_count`	Name of a column having count values for control sample [default:value1]
`st_count`	Name of a column having count values for treatment sample [default:value2]
`lfc_thr`	Log fold change cutoff for up and downregulated genes [default:1]

Returns:

MA plot image in same directory (ma.png)

Working example

Inverted Volcano plot

bioinfokit.visuz.involcano(table, lfc, pv, lfc_thr, pv_thr, color, valpha, geneid, genenames, gfont)

Parameters	Description
`table`	Comma separated (csv) gene expression table having atleast gene IDs, log fold change, P-values or adjusted P-values
`lfc`	Name of a column having log fold change values [default:logFC]
`pv`	Name of a column having P-values or adjusted P-values [default:p_values]
`lfc_thr`	Log fold change cutoff for up and downregulated genes [default:1]
`pv_thr`	P-values or adjusted P-values cutoff for up and downregulated genes [default:0.05]
`color`	Tuple of two colors [tuple][default: ("green", "red")]
`valpha`	Transparency of points on volcano plot [float (between 0 and 1)][default: 1.0]
`geneid`	Name of a column having gene Ids. This is necessary for plotting gene label on the points [string][default: None]
`genenames`	Tuple of gene Ids to label the points. The gene Ids must be present in the geneid column. If this option set to "deg" it will label all genes defined by lfc_thr and pv_thr [string, tuple, dict][default: None]
`gfont`	Font size for genenames [float][default: 10.0]

Returns:

Inverted volcano plot image in same directory (involcano.png)

Working example

Correlation matrix plot

bioinfokit.visuz.corr_mat(table, corm)

Parameters	Description
`table`	Dataframe object with numerical variables (columns) to find correlation. Ideally, you should have three or more variables. Dataframe should not have identifier column.
`corm`	Correlation method [pearson,kendall,spearman] [default:pearson]

Returns:

Correlation matrix plot image in same directory (corr_mat.png)

Working example

Merge VCF files

bioinfokit.analys.mergevcf(file)

Parameters	Description
`file`	Multiple vcf files and separate them by comma

Returns:

Merged VCF file (merge_vcf.vcf)

Working example

Merge VCF files

bioinfokit.analys.mergevcf(file)

Parameters	Description
`file`	Multiple vcf files and separate them by comma

Returns:

Merged VCF file (merge_vcf.vcf)

Working example

PCA

bioinfokit.analys.pca(table)

Parameters	Description
`table`	Dataframe object with numerical variables (columns). Dataframe should not have identifier column.

Returns:

PCA summary, scree plot (screepca.png), and 2D/3D pca plots (pcaplot_2d.png and pcaplot_3d.png)

Working example

Reverse complement of DNA sequence

bioinfokit.analys.rev_com(sequence)

Parameters	Description
`seq`	DNA sequence to perform reverse complement
`file`	DNA sequence in a fasta file

Returns:

Reverse complement of original DNA sequence

Working example

Sequencing coverage

bioinfokit.analys.seqcov(file, gs)

Parameters	Description
`file`	FASTQ file
`gs`	Genome size in Mbp

Returns:

Sequencing coverage of the given FASTQ file

Working example

Convert TAB to CSV file

bioinfokit.analys.tcsv(file)

Parameters	Description
`file`	TAB delimited text file

Returns:

CSV delimited file (out.csv)

Heatmap

bioinfokit.visuz.hmap(table, cmap='seismic', scale=True, dim=(6, 8), clus=True, zscore=None, xlabel=True, ylabel=True, tickfont=(12, 12))

Parameters	Description
`file`	CSV delimited data file. It should not have NA or missing values
`cmap`	Color Palette for heatmap [string][default: 'seismic']
`scale`	Draw a color key with heatmap [boolean (True or False)][default: True]
`dim`	heatmap figure size [tuple of two floats (width, height) in inches][default: (6, 8)]
`clus`	Draw hierarchical clustering with heatmap [boolean (True or False)][default: True]
`zscore`	Z-score standardization of row (0) or column (1). It works when clus is True. [None, 0, 1][default: None]
`xlable`	Plot X-label [boolean (True or False)][default: True]
`ylable`	Plot Y-label [boolean (True or False)][default: True]
`tickfont`	Fontsize for X and Y-axis tick labels [tuple of two floats][default: (14, 14)]

Returns:

heatmap plot (heatmap.png, heatmap_clus.png)

Working example

Venn Diagram

bioinfokit.visuz.venn(vennset, venncolor, vennalpha, vennlabel)

Parameters	Description
`vennset`	Venn dataset for 3 and 2-way venn. Data should be in the format of (100,010,110,001,101,011,111) for 3-way venn and 2-way venn (10, 01, 11) [default: (1,1,1,1,1,1,1)]
`venncolor`	Color Palette for Venn [color code][default: ('#00909e', '#f67280', '#ff971d')]
`vennalpha`	Transparency of Venn [float (0 to 1)][default: 0.5]
`vennlabel`	Labels to Venn [string][default: ('A', 'B', 'C')]

Returns:

Venn plot (venn3.png, venn2.png)

Working example

Two sample t-test with equal and unequal variance

bioinfokit.analys.ttsam(table, xfac, res, evar)

Parameters	Description
`table`	CSV delimited data file. It should be stacked table with independent (xfac) and dependent (res) variable columns.
`xfac`	Independent group column name with two levels [string][default: None]
`res`	Response variable column name [string][default: None]
`evar`	t-test with equal variance [bool (True or False)][default: True]

Returns:

summary output and group boxplot (ttsam_boxplot.png)

Working example

Chi-square test for independence

bioinfokit.analys.chisq(table)

Parameters	Description
`table`	CSV delimited data file. It should be contingency table.

Returns:

summary output and variable mosaic plot (mosaic.png)

Working example

File format conversions

bioinfokit.analys.format

Function	Parameters	Description
`bioinfokit.analys.format.fqtofa(file)`	`FASTQ file`	Convert FASTQ file into FASTA format
`bioinfokit.analys.format.hmmtocsv(file)`	`HMM file`	Convert HMM text output (from HMMER tool) to CSV format
`bioinfokit.analys.format.tabtocsv(file)`	`TAB file`	Convert TAB file to CSV format
`bioinfokit.analys.format.csvtotab(file)`	`CSV file`	Convert CSV file to TAB format

Returns:

Output will be saved in same directory

Working example

One-way ANOVA

bioinfokit.stat.oanova(table, res, xfac, ph, phalpha)

Parameters	Description
`table`	Pandas dataframe in stacked table format
`res`	Response variable (dependent variable) [string][default: None]
`xfac`	Treatments or groups or factors (independent variable) [string][default: None]
`ph`	perform pairwise comparisons with Tukey HSD test [bool (True or False)] [default: False]
`phalpha`	significance level Tukey HSD test [float (0 to 1)][default: 0.05]

Returns:

ANOVA summary, multiple pairwise comparisons, and assumption tests statistics

Working example

Manhatten plot

bioinfokit.visuz.marker.mhat(df, chr, pv, color, dim, r, ar, gwas_sign_line, gwasp, dotsize, markeridcol, markernames, gfont, valpha)

Parameters	Description
`df`	Pandas dataframe object with atleast SNP, chromosome, and P-values columns
`chr`	Name of a column having chromosome numbers [string][default:None]
`pv`	Name of a column having P-values. Must be numeric column [string][default:None]
`color`	List the name of the colors to be plotted. It can accept two alternate colors or the number colors equal to chromosome number. If nothing (None) provided, it will randomly assign the color to each chromosome [list][default:None]
`dim`	Figure size [tuple of two floats (width, height) in inches][default: (6, 4)]
`r`	Figure resolution in dpi [int][default: 300]
`ar`	Rotation of X-axis labels [float][default: 90]
`gwas_sign_line`	Plot statistical significant threshold line defined by option `gwasp` [bool (True or False)][default: False]
`gwasp`	Statistical significant threshold to identify significant SNPs [float][default: 5E-08]
`dotsize`	The size of the dots in the plot [float][default: 8]
`markeridcol`	Name of a column having SNPs. This is necessary for plotting SNP names on the plot [string][default: None]
`markernames`	The list of the SNPs to display on the plot. These SNP should be present in SNP column. Additionally, it also accepts the dict of SNPs and its associated gene name. If this option set to True, it will label all SNPs with P-value significant score defined by `gwasp` [string, list, dict][default: True]
`gfont`	Font size for SNP names to display on the plot [float][default: 8]
`valpha`	Transparency of points on plot [float (between 0 and 1)][default: 1.0]

Returns:

Manhatten plot image in same directory (manhatten.png)

Working example

Extract the sequences from the FASTA file

bioinfokit.analys.extract_seq(file, id)

Parameters	Description
`file`	input FASTA file from which sequneces to be extracted
`id`	sequence ID file

Returns: Extracted sequences in FASTA format file in same directory (out.fasta)

Bar-dot plot

bioinfokit.visuz.stat.bardot(df, colorbar, colordot, bw, dim, r, ar, hbsize, errorbar, dotsize, markerdot, valphabar, valphadot)

Parameters	Description
`df`	Pandas dataframe object
`colorbar`	Color of bar graph [string or list][default:"#bbcfff"]
`colordot`	Color of dots on bar [string or list][default:"#ee8972"]
`bw`	Width of bar [float][default: 0.4]
`dim`	Figure size [tuple of two floats (width, height) in inches][default: (6, 4)]
`r`	Figure resolution in dpi [int][default: 300]
`ar`	Rotation of X-axis labels [float][default: 0]
`hbsize`	Horizontal bar size for standard error bars [float][default: 4]
`errorbar`	Draw standard error bars [bool (True or False)][default: True]
`dotsize`	The size of the dots in the plot [float][default: 6]
`markerdot`	Shape of the dot marker. See more options at https://matplotlib.org/3.1.1/api/markers_api.html [string][default: "o"]
`valphabar`	Transparency of bars on plot [float (between 0 and 1)][default: 1]
`valphadot`	Transparency of dots on plot [float (between 0 and 1)][default: 1]

Returns:

Bra-dot plot image in same directory (bardot.png)

Working Example

FASTQ quality format detection

bioinfokit.analys.format.fq_qual_var(file)

Parameters	Description
`file`	FASTQ file to detect quality format [deafult: None]

Returns:

Quality format encoding name for FASTQ file (Supports only Sanger, Illumina 1.8+ and Illumina 1.3/1.4)

Working Example

References:

Travis E. Oliphant. A guide to NumPy, USA: Trelgol Publishing, (2006).
John D. Hunter. Matplotlib: A 2D Graphics Environment, Computing in Science & Engineering, 9, 90-95 (2007), DOI:10.1109/MCSE.2007.55 (publisher link)
Fernando Pérez and Brian E. Granger. IPython: A System for Interactive Scientific Computing, Computing in Science & Engineering, 9, 21-29 (2007), DOI:10.1109/MCSE.2007.53 (publisher link)
Michael Waskom, Olga Botvinnik, Joel Ostblom, Saulius Lukauskas, Paul Hobson, MaozGelbart, … Constantine Evans. (2020, January 24). mwaskom/seaborn: v0.10.0 (January 2020) (Version v0.10.0). Zenodo. http://doi.org/10.5281/zenodo.3629446
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825-2830 (2011)
Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.1.3

Aug 28, 2023

2.1.2

Aug 18, 2023

2.1.1

Jul 29, 2023

2.1.0

Sep 13, 2022

2.0.9

Sep 4, 2022

2.0.8

Nov 21, 2021

2.0.7

Nov 21, 2021

2.0.6

Aug 18, 2021

2.0.5

Aug 17, 2021

2.0.4

May 10, 2021

2.0.3

Apr 15, 2021

2.0.2

Mar 14, 2021

2.0.1

Mar 11, 2021

2.0.0

Mar 10, 2021

1.0.9

Mar 7, 2021

1.0.8

Feb 14, 2021

1.0.7

Jan 30, 2021

1.0.6

Jan 29, 2021

1.0.5

Dec 22, 2020

1.0.4

Nov 24, 2020

1.0.3

Nov 6, 2020

1.0.2

Oct 26, 2020

1.0.1

Oct 24, 2020

1.0.0

Oct 10, 2020

0.9.9

Oct 5, 2020

0.9.8

Sep 26, 2020

0.9.7

Sep 19, 2020

0.9.6

Aug 23, 2020

0.9.5

Aug 14, 2020

0.9.4

Aug 13, 2020

0.9.3

Aug 8, 2020

0.9.2

Jul 31, 2020

0.9.1

Jul 30, 2020

0.9

Jul 29, 2020

0.8.9

Jul 28, 2020

0.8.8

Jul 2, 2020

0.8.7

Jul 2, 2020

0.8.6

Jun 27, 2020

0.8.5

Jun 22, 2020

0.8.4

Jun 17, 2020

0.8.3

Jun 4, 2020

0.8.2

Jun 1, 2020

0.8

May 24, 2020

0.7.3

May 14, 2020

0.7.2

May 9, 2020

0.7.1

Apr 25, 2020

0.7

Apr 17, 2020

0.6

Apr 10, 2020

0.5

Mar 30, 2020

This version

0.4

Mar 12, 2020

0.3

Mar 11, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioinfokit-0.4.tar.gz (19.8 kB view hashes)

Uploaded Mar 12, 2020 Source

Hashes for bioinfokit-0.4.tar.gz

Hashes for bioinfokit-0.4.tar.gz
Algorithm	Hash digest
SHA256	`ddbacc24428b66ebad6484256558814c93c5bb43f5bef8bab966cfd5e5737be9`
MD5	`b0d0e46639a017ed1a5cdfb30d00f731`
BLAKE2b-256	`9d80506b63f846c4698e9da552e0ac522a460a1ece03eaefdc52f5cf3766317a`