Skip to main content

govcf

Project description

govcf - Variant Call File "call" generator

This is a proprietary package that is available from GenomOncology and works with our Knowledge Management System.

For more information about licensing please contact us at:

info@genomoncology.com

Additional proprietary projects available for download via pypi include:

  • GO SDK - GenomOncology Software Development Kit
  • GO CLI - GenomOncology Command Line Interface

Our open source projects include:

  • Related - Nested Object Models in Python with dictionary, YAML, and JSON transformation support
  • Specd - Swagger v2 Specification Directories
  • Rigor - HTTP-based DSL for for validating RESTful APIs

Overview

GenomOncology Variant Call File (VCF) generator built on top of the VCF parser within the pysam project. The generator yields two record types as indicated by the __type__ dictionary attribute:

  • Header (1 per VCF file)
  • Call (1 per unique sample alt)

The header includes the following information:

  • __child__: the type of the records that will follow the header.
  • config: any configuration fields provided to the generator.
  • file_path: the file location of the VCF.
  • formats: the meta data of the FORMAT fields in the header.
  • info: the meta data of the INFO fields in the header.
  • types: the field type of all of the fields found in the INFO or FORMAT.

A call is the representation of a single ALT allele for a given sample. The calls are generated for each VCF record by iterating each of the samples and yielding a call for each unique ALT index specified by the GT (genotype) field.

A call includes the following fields:

  • alt: alternate allele
  • chr: chromosome
  • filters: filters provided, including None for '.'
  • info: info value fields
  • is_het: boolean that is true when allele is heterozygous (e.g. 0/1)
  • is_phased: boolean that indicates whether phased (|) or unphased (/)
  • quality: quality value
  • ref: reference allele
  • rs_id: ID field
  • sample_name: name of the sample column
  • start: start position

This package also has a class called BedFilter which can be passed into the iterator functions that filters records by chromosome and start position and only yields calls that fall within the range specified by the BED file.

Quick Example

The following example is what the parsing of the example provided at the top of the VCF Specification document here:

https://samtools.github.io/hts-specs/VCFv4.2.pdf

Here is the VCF:

##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
20	14370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	17330	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
20	1110696	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4
20	1230237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:7:56,60	0|0:48:4:51,51	0/0:61:2
20	1234567	microsat1	GTC	G,GTCT	50	PASS	NS=3;DP=9;AA=G;H2	GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3

Here is some example python code:

from govcf import iterate_vcf_calls, BEDFilter
from pprint import pprint

bed_filter = BEDFilter("panel.bed")

for record in iterate_vcf_calls("tests/vcfs/spec.vcf", bed_filter=bed_filter):
    pprint(record)

Yields the following results:

{'__child__': 'CALL',
 '__type__': 'HEADER',
 'config': {'include_vaf': True},
 'file_path': '/Users/ian/code/govcf/tests/vcfs/spec.vcf',
 'formats': {'DP': {'description': 'Read Depth',
                    'id': 2,
                    'name': 'DP',
                    'number': 1,
                    'type': 'Integer'},
             'GQ': {'description': 'Genotype Quality',
                    'id': 10,
                    'name': 'GQ',
                    'number': 1,
                    'type': 'Integer'},
             'GT': {'description': 'Genotype',
                    'id': 9,
                    'name': 'GT',
                    'number': 1,
                    'type': 'String'},
             'HQ': {'description': 'Haplotype Quality',
                    'id': 11,
                    'name': 'HQ',
                    'number': 2,
                    'type': 'Integer'}},
 'info': {'AA': {'description': 'Ancestral Allele',
                 'id': 4,
                 'name': 'AA',
                 'number': 1,
                 'type': 'String'},
          'AF': {'description': 'Allele Frequency',
                 'id': 3,
                 'name': 'AF',
                 'number': 'A',
                 'type': 'Float'},
          'DB': {'description': 'dbSNP membership, build 129',
                 'id': 5,
                 'name': 'DB',
                 'number': 0,
                 'type': 'Flag'},
          'DP': {'description': 'Total Depth',
                 'id': 2,
                 'name': 'DP',
                 'number': 1,
                 'type': 'Integer'},
          'H2': {'description': 'HapMap2 membership',
                 'id': 6,
                 'name': 'H2',
                 'number': 0,
                 'type': 'Flag'},
          'NS': {'description': 'Number of Samples With Data',
                 'id': 1,
                 'name': 'NS',
                 'number': 1,
                 'type': 'Integer'}},
 'types': {'AA': 'string',
           'AF': 'float',
           'DB': 'boolean',
           'DP': 'int',
           'GQ': 'int',
           'H2': 'boolean',
           'HQ': 'mint',
           'NS': 'int'}}
{'__type__': 'CALL',
 'alt': 'A',
 'chr': '20',
 'filters': ['PASS'],
 'info': {'AF': 0.5,
          'DB': True,
          'DP': 8,
          'GQ': 48,
          'H2': True,
          'HQ': (51, 51),
          'NS': 3},
 'is_het': True,
 'is_phased': True,
 'quality': 29.0,
 'ref': 'G',
 'rs_id': 'rs6054257',
 'sample_name': 'NA00002',
 'start': 14370}
{'__type__': 'CALL',
 'alt': 'A',
 'chr': '20',
 'filters': ['PASS'],
 'info': {'AF': 0.5,
          'DB': True,
          'DP': 5,
          'GQ': 43,
          'H2': True,
          'HQ': (None, None),
          'NS': 3},
 'is_het': False,
 'is_phased': False,
 'quality': 29.0,
 'ref': 'G',
 'rs_id': 'rs6054257',
 'sample_name': 'NA00003',
 'start': 14370}
{'__type__': 'CALL',
 'alt': 'A',
 'chr': '20',
 'filters': ['q10'],
 'info': {'AF': 0.017000000923871994,
          'DP': 5,
          'GQ': 3,
          'HQ': (65, 3),
          'NS': 3},
 'is_het': True,
 'is_phased': True,
 'quality': 3.0,
 'ref': 'T',
 'rs_id': None,
 'sample_name': 'NA00002',
 'start': 17330}
{'__type__': 'CALL',
 'alt': 'G',
 'chr': '20',
 'filters': ['PASS'],
 'info': {'AA': 'T',
          'AF': 0.3330000042915344,
          'DB': True,
          'DP': 6,
          'GQ': 21,
          'HQ': (23, 27),
          'NS': 2},
 'is_het': True,
 'is_phased': True,
 'quality': 67.0,
 'ref': 'A',
 'rs_id': 'rs6040355',
 'sample_name': 'NA00001',
 'start': 1110696}
{'__type__': 'CALL',
 'alt': 'T',
 'chr': '20',
 'filters': ['PASS'],
 'info': {'AA': 'T',
          'AF': 0.6669999957084656,
          'DB': True,
          'DP': 6,
          'GQ': 21,
          'HQ': (23, 27),
          'NS': 2},
 'is_het': True,
 'is_phased': True,
 'quality': 67.0,
 'ref': 'A',
 'rs_id': 'rs6040355',
 'sample_name': 'NA00001',
 'start': 1110696}
{'__type__': 'CALL',
 'alt': 'G',
 'chr': '20',
 'filters': ['PASS'],
 'info': {'AA': 'T',
          'AF': 0.3330000042915344,
          'DB': True,
          'DP': 0,
          'GQ': 2,
          'HQ': (18, 2),
          'NS': 2},
 'is_het': True,
 'is_phased': True,
 'quality': 67.0,
 'ref': 'A',
 'rs_id': 'rs6040355',
 'sample_name': 'NA00002',
 'start': 1110696}
{'__type__': 'CALL',
 'alt': 'T',
 'chr': '20',
 'filters': ['PASS'],
 'info': {'AA': 'T',
          'AF': 0.6669999957084656,
          'DB': True,
          'DP': 0,
          'GQ': 2,
          'HQ': (18, 2),
          'NS': 2},
 'is_het': True,
 'is_phased': True,
 'quality': 67.0,
 'ref': 'A',
 'rs_id': 'rs6040355',
 'sample_name': 'NA00002',
 'start': 1110696}
{'__type__': 'CALL',
 'alt': 'T',
 'chr': '20',
 'filters': ['PASS'],
 'info': {'AA': 'T',
          'AF': 0.6669999957084656,
          'DB': True,
          'DP': 4,
          'GQ': 35,
          'HQ': (None,),
          'NS': 2},
 'is_het': False,
 'is_phased': False,
 'quality': 67.0,
 'ref': 'A',
 'rs_id': 'rs6040355',
 'sample_name': 'NA00003',
 'start': 1110696}
{'__type__': 'CALL',
 'alt': 'G',
 'chr': '20',
 'filters': ['PASS'],
 'info': {'AA': 'G', 'DP': 4, 'GQ': 35, 'H2': True, 'NS': 3},
 'is_het': True,
 'is_phased': False,
 'quality': 50.0,
 'ref': 'GTC',
 'rs_id': 'microsat1',
 'sample_name': 'NA00001',
 'start': 1234567}
{'__type__': 'CALL',
 'alt': 'GTCT',
 'chr': '20',
 'filters': ['PASS'],
 'info': {'AA': 'G', 'DP': 2, 'GQ': 17, 'H2': True, 'NS': 3},
 'is_het': True,
 'is_phased': False,
 'quality': 50.0,
 'ref': 'GTC',
 'rs_id': 'microsat1',
 'sample_name': 'NA00002',
 'start': 1234567}
{'__type__': 'CALL',
 'alt': 'G',
 'chr': '20',
 'filters': ['PASS'],
 'info': {'AA': 'G', 'DP': 3, 'GQ': 40, 'H2': True, 'NS': 3},
 'is_het': False,
 'is_phased': False,
 'quality': 50.0,
 'ref': 'GTC',
 'rs_id': 'microsat1',
 'sample_name': 'NA00003',
 'start': 1234567}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

govcf-0.9.0.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

govcf-0.9.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file govcf-0.9.0.tar.gz.

File metadata

  • Download URL: govcf-0.9.0.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for govcf-0.9.0.tar.gz
Algorithm Hash digest
SHA256 dd89b2c28d4de578e52b4623815d20f3f358e0955bbd882b2f7155bbfd6558e7
MD5 bd53757d71d05402be6a50f631a4fb52
BLAKE2b-256 716a060a40006e52c332fb0db72b380f318da2d27e5f39bf95c8c3ae81a38cea

See more details on using hashes here.

File details

Details for the file govcf-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: govcf-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for govcf-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94cd5504f76af7c33c4ef218667320470808265595d4d71b050a602f12b0e0d4
MD5 34abab33bb0646c9be4e80795fc7cd69
BLAKE2b-256 cd294de25d03735fb37477c12c7e3a8105e597284b114b9a652ab0546bdb61bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page