libPoMo.vcf

This module provides functions to read, write and access vcf files.

Objects

Classes:
  • NucBase, store a nucleotide base
  • VCFStream, a variant call format (VCF) stream object
  • VCFSeq, a VCF file sequence object
Exception Classes:
Functions:

exception libPoMo.vcf.NotANucBaseError[source]

Exception raised if given nucleotide base is not valid.

exception libPoMo.vcf.NotAVariantCallFormatFileError[source]

Exception raised if given VCF file is not valid.

class libPoMo.vcf.NucBase[source]

Stores a nucleotide base.

FIXME: Bases are split by ‘/’. They should also be split by ‘|’.

A class that stores a single nucleotide base and related information retrieved from a VCF file. Please see http://www.1000genomes.org/ for a detailed description of the vcf format.

Variables:
  • chrom (str) – Chromosome name.
  • pos (int) – 1-based position on the chromosome.
  • id (str) – ID.
  • ref (str) – Reference base.
  • alt (str) – Alternative base(s).
  • qual (str) – Quality.
  • filter (str) – Filter.
  • info (str) – Additional information.
  • format (str) – String with format specification.
  • speciesData ([str]) – List with strings of the species data (e.g. 0/1:...).
  • ploidy (int) – Ploidy (number of sets of chromosomes) of the sequenced individuals. Can be set with set_ploidy().
get_alt_base_list()[source]

Return alternative bases as a list.

get_base_ind(iI, iC)[source]

Return the base of a specific individual.

Parameters:
  • indiv (int) – 0-based index of individual.
  • chrom (int) – 0-based index of chromosome (for n-ploid individuals).
Return type:

character with nucleotide base.

get_info()[source]

Return nucleotide base information string.

get_ref_base()[source]

Return reference base.

Return type:char
get_speciesData()[source]

Return species data as a list.

  • data[0][0] = data of first species/individual on chromatide A

  • data[0][1] = only set for non-haploids; data of first

    species/individual on chromatide B

Sets data[i][j] to None if the base of individual i on chromosome j could not be read (e.g. it is not valid).

Return type:matrix of integers
print_info()[source]

Print nucleotide base information.

Print the stored single nucleotide base and related information from the VCF file.

purge()[source]

Purge the data associated with this NucBase.

set_ploidy()[source]

Set self.ploidy.

class libPoMo.vcf.VCFSeq[source]

Store data retrieved from a VCF file.

Initialized with open_seq().

Variables:
  • name (str) – Sequence name.
  • header (str) – Sequence header.
  • speciesL ([str]) – List with species / individuals.
  • nSpecies (int) – Number of species / individuals.
  • baseL ([NucBase]) – List with stored NucBase objects.
  • nBases (int) – Number of NucBase objects stored.
append_nuc_base(base)[source]

Append base, a given NucBase, to the VCFSeq object.

get_header_line_string(indiv)[source]

Return a standard VCF File header string with individuals indiv.

get_nuc_base(chrom, pos)[source]

Return base at position pos of chromosome chrom.

has_base(chrom, pos)[source]

Return True (False) if base is (not) found.

Parameters:
  • chrom (str) – Chromosome name.
  • pos (int) – 1-based position on chrom.
print_header_line(indiv)[source]

Print a standard VCF File header with individuals indiv.

print_info(maxB=50, printHeader=False)[source]

Print VCF sequence information.

Print vcf header, the total number of nucleotides and a maximum of maxB bases (defaults to 50). Only prints header if printHeader = True is given.

class libPoMo.vcf.VCFStream(seqName, vcfFileObject, speciesList, firstBase)[source]

Store base data from a VCF file line per line.

It can be initialized with init_seq(). This class stores a single base retrieved from a VCF file and the file itself. It is used to parse through a VCF file line by line processing the bases without having to read the whole file at one.

Parameters:
  • seqName (str) – Name of the stream.
  • vcfFileObject (fo) – File object associated with the stream.
  • speciesList ([str]) – List with species / individuals.
  • firstBase (NucBase) – First NucBase to be saved.
Variables:
  • name (str) – Name of the stream.
  • fo (fo) – Stored VCF file object.
  • speciesL ([str]) – List with species / individuals.
  • nSpecies (int) – Number of species / individuals.
  • base (NusBase) – Stored NucBase.
close()[source]

Closes the linked file.

print_info()[source]

Prints VCFStream information.

read_next_base()[source]

Read the next base.

Return position of next base.

Raise a ValueError if no next base is found.

libPoMo.vcf.check_fixed_field_header(ln)[source]

Check if the given line ln is the header of the fixed fields.

Sample header line:

#CHROM     POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SpeciesL
libPoMo.vcf.get_header_line_string(indiv)[source]

Return a standard VCF File header string with individuals indiv.

libPoMo.vcf.get_indiv_from_field_header(ln)[source]

Return species from a fixed field header line ln.

Sample header line:

#CHROM     POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SpeciesL
libPoMo.vcf.get_nuc_base_from_line(ln, info=False, ploidy=None)[source]

Retrieve base data from a VCF file line ln.

Split a given VCF file line and returns a NucBase object. If info is set to False, only #CHROM, POS, REF, ALT and speciesData will be read.

Parameters:
  • info (Bool) – Determines if info is retrieved from ln.
  • ploidy (int) – If ploidy is known and given, it is set.
libPoMo.vcf.init_seq(VCFFileName, maxskip=100, name=None)[source]

Open a (gzipped) VCF4.1 file.

Try to open the given VCF file, checks if it is in VCF format. Initialize a VCFStream object that contains the first base.

Please close the associated file object with VCFStream.close() when you don’t need it anymore.

Parameters:
  • VCFFileName (str) – Name of the VCF file.
  • maxskip (int) – Only look maxskip lines for the start of the bases (defaults to 80).
  • name (str) – Set the name of the sequence to name, otherwise set it to the filename.
libPoMo.vcf.open_seq(VCFFileName, maxskip=100, name=None)[source]

Open a VCF4.1 file.

Try to open the given VCF file, checks if it is in VCF format and reads the bases(s). It returns an VCFSeq object that contains all the information.

Parameters:
  • VCFFileName (str) – Name of the VCF file.
  • maxskip (int) – Only look maxskip lines for the start of the bases (defaults to 80).
  • name (str) – Set the name of the sequence to name, otherwise set it to the filename.
libPoMo.vcf.update_base(ln, base, info=True)[source]

Read line ln into base base.

Split a given VCF file line and returns a NucBase object. If info is set to False, only #CHROM, REF, ALT and speciesData will be read.