libPoMo.seqbase

This module provides basic functions and classes needed to work with sequence data.

Objects

Classes:
  • Seq, stores a single sequence
  • Region, region in a genome
Exception Classes:
Functions:

exception libPoMo.seqbase.NotAValidRefBase[source]

Reference base is not valid.

class libPoMo.seqbase.Region(chrom, start, end, name=None, orientation='+')[source]

Region in a genome.

The start and end points need to be given 1-based and are converted to 0-based positions that are used internally to save all positional data.

Parameters:
  • chrom (str) – Chromosome name.
  • start (int) – 1-based start position.
  • end (int) – 1-based end position.
  • name (str) – Optional, region name.
Variables:
  • chrom (str) – Chromosome name.
  • start (int) – 0-based start position.
  • end (int) – 0-base end position.
  • name (str) – Region name.
print_info()[source]

Print information about the region.

class libPoMo.seqbase.Seq[source]

A class that stores sequence data. .. _seqbase-seq:

Variables:
  • name (str) – Name of the sequence (e.g. species or individual name).
  • descr (str) – Description of the sequence.
  • data (str) – String with sequence data.
  • dataLen (int) – Number of saved bases.
  • rc (Boolean) – True if self.data stores the reverse-complement of the real sequence.
get_base(pos)[source]

Returns base at 1-based position pos.

get_exon_nr()[source]

Try to find the current and the total exon number of the sequence.

Extract the exon number and the total number of exons, if the name of the sequence is of the form (cf. UCSC Table Browser):

>CCDS3.1_hg18_2_19
Return type:(int nEx, int nExTot)
Raises:SequenceDataError, if the format of the sequence name is invalid.
get_in_frame()[source]

Try to find the inFrame of the gene.

inFrame: the frame number of the first nucleotide in the exon. Frame numbers can be 0, 1, or 2 depending on what position that nucleotide takes in the codon which contains it. This function gets the inFrame, if the description of the sequence is of the form (cf. UCSC Table Browser):

918 0 0 chr1:58954-59871+
Return type:int
Raises:SequenceDataError, if format of description is invalid.
get_out_frame()[source]

Try to find the outFrame of the gene.

outFrame: the frame number of the last nucleotide in the exon. Frame numbers can be 0, 1, or 2 depending on what position that nucleotide takes in the codon which contains it. This function gets the outFrame, if the description of the sequence is of the form (cf. UCSC Table Browser):

918 0 0 chr1:58954-59871+
Return type:int
Raises:SequenceDataError, if format of description is invalid.
get_rc()[source]

Return True if the sequence is reversed and complemented.

Return type:Boolean
get_region()[source]

Try to find the Region that the sequence spans.

The sequence might not physically start at position 1 but at some arbitrary value that is indicated in the sequence description. This function gets this physical Region, if the description of the sequence is of the form (cf. UCSC Table Browser):

918 0 0 chr1:58954-59871+
Raises:SequenceDataError, if format of description is invalid.
get_region_no_description(offset=0)[source]

Get the region of the sequence.

If no regional information is available in the sequence description (cf. get_region()), the position of the first base in the reference genome can be given manually. E.g., if the first base of the sequence does not correspond to the first but to the 11th base of the reference sequence, the offset should be 10.

The name of the chromosome will be set to the name of the sequence.

Parameters:offset (int) – Optional, offset of the sequence.
is_synonymous(pos)[source]

Return True if the base at pos is 4-fold degenerate.

This function checks if the base at pos is a synonymous one. The description of the sequence has to be of the form (cf. UCSC Table Browser):

918 0 0 chr1:58954-59871+
Variables:pos (int) – Position of the base in the sequence (0 to self.dataLen).
Rtype Boolean:True if base is 4-fold degenerate.
Raises:SequenceDataError, if format of description is invalid.
print_data(fo=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Print the sequence data.

Variables:fo (fileObject) – Print to file object fo. Defaults to stdout.
print_fa_entry(maxB=None, fo=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Print a fasta file entry with header and sequence data.

Variables:maxB (int) – Print a maximum of maxB bases. Default: print all bases.
print_fa_header(fo=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Print the sequence header line in fasta format.

Variables:fo (fileObject) – Print to file object fo. Defaults to stdout.
print_info(maxB=50)[source]

Print sequence information.

Print sequence name, description, the length of the sequence and a maximum of maxB bases (defaults to 50).

purge()[source]

Purge data saved in this sequence.

rev_comp(change_sequence_only=False)[source]

Reverses and complements the sequence.

This is rather slow for long sequences.

set_rc()[source]

Set the self.rc.

The instance variable self.rc is a Boolean value that is true if the saved sequence is reversed and complemented. This function sets this value according to the last character in the sequence description.

Raises:ValueError() if state could not be detected.
toggle_rc()[source]

Toggle the state of self.rc.

exception libPoMo.seqbase.SequenceDataError[source]

General sequence data error exception.

libPoMo.seqbase.gz_open(fn, mode='r')[source]

Open file with io.open() or gzip.open().

Parameters:
  • fn (str) – Name of the file to open.
  • md (char) – Mode ‘r‘ | ‘w’.
libPoMo.seqbase.stripFName(fn)[source]

Convenience function to strip filename off the ”.xyz” ending.