libPoMo.fasta

This module provides functions to read, write and access fasta files.

Objects

Classes:
  • FaStream, fasta file sequence stream object
  • MFaStream, multiple alignment fasta file sequence stream object
  • FaSeq, fasta file sequence object
  • MFaStrFilterProps, define multiple fasta file filter preferences
Exception Classes:
Functions:

class libPoMo.fasta.FaSeq[source]

Store sequence data retrieved from a fasta file.

Variables:
  • name (str) – Name of the FaSeq object.
  • seqL ([Seq]) – List of Seq objects that store the actual sequence data.
  • nSepcies (int) – Number of saved species / individuals / chromosomes.
get_distance()[source]

Number of segregating bases.

get_seq_base(seq, pos)[source]

Return base at 1-based position pos in sequence with name seq.

get_seq_by_id(i)[source]

Return sequence number i as Seq object.

get_seq_names()[source]

Return a list with sequence names.

print_info(maxB=50)[source]

Print fasta sequence information.

Print fasta sequence identifier, species names, the length of the sequence and a maximum of maxB bases (defaults to 50).

class libPoMo.fasta.FaStream(name, firstSeq, nextHL, faFileObject)[source]

A class that stores a fasta file sequence stream.

The sequence of one species / individual / chromosome is saved and functions are provided to read in the next sequence in the file, if there is any. This saves memory if files are huge and doesn’t increase runtime.

This object is usually initialized with init_seq().

Parameters:
  • name (str) – Name of the stream.
  • firstSeq (Seq) – First sequence (Seq object) to be saved.
  • nextHL (str) – Next header line.
  • faFileObject (fo) – File object associated with the stream.
Variables:
  • name (str) – Stream name.
  • seq (Seq) – Saved sequence (Seq object)
  • nextHeaderLine (str) – Next header line.
  • fo (fo) – File object that points to the start of the data of the next sequence.
close()[source]

Close the linked file.

print_info(maxB=50)[source]

Print sequence information.

Print information about this FaStream object, the fasta sequence stored at the moment the length of the sequence and a maximum of maxB bases (defaults to 50).

read_next_seq()[source]

Read next fasta sequence in file.

The return value is the name of the next sequence or None if no next sequence is found.

class libPoMo.fasta.MFaStrFilterProps(nSpecies)[source]

Define filter preferences for multiple fasta alignments.

Define the properties of the filter to be applied to an MFaStream.

By default, all filters are applied (all variables are set to True).

Parameters:

nSpecies (int) – Number of species that are aligned.

Variables:
  • check_all_aligned (Boolean) – Check if all treated species are available in the alignment (nSpecies gives the number of species, given to the object upon initialization).
  • check_divergence (Boolean) – Check if the divergence of the reference genome (the first sequence in the alignment) is lower than maxDiv (defaults to 10 percent).
  • check_start_codons (Boolean) – Check if all start codons are conserved.
  • check_stop_codons (Boolean) – Check if all stop codons are conserved.
  • check_frame_shifting_gaps (Boolean) – Check, that there are no frame-shifting gaps.
  • check_for_long_gaps (Boolean) – Check if no gap is longer than maxGapLength (defaults to 30) bases.
  • check_nonsense_codon (Boolean) – Check if there is no premature stop codon).
  • check_exon_length (Boolean) – Check that the exon is longer than minExonLen (defaults to 21).
  • check_exon_numbers (Boolean) – Check if exon number match for all sequences in the alignment.
class libPoMo.fasta.MFaStream(faFileName, maxskip=50, name=None)[source]

Store a multiple alignment fasta file sequence stream.

The sequences of one gene / alignment are saved for all species / individuals / chromosomes. Functions are provided to read in the next gene / alignment in the file that fulfills the given criteria, if there is any. This saves memory if files are huge and doesn’t increase runtime.

Initialization of an MFaStream opens the given fasta file, checks if it is in fasta format and reads the first alignment. The end of an alignment is reached when a line only contains the newline character. This object can later be used to parse the whole multiple alignment fasta file.

Alignments can be filtered with filter_mfa_str().

Parameters:
  • faFileName (str) – File name of the multiple alignment fasta file.
  • maxskip (int) – Only look maxskip lines for the start of a sequence (defaults to 50).
  • name (str) – Set the name of the stream to name, otherwise set it to the stripped filename.
Variables:
  • name (str) – Stream name.
  • seqL ([Seq]) – Saved sequences (Seq objects) in a list.
  • nSpecies (int) – Number of saved sequences / species in the alignment.
  • nextHeaderLine (str) – Next header line.
  • fo (fo) – File object that points to the start of the data of the next sequence.

Please close the associated file object with FaStream.close() when you don’t need it anymore.

close()[source]

Close the linked file object.

orient(firstOnly=False)[source]

Orient all sequences of the alignment to be in forward direction.

This is rather slow for long sequences.

Parameters:firstOnly (Boolean) – If true, orient the first sequence only.
print_info(maxB=50)[source]

Print sequence information.

Print information about this MFaStream object, the fasta sequence stored at the moment the length of the sequence and a maximum of maxB bases (defaults to 50).

print_msa(fo=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Print multiple sequence alignment at point.

Variables:fo (fileObject) – Print to file object fo. Defaults to stdout.
read_next_align()[source]

Read next alignment in fasta file.

The return value is the name of the newly saved alignment or None if no next alignment is found.

exception libPoMo.fasta.NotAFastaFileError[source]

Exception raised if given fasta file is not valid.

libPoMo.fasta.filter_mfa_str(mfaStr, fp, verb=None)[source]

Check multiple sequence alignment of an MFaStream.

Multiple sequence alignments usually include alignments that are not apt for analysis. These low quality alignments need to be filtered out of the original multiple sequence alignment fasta file. If verb is unset from None, information about any possible rejection is printed to the standard output.

Variables:
Return type:

Boolean, True if all filters have been passed.

libPoMo.fasta.init_seq(faFileName, maxskip=50, name=None)[source]

Open a fasta file and initialize an FaStream.

This function tries to open the given fasta file, checks if it is in fasta format and reads the first sequence. It returns an FaStream object. This object can later be used to parse the whole fasta file.

Please close the associated file object with FaStream.close() when you don’t need it anymore.

Parameters:
  • faFileName (str) – File name of the fasta file.
  • maxskip (int) – Only look maxskip lines for the start of a sequence (defaults to 50).
  • name (str) – Set the name of the sequence to name, otherwise set it to the stripped filename.
libPoMo.fasta.open_seq(faFileName, maxskip=50, name=None)[source]

Open and read a fasta file.

This function tries to open the given fasta file, checks if it is in fasta format and reads the sequence(s). It returns an FaSeq object that contains a list of species names, a list of the respective desriptions and a list with the sequences.

Parameters:
  • faFileName (str) – Name of the fasta file.
  • maxskip (int) – Only look maxskip lines for the start of a sequence (defaults to 50).
  • name (str) – Set the name of the sequence to name otherwise set it to the stripped filename.
libPoMo.fasta.read_align_from_fo(line, fo)[source]

Read a single fasta alignment.

Read a single fasta alignment from file object fo and save it to new Seq sequence objects. Return the header line of the next fasta alignment and the newly created sequences in a list. If no new alignment is found, the next header line will be set to None.

Parameters:
  • line (str) – Header line of the sequence.
  • fo (fo) – File object of the fasta file.
Return type:

(str, [Seq])

libPoMo.fasta.read_seq_from_fo(line, fo, getAlignEndFlag=False)[source]

Read a single fasta sequence.

Read a single fasta sequence from file object fo and save it to a new Seq sequence object. Return the header line of the next fasta sequence and the newly created sequence. If no new sequence is found, the next header line will be set to None.

Parameters:
  • line (str) – Header line of the sequence.
  • fo (fo) – File object of the fasta file.
  • getAlignFlag (Boolean) – If set to true, an additional Boolean value that specifies if a multiple sequence alignment ends, is returned.
Return type:

(str, Seq) | (str, Seq, Boolean)

libPoMo.fasta.save_as_vcf(faSeq, ref, VCFFileName)[source]

Save the given :classL`FaSeq` in VCF format.

In general, we want to convert a fasta file with various individuals with the help of a reference that contains one sequence to a VCF file that contains all the SNPs. This can be done with this function. Until now it is not possible to do this conversion for several chromosomes for each individual in one run. Still, the conversion can be done chromosome by chromosome.

This function saves the SNPs of faSeq, a given FaSeq (fasta sequence) object in VCF format to the file VCFFileName. The reference genome ref, to which faSeq is compared to, needs to be passed as a Seq object.

The function compares all sequences in faSeq to the sequence given in ref. The names of the individuals in the saved VCF file will be the sequence names of the faSeq object.

#CHROM = sequence name of the reference
POS    = position relative to reference
ID     = .
REF    = base of reference
ALT    = SNP (e.g. 'C' or 'G,T' if 2 different SNPs are present)
QUAL   = .
FILTER = .
INFO   = .
FORMAT = GT
Parameters:
  • faSeq (FaSeq) – FaSeq object to be converted.
  • ref (Seq) – Seq object of the reference sequence.
  • VCFFileName (str) – Name of the VCF output file.