libPoMo.fasta¶
This module provides functions to read, write and access fasta files.
Objects¶
- Classes:
FaStream
, fasta file sequence stream objectMFaStream
, multiple alignment fasta file sequence stream objectFaSeq
, fasta file sequence objectMFaStrFilterProps
, define multiple fasta file filter preferences
- Exception Classes:
- Functions:
filter_mfa_str()
, filter a givenMFaStream
according to the filters defined inMFaStrFilterProps
init_seq()
, initialize fasta sequence stream from fileopen_seq()
, open fasta filesave_as_vcf()
, save a givenFaSeq
in variant call format (VCF)read_seq_from_fo()
, read a single sequence from file objectread_align_from_fo()
, read an alignment from file object
-
class
libPoMo.fasta.
FaSeq
[source]¶ Store sequence data retrieved from a fasta file.
Variables: - name (str) – Name of the FaSeq object.
- seqL ([Seq]) – List of
Seq
objects that store the actual sequence data. - nSepcies (int) – Number of saved species / individuals / chromosomes.
-
class
libPoMo.fasta.
FaStream
(name, firstSeq, nextHL, faFileObject)[source]¶ A class that stores a fasta file sequence stream.
The sequence of one species / individual / chromosome is saved and functions are provided to read in the next sequence in the file, if there is any. This saves memory if files are huge and doesn’t increase runtime.
This object is usually initialized with
init_seq()
.Parameters: Variables:
-
class
libPoMo.fasta.
MFaStrFilterProps
(nSpecies)[source]¶ Define filter preferences for multiple fasta alignments.
Define the properties of the filter to be applied to an
MFaStream
.By default, all filters are applied (all variables are set to True).
Parameters: nSpecies (int) – Number of species that are aligned.
Variables: - check_all_aligned (Boolean) – Check if all treated species are available in the alignment (nSpecies gives the number of species, given to the object upon initialization).
- check_divergence (Boolean) – Check if the divergence of the reference genome (the first sequence in the alignment) is lower than maxDiv (defaults to 10 percent).
- check_start_codons (Boolean) – Check if all start codons are conserved.
- check_stop_codons (Boolean) – Check if all stop codons are conserved.
- check_frame_shifting_gaps (Boolean) – Check, that there are no frame-shifting gaps.
- check_for_long_gaps (Boolean) – Check if no gap is longer than maxGapLength (defaults to 30) bases.
- check_nonsense_codon (Boolean) – Check if there is no premature stop codon).
- check_exon_length (Boolean) – Check that the exon is longer than minExonLen (defaults to 21).
- check_exon_numbers (Boolean) – Check if exon number match for all sequences in the alignment.
-
class
libPoMo.fasta.
MFaStream
(faFileName, maxskip=50, name=None)[source]¶ Store a multiple alignment fasta file sequence stream.
The sequences of one gene / alignment are saved for all species / individuals / chromosomes. Functions are provided to read in the next gene / alignment in the file that fulfills the given criteria, if there is any. This saves memory if files are huge and doesn’t increase runtime.
Initialization of an
MFaStream
opens the given fasta file, checks if it is in fasta format and reads the first alignment. The end of an alignment is reached when a line only contains the newline character. This object can later be used to parse the whole multiple alignment fasta file.Alignments can be filtered with
filter_mfa_str()
.Parameters: - faFileName (str) – File name of the multiple alignment fasta file.
- maxskip (int) – Only look maxskip lines for the start of a sequence (defaults to 50).
- name (str) – Set the name of the stream to name, otherwise set it to the stripped filename.
Variables: - name (str) – Stream name.
- seqL ([Seq]) – Saved sequences (
Seq
objects) in a list. - nSpecies (int) – Number of saved sequences / species in the alignment.
- nextHeaderLine (str) – Next header line.
- fo (fo) – File object that points to the start of the data of the next sequence.
Please close the associated file object with
FaStream.close()
when you don’t need it anymore.-
orient
(firstOnly=False)[source]¶ Orient all sequences of the alignment to be in forward direction.
This is rather slow for long sequences.
Parameters: firstOnly (Boolean) – If true, orient the first sequence only.
-
print_info
(maxB=50)[source]¶ Print sequence information.
Print information about this MFaStream object, the fasta sequence stored at the moment the length of the sequence and a maximum of maxB bases (defaults to 50).
-
exception
libPoMo.fasta.
NotAFastaFileError
[source]¶ Exception raised if given fasta file is not valid.
-
libPoMo.fasta.
filter_mfa_str
(mfaStr, fp, verb=None)[source]¶ Check multiple sequence alignment of an MFaStream.
Multiple sequence alignments usually include alignments that are not apt for analysis. These low quality alignments need to be filtered out of the original multiple sequence alignment fasta file. If verb is unset from None, information about any possible rejection is printed to the standard output.
Variables: - mfaStr (MFaStream) –
MFaStream
object to check. - fp (MFaStrFilterProps) –
MFaStrFilterProps
; Properties of the filter to be applied. - verb (Boolean) – Verbosity.
Return type: Boolean, True if all filters have been passed.
- mfaStr (MFaStream) –
-
libPoMo.fasta.
init_seq
(faFileName, maxskip=50, name=None)[source]¶ Open a fasta file and initialize an
FaStream
.This function tries to open the given fasta file, checks if it is in fasta format and reads the first sequence. It returns an
FaStream
object. This object can later be used to parse the whole fasta file.Please close the associated file object with
FaStream.close()
when you don’t need it anymore.Parameters: - faFileName (str) – File name of the fasta file.
- maxskip (int) – Only look maxskip lines for the start of a sequence (defaults to 50).
- name (str) – Set the name of the sequence to name, otherwise set it to the stripped filename.
-
libPoMo.fasta.
open_seq
(faFileName, maxskip=50, name=None)[source]¶ Open and read a fasta file.
This function tries to open the given fasta file, checks if it is in fasta format and reads the sequence(s). It returns an
FaSeq
object that contains a list of species names, a list of the respective desriptions and a list with the sequences.Parameters: - faFileName (str) – Name of the fasta file.
- maxskip (int) – Only look maxskip lines for the start of a sequence (defaults to 50).
- name (str) – Set the name of the sequence to name otherwise set it to the stripped filename.
-
libPoMo.fasta.
read_align_from_fo
(line, fo)[source]¶ Read a single fasta alignment.
Read a single fasta alignment from file object fo and save it to new
Seq
sequence objects. Return the header line of the next fasta alignment and the newly created sequences in a list. If no new alignment is found, the next header line will be set to None.Parameters: - line (str) – Header line of the sequence.
- fo (fo) – File object of the fasta file.
Return type: (str, [Seq])
-
libPoMo.fasta.
read_seq_from_fo
(line, fo, getAlignEndFlag=False)[source]¶ Read a single fasta sequence.
Read a single fasta sequence from file object fo and save it to a new
Seq
sequence object. Return the header line of the next fasta sequence and the newly created sequence. If no new sequence is found, the next header line will be set to None.Parameters: - line (str) – Header line of the sequence.
- fo (fo) – File object of the fasta file.
- getAlignFlag (Boolean) – If set to true, an additional Boolean value that specifies if a multiple sequence alignment ends, is returned.
Return type: (str, Seq) | (str, Seq, Boolean)
-
libPoMo.fasta.
save_as_vcf
(faSeq, ref, VCFFileName)[source]¶ Save the given :classL`FaSeq` in VCF format.
In general, we want to convert a fasta file with various individuals with the help of a reference that contains one sequence to a VCF file that contains all the SNPs. This can be done with this function. Until now it is not possible to do this conversion for several chromosomes for each individual in one run. Still, the conversion can be done chromosome by chromosome.
This function saves the SNPs of faSeq, a given
FaSeq
(fasta sequence) object in VCF format to the file VCFFileName. The reference genome ref, to which faSeq is compared to, needs to be passed as aSeq
object.The function compares all sequences in faSeq to the sequence given in ref. The names of the individuals in the saved VCF file will be the sequence names of the faSeq object.
#CHROM = sequence name of the reference POS = position relative to reference ID = . REF = base of reference ALT = SNP (e.g. 'C' or 'G,T' if 2 different SNPs are present) QUAL = . FILTER = . INFO = . FORMAT = GT
Parameters: