libPoMo.cf¶
This model provides functions to read, write and access files that are in counts format.
The Counts Format¶
This file format is used by PoMo and lists the base counts for every position.
- It contains:
- 1 line that specifies the file as counts file and states the number of populations as well as the number of sites
- 1 headerline with tab separated sequence names
- N lines with counts of A, C, G and T bases at position n
- It can contain:
- any number of lines that start with a #, these are treated as comments; There are no more comments allowed after the headerline.
COUNTSFILE NPOP 5 NSITES N
CHROM POS Sheep BlackSheep RedSheep Wolf RedWolf
1 s 0,0,1,0 0,0,1,0 0,0,1,0 0,0,5,0 0,0,0,1
1 s + 1 0,0,0,1 0,0,0,1 0,0,0,1 0,0,0,5 0,0,0,1
.
.
.
9 8373 0,0,0,1 1,0,0,0 0,1,0,0 0,1,4,0 0,0,1,0
.
.
.
Y end 0,0,0,1 0,1,0,0 0,1,0,0 0,5,0,0 0,0,1,0
Convert to Counts Format¶
To convert a fasta reference file with SNP information from a variant
call format (VCF) to counts format use the CFWriter
. If you
want to convert a multiple alignment fasta file, use the
CFWriter
together with the convenience function
write_cf_from_MFaStream()
.
Tabix index files need to be provided for all VCF files. They can be created from the terminal with $(tabix -p vcf “vcf-file.vcf.gz”) if tabix is installed.
A code example is:
import import_libPoMo
import libPoMo.fasta as fa
import libPoMo.cf as cf
vcfFL = ["/path/to/vcf/file1", "/path/to/vcf/file2", "..."]
cfw = cf.CFWriter(vcfFL, "name-of-outfile")
mFaStr = fa.MFaStream("/path/to/fasta/reference")
cfw.write_HLn()
cf.write_cf_from_MFaStream(mFaStr, cfw)
cfw.close()
Objects¶
- Classes:
- Exception Classes:
- Functions:
interpret_cf_line()
, get data of a line in counts formatfaseq_append_base_of_cfS()
, append CFStream line to FaSeqcf_to_fasta()
, convert counts file to fasta filewrite_cf_from_MFaStream()
, write counts file using the given MFaStream and CFWriterfasta_to_cf()
, convert fasta to counts format
-
class
libPoMo.cf.
CFStream
(CFFileName, name=None)[source]¶ Store data of a CF file line per line.
Open a (gzipped) CF file. The file can be read line per line with
read_next_pos()
.Parameters: - CFFileName (str) – Counts format file name to be read.
- name (str) – Optional; stream name, defaults to stripped filename.
Variables: - name (str) – Stream name.
- chrom (str) – Chromosome name.
- pos (str) – Positional string.
- fo (fo) – Fileobject.
- indivL ([str]) – List of names of individuals (populations).
- countsL ([[int]]) – Numpy array of nucleotide counts.
- nIndiv (int) – Number of individuals (populations).
-
class
libPoMo.cf.
CFWriter
(vcfFileNameL, outFileName, splitChar='-', mergeL=None, nameL=None, oneIndividual=False)[source]¶ Write a counts format file.
Save information that is needed to write a CF file and use this information to write a CF file. Initialize with a list of vcf file names and an output file name:
CFWriter([vcfFileNames], "output")
Tabix index files need to be provided for all VCF files. They can be created from the terminal with $(tabix -p vcf “vcf-file.vcf.gz”) if tabix is installed.
Before the count file can be written, a reference sequence has to be specified. A single reference sequence can be set with
set_seq()
.Write a header line to output:
self.write_HLn()
Write lines in counts format from 1-based positions start to end on chromosome chrom to output:
rg = sb.Region("chrom", start, end) self.write_Rn(rg)
If you want to compare the SNPs of the VCF files to a multiple alingment fasta stream (
MFaStream
) consider the very convenient functionwrite_cf_from_MFaStream()
.To determine the different populations present in the VCF files, the names of the individuals will be cropped at a specific char that can be set at initialization (standard value = ‘-‘). It is also possible to collapse all individuals of determined VCF files to a single population (cf. mergeL and nameL).
The ploidity has to be set manually if it differs from 2.
Additional filters can be set before the counts file is written (e.g. only write synonymous sites).
Important: Remember to close the attached file objectsL with
close()
. If the CFWriter is not closed, the counts file is not usable because the first line is missing!Parameters: - vcfFileNameL ([str]) – List with names of vcf files.
- outFileName (str) – Output file name.
- verb (int) – Optional; verbosity level.
- splitChar (char) – Optional; set the split character so that the individuals get sorted into the correct populations.
- mergeL ([Boolean]) – Optional; a list of truth values. If mL[i] is True, all individuals of self.vcfL[i] are treated as one population orspecies independent of their name. The respective counts are summed up. If self.nL[i] is given, the name of the summed sequence will be self.nL[i]. If not, the name of the first individual in vcfL[i] will be used.
- nameL ([str]) – Optional; a list of names. Cf. self.mL.
- oneIndividual (Boolean) – Optional; pick one individual out of each population.
Variables: - refFN (str) – Name of reference fasta file.
- vcfL ([str]) – List with names of vcf files.
- outFN (str) – Output file name.
- v (int) – Verbosity.
- mL ([Boolean]) – A list of truth values. If mL[i] is True, all individuals of self.vcfL[i] are treated as one population orspecies independent of their name. The respective counts are summed up. If self.nL[i] is given, the name of the summed sequence will be self.nL[i]. If not, the name of the first individual in vcfL[i] will be used.
- nL ([str]) – A list of names. Cf. self.mL.
- nV (int) – Number of vcf files.
- vcfTfL ([fo]) – List with pysam.Tabixfile objects. Filled by self.__init_vcfTfL() during initialization.
- outFO (fo) – File object of the outfile. Filled by self.__init_outFO() during initialization.
- cD – List with allele or base counts. The alleles of individuals from the same population are summed up. Hence, self.cD[p] gives the base counts of population p in the form: [0, 0, 0, 0]. Population p does not need to be the one from self.vcfL[p] because several populations might be present in one vcf file. self.assM connects the individual j from self.vcfL[i] such that self.assM[i][j] is p.
- chrom (str) – Name of the current chromosome. Set and updated
by
write_Rn()
. - pos (int) – Current position on chromosome. Set and updated by
write_Rn()
. - offset (int) – Value that can be set with
set_offset()
, if the reference sequence does not start at the 1-based position 1 but at the 1-based position offset. - indM – Matrix with individuals from vcf files. self.indM[i] is the list of individuals found in self.vcfL[i].
- nIndL ([int]) – List with number of individuals in self.vcfL[i].
- assM – Assignment matrix that connects the individuals from the vcf files to the correct self.cD index. Cf. self.cD
- nPop (int) – Number of different populations in count format output file (e.g. number of populations). Filled by self.__init_assM() during initialization.
- refSeq (Seq) –
Seq
object of the reference Sequence. This has to be set withset_seq
. - ploidy (int) – Ploidy of individuals in vcf files. This has to be set manually to the correct value for non-diploids!
- splitCh (char) – Character that is used to split the individual names.
- onlySynonymous (Boolean) – Only write 4-fold degenerate sites.
- baseCounter (int) – Counts the total number of bases.
- __force (Boolean) – If set to true, skip name checks.
-
add_base_to_sequence
(pop_id, base_char, double_fixed_sites=False)[source]¶ Adds the base given in base_char to the counts of population with id pop_id. If double_fixed_sited is true, fixed sites are counted twice. This makes sense, when heterozygotes are encoded with IUPAC codes.
-
close
()[source]¶ Write file type specifier, number of populations and number of sites to the beginning of the output file. Close fileobjects.
-
set_offset
(offset)[source]¶ Set the offset of the sequence.
Parameters: offset (int) – Value that can be set, if the reference sequence does not start at the 1-based position 1 but at the 1-based position offset.
-
libPoMo.cf.
cf_to_fasta
(cfS, outname, consensus=False)[source]¶ Convert a
CFStream
to a fasta file.Extracts the sequences of a counts file that has been initialized with an
CFStream
. The conversion starts at the line pointed to by theCFStream
.If more than one base is present at a single site, one base is sampled out of all present ones according to its abundance.
If consensus is set to True, the consensus sequence is extracted (e.g., no sampling but the bases with highest counts for each individual or population are chosen).
Parameters: - cfS (CFStream) – Counts format file stream.
- outname (str) – Fasta output file name.
- consensus (Boolean) – Optional; Extract consensus sequence? Defaults to False.
-
libPoMo.cf.
faseq_append_base_of_cfS
(faS, cfS, consensus=False)[source]¶ Append a
CFStream
line to anlibPoMo.fasta.FaSeq
.Randomly chooses bases for each position according to their abundance.
Parameters:
-
libPoMo.cf.
fasta_to_cf
(fastaFN, countsFN, splitChar='-', chromName='NA', double_fixed_sites=False)[source]¶ Convert fasta to counts format.
The (aligned) sequences in the fasta file are read in and the data is written to a counts format file.
Sequence names are stripped at the first dash. If the strupped sequence name coincide, individuals are put into the same population.
E.g., homo_sapiens-XXX and homo_sapiens-YYY will be in the same population homo_sapiens.
Take care with large files, this uses a lot of memory.
The input as well as the output files can additionally be gzipped (indicated by a .gz file ending).
Variables: double_fixed_sites (bool) – Set to true if heterozygotes are encoded with IUPAC codes. Then, fixed sites will be counted twice so that the level of polymorphism stays correct.
-
libPoMo.cf.
interpret_cf_line
(ln)[source]¶ Interpret a counts file line.
Return type is a tuple containing the chromosome name, the position and a list with nucleotide counts (cf. counts file).
Parameters: ln (str) – Line in counts format. Return type: (str, int, [[int]])
-
libPoMo.cf.
weighted_choice
(lst)[source]¶ Choose element in integer list according to its value.
E.g., in [1,10], the second element will be chosen 10 times as often as the first one. Returns the index of the chosen element.
Variables: lst ([int]) – List of integers. Return type: int
-
libPoMo.cf.
write_cf_from_MFaStream
(refMFaStr, cfWr)[source]¶ Write counts file using the given MFaStream and CFWriter.
Write the counts format file using the first sequences of all alignments in the MFaStream. The sequences are automatically reversed and complemented if this is needed (indicated in the header line). This is very useful if you e.g. want to compare the VCF files to a CCDC alignment.
Parameters: