pacbio_data_processing package

Subpackages

Submodules

pacbio_data_processing.bam module

class pacbio_data_processing.bam.BamFile(bam_file_name, mode='r')[source]

Bases: object

Proxy class for _BamFileSamtools and _BamFilePysam. This is a high level class whose only role is to choose among different possible states: _ReadableBamFile and _WritableBamFile and to select the underlying implementation (strategy) to interact with the BAM file:

  • _BamFileSamtools: implementation that simply wraps the

‘samtools’ command line, and

  • _BamFilePysam: implementation that uses ‘pysam’

The code is ready to permit the choice of strategy. With the current implementation it is, intentionally, a bit convoluted. For instance, instead of the default implementation (pysam), another one can be chosen as follows:

from pacbio_data_processing.bam import BamFile
BamFile.bamfile_strategy_name = "samtools"
bam = BamFile("my.bam")

and samtools will be used under the hood to get access to the data in a BAM file.

__init__(bam_file_name, mode='r')[source]
bamfile_strategy_name = '_BamFilePysam'
class pacbio_data_processing.bam.BamFileStrategy(*args, **kwargs)[source]

Bases: Protocol

__init__(*args, **kwargs)
pacbio_data_processing.bam._strategy_factory(name: str = '_BamFilePysam') pacbio_data_processing.bam.BamFileStrategy[source]

Internal function that returns the strategy class in a concrete BamFile instance.

pacbio_data_processing.bam.pack_lines(lines)[source]
pacbio_data_processing.bam.set_pysam_verbosity()[source]

Ad-hoc function to remove unpleasant errors messages by pysam.

pacbio_data_processing.bam_file_filter module

This module contains the high level functions necessary to apply some filters to a given input BAM file.

class pacbio_data_processing.bam_file_filter.BamFilter(parameters)[source]

Bases: object

__call__()[source]

Call self as a function.

__init__(parameters)[source]
pacbio_data_processing.bam_file_filter.main()[source]

pacbio_data_processing.bam_utils module

Some helper functions to manipulate BAM files

class pacbio_data_processing.bam_utils.CircularDNAPosition(pos: int, ref_len: int = 0)[source]

Bases: object

A type that allows to do arithmetics with postitions in a circular topology.

>>> p = CircularDNAPosition(5, ref_len=9)

The class has a decent repr:

>>> p
CircularDNAPosition(5, ref_len=9)

And we can use it in arithmetic contexts:

>>> p + 1
CircularDNAPosition(6, ref_len=9)
>>> int(p+1)
6
>>> int(p+5)
1
>>> int(20+p)
7
>>> p - 1
CircularDNAPosition(4, ref_len=9)
>>> int(p-6)
8
>>> int(p-16)
7
>>> int(2-p)
6
>>> int(8-p)
3

Also boolean equality is supported:

>>> p == CircularDNAPosition(5, ref_len=9)
True
>>> p == CircularDNAPosition(6, ref_len=9)
False
>>> p == CircularDNAPosition(14, ref_len=9)
True
>>> p == CircularDNAPosition(5, ref_len=8)
False
>>> p == 5
False

But also < is supported:

>>> p < p+1
True
>>> p < p
False
>>> p < p-1
False

Of course two instances cannot be compared if their underlying references are not equally long:

>>> s = CircularDNAPosition(5, ref_len=10)
>>> p < s
Traceback (most recent call last):
...
ValueError: cannot compare positions if topologies differ

or if they are not both CircularDNAPosition’s:

>>> s < 6
Traceback (most recent call last):
...
TypeError: '<' not supported between instances of 'CircularDNAPosition' and 'int'

The class has a convenience method:

>>> p.as_1base()
6

If the ref_len input parameter is less than or equal to 0, the topology is assumed to be linear:

>>> q = CircularDNAPosition(5, ref_len=-1)
>>> q
CircularDNAPosition(5, ref_len=0)
>>> q + 1001
CircularDNAPosition(1006, ref_len=0)
>>> q - 100
CircularDNAPosition(-95, ref_len=0)
>>> int(10-q)
5

Linear topology is the default behaviour:

>>> r = CircularDNAPosition(5)
>>> r
CircularDNAPosition(5, ref_len=0)

It is possitble to use them as indices in slices:

>>> seq = "ABCDEFGHIJ"
>>> seq[r:r+2]
'FG'

And CircularDNAPosition instances can be hashed (so that they can be elements of a set or keys in a dictionary):

>>> positions = {p, q, r}

And, very conveniently, a CircularDNAPosition converts tp str as ints do:

>>> str(r) == '5'
True
__init__(pos: int, ref_len: int = 0)[source]

The parameter ‘ref_len’ represents the length of the sequence, which has full meaning only if the reference is truly circular. If the length is 0 or less, it is set to 0 and it is understood that the reference has a linear topology.

as_1base() int[source]

It returns the raw 1-based position.

class pacbio_data_processing.bam_utils.Molecule(id: int, src_bam_path: Optional[Union[str, pathlib.Path]] = None, _best_ccs_line: Optional[tuple[bytes]] = None)[source]

Bases: object

Abstraction around a single molecule from a Bam file

__init__(id: int, src_bam_path: Optional[Union[str, pathlib.Path]] = None, _best_ccs_line: Optional[tuple[bytes]] = None) None
property ascii_quals: str

Ascii qualities of sequencing the molecule. Each symbol refers to one base.

property cigar: pacbio_data_processing.cigar.Cigar
property dna: str
property end: pacbio_data_processing.bam_utils.CircularDNAPosition

Computes the end of a molecule as CircularDNAPosition(start+lenght of reference) which, obviously takes into account the possible circular topology of the reference.

find_gatc_positions() list[pacbio_data_processing.bam_utils.CircularDNAPosition][source]

The function returns the position of all the GATCs found in the Molecule’s sequence, taking into account the topology of the reference.

The return value is is the 0-based index of the GATC motif, ie, the index of the G in the Python convention.

id: int
is_crossing_origin(*, ori_pi_shifted=False) bool[source]

This method answers the question of whether the molecule crosses the origin, assuming a circular topology of the chromosome. The answer is True if the last base of the molecue is located before the first base. Otherwise the answer is False. It will return False if the molecule starts at the origin; but it will be True if it ends at the origin. There is an optional keyword-only boolean parameter, namely ori_pi_shifted to indicate that the reference has been shifted by pi radians, or not.

pi_shift_back() None[source]

Method that shifts back the (start, end) positions of the molecule assuming that they were shifted before by pi radians.

src_bam_path: Optional[Union[str, pathlib.Path]] = None
property start: pacbio_data_processing.bam_utils.CircularDNAPosition

Readable/Writable attribute. It was originally only readable but the SingleMoleculeAnalysis class relies on it being writable to make easier the shift back of pi-shifted positions, that are computed from this attribute. The logic is: by default, the value is taken from the _best_ccs_line attribute, until it is modified, in which case the value is simply stored and returned upon request.

pacbio_data_processing.bam_utils.count_subreads_per_molecule(bam: pacbio_data_processing.bam.BamFile) collections.defaultdict[int, collections.Counter][source]

Given a read-open BamFile instance, it returns a defaultdict with keys being molecule ids (str) and values, a counter with subreads classified by strand. The possible keys of the returned counter are: +, -, ? meaning direct strand, reverse strand and unknown, respectively.

pacbio_data_processing.bam_utils.estimate_max_mapping_quality(bam: pacbio_data_processing.bam.BamFile, min_lines: Optional[int] = None, max_lines: Optional[int] = None) int[source]

This function makes an estimation of the maximum mapping quality found in a the given BamFile, bam. It assumes that the file has been aligned.

The function has been designed to shortcut the time needed to fully read long BAM files: it is typically not necessary to read the whole file since normal BAM files are expected to have subreads not sorted by mapping quality.

The function iterates over the lines in the given BAM file. That iteration has checkpoints at the following line numbers:

10, 100, 1000, 10_000, 100_000, …

i.e., at powers of 10. If at a given checkpoint, the max mapping quality is the same as the max mapping quality found at the previous checkpoint, the function returns that value. This is called below early return.

The previous procedure can be modified by adding:

  • an upper bound, and/or

  • a lower bound

to the number of lines read from the input BAM file bam. If an upper bound is given (max_lines), then, after having read so many lines, the function returns the maximum mapping quality found until that point, irrespective of the value found at the previous checkpoint. If a lower bound is given (min_lines), then an early return at a checkpoint will only happen if the number of read lines is larger than the lower bound, min_lines.

pacbio_data_processing.bam_utils.flag2strand(flag: int) Literal['+', '-', '?'][source]

Given a FLAG (see the BAM format specification), it transforms it to the corresponding strand.

Returns

+, - or ? depending on the strand the input FLAG can be assigned to (? means: it could not be assigned to any strand).

pacbio_data_processing.bam_utils.gen_index_single_molecule_bams(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], program: pathlib.Path) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None][source]

It generates indices in the form of .pbi files using program, which must be the path to a working pbindex executable. For each molecule read from the input pipe, program is called like follows (the argument is the BAM associated with the current molecule):

pbindex aligned.pMA683.subreads.bam

The success of the operation is determined inspecting the return code. If the call succeeds (ie, the return code is 0), the corresponding MoleculeWorkUnit is yielded.

If the call fails (the return code is NOT 0), an error is reported.

pacbio_data_processing.bam_utils.join_gffs(work_units: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], out_file_path: Union[str, pathlib.Path]) collections.abc.Generator[pathlib.Path, None, None][source]

The gff files related to the molecules provided in the input are read and joined in a single file. The individual gff files are yielded back.

Probably this function is useless and should be removed in the future: it only provides a joint gff file that is not a valid gff file and that is never used in the rest of the processing.

pacbio_data_processing.bam_utils.old_single_molecule_work_units_gen(lines: collections.abc.Iterable, header: bytes, file_name_prefix: pathlib.Path, todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None][source]

This generator yields 2-tuples of (mol-id, Molecule) after having isolated the subreads corresponding to that molecule id from the lines (coming from the iteration over a BamFile instance). Before yielding, a one-molecule BAM file is created. .. warning:

This generator assumes that the subreads are sorted by
``molecule_id``, aka ZMW number. In that case, this implementation
is probably much faster in most situations than the equivalently
functional ``single_molecule_work_units_gen``.
pacbio_data_processing.bam_utils.single_molecule_work_units_gen(inbam: pacbio_data_processing.bam.BamFile, out_name_without_molid: pathlib.Path, todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None][source]

This generator yields 2-tuples of (mol-id, Molecule) after having isolated the subreads corresponding to that molecule id from inbam. The generator relies on inbam having a mapping, inbam.last_subreads_map that, for each molecule id gives the last subread index corresponding to that molecule id. This generator handles properly the case of BAM files where the subreads are not groupped by molecule id, i.e. BAM files that are not sorted by molecule id (or ZWM).

Before yielding, a one-molecule BAM file is created with all the subreads of that molecule.

Warning

The current implementation keeps in memory a dictionary with all subreads of molecules that are not yet completely read. For large BAM files that can be a large memory footprint.

pacbio_data_processing.bam_utils.split_bam_file_in_molecules(in_bam_file: Union[str, pathlib.Path], tempdir: Union[str, pathlib.Path], todo: dict[int, pacbio_data_processing.bam_utils.Molecule]) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None][source]

All the individual molecules in the bam file path given, in_bam_file, that are found in todo, will be isolated and stored individually in the directory tempdir. The yielded Molecule instances will have their src_bam_path updated accordingly.

pacbio_data_processing.bam_utils.write_one_molecule_bam(subreads: collections.abc.Iterable, header: bytes, in_file_name: pathlib.Path, pre_suffix: Any) pathlib.Path[source]

Given a sequence of BAM lines, a header, the source name and a suffix, a new bamFile is created containg the data provided an a suitable name.

pacbio_data_processing.cigar module

This module provides basic ‘re-invented’ functionality to handle Cigars. A Cigar describes the differences between two sequences by providing a series of operations that one has to apply to one sequence to obtain the other one. For instance, given these two sequences:

sequence 1 (e.g. from the refenrece):

AAGTTCCGCAAATT

and

sequence 2 (e.g. from the aligner):

AAGCTCCCGCAATT

The Cigar that brings us from sequence 1 to sequence 2 is:

3=1X3=1I4=1D2=

where the numbers refer to the amount of letters and the symbols’ meaning can be found in the table below. Therefore the Cigar in the example is a shorthand for:

3 equal bases followed by 1 replacement followed by 3 equal bases followed by 1 insertion followed by 4 equal bases followed by 1 deletion followed by 2 equal bases

symbol

meaning

=

equal

I

insertion

D

deletion

X

replacement

S

soft clip

H

hard clip

class pacbio_data_processing.cigar.Cigar(incigar)[source]

Bases: object

__init__(incigar)[source]
property diff_ratio

difference ratio: 1 means that each base is different; 0 means that all the bases are equal.

property number_diff_items
property number_diff_types
property number_pb_diffs
property number_pbs
property sim_ratio

similarity ratio: 1 means that all the bases are equal; 0 means that each base is different.

This is computed from diff_ratio().

pacbio_data_processing.constants module

pacbio_data_processing.errors module

exception pacbio_data_processing.errors.MissingGooeyError[source]

Bases: ModuleNotFoundError

exception pacbio_data_processing.errors.SMAMergeError[source]

Bases: pacbio_data_processing.errors.SMAPipelineError

exception pacbio_data_processing.errors.SMAPipelineError[source]

Bases: Exception

pacbio_data_processing.errors.high_level_handler(func)[source]

pacbio_data_processing.external module

class pacbio_data_processing.external.AlignerMixIn[source]

Bases: object

A MixIn providing common functionality for aligner wrappers.

class pacbio_data_processing.external.Blasr(path: Union[pathlib.Path, str])[source]

Bases: pacbio_data_processing.external.AlignerMixIn, pacbio_data_processing.external.ExternalProgram

A simple wrapper around the blasr aligner (https://github.com/BioinformaticsArchive/blasr).

__call__(in_bamfile: Union[pathlib.Path, str], fasta: Union[pathlib.Path, str], out_bamfile: Union[pathlib.Path, str], nprocs: int = 1) Optional[int][source]

Call self as a function.

class pacbio_data_processing.external.CCS(path: Union[pathlib.Path, str])[source]

Bases: pacbio_data_processing.external.ExternalProgram

A simple wrapper around the ccs program, from the pbccs package (https://ccs.how/)

__call__(in_bamfile: Union[pathlib.Path, str], out_bamfile: Union[pathlib.Path, str]) Optional[int][source]

It runs the executable, with the given paramenters. The return code of the associated process is returned by this method if the executable could run at all, else None is returned.

One case where the executable cannot run is when the sentinel file is there before the executable process is run.

class pacbio_data_processing.external.ExternalProgram(path: Union[pathlib.Path, str])[source]

Bases: object

A base class with common functionality to all external programs’ classes that:

  1. produce an output file, and

  2. its production is to be protected by a Sentinel.

This base class provides the interface and the Sentinel protection.

__call__(infile: Union[pathlib.Path, str], outfile: Union[pathlib.Path, str], *args, **kwargs) Optional[int][source]

It runs the executable, with the given paramenters. The return code of the associated process is returned by this method if the executable could run at all, else None is returned.

One case where the executable cannot run is when the sentinel file is there before the executable process is run.

__init__(path: Union[pathlib.Path, str]) None[source]
exception pacbio_data_processing.external.MissingExternalToolError[source]

Bases: FileNotFoundError

class pacbio_data_processing.external.Pbmm2(path: Union[pathlib.Path, str])[source]

Bases: pacbio_data_processing.external.AlignerMixIn, pacbio_data_processing.external.ExternalProgram

A simple wrapper around the pbmm2 aligner (https://github.com/PacificBiosciences/pbmm2).

__call__(in_bamfile: Union[pathlib.Path, str], fasta: Union[pathlib.Path, str], out_bamfile: Union[pathlib.Path, str], preset: str = 'SUBREAD') Optional[int][source]

Call self as a function.

pacbio_data_processing.filters module

pacbio_data_processing.filters.cleanup_molecules(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], min_mapq_cutoff: int) collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None][source]

Generator of MoleculeWorkUnit``s that pass all the *standard* *filters*, ie the sequence of filters needed by ``sm-analysis to select what molecules (and what subreads in those molecules) will be IPD-analyzed. The current implementation allows to specify the lower bound for the mapping quality through the min_mapq_cutoff parameter.

It is assumed that each file contains subreads corresponding to only ONE molecule (ie, ‘molecules’ is a generator of tuples (mol id, Molecule), with Molecule being related to a single molecule id). [Note for developers: Should we allow multiple molecules per file?]

If there are subreads surviving the filtering process, the bam file is overwritten with the filtered data and the tuple (mol id, Molecule) is yielded. If no subread survives the process, nothing is done (no bam written, no tuple yielded).

pacbio_data_processing.filters.empty_buffer(buf: collections.deque, threshold: int, flags_seen: set) Generator[tuple[bytes], None, None][source]

This generator cleans the passed-in buffer either yielding its items, if the conditions are met, or throwing away them if not.

The conditions are:

  1. the number of items are at least threshold, and

  2. the flags_seen is a (non-necessarily proper) superset of {'+', '-'}.

pacbio_data_processing.filters.filter_enough_data_per_molecule(lines: collections.abc.Iterable[tuple], threshold: int) Generator[tuple[bytes], None, None][source]

This generator yields the input data if there is enough data to yield. Enough means at least threshold number of data items.

pacbio_data_processing.filters.filter_mappings_binary(lines, mappings, *rest)[source]

Simply take or reject mappings depending on passed sequence

pacbio_data_processing.filters.filter_mappings_ratio(lines, mappings, ratio)[source]

Take or reject mappings depending on ratio of wished mappings vs total

pacbio_data_processing.filters.filter_quality(lines, quality_th)[source]
pacbio_data_processing.filters.filter_seq_len(lines, len_th)[source]

pacbio_data_processing.ipd module

exception pacbio_data_processing.ipd.MissingIpdSummaryError[source]

Bases: FileNotFoundError

exception pacbio_data_processing.ipd.UnknownErrorIpdSummary[source]

Bases: Exception

pacbio_data_processing.ipd.ipd_summary(molecule: tuple[int, pacbio_data_processing.bam_utils.Molecule], fasta: Union[str, pathlib.Path], program: pathlib.Path, nprocs: int, mod_types_comma_sep: str, ipd_model: Union[str, pathlib.Path], skip_if_present: bool) Optional[tuple[int, pacbio_data_processing.bam_utils.Molecule]][source]

Lowest level interface to ipdSummary: all calls to that program are expected to be done through this function. It runs ipdSummary with an input bam file like this:

ipdSummary aligned.pMA683.subreads.bam --reference pMA683.fa      --identify m6A --gff aligned.pMA683.subreads.476.bam.gff

As a result of this, a gff file is created. This function sets an attribute in the target Molecule with the path to that file.

If the process went well (ipdSummary returns 0), the input MoleculeWorkUnit is returned, otherwise the molecule is tagged as being problematic (had_processing_problems is set to True) and None is returned.

Missing features:

  • skip_if_present

pacbio_data_processing.ipd.multi_ipd_summary(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], None, None]

Generator that yields MoleculeWorkUnit resulting from ipd_summary (None results are skipped). Parallel implementation driven by a pool of threads.

pacbio_data_processing.ipd.multi_ipd_summary_direct(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], None, None][source]

Generator that yields MoleculeWorkUnit resulting from ipd_summary (None results are skipped). Serial implementation (one file produced after the other).

pacbio_data_processing.ipd.multi_ipd_summary_threads(molecules: collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], fasta: Union[str, pathlib.Path], program: Union[str, pathlib.Path], num_ipds: int, nprocs_per_ipd: int, modification_types: str, ipd_model: Optional[str] = None, skip_if_present: bool = False) collections.abc.Generator[collections.abc.Generator[tuple[int, pacbio_data_processing.bam_utils.Molecule], None, None], None, None][source]

Generator that yields MoleculeWorkUnit resulting from ipd_summary (None results are skipped). Parallel implementation driven by a pool of threads.

pacbio_data_processing.logs module

pacbio_data_processing.logs.config_logging(verbosity: int) None[source]

pacbio_data_processing.methylation module

A module containing methylation related code.

class pacbio_data_processing.methylation.MethylationReport(detections_csv, molecules, modification_types, filtered_bam_statistics=None)[source]

Bases: object

PRELOG = '[methylation report]'
__init__(detections_csv, molecules, modification_types, filtered_bam_statistics=None)[source]
property modification_types
save()[source]
pacbio_data_processing.methylation.match_methylation_states_m6A(pos_plus, ipd_meth_states)[source]

pacbio_data_processing.parameters module

This module defines mediator classes to interact with user given parameters.

class pacbio_data_processing.parameters.BamFilteringParameters(cl_input)[source]

Bases: pacbio_data_processing.parameters.ParametersBase

Mediator class: intermediary between the user input and the BamFilter instance.

property filter_mappings
property limit_mappings
property min_relative_mapping_ratio
property out_bam_file
class pacbio_data_processing.parameters.ParametersBase(cl_input)[source]

Bases: object

__init__(cl_input)[source]
class pacbio_data_processing.parameters.SingleMoleculeAnalysisParameters(cl_input)[source]

Bases: pacbio_data_processing.parameters.ParametersBase

Mediator class: intermediary between the user input and the SingleMoleculeAnalysis instance.

property aligner_path: pathlib.Path

The path to the aligner to be used. It depends on the choice made directly by the user through the -a option, and on the usage of the --use-blasr-aligner flag.

property ipd_model: Optional[pathlib.Path]
property joint_gff_filename
property partition: Optional[tuple[int, int]]

It validates the input partition and interfaces with API clients.

property partition_done_filename
property raw_detections_filename
property summary_report_html_filename

pacbio_data_processing.plots module

pacbio_data_processing.plots.make_barsplot(dataframe: pandas.core.frame.DataFrame, plot_title: str, filename: Union[pathlib.Path, str]) None[source]
pacbio_data_processing.plots.make_continuous_rolled_data(data: dict[typing.NewType.<locals>.new_type, typing.NewType.<locals>.new_type], window: int) pandas.core.frame.DataFrame[source]

Auxiliary function used by make_rolling_history to produce a dataframe with the rolling average of the input data. The resulting dataframe starts at the min input position and ends at the max input position. The holes are set to 0 in the input data.

pacbio_data_processing.plots.make_histogram(series: pandas.core.series.Series, plot_title: str, filename: Union[pathlib.Path, str], legend: bool = True, bins: Optional[int] = None, log_scale: Optional[tuple] = None, vertical_line_at: Optional[float] = None, vertical_line_label: Optional[str] = None) None[source]
pacbio_data_processing.plots.make_multi_histogram(data: dict[str, pandas.core.series.Series], plot_title: str, filename: Union[pathlib.Path, str], legend: bool = True) None[source]
pacbio_data_processing.plots.make_rolling_history(data: dict[typing.NewType.<locals>.new_type, typing.NewType.<locals>.new_type], plot_title: str, filename: Union[pathlib.Path, str], legend: bool = True, window: int = 1000) None[source]

pacbio_data_processing.sam module

pacbio_data_processing.sentinel module

class pacbio_data_processing.sentinel.Sentinel(checkpoint: pathlib.Path)[source]

Bases: object

This class creates objects that are expected to be used as context managers. At __enter__ a sentinel file is created. At __exit__ the sentinel file is removed. If the file is there before entering the context, or is not there when the context is exited, an exception is raised.

__init__(checkpoint: pathlib.Path)[source]
_anti_aging()[source]

Method that updates the modification time of the sentinel file every SLEEP_SECONDS seconds. This is part of the mechanism to ensure that the sentinel does not get fooled by an abandoned leftover sentinel file.

property is_file_too_old

Property that answers the question: is the sentinel file too old to be taken as an active sentinel file, or not?

exception pacbio_data_processing.sentinel.SentinelFileFound[source]

Bases: Exception

Exception expected when the sentinel file is there before its creation.

exception pacbio_data_processing.sentinel.SentinelFileNotFound[source]

Bases: Exception

Exception expected if the sentinel file is missing before the Sentinel removes it.

pacbio_data_processing.sm_analysis module

This module contains the high level functions necessary to run the ‘Single Molecule Analysis’ on an input BAM file.

class pacbio_data_processing.sm_analysis.SingleMoleculeAnalysis(parameters)[source]

Bases: object

property CCS_bam_file

It produces a Circular Consensus Sequence (CCS) version of the input BAM file and returns its name. It uses generate_CCS_file() to generate the file.

__call__() None[source]

Main entry point to perform a single molecule analysis: this method triggers the analysis.

__init__(parameters)[source]
_align_bam_if_no_candidate_found(inbam: pacbio_data_processing.bam.BamFile, bam_type: str, variant: str = 'straight') Optional[str][source]

[Internal method] Auxiliary method used by _ensure_input_bam_aligned and by _ensure_ccs_bam_aligned. Given a bam_type (among input and ccs) and a variant, an initial BAM file is selected and a target aligned BAM filename is constructed. The method checks first whether the aligned file is there. If a plausible candidate is not found, the initial BAM is aligned (straight or π-shifted, depending on the variant and using the proper reference). IF, on the other hand, a candidate is found, its computation is skipped.

If the aligner cannot be run (i.e. calling the aligner returns None), None is returned, meaning that the aligner was not called. This can happen when the aligner finds a sentinel file indicating that the computation is work in progress. (See pacbio_data_processing.external.Blasr.__call__() for more details on the implementation.) This mechanism allows reentrancy.

Returns

the aligned input bam file, if it is there, or None if it could not be computed (yet).

_collect_statistics() None[source]

[Internal method] It sets an attribute: ‘filtered_bam_statistics’ that contains some data to be consumed by the MethylationReport. For now the only data is the number of subreads per molecule and per strand.

_collect_suitable_molecules_from_ccs() dict[int, pacbio_data_processing.bam_utils.Molecule][source]

[Internal method] Auxiliary routine of _select_molecules in charge of choosing suitable molecules from the aligned CCS bam files. The resulting mapping contains all suitable molecules in the ‘straight’ variant and the suitable molecules in the ‘π-shifted’ variant that are not in the ‘straight’ variant. The molecules corresponding to both variants will be joined. Among all the possible subreads of each molecule in the aligned CCS, one is chosen by map_molecules_with_highest_sim_ratio. The choice of suitable molecules is done by the method _discard_molecules_with_seq_mismatch. Moreover the molecules are labeled with the variant they belong to. It is necessary to do this labeling, so that we can later trace what reference each molecule is attached to.

_create_references()[source]

[Internal method] DNA reference sequences are created here. The ‘true’ reference must exist as fasta beforehand, with its index. A π-shifted reference is created from the original one. Its index is also made.

This method sets two attributes which are, both, mappings with two keys (‘straight’ and ‘pi-shifted’) and values as follows:

  • reference: the values are DNASeq objects

  • fasta: the values are Path objects

_crosscheck_molecules_in_partition_with_ccs(molecules_from_ccs: dict[int, pacbio_data_processing.bam_utils.Molecule]) None[source]

[Internal method] This method ensures that only the molecules in the current partition are processed. It does it by crosschecking the sets corresponding to the partition (for all variants) with the set of valid molecules in the ccs file. The attribute _molecules_todo is set, and its type is:

dict[int, Molecule]
_disable_pi_shifted_analysis() None[source]

[Internal method] If the pi-shifted analysis cannot be carried out, it is disabled with this method.

_discard_molecules_with_seq_mismatch(molecules_from_ccs: dict[int, pacbio_data_processing.bam_utils.Molecule]) dict[int, pacbio_data_processing.bam_utils.Molecule][source]

[Internal method] The aligned CCS molecules are filtered in this method to keep only molecules that match perfectly the corresponding reference (ie, taking into account variants).

_dump_results() None[source]

[Internal method] All the output generated is driven by this method:

  • a joint gff file

  • a per detection csv file

  • a methylation report

  • a summary report

  • the molecules sets (see :py:class:pacbio_data_processing.summary.SummaryReport)

_ensure_ccs_bam_aligned() None[source]

[Internal method] As its name suggests, it is ensured that the aligned variants of the CCS file exist. The summary report is informed about the aligned CCS files.

Note

The CCS BAM file is created before checking if its aligned variants are present. It might seem a logic error to proceed this way instead of checking first for the existence of the aligned variants of the CCS BAM before deciding if the computation of the CCS BAM file is needed, but it is not an error: in order to decide if a given file can be an aligned version of the CCS BAM, we need the CCS BAM itself.

_ensure_input_bam_aligned() None[source]

[Internal method] Main check point for aligned input bam files: this method calls whatever is necessary to ensure that the input bam is aligned, which means: normal (straight) alignment and π-shifted alignment.

Warning! The method tries to find a pi-shifted aligned BAM if the input is aligned based on whether

  1. a file with suitable filename is found, and

  2. it is aligned.

_exists_pi_shifted_variant_from_aligned_input() bool[source]

[Internal method] It checks that the expected pi-shifted aligned file exists and is an aligned BAM file.

_filter_molecules() None[source]

[Internal method] The _molecules_todo mapping is here reduced by removing molecules that do not fulfill a minimum requirement of quality. The summary report is updated accordingly. See the cleanup_molecules auxiliary function for details on the filtering process. An attribute called _filtered_molecules_generator is set which produces MoleculeWorkUnit s.

_fix_positions() None[source]

[Internal method] The purpose is to shift back the shifted positions in the π-shifted molecules. Two operations are required to complete that task:

  1. fixing positions in the gff files, and

  2. fixing positions in the molecules themselves.

_fix_positions_in_gffs() None[source]

[Internal method] In the case that some molecules have been processed, the positions in the gff files corresponding to molecules that have been π-shifted are shifted back.

_fix_positions_in_molecules() None[source]

[Internal method] All positions of π-shifted molecules are shifted back in the _molecules_todo dictionary (which will be used to generate the methylation report).

_generate_indices() None[source]

[Internal method] Indices are generated for all files that need to be analyzed by ipdSummary.

_init_summary() None[source]

[Internal method] This method creates an instance of SummaryReport and sets an attribute with it.

_ipd_analysis() None[source]

[Internal method] Performs the IPD analysis of the single molecule files. Sets a generator with Paths to produced GFF files.

_keep_only_pishifted_molecules_crossing_origin(molecules_from_ccs: dict[int, pacbio_data_processing.bam_utils.Molecule]) dict[int, pacbio_data_processing.bam_utils.Molecule][source]

[Internal method] This method filters out molecules from the CCS aligned list that 1. Belong to the π-shifted variant, and 2. Do not cross the origin These molecules are unwanted because the point of including π-shifting in the analysis is to catch molecules crossing the origin.

_merge_partitions_if_needed() None[source]

[Internal method] This method merges properly the output files produced during the processing of all the partitions, if they are ready.

Warning

It is assumed that this is called within the :py:meth:_post_process_partition phase.

property _minimum_mapping_quality: int

[Internal property] This attribute (cached, but not intented to be manually overwritten) returns the minimum value of the mapping quality that is acceptable for the current analysis. If a mapping quality threshold is provided in the command line, it is used and returned as the attribute value. Otherwise a value is computed using the py:func:bam_utils.estimate_max_mapping_quality function.

_post_process_partition() None[source]

[Internal method] After the analysis is done, if only a fraction (aka partition) was processed, this method declares that the analysis of the current``partition`` is complete and tries to merge the partitions (which will only occur if the proper conditions are met).

_remove_partition_done_files()[source]

[Internal method] Remove the partition done marker files after the partitions have been successfully merged.

Warning

It is assumed that this is called within the :py:meth:_post_process_partition phase.

_report_discarded_molecules_with_seq_mismatch(mols_in_raw_ccs_files: dict[str, dict[int, pacbio_data_processing.bam_utils.Molecule]], molecules_from_ccs: dict[int, pacbio_data_processing.bam_utils.Molecule]) None[source]

[Internal method] This method simply logs the ids of discarded molecules and passes the infos to the SummaryReport instance.

_report_faulty_molecules() None[source]

[Internal method] The molecules that had any problem in their processing are passed to the SummaryReport as a set.

_select_molecules() None[source]

[Internal method] This method is part of the main sequence irrespective of whether the user selects to only produce the methylation report, or the full analysis. After this method the mapping _molecules_todo is created of type dict[int, Molecule], with molecules that:

  1. Belong to the partition,

  2. Are correctly mapped in the aligned CCS file, and

  3. If they belong to the pi-shifted variant (molecules obtained after aligning with a pi-shifted reference) then they cross the origin.

_set_aligner() None[source]

[Internal method] This method decides what aligner to use, sets an attribute with it and sets the prefixes (used in log messages) accordingly.

_split_bam() None[source]

[Internal method] Produces a generator with 2-tuples of the type (mol_id[int], Molecule) where the Molecule is related to a single molecule BAM file that has been generated by split_bam_file_in_molecules. It sets an attribute called _per_molecule_bam_generator that refers to that generator.

property all_partition_done_filenames: list[pathlib.Path]

Attribute that return a list of ``Path``s corresponding to the files expected to be found when all the partitions are processed (in case of partitioning the input BAM).

property all_partitions_ready: bool

Attribute that answers the question: are all the partitions ready?

property partition: pacbio_data_processing.utils.Partition

The target Partition of the input BAM file that must be processed by the current analysis, according to the input provided by the user.

produce_methylation_report() None[source]
property workdir: tempfile.TemporaryDirectory

This attribute returns the necessary temporary working directory on demand and it ensures that only one temporary dir is created by caching.

pacbio_data_processing.sm_analysis._main(config) None[source]

This function drives the Single Molecule Analysis once the input has been parsed.

pacbio_data_processing.sm_analysis.create_raw_detections_file(gffs: collections.abc.Iterable[Union[pathlib.Path, str]], detections_filename: Union[pathlib.Path, str], modification_types: list[str])[source]

Function in charge of creating the raw detections file. Starting from a set of .gff files, a csv file (delimiter=”,”), the raw detections file, is saved with the following columns:

  • mol id: taken from each gff filename (e.g. ‘a.b.c.gff’ -> mol id: ‘b’);

  • modtype: column number 3 (idx: 2) of the gffs (feature type) (e.g. ‘m6A’);

  • GATC position: column number 5 (idx: 4) of each gff which corresponds to the ‘end coordinate of the feature’ in the GFF3 standard;

  • score of the feature: column number 6 (idx: 5); floating point (Phred-transformed pvalue that a kinetic deviation exists at this position)

  • strand: strand of the feature. It can be +, - with obvious meanings. It can also be ? (meaning unknown) or . (for non stranded features)

There are more columns. Although their number is not fixed by this function, in practice they are 4 in the case of a detected modification. In that case these 4 last columns correspond to the values given in the ‘attributes’ column of the gffs (col 9; idx 8). For example, given the following attributes column:

coverage=134;context=TCA...;IPDRatio=3.91;identificationQv=228

we would get the following 4 ‘extra’ columns in our raw detections file:

134,TCA...,3.91,228

and this is exactly what happens with the m6A modification type. Notice that the value of identificationQV is, again, a phred transformed probability of having a detection. See eq. (8) in [1]

Parsing: All the lines starting by ‘#’ in the gff files are ignored. The format of the gff file is GFF3: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

[1]: “Detection and Identification of Base Modifications with Single Molecule Real-Time Sequencing Data”

pacbio_data_processing.sm_analysis.generate_CCS_file(ccs: pacbio_data_processing.external.CCS, in_bam: pathlib.Path, ccs_bam_file: pathlib.Path) Optional[pathlib.Path][source]

Idempotent computation of the Circular Consensus Sequence (CCS) version of the passed in in_bam file done with passed-in ccs object.

Returns

the CCS bam file, if it is there, or None if if could not be computed (yet).

pacbio_data_processing.sm_analysis.main_cl() None[source]

Entry point for sm-analysis executable.

pacbio_data_processing.sm_analysis.map_molecules_with_highest_sim_ratio(bam_file_name: Optional[Union[pathlib.Path, str]]) dict[int, pacbio_data_processing.bam_utils.Molecule][source]

Given the path to a bam file, it returns a dictionary, whose keys are mol ids (ints) and the values are the corresponding Molecules. If multiple lines in the given BAM file share the mol id, only the first line found with the highest similarity ratio (computed from the cigar) is chosen: if multiple lines share the molecule ID and the highest similarity ratio (say, 1), ONLY the first one is taken, irrespective of other factors.

pacbio_data_processing.sm_analysis.restore_old_run(old_path, new_path)[source]

pacbio_data_processing.sm_analysis_gui module

pacbio_data_processing.sm_analysis_gui.main_gui()[source]

Entry point for sm-analysis-gui executable.

pacbio_data_processing.summary module

class pacbio_data_processing.summary.AlignedBamAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.AlignedCCSBamsAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.BarsPlotAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.ROAttribute

class pacbio_data_processing.summary.GATCCoverageBarsPlot(name=None)[source]

Bases: pacbio_data_processing.summary.BarsPlotAttribute

data_definition = {'GATCs NOT in BAM file (%)': ('perc_all_gatcs_not_identified_in_bam',), 'GATCs NOT in methylation report (%)': ('perc_all_gatcs_not_in_meth',), 'GATCs in BAM file (%)': ('perc_all_gatcs_identified_in_bam',), 'GATCs in methylation report (%)': ('perc_all_gatcs_in_meth',)}
dependency_names = ('aligned_ccs_bam_files', 'methylation_report')
index_labels = ('Percentage',)
title = 'GATCs in BAM file and Methylation report'
class pacbio_data_processing.summary.HistoryPlotAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.ROAttribute

class pacbio_data_processing.summary.InputBamAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.InputReferenceAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.MappingQualityHistogram(name=None)[source]

Bases: pacbio_data_processing.summary.HistoryPlotAttribute

data_name = 'mapping quality'
dependency_name = 'aligned_bam'
legend = True
make_data_for_plot(instance)[source]
make_up_extra_args(instance)[source]
title = 'Mapping quality histogram of subreads in the aligned input BAM'
class pacbio_data_processing.summary.MappingQualityThresholdAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.MethTypeBarsPlot(name=None)[source]

Bases: pacbio_data_processing.summary.BarsPlotAttribute

data_definition = {'Fully methylated (%)': ('fully_methylated_gatcs_wrt_meth',), 'Fully unmethylated (%)': ('fully_unmethylated_gatcs_wrt_meth',), 'Hemi-methylated in + strand (%)': ('hemi_plus_methylated_gatcs_wrt_meth',), 'Hemi-methylated in - strand (%)': ('hemi_minus_methylated_gatcs_wrt_meth',)}
dependency_names = ('methylation_report',)
index_labels = ('Percentage',)
title = 'Methylation types in methylation report'
class pacbio_data_processing.summary.MethylationReport(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.MoleculeLenHistogram(name=None)[source]

Bases: pacbio_data_processing.summary.HistoryPlotAttribute

column_name = 'len(molecule)'
data_name = 'length'
dependency_name = 'methylation_report'
labels = ('Initial subreads', 'Analyzed molecules')
legend = True
make_data_for_plot(instance)[source]
title = 'Initial subreads and analyzed molecule length histogram'
class pacbio_data_processing.summary.MoleculeTypeBarsPlot(name=None)[source]

Bases: pacbio_data_processing.summary.BarsPlotAttribute

data_definition = {'Faulty (with processing error)': ('perc_faulty_mols', 'perc_faulty_subreads'), 'Filtered out': ('perc_filtered_out_mols', 'perc_filtered_out_subreads'), 'In Methylation report with GATC': ('perc_mols_in_meth_report_with_gatcs', 'perc_subreads_in_meth_report_with_gatcs'), 'In Methylation report without GATC': ('perc_mols_in_meth_report_without_gatcs', 'perc_subreads_in_meth_report_without_gatcs'), 'Mismatch discards': ('perc_mols_dna_mismatches', 'perc_subreads_dna_mismatches'), 'Used in aligned CCS': ('perc_mols_used_in_aligned_ccs', 'perc_subreads_used_in_aligned_ccs')}
dependency_names = ('mols_used_in_aligned_ccs', 'methylation_report')
index_labels = ('Number of molecules (%)', 'Number of subreads (%)')
title = 'Processed molecules and subreads'
class pacbio_data_processing.summary.MolsSetAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.PercAttribute(total_attr, pref='perc_', suf='_wrt_meth', name=None)[source]

Bases: pacbio_data_processing.summary.ROAttribute

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

__init__(total_attr, pref='perc_', suf='_wrt_meth', name=None)[source]
class pacbio_data_processing.summary.PositionCoverageBarsPlot(name=None)[source]

Bases: pacbio_data_processing.summary.BarsPlotAttribute

data_definition = {'Positions NOT covered by molecules in BAM file (%)': ('perc_all_positions_not_in_bam',), 'Positions NOT covered by molecules in methylation report (%)': ('perc_all_positions_not_in_meth',), 'Positions covered by molecules in BAM file (%)': ('perc_all_positions_in_bam',), 'Positions covered by molecules in methylation report (%)': ('perc_all_positions_in_meth',)}
dependency_names = ('aligned_ccs_bam_files', 'methylation_report')
index_labels = ('Percentage',)
title = 'Position coverage in BAM file and Methylation report'
class pacbio_data_processing.summary.PositionCoverageHistory(name=None)[source]

Bases: pacbio_data_processing.summary.HistoryPlotAttribute

dependency_name = 'methylation_report'
labels = ('Positions',)
legend = False
len_column_name = 'len(molecule)'
make_data_for_plot(instance)[source]
start_column_name = 'start of molecule'
title = 'Sequencing positions covered by analyzed molecules'
class pacbio_data_processing.summary.ROAttribute(name=None)[source]

Bases: pacbio_data_processing.summary.SimpleAttribute

class pacbio_data_processing.summary.SimpleAttribute(name=None)[source]

Bases: object

The base class of all other descriptor managed attributes of SummaryReport. It is a wrapper around the _data dictionary of the instance owning this attribute.

__init__(name=None)[source]
class pacbio_data_processing.summary.SummaryReport(bam_path, aligned_bam_path, dnaseq, figures_prefix='')[source]

Bases: collections.abc.Mapping

Final summary report generated by sm-analysis initially intended for humans.

This class has been crafted to carefully control most of its attributes. Data can be fed into the class by setting some attributes. That process can trigger the generation of other attributes, that are typically read-only. In some cases the attributes are simple attributes, without side effects.

After instantiating the class with the path to the input BAM and the dna sequence of the reference (instance of DNASeq), one must set some attributes to be able to save the summary report:

s = SummaryReport(bam_path, aligned_bam_path, dnaseq)
s.methylation_report = path_to_meth_report
s.raw_detections = path_to_raw_detections_file
s.gff_result = path_to_gff_result_file
s.aligned_ccs_bam_files = {
    'straight': aligned_ccs_path,
    'pi-shifted': pi_shifted_aligned_ccs_path
}
# The next is optional: it will add a vertical line to the
# mapping quality histogram:
s.mapping_quality_threshold = 30

# Some information about what happened with some molecules must
# be given as well. There are two options for that. First, in the
# *normal flow* the following would be done:
s.mols_used_in_aligned_ccs = {3, 67, ...}  # set of ints
# Optionally you can provide:
s.mols_dna_mismatches = {20, 49, ...}  # set of ints
# or/and:
s.filtered_out_mols = {22, 493, ...}  # set of ints
# or/and:
s.faulty_mols = {332, 389, ...}  # set of ints

# The second possibility is to load the data about the molecules
# from file(s). That is an option if a partitioned
# ``SingleMoleculeAnalysis`` has been carried out and the results
# must be merged. In that case, you would do:
s.load_molecule_sets("file1.pickle")
s.load_molecule_sets("file2.pickle")
...
# and so many files as necessary can be loaded. Their information
# will be added together.
# The names of the files can be also ``Path`` instances (which is
# the usual case).

At this point all the necessary data is there and the report can be created:

s.save('summary_whatever.html')
__init__(bam_path, aligned_bam_path, dnaseq, figures_prefix='')[source]
aligned_bam
aligned_ccs_bam_files
all_gatcs_identified_in_bam
all_gatcs_in_meth
all_gatcs_not_identified_in_bam
all_gatcs_not_in_meth
all_positions_in_bam
all_positions_in_meth
all_positions_not_in_bam
all_positions_not_in_meth
property as_html: str
body_md5sum
dump_molecule_sets(filename: pathlib.Path) None[source]

This method stores in a file the _molecule_sets attribute. It is done using pickle. The motivation for that is to be able to easily combine several SummaryReport instances coming from different partitioned analysis. To be able to do that without repeating the filtering process, etc, it is necessary to have the information about what molecules have been discarded for different reasons and what molecules are used from the aligned files.

faulty_mols
faulty_subreads
filtered_out_mols
filtered_out_subreads
full_md5sum
fully_methylated_gatcs
fully_methylated_gatcs_wrt_meth

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

fully_unmethylated_gatcs
fully_unmethylated_gatcs_wrt_meth

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

gatc_coverage_bars
gff_result

The base class of all other descriptor managed attributes of SummaryReport. It is a wrapper around the _data dictionary of the instance owning this attribute.

hemi_methylated_gatcs
hemi_methylated_gatcs_wrt_meth

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

hemi_minus_methylated_gatcs
hemi_minus_methylated_gatcs_wrt_meth

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

hemi_plus_methylated_gatcs
hemi_plus_methylated_gatcs_wrt_meth

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

input_bam
input_bam_size
input_reference
keys() a set-like object providing a view on D's keys[source]
load_molecule_sets(filename: pathlib.Path) None[source]

This method reads data from the file filename` (using ``pickle), it assumes that a dictionary is obtained with the sets of molecule ids (int) that are important to re-create the state of the SummaryReport without going through the SingleMoleculeAnalysis process all over again.

If can be used multiple times and the sets obtained each time will update the current ones (a mathematical union of sets).

mapping_qualities
mapping_quality_histogram
mapping_quality_threshold
max_possible_methylations
meth_type_bars
methylation_report
molecule_len_histogram
molecule_type_bars
mols_dna_mismatches
mols_in_meth_report
mols_in_meth_report_with_gatcs
mols_in_meth_report_without_gatcs
mols_ini
mols_used_in_aligned_ccs
perc_all_gatcs_identified_in_bam

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_all_gatcs_in_meth

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_all_gatcs_not_identified_in_bam

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_all_gatcs_not_in_meth

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_all_positions_in_bam

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_all_positions_in_meth

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_all_positions_not_in_bam

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_all_positions_not_in_meth

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_faulty_mols

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_faulty_subreads

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_filtered_out_mols

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_filtered_out_subreads

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_mols_dna_mismatches

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_mols_in_meth_report

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_mols_in_meth_report_with_gatcs

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_mols_in_meth_report_without_gatcs

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_mols_used_in_aligned_ccs

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_subreads_dna_mismatches

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_subreads_in_meth_report

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_subreads_in_meth_report_with_gatcs

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_subreads_in_meth_report_without_gatcs

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_subreads_used_in_aligned_ccs

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_subreads_with_high_mapq

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

perc_subreads_with_low_mapq

From a given attribute in a SummaryReport instance, the percentage is computed (wrt the value in s.total_attr) and returned as str.

position_coverage_bars
position_coverage_history
raw_detections

The base class of all other descriptor managed attributes of SummaryReport. It is a wrapper around the _data dictionary of the instance owning this attribute.

ready_to_go(*attrs) bool[source]

Method used to check if some attributes are already usable or not (in other words if they have been already set or not).

reference_base_pairs
reference_md5sum
reference_name
save(filename) None[source]
subreads_aligned_ini
subreads_dna_mismatches
subreads_in_meth_report
subreads_in_meth_report_with_gatcs
subreads_in_meth_report_without_gatcs
subreads_ini
subreads_used_in_aligned_ccs
subreads_with_high_mapq
subreads_with_low_mapq
switch_on(attribute: str) None[source]

Method used by descriptors to inform the instance of SummaryReport that some computed attributes needed by the plots are already computed and usable.

total_gatcs_in_ref

pacbio_data_processing.templates module

pacbio_data_processing.types module

pacbio_data_processing.utils module

class pacbio_data_processing.utils.AlmostUUID[source]

Bases: object

A class that provides a 5 letters summary of a UUID. It is intended to be used as prefix in all log messages. It is not necessary that two instances are different. But it is necessary that:

  1. the string representation is short, and

  2. given two instances their string representations most probably differ.

The underlying UUID is obtained from the stdlib using uuid.uuid1. The class is implemented using the Borg pattern: all instances running in the same interpreter share a common _uuid attribute.

__init__() None[source]
class pacbio_data_processing.utils.DNASeq(raw_seq: pacbio_data_processing.utils.DNASeqLike, name: str = '', description: str = '')[source]

Bases: Generic[pacbio_data_processing.utils.DNASeqLike]

Wrapper around ‘Bio.Seq.Seq’.

__init__(raw_seq: pacbio_data_processing.utils.DNASeqLike, name: str = '', description: str = '')[source]
classmethod from_fasta(fasta_name: str) pacbio_data_processing.utils.DNASeqType[source]

Returns a DNASeq from the first DNA sequence stored in the fasta named ‘fasta_name’ after ensuring that the fasta index is there.

property md5sum: str

It returns the MD5 checksum’s hexdigest of the upper version of the sequence as a string.

pi_shifted() pacbio_data_processing.utils.DNASeqType[source]

Method to return a pi-shifted DNASeq from the original one. pi-shifted means that a circular topology is assumed in the DNA sequence and a shift in the origin is done by π radians, ie the sequence is splitted in two parts and both parts are permuted.

upper() Bio.Seq.Seq[source]
write_fasta(output_file_name: Union[pathlib.Path, str]) None[source]
class pacbio_data_processing.utils.Partition(partition_specification: Optional[tuple[int, int]], bamfile: pacbio_data_processing.bam.BamFile)[source]

Bases: object

A Partition is a class that helps answering the following question: assuming that we are interested in processing a fraction of a BamFile, does the molecule ID mol_id belong to that fraction, or not? A prior implementation consisted in storing all the molecule IDs in the BamFile corresponding to a given partition in a set, and the answer is just obtained by querying if a molecule ID belongs to the set or not. That former implementation is not enough for the case of multiple alignment processes for the same raw BamFile (eg, when a combined analysis of the so-called ‘straight’ and ‘pi-shifted’ variants is performed). In that case the partition is decided with one file. And all molecule IDs belonging to the non-empty intersection with the other file must be unambiguously accomodated in a certain partition. This class has been designed to solve that problem.

__init__(partition_specification: Optional[tuple[int, int]], bamfile: pacbio_data_processing.bam.BamFile) None[source]

Creates a Partition object without validating the partition_specification, which is done at the time of reading the input given by the user. See :py:class:pacbio_data_processing.parameters.SingleMoleculeAnalysisParameters

_delimit_partitions() None[source]

[Internal method] This method decides what are the limits of all partitions given the number of partitions. The method sets an internal mapping, self._lower_limits, of the type {partition number [int]: lower limit [int]} with that information. This mapping is populated with all the partition numbers and corresponding values.

_set_current_limits() None[source]

[Internal method] Auxiliary method for __contains__ Here it is determined what is the range of molecule IDs, as ints, that belong to the partition. The method sets two integer attributes, namely:

  • _lower_limit_current: the minimum molecule ID of the current partition, and

  • _higher_limit_current: the maximum molecule ID of the current partition; it can be None, meaning that there is no maximum (last partition).

property is_proper: bool

A proper partition is one that refers to a proper subset of the given BamFile. Since an empty set is not permitted by the :py:class:SingleMoleculeAnalysisParameters class, an improper partition can only be a partition that refers to the whole BamFile.

pacbio_data_processing.utils.combine_scores(scores: collections.abc.Sequence[float]) float[source]

It computes the combined phred transformed score of the scores provided. Some examples:

>>> combine_scores([10])
10.0
>>> q = combine_scores([10, 12, 14])
>>> print(round(q, 6))
7.204355
>>> q = combine_scores([30, 20, 100, 92])
>>> print(round(q, 6))
19.590023
>>> q_500 = combine_scores([30, 20, 500])
>>> q_no_500 = combine_scores([30, 20])
>>> q_500 == q_no_500
True
>>> combine_scores([200, 300, 500])
200.0
pacbio_data_processing.utils.find_gatc_positions(seq: str, offset: int = 0) set[int][source]

Convenience function that computes the positions of all GATCs found in the given sequence. The values are relative to the offset.

>>> find_gatc_positions('AAAGAGAGATCGCGCGATC') == {7, 15}
True
>>> find_gatc_positions('AAAGAGAGTCGCGCCATC')
set()
>>> find_gatc_positions('AAAGAGAGATCGgaTcCGCGATC') == {7, 12, 19}
True
>>> s = find_gatc_positions('AAAGAGAGATCGgaTcCGCGATC', offset=23)
>>> s == {30, 35, 42}
True
pacbio_data_processing.utils.make_partition_prefix(partition: int, partitions: int) str[source]

Simple function to act as a Single Source of Truth for the partition prefix used elsewhere in the project. No validation is done. It just blindly returns a string constructed with the arguments.

pacbio_data_processing.utils.merge_files(infiles: list[pathlib.Path], outfile: pathlib.Path, keep_only_first_header=False) None[source]

Utility function that concatenates files optionally handling one-line headers correctly: if the files have (one-line) header, it must be declared at call time and then the function will only keep the header found in the first file. All other headers (first line of the remaining files) will be discarded.

pacbio_data_processing.utils.pishift_back_positions_in_gff(gff_path: Union[str, pathlib.Path]) None[source]

A function that parses the input GFF file (assumed to be a valid GFF file) and shifts back the positions found in it (columns 4th and 5th of lines not starting by #). It is assumed that the positions in the input file (gff_path) are referring to a pi-shifted origin. To undo the shift, the length of the sequence(s) is (are) read from the GFF3 directives (lines starting by ##), in particular from the ##sequence-region pragmas. This function can handle the case of multiple sequences.

Warning

The function overwrites the input gff_path.

pacbio_data_processing.utils.shift_me_back(pos: int, nbp: int) int[source]

Unshifts a given position taking into account that it has been previously shifted by half of the number of base pairs. It takes into account the possibility of having a sequence with an odd length.

@params:

  • pos - 1-based position of a base pair to unshift

  • nbp - number of base pairs in the reference

@returns:

  • unshifted position

Some examples:

>>> shift_me_back(3, 10)
8
>>> shift_me_back(1, 20)
11
>>> shift_me_back(3, 7)
6
>>> shift_me_back(4, 7)
7
>>> shift_me_back(5, 7)
1
>>> shift_me_back(7, 7)
3
>>> shift_me_back(1, 7)
4

To understand the operation of this function consider the following example. Given a sequence of 7 base pairs with the following indices found in the reference in the natural order, ie

1 2 3 4 5 6 7

then, after being pi-shifted the base pairs in the sequence are reordered, and the indices become (in parenthesis the former indices):

1’(=4) 2’(=5) 3’(=6) 4’(=7) 5’(=1) 6’(=2) 7’(=3)

The current function accepts primed indices and transforms them to the unprimed indices, ie, the positions returned refer to the original reference.

pacbio_data_processing.utils.try_computations_with_variants_until_done(func: Callable, variants: collections.abc.Sequence[str], *args: Any) None[source]

This function runs the passed in function func with the arguments``*args`` and for each variant in variants,eg. something like this: .. code-block:

for v in variants:
    result = func(*args, variant=v)

but it keeps doing so until each result returned by func is not None. When a None is returned by func, a call to sleep is warranted before continuing. The time slept depends on how many times it was sleeping before; the sleep time grows exponentially with every iteration:

t -> 2*t

until all the computations (results of func for each variant) are completed, ie all are not None. The main application of this function is to ensure that some common operations of the SingleMoleculeAnalysis are done once and only once irrespective of how many parallel instances of the analysis (with different partitions each) are carried out. For example, this function can be used to avoid collisions in the generation of aligned BAM files since pacbio_data_processing.external.Blasr has a mechanism that allows concurrent computations. This function delegates the decision on whether the computation is done or not to func.

Note

A special case is when a variant is None, in that case the function func is called without the variant argument:

result = func(*args)

Therefore, if variants is, e.g. (None,), then func is only called once in each iteration WITHOUT variant keyword argument. That is useful if the function func must be called until is done, but it takes no variant argument.

Module contents

Top-level package for PacBio data processing.