Classify¶
Classifies the newly identified ORFs into groups based on the most similar known ORF
Aligns the newly identified ORFs with reference sequences within these groups and builds a phylogenetic tree for each group.
Finds clusters of newly identified ORFs within these trees
Incorporates representative sequences from these clusters into a summary tree for each retroviral gene and genus (based on classification into gamma, beta, spuma, alpha, lenti, epsilon and delta retroviruses as defined by the ICTV (https://talk.ictvonline.org/taxonomy).
makeGroupFastas¶
Input Files
grouped.dir/GENE_groups.tsv
ERVsearch/phylogenies/group_phylogenies/*fasta
ERVsearch/phylogenies/summary_phylogenies/*fasta
ERVsearch/phylogenies/outgroups.tsv
Output Files
group_fastas.dir/GENE_(.*)_GENUS.fasta
group_fastas.dir/GENE_(.*)_GENUS_A.fasta
Parameters
[paths] path_to_ERVsearch
Two sets of reference fasta files are available (files are stored in ERVsearch/phylogenies/group_phylogenies
and ERVsearch/phylogenies/summary_phylogenies
)
group_phylogenies - groups of closely related ERVs for fine classification of sequences
summary_phylogenies - groups of most distant ERVs for broad classification of sequences
Sequences have been assigned to groups based on the most similar sequence in the provided ERV database, based on the score using the Exonerate ungapped algorithm. Where the most similar sequence is not part of a a well defined group, it has been assigned to a genus.
Fasta files are generated containing all members of the group from the group_phylogenies file (plus an outgroup) where possible and using representative sequences from the same genus, using the summary_phylogenies file, where only a genus has been assigned, plus all the newly identified ERVs in the group. These files are saved as GENE_(group_name_)GENUS.fasta.
A “~” is added to all new sequence names so they can be searched for easily.
The files are aligned using the MAFFT fftns algorithm https://mafft.cbrc.jp/alignment/software/manual/manual.html to generate the GENE_(group_name_)GENUS_A.fasta aligned output files.
makeGroupTrees¶
Input Files
group_fastas.dir/GENE_(.*_)GENUS_A.fasta
Output Files
group_trees.dir/GENE_(.*_)GENUS.tre
Parameters
None
Builds a phylogenetic tree, using the FastTree2 algorithm (http://www.microbesonline.org/fasttree) with the default settings plus the GTR model, for the aligned group FASTA files generated by the makeGroupFastas function.
drawGroupTrees¶
Input Files
group_trees.dir/GENE_(.*_)GENUS.tre
Output Files
group_trees.dir/GENE_(.*_)GENUS.FMT
(png, svg, pdf or jpg)
Parameters
[plots] gag_colour
[plots] pol_colour
[plots] env_colour
[trees] use_gene_colour
[trees] maincolour
[trees] highlightcolour
[trees] outgroupcolour
[trees] dpi
[trees] format
Generates an image file for each file generated in the makeGroupTrees step, using ete3 (http://etetoolkit.org). Newly identified sequences are labelled as “~” and shown in a different colour.
By default, newly identified sequences are shown in the colours specified in plots_gag_colour
, plots_pol_colour
and plots_env_colour
- to do this then trees_use_gene_colour
should be set to True in the pipeline.ini
. Alternatively, a fixed colour can be used by setting trees_use_gene_colour
to False and settings trees_highlightcolour
. The text colour of the reference sequences (default black) can be set using trees_maincolour
and the outgroup using trees_outgroupcolour
.
The output file DPI can be specified using trees_dpi
and the format (which can be png, svg, pdf or jpg) using trees_format
.
makeSummaryFastas¶
Input Files
group_fastas.dir/GENE_(.*_)GENUS.fasta
group_trees.dir/GENE_(*_)GENUS.tre
ERVsearch/phylogenies/summary_phylogenies/GENE_GENUS.fasta
ERVsearch/phylogenies/group_phylogenies/(.*)_GENUS_GENE.fasta
Output Files
summary_fastas.dir/GENE_GENUS.fasta
summary_fastas.dir/GENE_GENUS.tre
Parameters
[paths] path_to_ERVsearch
Based on the group phylogenetic trees generated in makeGroupTrees, monophyletic groups of newly idenified ERVs are identified. For each of these groups, a single sequence (the longest) is selected as representative. The representative sequences are combined with the FASTA files in ERVsearch/phylogenies/summary_phylogenies
, which contain representative sequences for each retroviral gene and genus. These are extended to include further reference sequences from the same small group as the newly identified sequences.
For example, if one MLV-like pol and one HERVF-like pol was identified in the gamma genus, the gamma_pol.fasta summary fasta would contain:
* The new MLV-like pol sequence
* The new HERVF-like pol sequence
* The reference sequences from ERVsearch/phylogenies/group_phylogenies/MLV-like_gamma_pol.fasta
- highly related sequences from the MLV-like group
* The reference sequences from ERVsearch/phylogenies/group_phylogenies/HERVF-like_gamma_pol.fasta
- highly related sequences from the HERVF-like group.
* The reference sequences from ERVsearch/phylogenies/summary_phylogenies/gamma_pol.fasta
- a less detailed but more diverse set of gammaretroviral pol ORFs.
* A epsilonretrovirus outgroup
This ensures sufficient detail in the groups of interest while avoiding excessive detail in groups where nothing new has been identified.
These FASTA files are saved as GENE_GENUS.fasta
The files are aligned using the MAFFT fftns algorithm https://mafft.cbrc.jp/alignment/software/manual/manual.html to generate the GENE_GENUS_A.fasta aligned output files.
makeSummaryTrees¶
Input Files
summary_fastas.dir/GENE_GENUS_A.fasta
Output Files
summary_trees.dir/GENE_GENUS.tre
Parameters
None
Builds a phylogenetic tree, using the FastTree2 algorithm (http://www.microbesonline.org/fasttree) with the default settings plus the GTR model, for the aligned group FASTA files generated by the makeSummaryFastas function.
drawSummaryTrees¶
Input Files
summary_trees.dir/GENE_GENUS.tre
Output Files
summary_trees.dir/GENE_GENUS.FMT
(FMT = png, svg, pdf or jpg)
Parameters
[plots] gag_colour
[plots] pol_colour
[plots] env_colour
[trees] use_gene_colour
[trees] maincolour
[trees] highlightcolour
[trees] outgroupcolour
[trees] dpi
[trees] format
Generates an image file for each file generated in the makeSummaryTrees step, using ete3 (http://etetoolkit.org). Newly identified sequences are labelled as “~” and shown in a different colour. Monophyletic groups of newly identified ERVs have been collapsed (by choosing a single representative sequence) and the number of sequences in the group is added to the label and represented by the size of the node tip.
By default, newly identified sequences are shown in the colours specified in plots_gag_colour
, plots_pol_colour
and plots_env_colour
- to do this then trees_use_gene_colour
should be set to True in the pipeline.ini
. Alternatively, a fixed colour can be used by setting trees_use_gene_colour
to False and settings trees_highlightcolour
. The text colour of the reference sequences (default black) can be set using trees_maincolour
and the outgroup using trees_outgroupcolour
.
The output file DPI can be specified using trees_dpi
and the format (which can be png, svg, pdf or jpg) using trees_format
.
Classify¶
Input Files None
Output Files None
Parameters None
Helper function to run all screening functions and classification functions (all functions prior to this point).