Classify

  • Classifies the newly identified ORFs into groups based on the most similar known ORF

  • Aligns the newly identified ORFs with reference sequences within these groups and builds a phylogenetic tree for each group.

  • Finds clusters of newly identified ORFs within these trees

  • Incorporates representative sequences from these clusters into a summary tree for each retroviral gene and genus (based on classification into gamma, beta, spuma, alpha, lenti, epsilon and delta retroviruses as defined by the ICTV (https://talk.ictvonline.org/taxonomy).

  1. makeGroupFastas

  2. makeGroupTrees

  3. drawGroupTrees

  4. makeSummaryFastas

  5. makeSummaryTrees

  6. drawSummaryTrees

  7. summariseClassify

  8. Classify

makeGroupFastas

Input Files
grouped.dir/GENE_groups.tsv
ERVsearch/phylogenies/group_phylogenies/*fasta
ERVsearch/phylogenies/summary_phylogenies/*fasta
ERVsearch/phylogenies/outgroups.tsv

Output Files
group_fastas.dir/GENE_(.*)_GENUS.fasta
group_fastas.dir/GENE_(.*)_GENUS_A.fasta

Parameters
[paths] path_to_ERVsearch

Two sets of reference fasta files are available (files are stored in ERVsearch/phylogenies/group_phylogenies and ERVsearch/phylogenies/summary_phylogenies)

  • group_phylogenies - groups of closely related ERVs for fine classification of sequences

  • summary_phylogenies - groups of most distant ERVs for broad classification of sequences

Sequences have been assigned to groups based on the most similar sequence in the provided ERV database, based on the score using the Exonerate ungapped algorithm. Where the most similar sequence is not part of a a well defined group, it has been assigned to a genus.

Fasta files are generated containing all members of the group from the group_phylogenies file (plus an outgroup) where possible and using representative sequences from the same genus, using the summary_phylogenies file, where only a genus has been assigned, plus all the newly identified ERVs in the group. These files are saved as GENE_(group_name_)GENUS.fasta.

A “~” is added to all new sequence names so they can be searched for easily.

The files are aligned using the MAFFT fftns algorithm https://mafft.cbrc.jp/alignment/software/manual/manual.html to generate the GENE_(group_name_)GENUS_A.fasta aligned output files.

makeGroupTrees

Input Files
group_fastas.dir/GENE_(.*_)GENUS_A.fasta

Output Files
group_trees.dir/GENE_(.*_)GENUS.tre

Parameters
None

Builds a phylogenetic tree, using the FastTree2 algorithm (http://www.microbesonline.org/fasttree) with the default settings plus the GTR model, for the aligned group FASTA files generated by the makeGroupFastas function.

drawGroupTrees

Input Files
group_trees.dir/GENE_(.*_)GENUS.tre

Output Files
group_trees.dir/GENE_(.*_)GENUS.FMT (png, svg, pdf or jpg)

Parameters
[plots] gag_colour
[plots] pol_colour
[plots] env_colour
[trees] use_gene_colour
[trees] maincolour
[trees] highlightcolour
[trees] outgroupcolour
[trees] dpi
[trees] format

Generates an image file for each file generated in the makeGroupTrees step, using ete3 (http://etetoolkit.org). Newly identified sequences are labelled as “~” and shown in a different colour.

By default, newly identified sequences are shown in the colours specified in plots_gag_colour, plots_pol_colour and plots_env_colour - to do this then trees_use_gene_colour should be set to True in the pipeline.ini. Alternatively, a fixed colour can be used by setting trees_use_gene_colour to False and settings trees_highlightcolour. The text colour of the reference sequences (default black) can be set using trees_maincolour and the outgroup using trees_outgroupcolour.

The output file DPI can be specified using trees_dpi and the format (which can be png, svg, pdf or jpg) using trees_format.

makeSummaryFastas

Input Files
group_fastas.dir/GENE_(.*_)GENUS.fasta
group_trees.dir/GENE_(*_)GENUS.tre
ERVsearch/phylogenies/summary_phylogenies/GENE_GENUS.fasta
ERVsearch/phylogenies/group_phylogenies/(.*)_GENUS_GENE.fasta

Output Files
summary_fastas.dir/GENE_GENUS.fasta
summary_fastas.dir/GENE_GENUS.tre

Parameters
[paths] path_to_ERVsearch

Based on the group phylogenetic trees generated in makeGroupTrees, monophyletic groups of newly idenified ERVs are identified. For each of these groups, a single sequence (the longest) is selected as representative. The representative sequences are combined with the FASTA files in ERVsearch/phylogenies/summary_phylogenies, which contain representative sequences for each retroviral gene and genus. These are extended to include further reference sequences from the same small group as the newly identified sequences.

For example, if one MLV-like pol and one HERVF-like pol was identified in the gamma genus, the gamma_pol.fasta summary fasta would contain: * The new MLV-like pol sequence * The new HERVF-like pol sequence * The reference sequences from ERVsearch/phylogenies/group_phylogenies/MLV-like_gamma_pol.fasta - highly related sequences from the MLV-like group * The reference sequences from ERVsearch/phylogenies/group_phylogenies/HERVF-like_gamma_pol.fasta - highly related sequences from the HERVF-like group. * The reference sequences from ERVsearch/phylogenies/summary_phylogenies/gamma_pol.fasta - a less detailed but more diverse set of gammaretroviral pol ORFs. * A epsilonretrovirus outgroup

This ensures sufficient detail in the groups of interest while avoiding excessive detail in groups where nothing new has been identified.

These FASTA files are saved as GENE_GENUS.fasta

The files are aligned using the MAFFT fftns algorithm https://mafft.cbrc.jp/alignment/software/manual/manual.html to generate the GENE_GENUS_A.fasta aligned output files.

makeSummaryTrees

Input Files
summary_fastas.dir/GENE_GENUS_A.fasta

Output Files
summary_trees.dir/GENE_GENUS.tre

Parameters
None

Builds a phylogenetic tree, using the FastTree2 algorithm (http://www.microbesonline.org/fasttree) with the default settings plus the GTR model, for the aligned group FASTA files generated by the makeSummaryFastas function.

drawSummaryTrees

Input Files
summary_trees.dir/GENE_GENUS.tre

Output Files
summary_trees.dir/GENE_GENUS.FMT (FMT = png, svg, pdf or jpg)

Parameters
[plots] gag_colour
[plots] pol_colour
[plots] env_colour
[trees] use_gene_colour
[trees] maincolour
[trees] highlightcolour
[trees] outgroupcolour
[trees] dpi
[trees] format

Generates an image file for each file generated in the makeSummaryTrees step, using ete3 (http://etetoolkit.org). Newly identified sequences are labelled as “~” and shown in a different colour. Monophyletic groups of newly identified ERVs have been collapsed (by choosing a single representative sequence) and the number of sequences in the group is added to the label and represented by the size of the node tip.

By default, newly identified sequences are shown in the colours specified in plots_gag_colour, plots_pol_colour and plots_env_colour - to do this then trees_use_gene_colour should be set to True in the pipeline.ini. Alternatively, a fixed colour can be used by setting trees_use_gene_colour to False and settings trees_highlightcolour. The text colour of the reference sequences (default black) can be set using trees_maincolour and the outgroup using trees_outgroupcolour.

The output file DPI can be specified using trees_dpi and the format (which can be png, svg, pdf or jpg) using trees_format.

summariseClassify

Input Files

Output Files

Parameters

Classify

Input Files None

Output Files None

Parameters None

Helper function to run all screening functions and classification functions (all functions prior to this point).