Screen

  • Screens the genome for ERV like regions by comparing the genome to a set of known retroviral ORFs using Exonerate.

  • Confirms the Exonerate regions using UBLAST

  • Finds and confirms ORFs within these regions

  • Finds the most similar known retroviral ORF in the database to each of the newly identified ORFs

Functions

  1. initiate

  2. genomeToChroms

  3. prepDBs

  4. runExonerate

  5. cleanExonerate

  6. mergeOverlaps

  7. makeFastas

  8. renameFastas

  9. makeUBLASTDb

  10. runUBLASTCheck

  11. classifyWithExonerate

  12. getORFs

  13. checkORFsUBLAST

  14. assignGroups

  15. summariseScreen

  16. Screen

initiate

Input Files
pipeline.ini

Output Files
init.txt

Parameters
[genome] file
[paths] path_to_ERVsearch
[paths] path_to_usearch
[paths] path_to_exonerate

Initialises the pipeline and checks that the required parameters in the pipeline.ini are set and valid and that the required software is in your $PATH.

Checks that:

  • The input genome file exists.

  • The correct path to ERVsearch is provided.

  • samtools, bedtools, FastTree and mafft are in the $PATH

  • The correct paths to usearch and exonerate are provided.

init.txt is a placeholder to show that this step has been completed.

genomeToChroms

Input Files [genome] file
keep_chroms.txt

Output Files
host_chromosomes.dir/*fasta

Parameters
[genome] file
[genomesplits] split
[genomesplits] split_n
[genomesplits] force

Splits the host genome provided by the user into FASTA files of a suitable size to run Exonerate efficiently.

If genomesplits_split in the pipeline.ini is False, the genome is split into one fasta file for each sequence - each chromosome, scaffold or contig.

If genomesplits_split in the pipeline.ini is True, the genome is split into the number of batches specified by the genomesplits_splitn parameter, unless the total number of sequences in the input file is less than this number.

The pipeline will fail if the number of sequences which would result from the genomesplits settings would result in >500 Exonerate runs, however it is possible to force the pipeline to run despite this by setting genomesplits_force to True.

If the file keep_chroms.txt exists in the working directory only chromosomes listed in this file will be kept.

An unzipped copy of zipped and gzipped fasta files will be created or a link to the file if it is already unzipped. this will be named genome.fa and be in the working directory.

This function generates a series of fasta files which are stored in the host_chromosomes.dir directory.

prepDBs

Input Files
None

Output Files
gene_databases.dir/GENE.fasta

Parameters
[database] use_custom_db
[database] gag
[database] pol
[database] env

Retrieves the gag, pol and env amino acid sequence database fasta files and puts a copy of each gene_databases.dir directory.

If custom databases are used they are retrieved and named as gag.fasta pol.fasta, env.fasta so the path doesn’t need to be changed every time.

runExonerate

Input Files
gene_databases.dir/GENE.fasta
host_chromosomes.dir/*fasta

Output Files
raw_exonerate_output.dir/GENE_*.tsv

Parameters
[paths] path_to_exonerate

Runs the protein2dna algorithm in the Exonerate software package with the host chromosomes (or other regions) in host_chromosomes.dir as target sequences and the FASTA files from prepDBs as the query sequences.

The raw output of Exonerate is stored in the raw_exonerate_output directory, one file is created for each combination of query and target sequences.

This step is carried out with low stringency as results are later filtered using UBLAST and Exonerate.

cleanExonerate

Input Files
raw_exonerate_output.dir/GENE_*.tsv

Output_Files
clean_exonerate_output.dir/GENE_*_unfiltered.tsv
clean_exonerate_output.dir/GENE_*_filtered.tsv
clean_exonerate_output.dir/GENE_*.bed

Parameters
[exonerate] min_hit_length

Filters and cleans up the Exonerate output.

  • Converts the raw Exonerate output files into dataframes - GENE_unfiltered.tsv

  • Filters out any regions containing introns (as defined by Exonerate)

  • Filters out regions less than exonerate_min_hit_length on the host sequence (in nucleotides).

  • Outputs the filtered regions to GENE_filtered.tsv

  • Converts this to bed format and outputs this to GENE.bed

mergeOverlaps

Input Files
clean_exonerate_output.dir/GENE_*.bed

Output_Files
gene_bed_files.dir/GENE_all.bed,
gene_bed_files.dir/GENE_merged.bed

Parameters
[exonerate] overlap

Merges the output bed files for individual sections of the input genome into a single bed file.

Overlapping regions or very close together regions of the genome detected by Exonerate with similarity to the same retroviral gene are then merged into single regions. This is performed using bedtools merge on the bed files output by cleanExonerate.

If there is a gap of less than exonerate_overlap between the regions they will be merged.

makeFastas

Input Files
gene_bed_files.dir/GENE_merged.bed
genome.fa

Output Files
gene_fasta_files.dir/GENE_merged.fasta

Parameters
None

Fasta files are generated containing the sequences of the merged regions of the genome identified using mergeOverlaps. These are extracted from the host chromosomes using bedtools getfasta.

renameFastas

Input Files
gene_fasta_files.dir/GENE_merged.fasta

Output Files
gene_fasta_files.dir/GENE_merged_renamed.fasta

Parameters
None

Renames the sequences in the fasta files of ERV-like regions identified with Exonerate so each record has a numbered unique ID (gag1, gag2 etc). Also removes “:” from sequence names as this causes problems later.

makeUBLASTDb

Input Files
gene_databases.dir/GENE.fasta

Output Files
UBLAST_db.dir/GENE_db.udb

Parameters
[paths] path_to_ublast

USEARCH requires an indexed database of query sequences to run. This function generates this database for the three gene amino acid fasta files used to screen the genome.

runUBLASTCheck

Input Files
UBLAST_db.dir/GENE_db.udb
gene_fasta_files.dir/GENE_merged.fasta

Output Files
ublast.dir/GENE_UBLAST_alignments.txt
ublast.dir/GENE_UBLAST.tsv
ublast.dir/GENE_filtered_UBLAST.fasta

Parameters
[paths] path_to_usearch
[usearch] min_id
[usearch] min_hit_length
[usearch] min_coverage

ERV regions in the fasta files generated by makeFasta are compared to the ERV amino acid database files for a second time, this time using USEARCH (https://www.drive5.com/usearch/). Using both of these tools reduces the number of false positives.

This allows sequences with low similarity to known ERVs to be filtered out. Similarity thresholds can be set in the pipeline.ini file (usearch_min_id, - minimum identity between query and target - usearch_min_hit_length - minimum length of hit on target sequence - and usearch_min_coverage - minimum proportion of the query sequence the hit should cover).

The raw output of running UBLAST against the target sequences is saved in GENE_UBLAST_alignments.txt (equivalent to the BLAST default output) and GENE_UBLAST.tsv (equivalent to the BLAST -outfmt 6 tabular output) this is already filtered by passing the appropriate parameters to UBLAST. The regions which passed the filtering and are therefore in these output files are then output to a FASTA file GENE_filtered_UBLAST.fasta.

classifyWithExonerate

Input Files
ublast.dir/GENE_filtered_UBLAST.fasta
ERVsearch/ERV_db/all_ERVs_nt.fasta

Output Files
exonerate_classification.dir/GENE_all_matches_exonerate.tsv
exonerate_classification.dir/GENE_best_matches_exonerate.tsv
exonerate_classification.dir/GENE_refiltered_matches_exonerate.fasta

Parameters
[paths] path_to_exonerate
[exonerate] min_score

Runs the Exonerate ungapped algorithm with each ERV region in the fasta files generated by makeFasta as queries and the all_ERVs_nt.fasta fasta file as a target, to detect which known retrovirus is most similar to each newly identified ERV region. Regions which don’t meet a minimum score threshold (exonerate_min_score) are filtered out.

all_ERVs_nt.fasta contains nucleic acid sequences for many known endogenous and exogenous retroviruses with known classifications.

First all seqeunces are compared to the database and the raw output is saved as exonerate_classification.dirGENE_all_matches_exonerate.tsv. Results need a score greater than exonerate_min_score against one of the genes of the same type (gag, pol or env) in the database. The highest scoring result which meets these critera for each sequence is then identified and output to exonerate_classification.dir/GENE_best_matches_exonerate.tsv. The sequences which meet these critera are also output to a FASTA file exonerate_classification.dir/GENE_refiltered_exonerate.fasta.

getORFs

Input Files
exonerate_classification.dir/GENE_refiltered_matches_exonerate.fasta

Output Files
ORFs.dir/GENE_orfs_raw.fasta
xsORFs.dir/GENE_orfs_nt.fasta
ORFs.dir/GENE_orfs_aa.fasta

Parameters
[orfs] translation_table
[orfs] min_orf_len

Finds the longest open reading frame in each of the ERV regions in the filtered output table.

This analysis is performed using EMBOSS revseq and EMBOSS transeq.

The sequence is translated in all six frames using the user specified translation table. The longest ORF is then identified. ORFs shorter than orfs_min_orf_length are filtered out.

The positions of the ORFs are also convered so that they can be extracted directly from the input sequence file, rather than using the co-ordinates relative to the original Exonerate regions.

The raw transeq output, the nucleotide sequences of the ORFs and the amino acid sequences of the ORFs are written to the output FASTA files.

checkORFsUBLAST

Input Files
ORFs.dir/GENE_orfs_nt.fasta
UBLAST_dbs.dir/GENE_db.udb

Output Files
ublast_orfs.dir/GENE_UBLAST_alignments.txt
ublast_orfs.dir/GENE_UBLAST.tsv
ublast_orfs.dir/GENE_filtered_UBLAST.fasta

Parameters
[paths] path_to_usearch
[usearch] min_id
[usearch] min_hit_length
[usearch] min_coverage

ERV ORFs in the fasta files generated by the ORFs function are compared to the original ERV amino acid files using UBLAST. This allows any remaining sequences with poor similarity to known ERVs to be filtered out.

This allows ORFs with low similarity to known ERVs to be filtered out. Similarity thresholds can be set in the pipeline.ini file (usearch_min_id, - minimum identity between query and target - usearch_min_hit_length - minimum length of hit on target sequence - and usearch_min_coverage - minimum proportion of the query sequence the hit should cover).

The raw output of running UBLAST against the target sequences is saved in GENE_UBLAST_alignments.txt (equivalent to the BLAST default output) and GENE_UBLAST.tsv (equivalent to the BLAST -outfmt 6 tabular output) this is already filtered by passing the appropriate parameters to UBLAST. The regions which passed the filtering and are therefore in these output files are then output to a FASTA file GENE_filtered_UBLAST.fasta.

assignGroups

Input Files
ublast_orfs.dir/GENE_UBLAST.tsv
ERVsearch/ERV_db/convert.tsv

Output Files
grouped.dir/GENE_groups.tsv

Parameters
[paths] path_to_ERVsearch

Many of the retroviruses in the input database all_ERVs_nt.fasta have been classified into groups based on sequence similarity, prior knowledge and phylogenetic clustering. Some sequences don’t fall into any well defined group, in these cases they are just assigned to a genus, usually based on prior knowledge. The information about these groups is stored in the provided file ERVsearch/ERV_db/convert.tsv.

Each sequence in the filtered fasta file of newly identified ORFs is assigned to one of these groups based on the sequence identified as the most similar in the classifyWithExonerate step.

The output table is also tidied up to include the UBLAST output, chromosome, ORF start and end positions, genus and group.

summariseScreen

Input Files

Output Files

Parameters

Screen

Input Files
None

Output Files
None

Parameters
None

Helper function to run all screening functions (all functions prior to this point).