ShadowCaster’s documentation¶
ShadowCaster implements an evolutionary model to calculate Bayesian likelihoods for each ‘alien genes’ with an unusual sequence composition according to the host genome background to detect HGT events in prokaryotes.
ShadowCaster analysis workflow¶
- The user defines a query genome (by providing two fasta files, see Usage ) in which HGT events will be detected.
- ShadowCaster uses a list of proteomes from phylogenetically related species to the query genome (one proteome FASTA file per species) to construct a phylogenetic shadow.
- The list of proteomes could be either:
-Provided by the user (a collection of FASTA files).
-Automatically retrieved by ShadowCaster from the NCBI ftp site by using
script/get_proteomes.py
, see get_proteomes.py.
- A prioritized list of potential ‘alien genes’ present in the query genome is generated by the analysis of compositional features i.e. 4mers and codon usage. An unsupervised one-class support vector machine is used for the prioritization task.
- Orthology relationships among the query genome and its phylogenetically related species are obtained by the third-party algorithm ORTHOMCL. This information is used to calculate the ‘probability of orthology’ between the query genome and each other genome in the phylogenetic shadow.
- BLAST is used to calculate the identity between each alien gene and the rest of genes in the genomes of the phylogenetic shadow.
- A likelihood is calculated for each alien genes in the list from step 3. The likelihood expresses how likely is that the pattern of identity across genomes in the phylogenetic shadow for this alien gene derives from vertical inheritance.
Installation¶
For a comprehensive guide on how to install ShadowCaster and its prerequisites, see Installation.
Support¶
Send additional enquiries to asanchez2@utpl.edu.ec
License¶
GNU General Public License Version 3
Contents¶
Installation¶
Prerequisites¶
Fundamental prerequisites¶
python => v2.7.10
R => v3.4
perl => v5.10
These items are prerequisites for the installation of ShadowCaster as described below.
Packages¶
- Install python packages using:
pip install numpy scipy biopython pandas ete3 scikit-learn matplotlib
- R packages:
cluster
v2.0.* - Install figlet. For Linux (Mint 18.3) use:
apt install figlet
Other dependencies¶
For using the phylogenetic component, some programs are required and should be in the PATH:
- BLAST 2.2.26+ (blastp needed).
If you have a newer version of BLAST already installed, you just can copy the path to the executable file for blastp in the configuration file of ShadowCaster.
OrthoMcl needs blast v2.2.26(blastall and formatdb), these binary files can be found here. Download blast-2.2.26-*.tar.gz.
For Linux (Mint 18.3) this can be installed through:
apt-get install blast2
More information, see Suggestions.
- Package emboss.
For Linux (Mint 18.3) this is installed through:
apt-get install emboss
To check that these dependencies were correctly added in the path, run:
type orthomcl-pipeline blastall formatdb
Installation of ShadowCaster¶
To use ShadowCaster, download it from the GitHub repository and extract the files. If you have git installed, you can install ShadowCaster by running:
cd
git clone https://github.com/dani2s/ShadowCaster.git
Resolve all dependencies, see above and then execute:
cd ShadowCaster/
python setup.py install
This will install ShadowCaster under your home folder.
Usage¶
ShadowCaster uses a configuration file (args.ini) to manage all the options needed. This file must be specified on the command line and will supply the following arguments:
Files¶
- Query genome Only fasta files.
- Query proteome Only fasta files
- OrthoMcl configuration file, previously obtained with the installation of OrthoMcl-pipeline(orthomcl.conf). This file looks like:
coOrthologTable=CoOrtholog
dbConnectString=dbi:mysql:orthomcl:localhost:mysql_local_infile=1
dbLogin=orthomcl
dbPassword=orthomcl
dbVendor=mysql
evalueExponentCutoff=-5
inParalogTable=InParalog
interTaxonMatchView=InterTaxonMatch
oracleIndexTblSpc=NONE
orthologTable=Ortholog
percentMatchCutoff=50
similarSequencesTable=SimilarSequences
Path¶
- Proteomes folder, contain the proteomes(Only fasta files) of each species to construct the shadow.
-Provided by the user (a collection of FASTA files)
or
-Automatically retrieved by ShadowCaster from the NCBI ftp site by using
script/get_proteomes.py
, see get_proteomes.py.
Blastp26 ShadowCaster uses blastp 2.2.26, specify the binary file or the shell command used
Formatdb Specify the binary file or the shell command used. Ex. formatdb
Parametric¶
- nuSVM A bound between the fraction of training errors and the fraction of support vectors. Should be in the interval (0, 1].
A template of the args.ini file can be found in the bin folder.
Specifications of fasta files¶
The id of each gene in the GENOME fasta file SHOULD not contain any character like |, #, %, /, \, *, &, $, !, :
.
It is preferred only one identification number.
- All the proteomes files provided by the user MUST follow these specifications:
- Fasta file name ONLY can be the binomial name of the species (Rhodanobacter_denitrificans.fasta).
- Each id record of a fasta file MUST have only one id number or the following structure:
>AGG91012.1 ribosomal protein L34 [Rhodanobacter denitrificans]
MKRTFQPSKLKRARTHGFRARMATADGRKVLNARRAKGRKRLIP
Examples of the input files can be found in the test data repository of ShadowCaster, see here
Run ShadowCaster¶
cd $ShadowCaster/bin/
shadowcaster --config_file args.ini
Example¶
This documentation aims to be a complete example walk through for the usage of ShadowCaster. It assumes you have successfully gone through the Installation.
Downloading test data¶
Download and extract the test data repository of ShadowCaster in a suitable location, see here
Software specifications¶
The results provided in the test data repository were obtained running ShadowCaster with the following software versions.
O.S: | Linux Mint 18.03 (Sylvia) |
---|---|
Package base: | Ubuntu Xenial |
Python: | v2.7.15 |
R: | v3.4.4 |
Perl: | v5.22.1 |
mysql: | v5.7.23 |
Python packages: | |
|
|
R package: |
|
Run ShadowCaster¶
All fasta files are in the shadowcaster-input
and proteomes-output
folders. The arguments needed in the args.ini file are:
- query_genome = /home/user/path/to/shadowcaster-input/Rdenitrificans_genome.fasta
- query_proteome = /home/user/path/to/shadowcaster-input/Rhodanobacter_denitrificans.fasta
- proteomes_folder = /home/user/path/to/proteomes-output/proteomes/
- orthomcl_config = /home/user/orthomcl-pipeline/orthomcl.conf
- blastp26 = Binary file or your command line used of blastp v2.2.26.
- formatdb26 = Binary file or your command line used of blastall v2.2.26.
- nuSVM = 0.4
MUST use the full path of the files or directory.
Run ShadowCaster through:
cd $ShadowCaster/bin/
shadowcaster --config_file args.ini
When ShadowCaster has finished the message “ShadowCaster finished” is printed. The program generates a number of files in the output directory (called with the date and time of the running).
Output description¶
ShadowCaster generates the following files in the output directory.
log.txt: | Contains used parameters |
---|---|
Parametric folder: | |
|
|
Phylogenetic folder: | |
|
A copy of all the output files can be found in the shadowcaster-output
folder.
get_proteomes.py¶
get_proteomes.py
implements a method that we apply before finding HGT candidates with ShadowCaster.
This script retrieves a list of proteomes from phylogenetically related species to the query species (fasta files) from the NCBI ftp.
ShadowCaster needs these proteomes to construct a phylogenetic shadow used in its phylogenetic component.
Prerequisites¶
- EDirect UNIX command line of NCBI.
Before using the script, check that the commands esearch and xtract work correctly in a new shell window.
type esearch xtract
Usage¶
The usage and help documentation of get_proteomes.py
can be seen by
running python get_proteomes.py -h
:
Example¶
An example of how to run get_proteomes.py
on the test data:
cd ShadowCaster/scripts
python get_proteomes.py -n Rhodanobacter_denitrificans -sp 25
This results in the following output files in the folder named with the species name provided:
log.txt
Name of the downloaded species and its ftp address.proteomes folder
Proteomes (fasta file) used to construct the shadow.
The results should be similar to those found in the proteomes-output
folder of the test data repository, see here