Tools and Databases
IPEV
IPEV applied CNN to distinguish prokaryotic and eukaryotic Virus from virome data. It is built on Python3.8.6 , Tensorflow 2.3.1. IPEV calculates a set of scores that reflect the probability that the input sequence fragments are prokaryotic and eukaryotic viral sequences. By using parallelism and algorithmic optimization, IPEV gets the results of the calculations very quickly.
DeepHoF
DeepHoF (using deep learning to virus-host finder) is designed to predict the potential host types (plant, germ, invertebrate, vertebrate, human) of a given virus, which is represented by its nucleotide sequences. The tool will provide five scores and the corresponding p-values which reflect the propobilities of the virus infecting each host type. In addition, the infection likelihood profile of the given virus is provided.
LightCUD
lightCUD was a validated, high-performance program based on a machine-learning algorithm (lightGBM). LightCUD was implemented in the python language and packaged to be used free of installation with embedded customized databases. With WGS data or 16S rRNA sequencing data of gut samples as input, lightCUD can discriminate IBD from healthy controls with high accuracy and further identified the specific form of IBD。
DREEM
We construct a disease-related marker genes database, named DREEM, in human gut microbiome by comprehensively retrieving all at present available data resources with the 18.63T WGS data of 1,729 samples, involving the state-of-the-art bioinformatics tools and well-designed statistical analysis. A total of 1,953,046 DREEM genes is built with covering six diseases, and further classified into six groups corresponding to each disease. Moreover, 5,100 Core-DREEM genes are defined as a common set shared by the diseases.
InteMAP
We developed a pipeline named InteMAP (Integrated Metagenome Assembly Pipeline for short reads) for integrating individual assemblers that complemented the advantages mutually in assembling metagenomic sequences. By comparing the performance of InteMAP with individual assemblers on both synthetic and real NGS metagenomic data, we showed that the InteMAP pipeline is able to achieve high performance of better assembly with a longer total contig length, the higher contiguity, and containing more genes than individual assemblers.
LncADeep
We propose an ab initio lncRNA identification and functional annotation tool named LncADeep. LncADeep has outperformed state-of-the-art tools on predicting lncRNAs and lncRNA-protein interactions, and can automatically provide informative functional annotations for lncRNAs.
MetaComp
MetaComp is capable to process all meta-omics data, such as metagenomics, metatranscriptomics, metaproteomics and metabolomics data, respectively.
MAP
A de novo assembly approach and its implementation based on an improved Overlap/Layout/Consensus (OLC) strategy incorporated with several special algorithms.MAP uses the mate pair information, resulting in being more applicable to shotgun DNA reads (recommended as > 200 bp) currently widely-used in metagenome projects. Results of extensive tests on simulated data show that MAP can be superior to both Celera and Phrap for typical longer reads by Sanger sequencing, as well as has an evident advantage over Celera, Newbler, and the newest Genovo, for typical shorter reads by 454 sequencing.
MetaTISA
A tool with an aim to improve translationa initiation sites (TISs) prediction of current gene-finders for metagenomes. The method employs a two-step strategy to predict TISs by first clustering metagenomic fragments into phylogenetic groups and then predicting TISs independently for each group in an unsupervised manner. As evaluated on experimentally verified TISs, MetaTISA greatly improves the accuracies of TIS prediction of current gene-finders.
MED2.1
MED2.1 is a non-supervised prokaryotic gene prediction method which integrates MED2.0 and TriTISA, an iterative self-learning translation initiation site (TIS) prediction algorithm. As the update of MED2.0, MED2.1 modifies the TIS model by replacing the previous one to TriTISA, which imoroves the prediction accuracies for both 3' and 5' ends.
MetaGUN
MetaGUN is a novel gene prediction protocol for metagenomic fragments based on a machine learning approach of SVM. It can predict accurate results on both 3' and 5' ends of genes with fragments of various lengths. Especially, it makes the most reliable predictions among current metagenomic gene finders. Application to two samples of human gut microbiome indicates that MetaGUN tends to predict more potential novel genes than other current metagenomic gene finders.
MID
MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome.
PROPER
PROPER is a stand-alone and cross-platform tool for predicting operon and prokaryotic transcription units, providing the visualization of results.
ProTISA
ProTISA is intended to collect confirmed translation initiation sites (TISs) for prokaryotic genomes. As of Oct 2008, it includes data for 728 genomes (676 Bacteria and 52 Archaea) with more than 700, 000 confirmed TISs. The confirmed data has supporting evidence from different sources, including experiments records in the public protein database Swiss-Prot, literature, conserved domain search and sequence alignment among orthologous genes. Combing with predictions from the-state-of-the-art TIS predictor MED-Start/MED-StartPlus and TriTISA and annotations on potential regulatory signals, the database can serve as a refined annotation resource for the public database RefSeq.
SigmaPromoter
SigmaPromoter is designed based on a scoring scheme to predict all promoters in a prokaryotic genome, and is able to annotate sigma factors used by the promoters. Upon test of the reliable sets, we show that SigmaPromoter is effective in detecting distinct types of sigma factors regulating transcription events, moreover the total prediction performance of SigmaPromoter evidently outperforms the existing methods. Compared to the current best predictors, SigmaPromoter achieves both the best sensitivity and the best specificity on average. In addition, SigmaPromoter's promoter prediction is able to annotate alternative transcription with the advantage of higher reliability
TriTISA
TriTISA is an TIS post-processor to refine annotation/prediction of translation initiation site (TIS) from an existing system for microbial genomes. The current version provides options for post-processing genome annotations from public databases such as GenBank and RefSeq, gene predictions from widely used gene finders such as GeneMark and Glimmer.