Supporting data for " DeePhage: distinguish virulent and temperate phage-derived sequences in metavirome data with a deep learning approach" ======================================================================================================================================= Shufang W; Zhencheng F; Jie T; Mo L; Chunhui W; Qian G; Congmin X; Xiaoqing J; Huqiqiu Z (2020): Supporting data for " DeePhage: distinguish virulent and temperate phage-derived sequences in metavirome data with a deep learning approach ". Summary: ------------------------------------------------------------------------------------------------------------------------------------------- DeePhage is designed to identify metavirome sequences as virulent phage-derived or temperate phage-derived fragments. The program calculate a score between 0 and 1 for each input fragment. The sequence with a score higher than 0.5 would be regarded as a virulent phage-derived fragment and the sequence with a score lower than 0.5 would be regarded as a temperate phage-derived fragment. DeePhage can run either on the virtual machine or physical host. For non-computer professionals, we recommend running the virtual machine version of DeePhage on local PC. In this way, users do not need to install any dependency package. If GPU is available, you can also choose to run the physical host version. This version can automatically speed up with GPU and is more suitable to handle large scale data. ------------------------------------------------------------------------------------------------------------------------------------------- The following is the description about each file. File: md5.txt The md5 value of each file. File: DeePhage_v_1_0.zip Codes and executable file of DeePhage. File: VM_Bioinfo.vdi.7z The virtual machine of DeePhage. File: Figure1_Figure_S2.7z Figure 1 and Figure S2. File: training.7z. After decompression, the folder contains data, related results and scripts. ------------------------------------------------------------------------------------------------------------------------------------------- Folder Description five_fold_validation_model Output of trained models for each validation groups five_fold_validation_prediction_label Output of predicted results and label for each validation groups seperated_test_training_phages Phage genomes used in each cross validation, inclueding training and test set. MetaSim_simulating Output of MetaSim for training and testing sequences in each validation. result_encoding_onehot Output of script encode_onehot_for_training_set.m and encode_onehot_for_test_set.m Figures_S3 Script and figures for Figures S3. train_different_models Folder contains data for other different five models (Kmer-4, No-Maxpooling, No-Dropout, No-Globalpooling, and No-BN). File cal_sn_sp_acc.m Script for calculating the accuracy, sn and sp of each different models. subfolder kmer_4 Folder contiains data for Kmer-4 model File adjust_uncertain_nt.m Script called by seq_to_frequency.m kmer_4_string.m Script for coding 4-mer strings. kmer_order_4.mat Output of script kmer_4_string.m seq_to_frequency.m Script called by kmer_frequency_for_single_seq_test_set.m and kmer_frequency_for_single_seq_and_complement_seq_train_set.m kmer_frequency_for_single_seq_test_set.m Script for coding test sequences into 4-mer frequency. kmer_frequency_for_single_seq_and_complement_seq_train_set.m Script for coding train sequences into 4-mer frequency. test_1200_1800_2_kmer_4.mat Output of script kmer_frequency_for_single_seq_test_set.m test_label_1200_1800_2_kmer_4.mat Output of script kmer_frequency_for_single_seq_test_set.m train_1200_1800_2_kmer_4.mat Output of script kmer_frequency_for_single_seq_and_complement_seq_train_set.m train_label_1200_1800_2_kmer_4.mat Output of script kmer_frequency_for_single_seq_and_complement_seq_train_set.m train_model_1800_kmer_4.py Script for training K-mer model. Kmer_4_model.h5 Output of script train_model_1800_kmer_4.py Kmer_4_prediction.csv Output of script train_model_1800_kmer_4.py Kmer_4_output.txt Output of script train_model_1800_kmer_4.py train_test_fasta subfolder 1200_1800_2 File temp_1200_1800_2.fna Simulation temperate train sequences from 2-th cross validation set in Gruop D. viru_1200_1800_2.fna Simulation virulent train sequences from 2-th cross validation set in Gruop D. P_test.mat One-hot encoding form for 2-th cross validation test set in Gruop D. P_train_ds.mat One-hot encoding form for 2-th cross validation train set in Gruop D. T_test.mat Lable for 2-th cross validation test set in Gruop D. T_train_ds.mat Lable for 2-th cross validation train set in Gruop D. predict_1200_1800_2.csv Prediction results by DeePhage for 2-th cross validation test set in Gruop D. No_BN File No_BN_model.h5 Output of script train_model_1800_No_BN.py No_BN_prediction.csv Output of script train_model_1800_No_BN.py No_BN_output.txt Output of script train_model_1800_No_BN.py train_model_1800_No_BN.py Script for training No-BN model. No_Dropout File No_Dropout_model.h5 Output of script train_model_1800_No_Dropout.py No_Dropout_prediction.csv Output of script train_model_1800_No_Dropout.py No_Dropout_output.txt Output of script train_model_1800_No_Dropout.py train_model_1800_No_Dropout.py Script for training No-Dropout model. No_Globalpooling File No_Globalpooling_model.h5 Output of script train_model_1800_No_Globalpooling.py No_Globalpooling_prediction.csv Output of script train_model_1800_No_Globalpooling.py No_Globalpooling_output.txt Output of script train_model_1800_No_Globalpooling.py train_model_1800_No_Globalpooling.py Script for training No-Globalpooling model. No_Maxpooling File No_Maxpooling_model.h5 Output of script train_model_1800_No_Maxpooling.py No_Maxpooling_prediction.csv Output of script train_model_1800_No_Maxpooling.py No_Maxpooling_output.txt Output of script train_model_1800_No_Maxpooling.py train_model_1800_No_Globalpooling.py Script for training No-Maxpooling model. File Description extract_testing_and_training_seq.m Script for dividing the dataset into training and testing set in each validation. encode_onehot_for_training_set.m Script for encoding traininig sequences with one-hot form in each validation. encode_onehot_for_test_set.m Script for encoding test sequences with one-hot form in each validation. delet_two_virulent.m Script for deleting two ambiguous virulent phages in Dataset-1. Virulent_delet.fasta Output for script delet_two_virulent.m. Dataset-1_virulent.fasta Complete genomes of virulent phages from Dataset-1. Dataset-1_temperate.fasta Complete genomes of virulent phages from Dataset-1. Dataset-2_virulent.fasta Complete genomes of virulent phages from Dataset-2. Dataset-2_temperate.fasta Complete genomes of virulent phages from Dataset-2. calculate_validation_accuracy_sn_sp.m Script for calculating the accuracy, sn and sp in each validation. adjust_uncertain_nt.m Script called by encode_onehot.m. dividing_testing_and_training_phage.m Script for dividing phages into testing and training parts for Dataset-1. train_test_num.mat Output of script dividing_testing_and_training_phage.m validation_acc_sn_sp.mat Output of script calculate_validation_accuracy_sn_sp.m. train_model_400.py Script for training with 100-400bp groups. train_model_800.py Script for training with 400-800bp groups. train_model_1200.py Script for training with 800-1200bp groups. train_model_1800.py Script for training with 1200-1800bp groups. ------------------------------------------------------------------------------------------------------------------------------------------- File: all_train_all_CDS.7z. After decompression, the folder contains data, related results and scripts. ------------------------------------------------------------------------------------------------------------------------------------------- Folder Description temperate_all_gb_file Output of get_gb_file.py. virulent_all_gb_file Output of get_gb_file.py. CDS_faa_fasta_temperate Output of all_CDS_temperate.m. CDS_faa_fasta_virulent Output of all_CDS_virulent.m. phacts_result Subfolder all_CDS_temperate_faa Output of predictions of all temperate CDS from PHACTS. all_CDS_virulent_faa Output of predictions of all virulent CDS from PHACTS. PhagePred_result File all_temperate_CDS_predict_result_use_PhagePred.csv Output of predictions of all temperate CDS from PhagePred. all_virulent_CDS_predict_result_use_PhagePred.csv Output of predictions of all virulent CDS from PhagePred. File Description get_gb_file.py Script for getting the Genebank file from NCBI for all the virulent and temperate phages. get_NC_accession.m Script for getting the accessions of all virulent and temperate phages. header_test.mat Output of get_NC_accession.m. all_CDS_temperate.m DNA and protein sequence of all CDS regions from temperate phages. all_CDS_virulent.m Script for DNA and protein sequence of all CDS regions from virulent phages. all_CDS_temperate_for_DeePhage_in_one_file.m Script for integrating all the temperate CDS into one file. all_CDS_virulent_for_DeePhage_in_one_file.m Script for integrating all the virulent CDS into one file. delet2_temp_all_CDS_in_one_file.fasta Output of script all_CDS_temperate_for_DeePhage_in_one_file.m delet2_viru_all_CDS_in_one_file.fasta Output of script all_CDS_virulent_for_DeePhage_in_one_file.m delet2_temp_all_CDS_in_one_file.csv Output of predictions of all temperate CDS from DeePhage. delet2_viru_all_CDS_in_one_file.csv Output of predictions of all virulent CDS from DeePhage. Dataset-1_temperate.fasta Complete genomes of temperate phages in Dataset-1. Dataset-1_virulent.fasta Complete genomes of virulent phages in Dataset-1. acc_DeePhage.m Script for calculating the accuracy of DeePhage. acc_PHACTS.m Script for calculating the accuracy of PHACTS. acc_PhagePred.m Script for calculating the accuracy of PhagePred. --------------------------------------------------------------------------------------------------------------------------------------------- File: bacteria_predict.z7. After decompression, the folder contains data, related results and script. --------------------------------------------------------------------------------------------------------------------------------------------- File Description prok_reference_genomes.txt Downloaded 120 bacteria information. prok_reference_genomes.xlsx The NC accession numbers of 120 bacteria. bacteria_120reference_fna.zip The zip file of whole genomes files of 120 bacteria and their md5 file. bacteria_100_400.fna Output of MetaSim from 120 bacteria genomes with the length ranging from 100bp to 400bp. bacteria_400_800.fna Output of MetaSim from 120 bacteria genomes with the length ranging from 400bp to 800bp. bacteria_800_1200.fna Output of MetaSim from 120 bacteria genomes with the length ranging from 800bp to 1200bp. bacteria_1200_1800.fna Output of MetaSim from 120 bacteria genomes with the length ranging from 1200bp to 1800bp. bacteria_100_400.csv Output of the prediction from DeePhage for the file bacteria_100_400.fna. bacteria_400_800.csv Output of the prediction from DeePhage for the file bacteria_400_800.fna. bacteria_800_1200.csv Output of the prediction from DeePhage for the file bacteria_800_1200.fna. bacteria_1200_1800.csv Output of the prediction from DeePhage for the file bacteria_1200_1800.fna. temp_proportion.m Script for calculating the proportion of bacteria sequences predicted as temperate phages in different length conditions. ----------------------------------------------------------------------------------------------------------------------------------------------- File: all_data_to_train_model.z7. After decompression, the folder contains data, related results and script. --------------------------------------------------------------------------------------------------------------------------------------------- Subfolder Description 100_400 File temp_100_400.fna All the temperate phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 100bp to 400bp) using MetaSim. viru_100_400.fna All the virulent phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 100bp to 400bp) using MetaSim. T_train_ds.mat Labels of 80000 sequences. P_train_ds.mat Encoded form ("one-hot") of 80000 sequences. 100_400_all_train_model.h5 Trained model of Gruop A; output of script train_model_all_train_400.py 400_800 File temp_400_800.fna All the temperate phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 400bp to 800bp) using MetaSim. viru_400_800.fna All the virulent phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 400bp to 800bp) using MetaSim. T_train_ds.mat Labels of 80000 sequences. P_train_ds.mat Encoded form ("one-hot") of 80000 sequences. 400_800_all_train_model.h5 Trained model of Gruop A; output of script train_model_all_train_800.py 800_1200 File temp_800_1200.fna All the temperate phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 800bp to 1200bp) using MetaSim. viru_800_1200.fna All the virulent phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 800bp to 1200bp) using MetaSim. T_train_ds.mat Labels of 80000 sequences. P_train_ds.mat Encoded form ("one-hot") of 80000 sequences. 800_1200_all_train_model.h5 Trained model of Gruop A; output of script train_model_all_train_1200.py 1200_1800 File temp_1200_1800.fna All the temperate phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 1200bp to 1800bp) using MetaSim. viru_1200_1800.fna All the virulent phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 1200bp to 1800bp) using MetaSim. T_train_ds.mat Labels of 80000 sequences. P_train_ds.mat Encoded form ("one-hot") of 80000 sequences. 1200_1800_all_train_model.h5 Trained model of Gruop A; output of script train_model_all_train_1800.py t_SNE_net Description Subfolder results Containing files outputted by script layer_weight.py plots_using_origin Containing files used for Figure 2 by Origin. File layer_weight_get_weight.py Script for calculating weight matrix of each layers from Group D and save the weights pca_process.py Script for visualizations by dimensionality reduction. T_train_ds.mat Labels of 80000 sequences from all the phage genomes in Group D. P_train_ds.mat Encoded form ("one-hot") of 80000 sequences from all the phage genomes in Group D. *.npy Output of script layer_weight_get_weight.py. File Description adjust_uncertain_nt.m Script called by script.m. encoding_to_onehot_100_400.m Script for encoding the training sequence in Group A into "one-hot" form. encoding_to_onehot_400_800.m Script for encoding the training sequence in Group B into "one-hot" form. encoding_to_onehot_800_1200.m Script for encoding the training sequence in Group C into "one-hot" form. encoding_to_onehot_1200_1800.m Script for encoding the training sequence in Group D into "one-hot" form. Dataset-1_virulent.fasta Complete genomes of virulent phages from Dataset-1. Dataset-1_temperate.fasta Complete genomes of virulent phages from Dataset-1. Dataset-2_virulent.fasta Complete genomes of virulent phages from Dataset-2. Dataset-2_temperate.fasta Complete genomes of virulent phages from Dataset-2. train_model_all_train_400.py Script for training new model using all the dataset in Group A. train_model_all_train_800.py Script for training new model using all the dataset in Group B. train_model_all_train_1200.py Script for training new model using all the dataset in Group C. train_model_all_train_1800.py Script for training new model using all the dataset in Group D. ----------------------------------------------------------------------------------------------------------------------------------------------- File: PCA.z7. After decompression, the folder contains data, related results and script. --------------------------------------------------------------------------------------------------------------------------------------------- File Description temp_viru_kmer_frequency.m Script for calculating the 4-mer frequency of virulent and temperate phage whole genomes. temp_frequency.mat Output of script temp_viru_kmer_frequency.m viru_frequency.mat Output of script temp_viru_kmer_frequency.m cal_PCA.m Script for calculating the results of PCA. kmer_4_PCA.opju Origin files for two-dimension visualization. PCA_4.png Figure S1. PCA_4_300_dpi.png A 300 dpi version of Figure S1. ----------------------------------------------------------------------------------------------------------------------------------------------- File: runtime.z7. After decompression, the folder contains data, related results and script. --------------------------------------------------------------------------------------------------------------------------------------------- Folder Description files Files that used for testing running time of PHACTS, DeePhage and PhagePred. VM_Bioinfo.vdi A virtual machine that used for testing running time of PHACTS and DeePhage. ----------------------------------------------------------------------------------------------------------------------------------------------- File: real_virome_data.7z. After decompression, the folder contains data, related results and script. ----------------------------------------------------------------------------------------------------------------------------------------------- File Description 101835.fastq Downloaded metavirome data of bodily fluid in the bovine rumen from MG-RAST(accessions: mgm4534202.3). 101836.fastq Downloaded metavirome data of bodily fluid in the bovine rumen from MG-RAST(accessions: mgm4534203.3). contigs.fasta Assembled contigs of the file 101835.fastq and 101836.fastq by SPAdes. contigs-predicted-gene.faa Predicted genes of the file contigs.fasta by FragGeneScan. viral.1.protein.faa Viral proteins from viral protein database. viral.2.protein.faa Viral proteins from viral protein database. Calculate_the_lenght_of_N50.m Script for calculating the length of the N50 sequence. blastx_contigs_hallmarker_no_head.txt Output of blastx for aligning the predicted genes against the viral proteins. contigs_result.csv Output of the prediction from DeePhage for the file contigs.fasta. Folder Subfolder hits_16_contigs Subfolder Description 16_contigs_fasta_DNA_sequence DNA sequences of 16 contigs. 16_contigs_protein_sequence_by_FragGeneScan Protein prediction results by FragGeneScan for 16 contigs. 16_contigs_prediction_result_by_PHACTS Prediction results by PHACTS for 16 contigs. File Description Pick_out_16_contigs_use_evaleu_e-10_hits_len_400.m Script for picking out 16 contigs with e-value less than 1e-10 and hit length more than 400. contig_len_identity_evalue_hitlen.mat Output of script Pick_out_16_contigs_use_evaleu_e-10_hits_len_400.m Prediction_scores_by_DeePhage_16_contig.m Prediction scores for 16 contigs by DeePhage. Contig_ID_16_contigs.mat 16 contigs' ID saved in mat file. predict_all_blast_contigs_by_PHACTS File Description single_contig_file_faa_file.zip Zip file of each contigs' faa sequence by FragGeneScan (contigs having no faa sequences are not contained in). single_contig_file_faa_file_predict_result_by_PHACTS.zip Zip file of each contigs' prediction result by PHACTS blast_all_viru_temp_whole_genomes File Description contigs_blast_all_temp_viru.txt Contigs' blast results when using all the virulent and temperate whole genomes as database. sort_blast_result_with_contig_ID_len_identity_evalue_lifestyle.m Script for sorting blast result and pickout the information of contig ID, Contig length, identify score, e-value score and the according types of phages that being aligned. blast_len_identity_evalue_sort.mat Output of script sort_blast_result_with_contig_ID_len_identity_evalue_lifestyle.m phacts_pre_acc_for_real_data_blast_all_result.m Script for calculating the prediction proportion of two typed of phages by PHACTS. DeePhage_pre_acc_for_real_data_blast_all_result.m Script for calculating the prediction proportion of two typed of phages by DeePhage. predict_all_blast_contigs_by_PhagePred File Description PhagePred_16.m Script for calculating the prediction accuracy for 16 contig by PhagePred. PhagePred_pre_acc_for_real_data_blast_all_result.m Script for calculating the prediction proportion of two typed of phages by PhagePred. PhagePred_pre_real_data.csv Prediction results for real data by PhagePred. blast_len_identity_evalue_sort.mat Output of script sort_blast_result_with_contig_ID_len_identity_evalue_lifestyle.m ------------------------------------------------------------------------------------------------------------------------------------------------ File: phage_transformations.7z. After decompression, the folder contains data, related results and script. ------------------------------------------------------------------------------------------------------------------------------------------------ Folder contig Subfolder meta_genome_phage Subfolder Description SPAdes_assemble_meta_real_data The assembled contigs of metagenome samples (including Health and UC samples) by SPAdes. meta_SPAdes_phage_pro_result The prediction results of metagenome samples (including Health and UC samples) by PPR-Meta. phage_contigs_SPAdes Picked out phage contigs of metagenome samples (including Health and UC samples). DeePhage_pre_phage_contigs Prediciton results of phage contigs of metagenome samples (including Health and UC samples) by DeePhage. meta_virom Subfolder Description Health The assembled contigs by SPAdes (fasta file) and prediction results (csv file) by DeePhage of virome health samples. UC The assembled contigs by SPAdes (fasta file) and prediction results (csv file) by DeePhage of virome UC samples. blast_phage_genomes File Description viruses.txt Viruses information obtained from NCBI database (The NCBI database. ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/. Accessed 23 November 2020). phage.txt Phages information picked out from viruses.txt file. NC_num_phage.txt Phages' useful NC accession number picked out from phage.txt. phage_genomes.fasta Phages' genome sequences downloaded from NCBI database (https://www.ncbi.nlm.nih.gov/sites/batchentrez) according to the file NC_num_phage.txt. select_temp_contig.m Script for picking out all the temperate contigs annotated by DeePhage from virome Health and UC samples. temp_all_contigs_UC_54_sample.fasta Output of script select_temp_contig.m temp_all_contigs_Health_23_sample.fasta Output of script select_temp_contig.m temp_all_contigs_UC_54_blast_phage_e-10_simple.out Output of blast method when aligning temperate UC samples' contig to phage genomes. temp_all_contigs_Health_23_blast_phage_e-10_simple.out Output of blast method when aligning temperate Health samples' contig to phage genomes. temp_all_contigs_UC_54_blast_phage_e-10_simple_with_taxid.out.out The contig ID and its taxid (if existing) picked out from file temp_all_contigs_UC_blast_phage_e-10_54_simple.out temp_all_contigs_Health_23_blast_phage_e-10_simple_with_taxid.out.out The contig ID and its taxid (if existing) picked out from file temp_all_contigs_Health_blast_phage_e-10_23_simple.out get_different_taxonomy.r Script for calculating the difference species of phages between virome Health and UC samples. UC_taxid_unique.csv Output of script of get_different_taxonomy.r. Health_taxid_unique.csv Output of script of get_different_taxonomy.r. different_UC_state File Description different_UC_state_average_temperate_pro.m Script for calculating the proportion of temperate phage contigs in different UC states. different_state_average_pro.m Script for calculating the average proportion of temperate phage contigs in different UC states. pro_health_early.mat Output of script of different_UC_state_average_temperate_pro.m. pro_health_flare.mat Output of script of different_UC_state_average_temperate_pro.m. pro_health_improve.mat Output of script of different_UC_state_average_temperate_pro.m. pro_health_inactivate.mat Output of script of different_UC_state_average_temperate_pro.m. pro_health_late_resolve.mat Output of script of different_UC_state_average_temperate_pro.m. pro_health_mild.mat Output of script of different_UC_state_average_temperate_pro.m. pro_health_moderate.mat Output of script of different_UC_state_average_temperate_pro.m. Figure_3_0509.opju Origin file for Figure 3. Figure_3_revised.png Figure 3 in the manuscript. File Description phage_pro.m Script for calculating the phage proportions in each health and UC samples' metagenome data. metagenome_temperate_pro.m Script for calculating the temperate phage proportions in each health and UC samples' metagenome data. virome_temperate_pro.m Script for calculating the temperate phage proportions in each health and UC samples' virome data. metagenome_phage_pro.mat Output of script phage_pro.m result_temp_pro_meta.mat Output of script metagenome_temperate_pro.m result_virome.mat Output of script virome_temperate_pro.m calculate_p_value.r Script for calculating the significant difference of temperate phage proportions in health and UC samples' metagenome and virome. File vlp_site_disease_used_in_DeePhage.xlsx The information of virome UC samples (54) and Health samples (23) used in our study. ------------------------------------------------------------------------------------------------------------------------------------------------ File: tools_validation_comparsion.7z. After decompression, the folder contains data, related results and script. ------------------------------------------------------------------------------------------------------------------------------------------------ Folder fna_file Subfolder Description 100_400_1_test 20000 test DNA sequences from the 1st cross validation set of Group A. 100_400_2_test 20000 test DNA sequences from the 2nd cross validation set of Group A. 100_400_3_test 20000 test DNA sequences from the 3rd cross validation set of Group A. 100_400_4_test 20000 test DNA sequences from the 4th cross validation set of Group A. 100_400_5_test 20000 test DNA sequences from the 5th cross validation set of Group A. 400_800_1_test 20000 test DNA sequences from the 1st cross validation set of Group B. 400_800_2_test 20000 test DNA sequences from the 2nd cross validation set of Group B. 400_800_3_test 20000 test DNA sequences from the 3rd cross validation set of Group B. 400_800_4_test 20000 test DNA sequences from the 4th cross validation set of Group B. 400_800_5_test 20000 test DNA sequences from the 5th cross validation set of Group B. 800_1200_1_test 20000 test DNA sequences from the 1st cross validation set of Group C. 800_1200_2_test 20000 test DNA sequences from the 2nd cross validation set of Group C. 800_1200_3_test 20000 test DNA sequences from the 3rd cross validation set of Group C. 800_1200_4_test 20000 test DNA sequences from the 4th cross validation set of Group C. 800_1200_5_test 20000 test DNA sequences from the 5th cross validation set of Group C. 1200_1800_1_test 20000 test DNA sequences from the 1st cross validation set of Group D. 1200_1800_2_test 20000 test DNA sequences from the 2nd cross validation set of Group D. 1200_1800_3_test 20000 test DNA sequences from the 3rd cross validation set of Group D. 1200_1800_4_test 20000 test DNA sequences from the 4th cross validation set of Group D. 1200_1800_5_test 20000 test DNA sequences from the 5th cross validation set of Group D. FragGeneScan_pre_result File Description whether_pre_gene.m Script for judging whether a sequence could have a predicted protein sequence by FragGeneScan in each test set. 100_400_1_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 1st cross validation set of Group A. 100_400_2_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 2nd cross validation set of Group A. 100_400_3_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 3rd cross validation set of Group A. 100_400_4_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 4th cross validation set of Group A. 100_400_5_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 5th cross validation set of Group A. 400_800_1_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 1st cross validation set of Group B. 400_800_2_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 2nd cross validation set of Group B. 400_800_3_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 3rd cross validation set of Group B. 400_800_4_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 4th cross validation set of Group B. 400_800_5_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 5th cross validation set of Group B. 800_1200_1_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 1st cross validation set of Group C. 800_1200_2_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 2nd cross validation set of Group C. 800_1200_3_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 3rd cross validation set of Group C. 800_1200_4_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 4th cross validation set of Group C. 800_1200_5_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 5th cross validation set of Group C. 1200_1800_1_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 1st cross validation set of Group D. 1200_1800_2_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 2nd cross validation set of Group D. 1200_1800_3_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 3rd cross validation set of Group D. 1200_1800_4_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 4th cross validation set of Group D. 1200_1800_5_test_pre_gene.csv Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 5th cross validation set of Group D. Subfolder Description 100_400_1_test Gene prediction results of 20000 test DNA sequences from the 1st cross validation set of Group A by FragGeneScan. 100_400_2_test Gene prediction results of 20000 test DNA sequences from the 2nd cross validation set of Group A by FragGeneScan. 100_400_3_test Gene prediction results of 20000 test DNA sequences from the 3rd cross validation set of Group A by FragGeneScan. 100_400_4_test Gene prediction results of 20000 test DNA sequences from the 4th cross validation set of Group A by FragGeneScan. 100_400_5_test Gene prediction results of 20000 test DNA sequences from the 5th cross validation set of Group A by FragGeneScan. 400_800_1_test Gene prediction results of 20000 test DNA sequences from the 1st cross validation set of Group B by FragGeneScan. 400_800_2_test Gene prediction results of 20000 test DNA sequences from the 2nd cross validation set of Group B by FragGeneScan. 400_800_3_test Gene prediction results of 20000 test DNA sequences from the 3rd cross validation set of Group B by FragGeneScan. 400_800_4_test Gene prediction results of 20000 test DNA sequences from the 4th cross validation set of Group B by FragGeneScan. 400_800_5_test Gene prediction results of 20000 test DNA sequences from the 5th cross validation set of Group B by FragGeneScan. 800_1200_1_test Gene prediction results of 20000 test DNA sequences from the 1st cross validation set of Group C by FragGeneScan. 800_1200_2_test Gene prediction results of 20000 test DNA sequences from the 2nd cross validation set of Group C by FragGeneScan. 800_1200_3_test Gene prediction results of 20000 test DNA sequences from the 3rd cross validation set of Group C by FragGeneScan. 800_1200_4_test Gene prediction results of 20000 test DNA sequences from the 4th cross validation set of Group C by FragGeneScan. 800_1200_5_test Gene prediction results of 20000 test DNA sequences from the 5th cross validation set of Group C by FragGeneScan. 1200_1800_1_test Gene prediction results of 20000 test DNA sequences from the 1st cross validation set of Group D by FragGeneScan. 1200_1800_2_test Gene prediction results of 20000 test DNA sequences from the 2nd cross validation set of Group D by FragGeneScan. 1200_1800_3_test Gene prediction results of 20000 test DNA sequences from the 3rd cross validation set of Group D by FragGeneScan. 1200_1800_4_test Gene prediction results of 20000 test DNA sequences from the 4th cross validation set of Group D by FragGeneScan. 1200_1800_5_test Gene prediction results of 20000 test DNA sequences from the 5th cross validation set of Group D by FragGeneScan. label File Description 100_400_1_test_label.csv Labels of 20000 test DNA sequences from the 1st cross validation set of Group A, output by scirpt get_label.m 100_400_2_test_label.csv Labels of 20000 test DNA sequences from the 2nd cross validation set of Group A, output by scirpt get_label.m 100_400_3_test_label.csv Labels of 20000 test DNA sequences from the 3rd cross validation set of Group A, output by scirpt get_label.m 100_400_4_test_label.csv Labels of 20000 test DNA sequences from the 4th cross validation set of Group A, output by scirpt get_label.m 100_400_5_test_label.csv Labels of 20000 test DNA sequences from the 5th cross validation set of Group A, output by scirpt get_label.m 400_800_1_test_label.csv Labels of 20000 test DNA sequences from the 1st cross validation set of Group B, output by scirpt get_label.m 400_800_2_test_label.csv Labels of 20000 test DNA sequences from the 2nd cross validation set of Group B, output by scirpt get_label.m 400_800_3_test_label.csv Labels of 20000 test DNA sequences from the 3rd cross validation set of Group B, output by scirpt get_label.m 400_800_4_test_label.csv Labels of 20000 test DNA sequences from the 4th cross validation set of Group B, output by scirpt get_label.m 400_800_5_test_label.csv Labels of 20000 test DNA sequences from the 5th cross validation set of Group B, output by scirpt get_label.m 800_1200_1_test_label.csv Labels of 20000 test DNA sequences from the 1st cross validation set of Group C, output by scirpt get_label.m 800_1200_2_test_label.csv Labels of 20000 test DNA sequences from the 2nd cross validation set of Group C, output by scirpt get_label.m 800_1200_3_test_label.csv Labels of 20000 test DNA sequences from the 3rd cross validation set of Group C, output by scirpt get_label.m 800_1200_4_test_label.csv Labels of 20000 test DNA sequences from the 4th cross validation set of Group C, output by scirpt get_label.m 800_1200_5_test_label.csv Labels of 20000 test DNA sequences from the 5th cross validation set of Group C, output by scirpt get_label.m 1200_1800_1_test_label.csv Labels of 20000 test DNA sequences from the 1st cross validation set of Group D, output by scirpt get_label.m 1200_1800_2_test_label.csv Labels of 20000 test DNA sequences from the 2nd cross validation set of Group D, output by scirpt get_label.m 1200_1800_3_test_label.csv Labels of 20000 test DNA sequences from the 3rd cross validation set of Group D, output by scirpt get_label.m 1200_1800_4_test_label.csv Labels of 20000 test DNA sequences from the 4th cross validation set of Group D, output by scirpt get_label.m 1200_1800_5_test_label.csv Labels of 20000 test DNA sequences from the 5th cross validation set of Group D, output by scirpt get_label.m phacts_predict_result Subfolder Description 100_400_1_test Output files of 20000 test gene sequences from the 1st cross validation set of Group A by PHACTS. 100_400_2_test Output files of 20000 test gene sequences from the 2nd cross validation set of Group A by PHACTS. 100_400_3_test Output files of 20000 test gene sequences from the 3rd cross validation set of Group A by PHACTS. 100_400_4_test Output files of 20000 test gene sequences from the 4th cross validation set of Group A by PHACTS. 100_400_5_test Output files of 20000 test gene sequences from the 5th cross validation set of Group A by PHACTS. 400_800_1_test Output files of 20000 test gene sequences from the 1st cross validation set of Group B by PHACTS. 400_800_2_test Output files of 20000 test gene sequences from the 2nd cross validation set of Group B by PHACTS. 400_800_3_test Output files of 20000 test gene sequences from the 3rd cross validation set of Group B by PHACTS. 400_800_4_test Output files of 20000 test gene sequences from the 4th cross validation set of Group B by PHACTS. 400_800_5_test Output files of 20000 test gene sequences from the 5th cross validation set of Group B by PHACTS. 800_1200_1_test Output files of 20000 test gene sequences from the 1st cross validation set of Group C by PHACTS. 800_1200_2_test Output files of 20000 test gene sequences from the 2nd cross validation set of Group C by PHACTS. 800_1200_3_test Output files of 20000 test gene sequences from the 3rd cross validation set of Group C by PHACTS. 800_1200_4_test Output files of 20000 test gene sequences from the 4th cross validation set of Group C by PHACTS. 800_1200_5_test Output files of 20000 test gene sequences from the 5th cross validation set of Group C by PHACTS. 1200_1800_1_test Output files of 20000 test gene sequences from the 1st cross validation set of Group D by PHACTS. 1200_1800_2_test Output files of 20000 test gene sequences from the 2nd cross validation set of Group D by PHACTS. 1200_1800_3_test Output files of 20000 test gene sequences from the 3rd cross validation set of Group D by PHACTS. 1200_1800_4_test Output files of 20000 test gene sequences from the 4th cross validation set of Group D by PHACTS. 1200_1800_5_test Output files of 20000 test gene sequences from the 5th cross validation set of Group D by PHACTS. File Description phacts_calculate_validation_accuracy_sn_sp_for_table_1.m Script for calculating the acc, sn and sp in each cross validation in table 1. phacts_except_no_gene_for_figure_1.m Script for calculating the acc, sn and sp in each cross validation in figure 1 (except sequences without gene). 100_400_1_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 1st cross validation set of Group A. 100_400_2_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 2nd cross validation set of Group A. 100_400_3_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 3rd cross validation set of Group A. 100_400_4_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 4th cross validation set of Group A. 100_400_5_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 5th cross validation set of Group A. 400_800_1_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 1st cross validation set of Group B. 400_800_2_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 2nd cross validation set of Group B. 400_800_3_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 3rd cross validation set of Group B. 400_800_4_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 4th cross validation set of Group B. 400_800_5_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 5th cross validation set of Group B. 800_1200_1_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 1st cross validation set of Group C. 800_1200_2_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 2nd cross validation set of Group C. 800_1200_3_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 3rd cross validation set of Group C. 800_1200_4_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 4th cross validation set of Group C. 800_1200_5_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 5th cross validation set of Group C. 1200_1800_1_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 1st cross validation set of Group D. 1200_1800_2_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 2rd cross validation set of Group D. 1200_1800_3_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 3nd cross validation set of Group D. 1200_1800_4_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 4th cross validation set of Group D. 1200_1800_5_test_phacts_pre.csv Prediction results by PHACTS of 20000 test DNA sequences from the 1th cross validation set of Group D. DeePhage_predict_result Subfolder DeePhage_pre_when_cross_validation Prediction results by DeePhage and labels in each cross validation. File rearrange_T_test_to_calculate_validation_accuracy_sn_sp.m Script for calculating the acc, sn and sp in each cross validation in table 1. rearrange_DeePhage_except_no_gene.m Script for calculating the acc, sn and sp in each cross validation in figure 1 (except sequences without gene). PhagePred_predict_result File Description PhagePred_calculate_validation_accuracy_sn_sp_for_table_1.m Script for calculating the acc, sn and sp in each cross validation in table 1. PhagePred_except_no_gene_for_figure_1.m Script for calculating the acc, sn and sp in each cross validation in figure 1 (except sequences without gene). 100_400_1_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 1st cross validation set of Group A. 100_400_2_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 2nd cross validation set of Group A. 100_400_3_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 3rd cross validation set of Group A. 100_400_4_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 4th cross validation set of Group A. 100_400_5_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 5th cross validation set of Group A. 400_800_1_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 1st cross validation set of Group B. 400_800_2_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 2nd cross validation set of Group B. 400_800_3_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 3rd cross validation set of Group B. 400_800_4_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 4th cross validation set of Group B. 400_800_5_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 5th cross validation set of Group B. 800_1200_1_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 1st cross validation set of Group C. 800_1200_2_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 2nd cross validation set of Group C. 800_1200_3_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 3rd cross validation set of Group C. 800_1200_4_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 4th cross validation set of Group C. 800_1200_5_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 5th cross validation set of Group C. 1200_1800_1_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 1st cross validation set of Group D. 1200_1800_2_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 2rd cross validation set of Group D. 1200_1800_3_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 3nd cross validation set of Group D. 1200_1800_4_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 4th cross validation set of Group D. 1200_1800_5_test_PhagePred_pre.csv Prediction results by PhagePred of 20000 test DNA sequences from the 1th cross validation set of Group D. File Description 100_400_1_test.fna 20000 test DNA sequences integrated into a single file from the 1st cross validation set of Group A. 100_400_2_test.fna 20000 test DNA sequences integrated into a single file from the 2nd cross validation set of Group A. 100_400_3_test.fna 20000 test DNA sequences integrated into a single file from the 3rd cross validation set of Group A. 100_400_4_test.fna 20000 test DNA sequences integrated into a single file from the 4th cross validation set of Group A. 100_400_5_test.fna 20000 test DNA sequences integrated into a single file from the 5th cross validation set of Group A. 400_800_1_test.fna 20000 test DNA sequences integrated into a single file from the 1st cross validation set of Group B. 400_800_2_test.fna 20000 test DNA sequences integrated into a single file from the 2nd cross validation set of Group B. 400_800_3_test.fna 20000 test DNA sequences integrated into a single file from the 3rd cross validation set of Group B. 400_800_4_test.fna 20000 test DNA sequences integrated into a single file from the 4th cross validation set of Group B. 400_800_5_test.fna 20000 test DNA sequences integrated into a single file from the 5th cross validation set of Group B. 800_1200_1_test.fna 20000 test DNA sequences integrated into a single file from the 1st cross validation set of Group C. 800_1200_2_test.fna 20000 test DNA sequences integrated into a single file from the 2nd cross validation set of Group C. 800_1200_3_test.fna 20000 test DNA sequences integrated into a single file from the 3rd cross validation set of Group C. 800_1200_4_test.fna 20000 test DNA sequences integrated into a single file from the 4th cross validation set of Group C. 800_1200_5_test.fna 20000 test DNA sequences integrated into a single file from the 5th cross validation set of Group C. 1200_1800_1_test.fna 20000 test DNA sequences integrated into a single file from the 1st cross validation set of Group D. 1200_1800_2_test.fna 20000 test DNA sequences integrated into a single file from the 2nd cross validation set of Group D. 1200_1800_3_test.fna 20000 test DNA sequences integrated into a single file from the 3rd cross validation set of Group D. 1200_1800_4_test.fna 20000 test DNA sequences integrated into a single file from the 4th cross validation set of Group D. 1200_1800_5_test.fna 20000 test DNA sequences integrated into a single file from the 5th cross validation set of Group D. seperate_20000_sequence.m Script for seperating every test DNA sequences in each cross validation into a single file. get_label.m Script for getting the ture label of every sequences in each cross validation. -------------------------------------------------------------------------------------------------------------------------------------------------