Supporting data for " DeePhage: distinguish virulent and temperate phage-derived sequences in metavirome data with a deep learning approach"
=======================================================================================================================================

Shufang W; Zhencheng F; Jie T; Mo L; Chunhui W; Qian G; Congmin X; Xiaoqing J; Huqiqiu Z (2020): Supporting data for " DeePhage: distinguish virulent and temperate phage-derived sequences in metavirome data with a deep learning approach ".

Summary:
-------------------------------------------------------------------------------------------------------------------------------------------
DeePhage is designed to identify metavirome sequences as virulent phage-derived or temperate phage-derived fragments. The program calculate a score between 0 and 1 for each input fragment. The sequence with a score higher than 0.5 would be regarded as a virulent phage-derived fragment and the sequence with a score lower than 0.5 would be regarded as a temperate phage-derived fragment. DeePhage can run either on the virtual machine or physical host. For non-computer professionals, we recommend running the virtual machine version of DeePhage on local PC. In this way, users do not need to install any dependency package. If GPU is available, you can also choose to run the physical host version. This version can automatically speed up with GPU and is more suitable to handle large scale data.
-------------------------------------------------------------------------------------------------------------------------------------------


The following is the description about each file.

File: md5.txt                The md5 value of each file.

File: DeePhage_v_1_0.zip     Codes and executable file of DeePhage.

File: VM_Bioinfo.vdi.7z      The virtual machine of DeePhage.

File: Figure1_Figure_S2.7z   Figure 1 and Figure S2.

File: training.7z. After decompression, the folder contains data, related results and scripts.
-------------------------------------------------------------------------------------------------------------------------------------------
Folder                                      Description
five_fold_validation_model                  Output of trained models for each validation groups
five_fold_validation_prediction_label       Output of predicted results and label for each validation groups
seperated_test_training_phages              Phage genomes used in each cross validation, inclueding training and test set.
MetaSim_simulating                          Output of MetaSim for training and testing sequences in each validation.
result_encoding_onehot                      Output of script encode_onehot_for_training_set.m and encode_onehot_for_test_set.m
Figures_S3          Script and figures for Figures S3.

train_different_models                    Folder contains data for other different five models (Kmer-4, No-Maxpooling, No-Dropout, No-Globalpooling, and No-BN).                                   
          File
          cal_sn_sp_acc.m    Script for calculating the accuracy, sn and sp of each different models.
          
          subfolder
          kmer_4                                  Folder contiains data for Kmer-4 model
                      File
                      adjust_uncertain_nt.m    Script called by seq_to_frequency.m
                      kmer_4_string.m             Script for coding 4-mer strings.
                      kmer_order_4.mat           Output of script  kmer_4_string.m
	                  seq_to_frequency.m        Script called by kmer_frequency_for_single_seq_test_set.m and kmer_frequency_for_single_seq_and_complement_seq_train_set.m
                      kmer_frequency_for_single_seq_test_set.m    Script for coding test sequences into 4-mer frequency.
                      kmer_frequency_for_single_seq_and_complement_seq_train_set.m      Script for coding train sequences into 4-mer frequency.
                      test_1200_1800_2_kmer_4.mat      Output of script  kmer_frequency_for_single_seq_test_set.m
                      test_label_1200_1800_2_kmer_4.mat   Output of script  kmer_frequency_for_single_seq_test_set.m
                      train_1200_1800_2_kmer_4.mat        Output of script kmer_frequency_for_single_seq_and_complement_seq_train_set.m
                      train_label_1200_1800_2_kmer_4.mat   Output of script kmer_frequency_for_single_seq_and_complement_seq_train_set.m
                      train_model_1800_kmer_4.py    Script for training K-mer model.
                      Kmer_4_model.h5       Output of script  train_model_1800_kmer_4.py
                      Kmer_4_prediction.csv     Output of script  train_model_1800_kmer_4.py
                      Kmer_4_output.txt      Output of script  train_model_1800_kmer_4.py
            
         train_test_fasta
                      subfolder
                      1200_1800_2
	                             File
                                 temp_1200_1800_2.fna   Simulation temperate train sequences from 2-th cross validation set in Gruop D.
                                 viru_1200_1800_2.fna   Simulation virulent train sequences from 2-th cross validation set in Gruop D.
		                         P_test.mat    One-hot encoding form for 2-th cross validation test set in Gruop D.
                                 P_train_ds.mat    One-hot encoding form for 2-th cross validation train set in Gruop D.
                                 T_test.mat   Lable for 2-th cross validation test set in Gruop D.
                                 T_train_ds.mat   Lable for 2-th cross validation train set in Gruop D.
                                 predict_1200_1800_2.csv  Prediction results by DeePhage for 2-th cross validation test set in Gruop D.

         No_BN
                        File
                        No_BN_model.h5       Output of script  train_model_1800_No_BN.py
                        No_BN_prediction.csv     Output of script  train_model_1800_No_BN.py
                        No_BN_output.txt      Output of script  train_model_1800_No_BN.py 
                        train_model_1800_No_BN.py    Script for training No-BN model.           

         No_Dropout
                        File
                        No_Dropout_model.h5       Output of script  train_model_1800_No_Dropout.py
                        No_Dropout_prediction.csv     Output of script  train_model_1800_No_Dropout.py
                        No_Dropout_output.txt      Output of script  train_model_1800_No_Dropout.py 
                        train_model_1800_No_Dropout.py    Script for training No-Dropout model.       

         No_Globalpooling
                        File
                        No_Globalpooling_model.h5       Output of script  train_model_1800_No_Globalpooling.py
                        No_Globalpooling_prediction.csv     Output of script  train_model_1800_No_Globalpooling.py
                        No_Globalpooling_output.txt      Output of script  train_model_1800_No_Globalpooling.py 
                        train_model_1800_No_Globalpooling.py    Script for training No-Globalpooling model. 

         No_Maxpooling
                        File
                        No_Maxpooling_model.h5       Output of script  train_model_1800_No_Maxpooling.py
                        No_Maxpooling_prediction.csv     Output of script  train_model_1800_No_Maxpooling.py
                        No_Maxpooling_output.txt      Output of script  train_model_1800_No_Maxpooling.py 
                        train_model_1800_No_Globalpooling.py    Script for training No-Maxpooling model. 

  
File                                        Description
extract_testing_and_training_seq.m          Script for dividing the dataset into training and testing set in each validation.
encode_onehot_for_training_set.m            Script for encoding traininig sequences with one-hot form in each validation.
encode_onehot_for_test_set.m                Script for encoding test sequences with one-hot form in each validation.
delet_two_virulent.m                        Script for deleting two ambiguous virulent phages in Dataset-1.
Virulent_delet.fasta                        Output for script delet_two_virulent.m.
Dataset-1_virulent.fasta                    Complete genomes of virulent phages from Dataset-1.
Dataset-1_temperate.fasta                   Complete genomes of virulent phages from Dataset-1.
Dataset-2_virulent.fasta                    Complete genomes of virulent phages from Dataset-2.
Dataset-2_temperate.fasta                   Complete genomes of virulent phages from Dataset-2.
calculate_validation_accuracy_sn_sp.m       Script for calculating the accuracy, sn and sp in each validation.
adjust_uncertain_nt.m                       Script called by encode_onehot.m.
dividing_testing_and_training_phage.m       Script for dividing phages into testing and training parts for Dataset-1.
train_test_num.mat                          Output of script dividing_testing_and_training_phage.m
validation_acc_sn_sp.mat                    Output of script calculate_validation_accuracy_sn_sp.m.
train_model_400.py                          Script for training with 100-400bp groups.
train_model_800.py                          Script for training with 400-800bp groups.
train_model_1200.py                         Script for training with 800-1200bp groups.
train_model_1800.py                         Script for training with 1200-1800bp groups.
-------------------------------------------------------------------------------------------------------------------------------------------

File: all_train_all_CDS.7z. After decompression, the folder contains data, related results and scripts.
-------------------------------------------------------------------------------------------------------------------------------------------
Folder                                      Description
temperate_all_gb_file                       Output of get_gb_file.py.
virulent_all_gb_file                        Output of get_gb_file.py.
CDS_faa_fasta_temperate                     Output of all_CDS_temperate.m.
CDS_faa_fasta_virulent                      Output of all_CDS_virulent.m.
phacts_result                           
          Subfolder
          all_CDS_temperate_faa    Output of predictions of all temperate CDS from PHACTS.
          all_CDS_virulent_faa    Output of predictions of all virulent CDS from PHACTS.
PhagePred_result
          File
          all_temperate_CDS_predict_result_use_PhagePred.csv   Output of predictions of all temperate CDS from PhagePred.
          all_virulent_CDS_predict_result_use_PhagePred.csv   Output of predictions of all virulent CDS from PhagePred.
File                                            Description
get_gb_file.py                                  Script for getting the Genebank file from NCBI for all the virulent and temperate phages.
get_NC_accession.m                              Script for getting the accessions of all virulent and temperate phages.
header_test.mat                                 Output of get_NC_accession.m.
all_CDS_temperate.m                             DNA and protein sequence of all CDS regions from temperate phages.
all_CDS_virulent.m                              Script for DNA and protein sequence of all CDS regions from virulent phages.
all_CDS_temperate_for_DeePhage_in_one_file.m    Script for integrating all the temperate CDS into one file.
all_CDS_virulent_for_DeePhage_in_one_file.m     Script for integrating all the virulent CDS into one file.
delet2_temp_all_CDS_in_one_file.fasta           Output of script all_CDS_temperate_for_DeePhage_in_one_file.m
delet2_viru_all_CDS_in_one_file.fasta           Output of script all_CDS_virulent_for_DeePhage_in_one_file.m
delet2_temp_all_CDS_in_one_file.csv             Output of predictions of all temperate CDS from DeePhage.
delet2_viru_all_CDS_in_one_file.csv             Output of predictions of all virulent CDS from DeePhage.
Dataset-1_temperate.fasta                       Complete genomes of temperate phages in Dataset-1.
Dataset-1_virulent.fasta                        Complete genomes of virulent phages in Dataset-1.
acc_DeePhage.m                                  Script for calculating the accuracy of DeePhage.
acc_PHACTS.m                                    Script for calculating the accuracy of PHACTS.
acc_PhagePred.m                                 Script for calculating the accuracy of PhagePred.
---------------------------------------------------------------------------------------------------------------------------------------------

File: bacteria_predict.z7. After decompression, the folder contains data, related results and script.
---------------------------------------------------------------------------------------------------------------------------------------------
File                                        Description
prok_reference_genomes.txt                  Downloaded 120 bacteria information.
prok_reference_genomes.xlsx                 The NC accession numbers of 120 bacteria.
bacteria_120reference_fna.zip               The zip file of whole genomes files of 120 bacteria and their md5 file.
bacteria_100_400.fna                        Output of MetaSim from 120 bacteria genomes with the length ranging from 100bp to 400bp.
bacteria_400_800.fna                        Output of MetaSim from 120 bacteria genomes with the length ranging from 400bp to 800bp.
bacteria_800_1200.fna                       Output of MetaSim from 120 bacteria genomes with the length ranging from 800bp to 1200bp.
bacteria_1200_1800.fna                      Output of MetaSim from 120 bacteria genomes with the length ranging from 1200bp to 1800bp.
bacteria_100_400.csv                        Output of the prediction from DeePhage for the file bacteria_100_400.fna.
bacteria_400_800.csv                        Output of the prediction from DeePhage for the file bacteria_400_800.fna.
bacteria_800_1200.csv                       Output of the prediction from DeePhage for the file bacteria_800_1200.fna.
bacteria_1200_1800.csv                      Output of the prediction from DeePhage for the file bacteria_1200_1800.fna.
temp_proportion.m                           Script for calculating the proportion of bacteria sequences predicted as temperate phages in different length conditions.
-----------------------------------------------------------------------------------------------------------------------------------------------

File: all_data_to_train_model.z7. After decompression, the folder contains data, related results and script.
---------------------------------------------------------------------------------------------------------------------------------------------
Subfolder                                          Description
          100_400
                     File
                     temp_100_400.fna              All the temperate phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 100bp to 400bp) using MetaSim.
                     viru_100_400.fna              All the virulent phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 100bp to 400bp) using MetaSim.
                     T_train_ds.mat                Labels of 80000 sequences.
                     P_train_ds.mat                Encoded form ("one-hot") of 80000 sequences.
                     100_400_all_train_model.h5    Trained model of Gruop A; output of script train_model_all_train_400.py

          400_800
                     File
                     temp_400_800.fna              All the temperate phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 400bp to 800bp) using MetaSim.
                     viru_400_800.fna              All the virulent phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 400bp to 800bp) using MetaSim.
                     T_train_ds.mat                Labels of 80000 sequences.
                     P_train_ds.mat                Encoded form ("one-hot") of 80000 sequences.
                     400_800_all_train_model.h5    Trained model of Gruop A; output of script train_model_all_train_800.py

          800_1200
                     File
                     temp_800_1200.fna              All the temperate phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 800bp to 1200bp) using MetaSim.
                     viru_800_1200.fna              All the virulent phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 800bp to 1200bp) using MetaSim.
                     T_train_ds.mat                 Labels of 80000 sequences.
                     P_train_ds.mat                 Encoded form ("one-hot") of 80000 sequences.
                     800_1200_all_train_model.h5    Trained model of Gruop A; output of script train_model_all_train_1200.py

          1200_1800
                     File
                     temp_1200_1800.fna              All the temperate phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 1200bp to 1800bp) using MetaSim.
                     viru_1200_1800.fna              All the virulent phage genomes (from Dataset-1 and Dataset-2) generat 80000 short sequences (ranging from 1200bp to 1800bp) using MetaSim.
                     T_train_ds.mat                  Labels of 80000 sequences.
                     P_train_ds.mat                  Encoded form ("one-hot") of 80000 sequences.
                     1200_1800_all_train_model.h5    Trained model of Gruop A; output of script train_model_all_train_1800.py

          t_SNE_net                                  Description
                 Subfolder
                 results                             Containing files outputted by script  layer_weight.py
                 plots_using_origin                  Containing files used for Figure 2 by Origin.
                 
                 File
                 layer_weight_get_weight.py          Script for calculating weight matrix of each layers from Group D and save the weights
                 pca_process.py                      Script for visualizations by dimensionality reduction.
                 T_train_ds.mat                      Labels of 80000 sequences from all the phage genomes in Group D.
                 P_train_ds.mat                      Encoded form ("one-hot") of 80000 sequences from all the phage genomes in Group D.
                 *.npy                               Output of script layer_weight_get_weight.py.        

File                                   Description
adjust_uncertain_nt.m                  Script called by script.m.
encoding_to_onehot_100_400.m           Script for encoding the training sequence in Group A into "one-hot" form.
encoding_to_onehot_400_800.m           Script for encoding the training sequence in Group B into "one-hot" form.
encoding_to_onehot_800_1200.m          Script for encoding the training sequence in Group C into "one-hot" form.
encoding_to_onehot_1200_1800.m         Script for encoding the training sequence in Group D into "one-hot" form.
Dataset-1_virulent.fasta               Complete genomes of virulent phages from Dataset-1.
Dataset-1_temperate.fasta              Complete genomes of virulent phages from Dataset-1.
Dataset-2_virulent.fasta               Complete genomes of virulent phages from Dataset-2.
Dataset-2_temperate.fasta              Complete genomes of virulent phages from Dataset-2.
train_model_all_train_400.py           Script for training new model using all the dataset in Group A.
train_model_all_train_800.py           Script for training new model using all the dataset in Group B.
train_model_all_train_1200.py          Script for training new model using all the dataset in Group C.
train_model_all_train_1800.py          Script for training new model using all the dataset in Group D.
-----------------------------------------------------------------------------------------------------------------------------------------------

File: PCA.z7. After decompression, the folder contains data, related results and script.
---------------------------------------------------------------------------------------------------------------------------------------------
File                                        Description
temp_viru_kmer_frequency.m                  Script for calculating the 4-mer frequency of virulent and temperate phage whole genomes.
temp_frequency.mat                          Output of script temp_viru_kmer_frequency.m
viru_frequency.mat                          Output of script temp_viru_kmer_frequency.m
cal_PCA.m                                   Script for calculating the results of PCA.
kmer_4_PCA.opju                             Origin files for two-dimension visualization.
PCA_4.png                                   Figure S1.
PCA_4_300_dpi.png                           A 300 dpi version of Figure S1.
-----------------------------------------------------------------------------------------------------------------------------------------------

File: runtime.z7. After decompression, the folder contains data, related results and script.
---------------------------------------------------------------------------------------------------------------------------------------------
Folder                  Description
files                   Files that used for testing running time of PHACTS, DeePhage and PhagePred.
VM_Bioinfo.vdi          A virtual machine that used for testing running time of PHACTS and DeePhage.
-----------------------------------------------------------------------------------------------------------------------------------------------

File: real_virome_data.7z. After decompression, the folder contains data, related results and script.
-----------------------------------------------------------------------------------------------------------------------------------------------
File                                        Description
101835.fastq                                Downloaded metavirome data of bodily fluid in the bovine rumen from MG-RAST(accessions: mgm4534202.3).
101836.fastq                                Downloaded metavirome data of bodily fluid in the bovine rumen from MG-RAST(accessions: mgm4534203.3).
contigs.fasta                               Assembled contigs of the file 101835.fastq and 101836.fastq by SPAdes.
contigs-predicted-gene.faa                  Predicted genes of the file contigs.fasta by FragGeneScan.
viral.1.protein.faa                         Viral proteins from viral protein database.
viral.2.protein.faa                         Viral proteins from viral protein database.
Calculate_the_lenght_of_N50.m               Script for calculating the length of the N50 sequence.
blastx_contigs_hallmarker_no_head.txt       Output of blastx for aligning the predicted genes against the viral proteins.
contigs_result.csv                          Output of the prediction from DeePhage for the file contigs.fasta.

Folder
          Subfolder          
          hits_16_contigs                                          
                     Subfolder                                     Description
                     16_contigs_fasta_DNA_sequence                 DNA sequences of 16 contigs.	
                     16_contigs_protein_sequence_by_FragGeneScan   Protein prediction results by FragGeneScan for 16 contigs.
                     16_contigs_prediction_result_by_PHACTS        Prediction results by PHACTS for 16 contigs.
      
                     File                                                  Description
                     Pick_out_16_contigs_use_evaleu_e-10_hits_len_400.m    Script for  picking out 16 contigs with e-value less than 1e-10 and hit length more than 400.
                     contig_len_identity_evalue_hitlen.mat                 Output of script Pick_out_16_contigs_use_evaleu_e-10_hits_len_400.m
                     Prediction_scores_by_DeePhage_16_contig.m             Prediction scores for 16 contigs by DeePhage.
                     Contig_ID_16_contigs.mat                              16 contigs' ID saved in mat file.
            
          predict_all_blast_contigs_by_PHACTS
                    File                                                          Description
                    single_contig_file_faa_file.zip                               Zip file of each contigs' faa sequence by FragGeneScan (contigs having no faa sequences are not contained in).
                    single_contig_file_faa_file_predict_result_by_PHACTS.zip      Zip file of each contigs' prediction result by PHACTS

          blast_all_viru_temp_whole_genomes
                     File                                                                 Description
                     contigs_blast_all_temp_viru.txt                                      Contigs' blast results when using all the virulent and temperate whole genomes as database.
                     sort_blast_result_with_contig_ID_len_identity_evalue_lifestyle.m     Script for sorting blast result and pickout the information of contig ID, Contig length, identify score, e-value score and the according types of phages that being aligned.
                     blast_len_identity_evalue_sort.mat                                   Output of script sort_blast_result_with_contig_ID_len_identity_evalue_lifestyle.m
                     phacts_pre_acc_for_real_data_blast_all_result.m                      Script for calculating the prediction proportion of two typed of phages by PHACTS.
                     DeePhage_pre_acc_for_real_data_blast_all_result.m                    Script for calculating the prediction proportion of two typed of phages by DeePhage.

          predict_all_blast_contigs_by_PhagePred
                     File                                                   Description
                     PhagePred_16.m                                         Script for calculating the prediction accuracy for 16 contig by PhagePred.
                     PhagePred_pre_acc_for_real_data_blast_all_result.m     Script for calculating the prediction proportion of two typed of phages by PhagePred.
                     PhagePred_pre_real_data.csv                            Prediction results for real data by PhagePred.
                     blast_len_identity_evalue_sort.mat                     Output of script sort_blast_result_with_contig_ID_len_identity_evalue_lifestyle.m
------------------------------------------------------------------------------------------------------------------------------------------------

File: phage_transformations.7z. After decompression, the folder contains data, related results and script.
------------------------------------------------------------------------------------------------------------------------------------------------
Folder
contig
          Subfolder
          meta_genome_phage
                   Subfolder                                Description
                   SPAdes_assemble_meta_real_data           The assembled contigs of metagenome samples (including Health and UC samples) by SPAdes.
                   meta_SPAdes_phage_pro_result             The prediction results of metagenome samples (including Health and UC samples) by PPR-Meta.
                   phage_contigs_SPAdes                     Picked out phage contigs of metagenome samples (including Health and UC samples).
                   DeePhage_pre_phage_contigs               Prediciton results of phage contigs of metagenome samples (including Health and UC samples) by DeePhage.

           meta_virom                                  
           Subfolder                 Description
                    Health           The assembled contigs by SPAdes (fasta file) and prediction results (csv file) by DeePhage of virome health samples.
                    UC               The assembled contigs by SPAdes (fasta file) and prediction results (csv file) by DeePhage of virome UC samples.

          blast_phage_genomes
                    File                                                                       Description
                    viruses.txt                                                                Viruses information obtained from NCBI database (The NCBI database. ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/. Accessed 23 November 2020).
                    phage.txt                                                                  Phages information picked out from  viruses.txt file.
                    NC_num_phage.txt                                                           Phages' useful NC accession number picked out from phage.txt.
                    phage_genomes.fasta                                                        Phages' genome sequences downloaded from NCBI database (https://www.ncbi.nlm.nih.gov/sites/batchentrez) according to the file NC_num_phage.txt.
                    select_temp_contig.m                                                       Script for picking out all the temperate contigs annotated by DeePhage from virome Health and UC samples. 
                    temp_all_contigs_UC_54_sample.fasta                                        Output of script select_temp_contig.m
                    temp_all_contigs_Health_23_sample.fasta                                    Output of script select_temp_contig.m
                    temp_all_contigs_UC_54_blast_phage_e-10_simple.out                         Output of blast method when aligning temperate UC samples' contig to phage genomes.
                    temp_all_contigs_Health_23_blast_phage_e-10_simple.out                     Output of blast method when aligning temperate Health samples' contig to phage genomes.
                    temp_all_contigs_UC_54_blast_phage_e-10_simple_with_taxid.out.out          The contig ID and its taxid (if existing) picked out from file temp_all_contigs_UC_blast_phage_e-10_54_simple.out
                    temp_all_contigs_Health_23_blast_phage_e-10_simple_with_taxid.out.out      The contig ID and its taxid (if existing) picked out from file temp_all_contigs_Health_blast_phage_e-10_23_simple.out
                    get_different_taxonomy.r                                                   Script for calculating the difference species of phages between virome Health and UC samples.
                    UC_taxid_unique.csv                                                        Output of script of get_different_taxonomy.r.
                    Health_taxid_unique.csv                                                    Output of script of get_different_taxonomy.r.                    
           different_UC_state
                    File                                             Description
                    different_UC_state_average_temperate_pro.m       Script for calculating the proportion of temperate phage contigs in different UC states. 
                    different_state_average_pro.m                    Script for calculating the average proportion of temperate phage contigs in different UC states.
                    pro_health_early.mat                             Output of script of different_UC_state_average_temperate_pro.m.
                    pro_health_flare.mat                             Output of script of different_UC_state_average_temperate_pro.m.
                    pro_health_improve.mat                           Output of script of different_UC_state_average_temperate_pro.m.
                    pro_health_inactivate.mat                        Output of script of different_UC_state_average_temperate_pro.m.
                    pro_health_late_resolve.mat                      Output of script of different_UC_state_average_temperate_pro.m.
                    pro_health_mild.mat                              Output of script of different_UC_state_average_temperate_pro.m.
                    pro_health_moderate.mat                          Output of script of different_UC_state_average_temperate_pro.m.
                    Figure_3_0509.opju                               Origin file for Figure 3.
                    Figure_3_revised.png                             Figure 3 in the manuscript.


          File                             Description
          phage_pro.m                      Script for calculating the phage proportions in each health and UC samples' metagenome data.
          metagenome_temperate_pro.m       Script for calculating the temperate phage proportions in each health and UC samples' metagenome data.
          virome_temperate_pro.m           Script for calculating the temperate phage proportions in each health and UC samples' virome data.
          metagenome_phage_pro.mat         Output of script phage_pro.m
          result_temp_pro_meta.mat         Output of script metagenome_temperate_pro.m
          result_virome.mat                Output of script virome_temperate_pro.m
          calculate_p_value.r              Script for calculating the significant difference of temperate phage proportions in health and UC samples' metagenome and virome.

File
vlp_site_disease_used_in_DeePhage.xlsx          The information of virome UC samples (54) and Health samples (23) used in our study. 

------------------------------------------------------------------------------------------------------------------------------------------------

File: tools_validation_comparsion.7z. After decompression, the folder contains data, related results and script.
------------------------------------------------------------------------------------------------------------------------------------------------
Folder                                        
fna_file
          Subfolder               Description
          100_400_1_test          20000 test DNA sequences from the 1st cross validation set of Group A.
          100_400_2_test          20000 test DNA sequences from the 2nd cross validation set of Group A.
          100_400_3_test          20000 test DNA sequences from the 3rd cross validation set of Group A.
          100_400_4_test          20000 test DNA sequences from the 4th cross validation set of Group A.
          100_400_5_test          20000 test DNA sequences from the 5th cross validation set of Group A.
          400_800_1_test          20000 test DNA sequences from the 1st cross validation set of Group B.
          400_800_2_test          20000 test DNA sequences from the 2nd cross validation set of Group B.
          400_800_3_test          20000 test DNA sequences from the 3rd cross validation set of Group B.
          400_800_4_test          20000 test DNA sequences from the 4th cross validation set of Group B.
          400_800_5_test          20000 test DNA sequences from the 5th cross validation set of Group B.
          800_1200_1_test         20000 test DNA sequences from the 1st cross validation set of Group C.
          800_1200_2_test         20000 test DNA sequences from the 2nd cross validation set of Group C.
          800_1200_3_test         20000 test DNA sequences from the 3rd cross validation set of Group C.
          800_1200_4_test         20000 test DNA sequences from the 4th cross validation set of Group C.
          800_1200_5_test         20000 test DNA sequences from the 5th cross validation set of Group C.
          1200_1800_1_test        20000 test DNA sequences from the 1st cross validation set of Group D.
          1200_1800_2_test        20000 test DNA sequences from the 2nd cross validation set of Group D.
          1200_1800_3_test        20000 test DNA sequences from the 3rd cross validation set of Group D.
          1200_1800_4_test        20000 test DNA sequences from the 4th cross validation set of Group D.
          1200_1800_5_test        20000 test DNA sequences from the 5th cross validation set of Group D.

FragGeneScan_pre_result
          File                                 Description
          whether_pre_gene.m                   Script for judging whether a sequence could have a predicted protein sequence by FragGeneScan in each test set.
          100_400_1_test_pre_gene.csv          Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 1st cross validation set of Group A.
          100_400_2_test_pre_gene.csv          Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 2nd cross validation set of Group A.
          100_400_3_test_pre_gene.csv          Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 3rd cross validation set of Group A.
          100_400_4_test_pre_gene.csv          Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 4th cross validation set of Group A.
          100_400_5_test_pre_gene.csv          Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 5th cross validation set of Group A.
          400_800_1_test_pre_gene.csv          Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 1st cross validation set of Group B.
          400_800_2_test_pre_gene.csv          Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 2nd cross validation set of Group B.
          400_800_3_test_pre_gene.csv          Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 3rd cross validation set of Group B.
          400_800_4_test_pre_gene.csv          Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 4th cross validation set of Group B.
          400_800_5_test_pre_gene.csv          Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 5th cross validation set of Group B.
          800_1200_1_test_pre_gene.csv         Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 1st cross validation set of Group C.
          800_1200_2_test_pre_gene.csv         Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 2nd cross validation set of Group C.
          800_1200_3_test_pre_gene.csv         Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 3rd cross validation set of Group C.
          800_1200_4_test_pre_gene.csv         Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 4th cross validation set of Group C.
          800_1200_5_test_pre_gene.csv         Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 5th cross validation set of Group C.
          1200_1800_1_test_pre_gene.csv        Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 1st cross validation set of Group D.
          1200_1800_2_test_pre_gene.csv        Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 2nd cross validation set of Group D.
          1200_1800_3_test_pre_gene.csv        Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 3rd cross validation set of Group D.
          1200_1800_4_test_pre_gene.csv        Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 4th cross validation set of Group D.
          1200_1800_5_test_pre_gene.csv        Labels to indicate whether a sequences could have a predicted protein sequence by FragGeneScan from the 5th cross validation set of Group D.        
          
          Subfolder               Description
          100_400_1_test          Gene prediction results of 20000 test DNA sequences from the 1st cross validation set of Group A by FragGeneScan.
          100_400_2_test          Gene prediction results of 20000 test DNA sequences from the 2nd cross validation set of Group A by FragGeneScan.
          100_400_3_test          Gene prediction results of 20000 test DNA sequences from the 3rd cross validation set of Group A by FragGeneScan.
          100_400_4_test          Gene prediction results of 20000 test DNA sequences from the 4th cross validation set of Group A by FragGeneScan.
          100_400_5_test          Gene prediction results of 20000 test DNA sequences from the 5th cross validation set of Group A by FragGeneScan.
          400_800_1_test          Gene prediction results of 20000 test DNA sequences from the 1st cross validation set of Group B by FragGeneScan.
          400_800_2_test          Gene prediction results of 20000 test DNA sequences from the 2nd cross validation set of Group B by FragGeneScan.
          400_800_3_test          Gene prediction results of 20000 test DNA sequences from the 3rd cross validation set of Group B by FragGeneScan.
          400_800_4_test          Gene prediction results of 20000 test DNA sequences from the 4th cross validation set of Group B by FragGeneScan.
          400_800_5_test          Gene prediction results of 20000 test DNA sequences from the 5th cross validation set of Group B by FragGeneScan.
          800_1200_1_test         Gene prediction results of 20000 test DNA sequences from the 1st cross validation set of Group C by FragGeneScan.
          800_1200_2_test         Gene prediction results of 20000 test DNA sequences from the 2nd cross validation set of Group C by FragGeneScan.
          800_1200_3_test         Gene prediction results of 20000 test DNA sequences from the 3rd cross validation set of Group C by FragGeneScan.
          800_1200_4_test         Gene prediction results of 20000 test DNA sequences from the 4th cross validation set of Group C by FragGeneScan.
          800_1200_5_test         Gene prediction results of 20000 test DNA sequences from the 5th cross validation set of Group C by FragGeneScan.
          1200_1800_1_test        Gene prediction results of 20000 test DNA sequences from the 1st cross validation set of Group D by FragGeneScan.
          1200_1800_2_test        Gene prediction results of 20000 test DNA sequences from the 2nd cross validation set of Group D by FragGeneScan.
          1200_1800_3_test        Gene prediction results of 20000 test DNA sequences from the 3rd cross validation set of Group D by FragGeneScan.
          1200_1800_4_test        Gene prediction results of 20000 test DNA sequences from the 4th cross validation set of Group D by FragGeneScan.
          1200_1800_5_test        Gene prediction results of 20000 test DNA sequences from the 5th cross validation set of Group D by FragGeneScan.          

label
          File                               Description
          100_400_1_test_label.csv           Labels of 20000 test DNA sequences from the 1st cross validation set of Group A, output by scirpt get_label.m
          100_400_2_test_label.csv           Labels of 20000 test DNA sequences from the 2nd cross validation set of Group A, output by scirpt get_label.m
          100_400_3_test_label.csv           Labels of 20000 test DNA sequences from the 3rd cross validation set of Group A, output by scirpt get_label.m
          100_400_4_test_label.csv           Labels of 20000 test DNA sequences from the 4th cross validation set of Group A, output by scirpt get_label.m
          100_400_5_test_label.csv           Labels of 20000 test DNA sequences from the 5th cross validation set of Group A, output by scirpt get_label.m
          400_800_1_test_label.csv           Labels of 20000 test DNA sequences from the 1st cross validation set of Group B, output by scirpt get_label.m
          400_800_2_test_label.csv           Labels of 20000 test DNA sequences from the 2nd cross validation set of Group B, output by scirpt get_label.m
          400_800_3_test_label.csv           Labels of 20000 test DNA sequences from the 3rd cross validation set of Group B, output by scirpt get_label.m
          400_800_4_test_label.csv           Labels of 20000 test DNA sequences from the 4th cross validation set of Group B, output by scirpt get_label.m
          400_800_5_test_label.csv           Labels of 20000 test DNA sequences from the 5th cross validation set of Group B, output by scirpt get_label.m
          800_1200_1_test_label.csv          Labels of 20000 test DNA sequences from the 1st cross validation set of Group C, output by scirpt get_label.m
          800_1200_2_test_label.csv          Labels of 20000 test DNA sequences from the 2nd cross validation set of Group C, output by scirpt get_label.m
          800_1200_3_test_label.csv          Labels of 20000 test DNA sequences from the 3rd cross validation set of Group C, output by scirpt get_label.m
          800_1200_4_test_label.csv          Labels of 20000 test DNA sequences from the 4th cross validation set of Group C, output by scirpt get_label.m
          800_1200_5_test_label.csv          Labels of 20000 test DNA sequences from the 5th cross validation set of Group C, output by scirpt get_label.m
          1200_1800_1_test_label.csv         Labels of 20000 test DNA sequences from the 1st cross validation set of Group D, output by scirpt get_label.m
          1200_1800_2_test_label.csv         Labels of 20000 test DNA sequences from the 2nd cross validation set of Group D, output by scirpt get_label.m
          1200_1800_3_test_label.csv         Labels of 20000 test DNA sequences from the 3rd cross validation set of Group D, output by scirpt get_label.m
          1200_1800_4_test_label.csv         Labels of 20000 test DNA sequences from the 4th cross validation set of Group D, output by scirpt get_label.m
          1200_1800_5_test_label.csv         Labels of 20000 test DNA sequences from the 5th cross validation set of Group D, output by scirpt get_label.m

phacts_predict_result
          Subfolder               Description
          100_400_1_test          Output files of 20000 test gene sequences from the 1st cross validation set of Group A by PHACTS.
          100_400_2_test          Output files of 20000 test gene sequences from the 2nd cross validation set of Group A by PHACTS.
          100_400_3_test          Output files of 20000 test gene sequences from the 3rd cross validation set of Group A by PHACTS.
          100_400_4_test          Output files of 20000 test gene sequences from the 4th cross validation set of Group A by PHACTS.
          100_400_5_test          Output files of 20000 test gene sequences from the 5th cross validation set of Group A by PHACTS.
          400_800_1_test          Output files of 20000 test gene sequences from the 1st cross validation set of Group B by PHACTS.
          400_800_2_test          Output files of 20000 test gene sequences from the 2nd cross validation set of Group B by PHACTS.
          400_800_3_test          Output files of 20000 test gene sequences from the 3rd cross validation set of Group B by PHACTS.
          400_800_4_test          Output files of 20000 test gene sequences from the 4th cross validation set of Group B by PHACTS.
          400_800_5_test          Output files of 20000 test gene sequences from the 5th cross validation set of Group B by PHACTS.
          800_1200_1_test         Output files of 20000 test gene sequences from the 1st cross validation set of Group C by PHACTS.
          800_1200_2_test         Output files of 20000 test gene sequences from the 2nd cross validation set of Group C by PHACTS.
          800_1200_3_test         Output files of 20000 test gene sequences from the 3rd cross validation set of Group C by PHACTS.
          800_1200_4_test         Output files of 20000 test gene sequences from the 4th cross validation set of Group C by PHACTS.
          800_1200_5_test         Output files of 20000 test gene sequences from the 5th cross validation set of Group C by PHACTS.
          1200_1800_1_test        Output files of 20000 test gene sequences from the 1st cross validation set of Group D by PHACTS.
          1200_1800_2_test        Output files of 20000 test gene sequences from the 2nd cross validation set of Group D by PHACTS.
          1200_1800_3_test        Output files of 20000 test gene sequences from the 3rd cross validation set of Group D by PHACTS.
          1200_1800_4_test        Output files of 20000 test gene sequences from the 4th cross validation set of Group D by PHACTS.
          1200_1800_5_test        Output files of 20000 test gene sequences from the 5th cross validation set of Group D by PHACTS.
         
          File                                                              Description
          phacts_calculate_validation_accuracy_sn_sp_for_table_1.m          Script for calculating the acc, sn and sp in each cross validation in table 1.
          phacts_except_no_gene_for_figure_1.m                              Script for calculating the acc, sn and sp in each cross validation in figure 1 (except sequences without gene).
          100_400_1_test_phacts_pre.csv                                     Prediction results by PHACTS of 20000 test DNA sequences from the 1st cross validation set of Group A.
          100_400_2_test_phacts_pre.csv                                     Prediction results by PHACTS of 20000 test DNA sequences from the 2nd cross validation set of Group A.
          100_400_3_test_phacts_pre.csv                                     Prediction results by PHACTS of 20000 test DNA sequences from the 3rd cross validation set of Group A.
          100_400_4_test_phacts_pre.csv                                     Prediction results by PHACTS of 20000 test DNA sequences from the 4th cross validation set of Group A.
          100_400_5_test_phacts_pre.csv                                     Prediction results by PHACTS of 20000 test DNA sequences from the 5th cross validation set of Group A.
          400_800_1_test_phacts_pre.csv                                     Prediction results by PHACTS of 20000 test DNA sequences from the 1st cross validation set of Group B.
          400_800_2_test_phacts_pre.csv                                     Prediction results by PHACTS of 20000 test DNA sequences from the 2nd cross validation set of Group B.
          400_800_3_test_phacts_pre.csv                                     Prediction results by PHACTS of 20000 test DNA sequences from the 3rd cross validation set of Group B.
          400_800_4_test_phacts_pre.csv                                     Prediction results by PHACTS of 20000 test DNA sequences from the 4th cross validation set of Group B.
          400_800_5_test_phacts_pre.csv                                     Prediction results by PHACTS of 20000 test DNA sequences from the 5th cross validation set of Group B.
          800_1200_1_test_phacts_pre.csv                                    Prediction results by PHACTS of 20000 test DNA sequences from the 1st cross validation set of Group C.
          800_1200_2_test_phacts_pre.csv                                    Prediction results by PHACTS of 20000 test DNA sequences from the 2nd cross validation set of Group C.
          800_1200_3_test_phacts_pre.csv                                    Prediction results by PHACTS of 20000 test DNA sequences from the 3rd cross validation set of Group C.
          800_1200_4_test_phacts_pre.csv                                    Prediction results by PHACTS of 20000 test DNA sequences from the 4th cross validation set of Group C.
          800_1200_5_test_phacts_pre.csv                                    Prediction results by PHACTS of 20000 test DNA sequences from the 5th cross validation set of Group C.
          1200_1800_1_test_phacts_pre.csv                                   Prediction results by PHACTS of 20000 test DNA sequences from the 1st cross validation set of Group D.
          1200_1800_2_test_phacts_pre.csv                                   Prediction results by PHACTS of 20000 test DNA sequences from the 2rd cross validation set of Group D.
          1200_1800_3_test_phacts_pre.csv                                   Prediction results by PHACTS of 20000 test DNA sequences from the 3nd cross validation set of Group D.
          1200_1800_4_test_phacts_pre.csv                                   Prediction results by PHACTS of 20000 test DNA sequences from the 4th cross validation set of Group D.
          1200_1800_5_test_phacts_pre.csv                                   Prediction results by PHACTS of 20000 test DNA sequences from the 1th cross validation set of Group D.
          
DeePhage_predict_result
        Subfolder 
        DeePhage_pre_when_cross_validation     Prediction results by DeePhage and labels in each cross validation.
        File
        rearrange_T_test_to_calculate_validation_accuracy_sn_sp.m  Script for calculating the acc, sn and sp in each cross validation in table 1.
        rearrange_DeePhage_except_no_gene.m     Script for calculating the acc, sn and sp in each cross validation in figure 1 (except sequences without gene).

PhagePred_predict_result
          File                                                                 Description
          PhagePred_calculate_validation_accuracy_sn_sp_for_table_1.m          Script for calculating the acc, sn and sp in each cross validation in table 1.
          PhagePred_except_no_gene_for_figure_1.m                              Script for calculating the acc, sn and sp in each cross validation in figure 1 (except sequences without gene).
          100_400_1_test_PhagePred_pre.csv                                     Prediction results by PhagePred of 20000 test DNA sequences from the 1st cross validation set of Group A.
          100_400_2_test_PhagePred_pre.csv                                     Prediction results by PhagePred of 20000 test DNA sequences from the 2nd cross validation set of Group A.
          100_400_3_test_PhagePred_pre.csv                                     Prediction results by PhagePred of 20000 test DNA sequences from the 3rd cross validation set of Group A.
          100_400_4_test_PhagePred_pre.csv                                     Prediction results by PhagePred of 20000 test DNA sequences from the 4th cross validation set of Group A.
          100_400_5_test_PhagePred_pre.csv                                     Prediction results by PhagePred of 20000 test DNA sequences from the 5th cross validation set of Group A.
          400_800_1_test_PhagePred_pre.csv                                     Prediction results by PhagePred of 20000 test DNA sequences from the 1st cross validation set of Group B.
          400_800_2_test_PhagePred_pre.csv                                     Prediction results by PhagePred of 20000 test DNA sequences from the 2nd cross validation set of Group B.
          400_800_3_test_PhagePred_pre.csv                                     Prediction results by PhagePred of 20000 test DNA sequences from the 3rd cross validation set of Group B.
          400_800_4_test_PhagePred_pre.csv                                     Prediction results by PhagePred of 20000 test DNA sequences from the 4th cross validation set of Group B.
          400_800_5_test_PhagePred_pre.csv                                     Prediction results by PhagePred of 20000 test DNA sequences from the 5th cross validation set of Group B.
          800_1200_1_test_PhagePred_pre.csv                                    Prediction results by PhagePred of 20000 test DNA sequences from the 1st cross validation set of Group C.
          800_1200_2_test_PhagePred_pre.csv                                    Prediction results by PhagePred of 20000 test DNA sequences from the 2nd cross validation set of Group C.
          800_1200_3_test_PhagePred_pre.csv                                    Prediction results by PhagePred of 20000 test DNA sequences from the 3rd cross validation set of Group C.
          800_1200_4_test_PhagePred_pre.csv                                    Prediction results by PhagePred of 20000 test DNA sequences from the 4th cross validation set of Group C.
          800_1200_5_test_PhagePred_pre.csv                                    Prediction results by PhagePred of 20000 test DNA sequences from the 5th cross validation set of Group C.
          1200_1800_1_test_PhagePred_pre.csv                                   Prediction results by PhagePred of 20000 test DNA sequences from the 1st cross validation set of Group D.
          1200_1800_2_test_PhagePred_pre.csv                                   Prediction results by PhagePred of 20000 test DNA sequences from the 2rd cross validation set of Group D.
          1200_1800_3_test_PhagePred_pre.csv                                   Prediction results by PhagePred of 20000 test DNA sequences from the 3nd cross validation set of Group D.
          1200_1800_4_test_PhagePred_pre.csv                                   Prediction results by PhagePred of 20000 test DNA sequences from the 4th cross validation set of Group D.
          1200_1800_5_test_PhagePred_pre.csv                                   Prediction results by PhagePred of 20000 test DNA sequences from the 1th cross validation set of Group D.


File                        Description
100_400_1_test.fna          20000 test DNA sequences integrated into a single file from the 1st cross validation set of Group A.
100_400_2_test.fna          20000 test DNA sequences integrated into a single file from the 2nd cross validation set of Group A.
100_400_3_test.fna          20000 test DNA sequences integrated into a single file from the 3rd cross validation set of Group A.
100_400_4_test.fna          20000 test DNA sequences integrated into a single file from the 4th cross validation set of Group A.
100_400_5_test.fna          20000 test DNA sequences integrated into a single file from the 5th cross validation set of Group A.
400_800_1_test.fna          20000 test DNA sequences integrated into a single file from the 1st cross validation set of Group B.
400_800_2_test.fna          20000 test DNA sequences integrated into a single file from the 2nd cross validation set of Group B.
400_800_3_test.fna          20000 test DNA sequences integrated into a single file from the 3rd cross validation set of Group B.
400_800_4_test.fna          20000 test DNA sequences integrated into a single file from the 4th cross validation set of Group B.
400_800_5_test.fna          20000 test DNA sequences integrated into a single file from the 5th cross validation set of Group B.
800_1200_1_test.fna         20000 test DNA sequences integrated into a single file from the 1st cross validation set of Group C.
800_1200_2_test.fna         20000 test DNA sequences integrated into a single file from the 2nd cross validation set of Group C.
800_1200_3_test.fna         20000 test DNA sequences integrated into a single file from the 3rd cross validation set of Group C.
800_1200_4_test.fna         20000 test DNA sequences integrated into a single file from the 4th cross validation set of Group C.
800_1200_5_test.fna         20000 test DNA sequences integrated into a single file from the 5th cross validation set of Group C.
1200_1800_1_test.fna        20000 test DNA sequences integrated into a single file from the 1st cross validation set of Group D.
1200_1800_2_test.fna        20000 test DNA sequences integrated into a single file from the 2nd cross validation set of Group D.
1200_1800_3_test.fna        20000 test DNA sequences integrated into a single file from the 3rd cross validation set of Group D.
1200_1800_4_test.fna        20000 test DNA sequences integrated into a single file from the 4th cross validation set of Group D.
1200_1800_5_test.fna        20000 test DNA sequences integrated into a single file from the 5th cross validation set of Group D.
seperate_20000_sequence.m   Script for seperating every test DNA sequences in each cross validation into a single file.
get_label.m                 Script for getting the ture label of every sequences in each cross validation.
-------------------------------------------------------------------------------------------------------------------------------------------------