HC => high complexity; MC => medium complexity; LC => low complexity. 800 => the mean length of the read is 800bp; 200 => the mean length of the read is 200bp; fr => error free; er => sequencing error rate > 0(In all the cases, the mean error rate is 0.005 errors per base) All the datasets are mate pair reads with mate pair length follows normal distribution (mean:2000bp, stdv:200bp). For 800bp reads datasets, there are three files included. 1) The "fna" file keeps the sequences of reads in FASTA format. 2) The ".fna.mate" keeps the mate pair information of the dataset. Each line keeps one mate pair with four items in order: the first two items are the id of the paired reads, and the last two items are the two parameters (mean and stdv.) of the mate pair length. The id of the read is the same as the first line of the read from the FASTA file. 3) The ".info" file keeps the origin information of the read. Each read archive begins with a ">" and its id. Four items follows: GI keeps the GenBank id of the genome from which the read was sampled; forb indecates the read sequence is either in forward or backward, 0 means forward and 1 means backward; pos1 and pos2 means the start and the end positon of the read project to the original genome. For 200bp reads datasets, there are two files included. The ".sff" file and the ".info" file. The ".sff" file keeps the sequences of reads in sff format. As the 200bp reads model the 454 mate pair reads, each sequence contains two ends of the mate pair template together with linker sequence in the middle. The linker sequence use the "flx" linker: "GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC". And the ".info" file keeps the origin information of the reads as the 800bp files. As each read is one end of the mate pair template, the id of one read is assigned as the id of the template plus either 'a' or 'b'. The file reference.genomes is in fasta format consisting of all the reference genomes of 113 species that used to simulate the datasets in the paper.