INPUT DATA FILES

MAP accepts two formats of the input sequence files: one is the FASTA format, the other is the FRG format which is specially designed as input for the Celera Assembler. The input sequence files are specified followed the option -p, and mutiple sequence files can be specified, separated by comma. Once a input sequence file is specified, MAP will identify the format from the filename: the file with the name having the suffix ".frg" will be recoganized as the FRG format file, others will be identified as the FASTA format. Instructions of the FRG format can be seen from
http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=FRG_Files
IF FRG format files are input, all input information needed by MAP can be read by MAP from the FRG files including sequences, base quality score and mate pair information if there is any.

MAP also accepts the FASTA format sequences file. If a FASTA file is specified, you may optionally specify a corresponding file of data quality information following the option -q. The quality of the sequence base is used in the overlap calculation and the consensus stage by MAP. If multiple sequence files in FASTA format are specified, multiple quality files can be specified, also separated by comma. Attention! The quality file corresponding to a sequence file must consist of the name of the FASTA file, with ".qual" appended.

The format of the .qual file is similar to that of the corresponding FASTA sequence file. For each read there should appear a header line identical to that in the FASTA file. This is followed by one or more lines giving the qualities for each base. Quality values should be integers between 0 and 99 (inclusive), and should be separated by spaces. The total number of quality values for each read must match the number of bases for that read in the FASTA file. The quality score should be Phred quality score, which follows the transformation q = -10 log_10(p), where the p is the probability of an error in the base call, and the q is the quality score of that base. You may also specify a const quality score via the command -d to assign that values to all the bases from the FASTA files that has not assigned a corresponding quality files. By default, this value is assigned 23, which indecates a mean sequencing error about 0.005.

Mate pair information should be included in the file with the name consist of the name of the FASTA file, with ".mate" appended. Mate pair files are specified with the option -m. Again, multiple mate files can be separated comma. Mate pair files should consist of lines of mate pair information. Each line consists of four strings. First two are ids of reads, and last two of which are distribution parameters (should be integers) (the mean value and the standard variation) of the mate pair length or insert length. The id of each read should be the first string following the ">" and ends at the first blank space in the header line of the read in the FASTA files. Although we strongly commend you to provide the mate pair files to MAP so that MAP can give full play of its merits, it is also okey to run MAP without mate pair information. Thus, missing mate pair files of partial or all the sequence files is permited for MAP.

OUTPUT FILES

There are four output files MAP generates. The output prefix can be specified via the option -o (By default, MAP will specify the output prefix "assembly" ). The first output file with the suffix ".contigs" gives the contig consensus sequences in FASTA format. The second output file with the suffix ".contiginfo" gives the read maps of each contig. Each contig begins with the header identical with the ".contigs" file, and is followed by several lines with each line presenting the map infomation (id, strand, the starting position, the ending position) of one read in that contig. The third output file with the suffix ".singlets" gives the reads that are not assembled in contigs. The last file with the suffix ".stat" gives some statistics of the final assembly.

COMMAND LINE OPTIONS

-s Sequence file(s) in FASTA format or FRG format, seperated by comma
-q Quality file(s), seperated by comma
-m Matepair file(s), seperated by comma
-o [string] Output prefix (by default "assembly")
-k [integer] Kmer Length (by default 17)

Before calculating overlaps, MAP selects pairs of reads sharing kmers as the potential pairs of reads that have overlaps.

-n [integer] the number of kmer archives to write into the temp files at once to release the memory (by default 10000000)

This parameter is used in the process of reading kmers of all the reads and recording the position information of each kmer of the read. Large number of kmers depending on the number of reads requires large capacity of the machine memory, thus would be kept in temporary files to reduce the demand of the memory at the cost of longer runtime.

-l [integer] Minimal overlap length (by default 30)
-d [integer] Quality score (by default 23)
-e [float] Allowed maximal overlap error rate (by default 0.03)

MAP use this value in the overlap calculation. A precise mean error rate provided will increase the accuracy of the identification of the overlaps between reads. Generally, a higher error rate would increase the false positive overlap rate, while a lowere error rate would decrease the sensitivity of the correct overlaps.

-t [integer] Maximal thread number (by default 1)

DOCUMENTATION FOR MAP

CONTENTS

PREREQUISITES

INSTALLATION

RUNNING MAP

INPUT DATA FILES

OUTPUT FILES

COMMAND LINE OPTIONS