MAP accepts two formats of the input sequence files: one is the FASTA format, the other is
the FRG format which is specially designed as input for the Celera Assembler.
The input sequence files are specified followed the option -p, and mutiple sequence files
can be specified, separated by comma. Once a input sequence file is specified, MAP will identify
the format from the filename: the file with the name having the suffix ".frg" will be recoganized
as the FRG format file, others will be identified as the FASTA format.
Instructions of the FRG format can be seen from
http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=FRG_Files
IF FRG format files are input, all input information needed by MAP can be read by MAP from the FRG
files including sequences, base quality score and mate pair information if there is any.
MAP also accepts the FASTA format sequences file. If a FASTA file is specified, you may optionally specify a corresponding file of data quality information following the option -q. The quality of the sequence base is used in the overlap calculation and the consensus stage by MAP. If multiple sequence files in FASTA format are specified, multiple quality files can be specified, also separated by comma. Attention! The quality file corresponding to a sequence file must consist of the name of the FASTA file, with ".qual" appended.
The format of the .qual file is similar to that of the corresponding FASTA sequence file. For each read there should appear a header line identical to that in the FASTA file. This is followed by one or more lines giving the qualities for each base. Quality values should be integers between 0 and 99 (inclusive), and should be separated by spaces. The total number of quality values for each read must match the number of bases for that read in the FASTA file. The quality score should be Phred quality score, which follows the transformation q = -10 log_10(p), where the p is the probability of an error in the base call, and the q is the quality score of that base. You may also specify a const quality score via the command -d to assign that values to all the bases from the FASTA files that has not assigned a corresponding quality files. By default, this value is assigned 23, which indecates a mean sequencing error about 0.005.
Mate pair information should be included in the file with the name consist of the name of the FASTA file, with ".mate" appended. Mate pair files are specified with the option -m. Again, multiple mate files can be separated comma. Mate pair files should consist of lines of mate pair information. Each line consists of four strings. First two are ids of reads, and last two of which are distribution parameters (should be integers) (the mean value and the standard variation) of the mate pair length or insert length. The id of each read should be the first string following the ">" and ends at the first blank space in the header line of the read in the FASTA files. Although we strongly commend you to provide the mate pair files to MAP so that MAP can give full play of its merits, it is also okey to run MAP without mate pair information. Thus, missing mate pair files of partial or all the sequence files is permited for MAP.
There are four output files MAP generates. The output prefix can be specified via the option -o (By default, MAP will specify the output prefix "assembly" ). The first output file with the suffix ".contigs" gives the contig consensus sequences in FASTA format. The second output file with the suffix ".contiginfo" gives the read maps of each contig. Each contig begins with the header identical with the ".contigs" file, and is followed by several lines with each line presenting the map infomation (id, strand, the starting position, the ending position) of one read in that contig. The third output file with the suffix ".singlets" gives the reads that are not assembled in contigs. The last file with the suffix ".stat" gives some statistics of the final assembly.