bam2fastq


This software is no longer supported by Discovery Genomics. We recommend using Picard's SamToFastq to accomplish this task.

The BAM format is an efficient method for storing and sharing data from modern, highly parallel sequencers. While primarily used for storing alignment information, BAMs can (and frequently do) store unaligned reads as well.

There are a growing number of general-purpose SAM/BAM manipulation programs, including SAMtools, Picard, and Bamtools. This tool is not intended to duplicate the complex suite of tasks those programs perform. Rather, it is simply intended to extract raw sequences (with qualities). We envision this tool being primarily useful to those wishing to duplicate or extend previous analyses.

Installing

This program requires the standard GNU development environment (gcc and make), along with the SAMtools source code (included) and the zlib compression library.

Download the distribution and extract it with tar -xzf. Change into the extracted directory and run make.


Version History

1.1.0 (18 August 2010)

Added --strict option

Altered default handling of read names

1.0.0 (17 August 2010)

Initial Release

Usage Notes

Several assumptions are made about the format of the BAM file:

If these assumptions are correct, the extracted sequence.txt files will contain the same information as the files used to create the BAM file. However, the presentation of the data will differ slightly from the original:

  1. Extracted quality scores will be encoded using the "Phred+33" scheme, as opposed to the "Phred+64" scheme used by Illumina.
  2. Illumina output contains the read name twice for each read - once as the read name (prefixed by '@'), and once as the quality name (prefixed by '+'). To save space, we do not output the quality name.
  3. If the BAM file has been sorted, the order of the reads in the output files will not match the original order.

However, none of these differences will impact the actual data, merely the representation. So while the files will not be byte-for-byte identical, they will contain the same biological data. For example, compare the output of the original sequence files (s_1_1_sequence.txt and s_1_2_sequence.txt) to the same data extracted from a BAM file (s_1_1_extracted.txt and s_1_2_extracted.txt). Notice that the Quality Encoding is different, but the remaining values are identical. Also note (on the Files tab) that the size of the extracted files is smaller than the originals.

By default, pair names in the BAM file are modified slightly to allow for BAMs that don't quite meet specification. This behavior can be disabled with the --strict flag.

Parameters

Usage:

bam2fastq [options] <bam file>

Options:

-o FILENAME, --output FILENAME
Specifies the name of the FASTQ file(s) that will be generated. May contain the special characters % (replaced with the lane number) and # (replaced with _1 or _2 to distinguish PE reads, removed for SE reads). [Default: s_%#_sequence.txt]

-f, --force, --overwrite
Create output files specified with --output, overwriting existing files if necessary [Default: exit program rather than overwrite files]

--aligned
--no-aligned
Reads in the BAM that are aligned will (will not) be extracted. [Default: extract aligned reads]

--unaligned
--no-unaligned
Reads in the BAM that are not aligned will (will not) be extracted. [Default: extract unaligned reads]

--filtered
--no-filtered
Reads that are marked as failing QC checks will (will not) be extracted. [Default: extract filtered reads]

-q, --quiet
Suppress informational messages [Default: print messages]

-s, --strict
Keep bam2fastq's processing to a minimum, assuming that the BAM strictly meets specifications. [Default: allow some errors in the BAM]