statistical analysis of NGS data

Next Generation Sequencing data poses an exciting challenge for statistical data analysis. Why?


error model it is not only substantially different from the one of classical Sanger technology, but also specific of each sequencing technology (Roche 454, Solexa Illumina,ABi SOLiD, ...)
data size due to short read lengths, you end up with big sample sizes, thus requiring considerable computing power.


and, why would statistical analysis be useful? here are some examples


data description
read length distribution, quality scores ... or, nucleotide composition along an assembled sequence could lead you to plasmid detection.
refine base-calling
detecting and filtering base-calling errors, or applying alternative base-calling methods could also make the difference when working with, for example, SNPs.
clustering and classification
with respect to parameters such as sequence composition, motifs, higher-order markov dependencies, ... this is of fundamental importance for viral quasispecies or metagenomics studies.
your data vs a probabilistic model
testing your data against a technology-specific probabilistic model could lead to possible sequence composition bias, unexpected number of equal reads, and thus to act as a quality control; furthermore, this kind of process could easily add value to sequencing providers, when implemented and deployed as a workflow.

