Read Types

From ExpressionPlot
Revision as of 11:16, 13 June 2011 by Brad (Talk | contribs) (Canonical Distribution Plots)

Jump to: navigation, search
Example "read_type" plot (readclass=matching, normalize=true)
Example "canonical distribution" plot (readclass=positional)

Generate bar plots showing the numbers or percentages of reads with different types of alignments. "Types" of alignments can include "none", "multiple", "single", "paired", "exonic", "intronic", etc.

Or, generate a "canonical distribution" plot, showing the positional distribution of read densities flanking 11 types of genomic "landmarks", based on the transcripts in the "knownCanonical" UCSC table:

  1. intergenic: one landmark created half-way between every pair of adjacent, non-overlapping genes
  2. tss: transcription start sites
  3. start: start codon
  4. fe5ss: first exon 5' splice site
  5. intronic: one landmark created in the middle of each intron
  6. ie3ss: internal exon 3' splice site
  7. ie5ss: internal exon 5' splice site
  8. splice: special landmark, just quantifies read density over all the splice junctions of the knownCanonical transcripts.
  9. le3ss: last exon 3' splice site
  10. stop: stop codon
  11. pacs: poly-adenylation/cleavage site

The options for read_types are as follows:

read_types Options
Set The project you would like to analyze.
Read Class Choosing "All" for read class tabulates the fate of all reads: non-matching, multiply matching, single-end unique matching or paired-end unique matching.

Choosing "matching" shows the genomic features hit by just the aligning reads: exons, introns, both, splice junctions or intergenic. Choosing "positional" generates a "canonical distribution" plot, showing the positional distribution of reads flanking canonical genomic landmarks.

Normalize Whether the bars in the bar plots should be normalized to 100% ("Yes") or shown as numbers of reads ("No") (not an option when readclass=positional).
width Width of the image in pixels
height Height of the image in pixels

Canonical Distribution Plots

To calculate the canonical distributions, each chromosomal position is mapped to the closest landmark. All positions further than 200 bases (or whatever value of --radius-canon-dist is supplied to from the landmark are counted at 201 bases from the landmark, which may lead to apparent dropping off or skyrocketing at the extremal positions (see for example the right extreme of the pacs distributions in the figure, in red). For each distance <math>d</math> from -201 to 201 and for each landmark type <math>t</math>, the number of positions <math>NC(t,d)</math> in the genome at that distance from the nearest landmark is counted (the sign indicates whether the positions is downstream or upstream on the same strand as the landmark, or on the plus strand for intergenic landmarks). Then, the number of reads <math>NR(t,d)</math> overlapping at each distance from each type of landmark is counted. This is done in a way that counts each base of each alignment to the appropriate distance from its nearest landmark; usually the entire read will be nearest to a single landmark but it is possible for the beginning of the read to have a different nearest landmark than the middle or end. Finally the number of splice junctions in canonical transcripts is counted and the number of reads aligning to those splice junctions is counted. The RPKM for a particular landmark type at a particular position is calculated by normalizing the total number of reads whose alignment overlaps that distance from their nearest landmark of that type by the total number of chromosomal positions at that distance from that type of landmark and by the total number of reads <math>N</math> for that lane:


t &= \mbox{landmark type} \\ d &= \mbox{signed distance to nearest landmark} \\ NC(t,d) &= \mbox{number of chromosomal positions at distance } d \mbox{ to nearest landmark of type } t\\ NR(t,d) &= \mbox{number of read alignment positions at distance } d \mbox{ to nearest landmark of type } t\\ N &= \mbox{total number of aligning reads for lane} \\ RPKM(t,d) &= \frac{NR(t,d) \times 10^9}{NC(t,d) \times N} \end{align}


for <math>t \ne \mbox{splice}</math>. For splice junction reads there is no positional parameter since distance from a splice junction is already captured by distance from a splice site, so, letting <math>NC(\mbox{splice})</math> denote the total number of splice junctions in canonical transcripts and <math>NR(\mbox{splice})</math> denote the total number of read alignments overlapping those junctions,

<math>RPKM(\mbox{splice}) = \frac{NR(\mbox{splice}) \times 10^9}{NC(\mbox{splice}) \times N}</math>