Format for ExpressionPlot annotation files
Annotation files can be supplied to the ExpressionPlot backend with the following switches:
-cl clusters_fn -nc ncRNA_clusters_fn -ensT ensembl_tRNA_clusters_fn -ae AE_fn -ri RI_fn -ate ATE_fn
These are described in more detail in the long usage message. Briefly, the first three (-cl, -nc and -ensT) can be used to supply three different gene annotations, -ae is for alternative (cassette) exon splicing events, -ri is for retained intron events and -ate is for alternative terminal exon events. In practice we usually use UCSC genes for -cl, ensembl genes with -ensT (actually a set of ensembl genes with annotated tRNAs added in). We don't use the -nc switch much, but in the past we've used to specifically for non-coding RNAs. In reality you could use any three different gene sets with these switches, and also you don't have to use more than one.
All of the files are in headered TSV format. The columns are described below.
Contents
Gene Cluster Files
This is the format of files supplied with -cl -nc or -ensT switches. These are called "gene cluster files" because I usually think of them as describing the cluster of transcripts associated with a gene, with the coordinates giving the union of all the exons of any transcript. However, it is not necessary to interpret them this way---you could create a file where each row is a different transcript. However, reads would get counted onto all the transcripts.
I usually create this file and then run a script to trim overlapping genes (from the opposite strand). This is important to avoid counting overlapping minus strand transcripts. The files available for download from the website already have this trimming.
The columns are as follows:
gene cluster annotation file | |
---|---|
column name | description |
clusterID
|
This is any identifier for the gene. For example, I use UCSC cluster ID (from knownIsoforms table) or Ensembl ENSG ID. |
nKG
|
This is the number of transcripts associated with the isoform. This field is not used by ExpressionPlot and can be omitted. |
aliases
|
This is a comma-separated list of names for the gene. These fields will be used by tools that need to look up genes by names
rather than IDs. |
chr
|
This is the chromosome of the gene (usually begins with the characters "chr"). |
str
|
The strand of the gene: "+" or "-". |
nInterval
|
The number of "intervals" for the gene. This field is not really used and should probably be taken out. It is meant to give
the number of exons (or exon-like units) in the coords field. |
coords
|
Each exon (or exon-like unit) is described by a string of the form "$start,$end" , and then all of the exons
are joined together with ";". |
Here is an example of the first few lines of the mm9_gene_clusters.tsv
file, organized into a table, and then as unformatted text:
clusterID | nKG | aliases | chr | str | nInterval | coords |
---|---|---|---|---|---|---|
1 | 2 | Xrg4,mKIAA1889,Xkr4,XKR4_MOUSE | chr1 | - | 4 | 3195985,3197398;3203520,3207049;3411783,3411982;3660633,3661579 |
2 | 1 | uc007aev.1 | chr1 | - | 2 | 3638392,3640590;3648928,3648985 |
3 | 2 | RP1_MOUSE,Rp1h,Orp1,Rp1 | chr1 | - | 6 | 4280927,4283093;4334224,4340172;4341991,4342162;4342283,4342918;4350281,4350473;4399251,4399322 |
4 | 5 | SOX17_MOUSE,Sox-17,Sox17 | chr1 | - | 5 | 4481009,4482749;4483181,4483816;4483853,4483944;4485217,4486023;4486372,4486494 |
5 | 3 | Mrpl15,RM15_MOUSE | chr1 | - | 6 | 4764015,4764597;4766458,4766882;4767606,4767729;4772649,4772814;4774032,4774186;4775654,4775768 |
6 | 2 | Lypla1,Apt1,Pla1a,LYPA1_MOUSE | chr1 | + | 9 | 4797974,4798063;4798536,4798567;4818665,4818730;4820349,4820396;4822392,4822462;4827082,4827155;4829468,4829569;4831037,4832908;4835044,4836816 |
clusterID nKG aliases chr str nInterval coords 1 2 Xrg4,mKIAA1889,Xkr4,XKR4_MOUSE chr1 - 4 3195985,3197398;3203520,3207049;3411783,3411982;3660633,3661579 2 1 uc007aev.1 chr1 - 2 3638392,3640590;3648928,3648985 3 2 RP1_MOUSE,Rp1h,Orp1,Rp1 chr1 - 6 4280927,4283093;4334224,4340172;4341991,4342162;4342283,4342918;4350281,4350473;4399251,4399322 4 5 SOX17_MOUSE,Sox-17,Sox17 chr1 - 5 4481009,4482749;4483181,4483816;4483853,4483944;4485217,4486023;4486372,4486494 5 3 Mrpl15,RM15_MOUSE chr1 - 6 4764015,4764597;4766458,4766882;4767606,4767729;4772649,4772814;4774032,4774186;4775654,4775768 6 2 Lypla1,Apt1,Pla1a,LYPA1_MOUSE chr1 + 9 4797974,4798063;4798536,4798567;4818665,4818730;4820349,4820396;4822392,4822462;4827082,4827155;4829468,4829569;4831037,4832908;4835044,4836816
ExpressionPlot includes scripts to create gene cluster files based on UCSC-like transcript tables.
Alternative Exon Files
The -ae switch lets you provide alternative cassette exon events to the ExpressionPlot RNA-Seq backend. These are the cassette exons: exons which can be either included or entirely skipped in any particular transcript. They are all by definition internal (not first or last) exons. Since assigning a read to the exon-skipping isoform requires that the read be anchored in the same gene of which the exon is a part, the known splice sites of that gene must be included in the annotation of each event. The alternative exon files that come with the ExpressionPlot are notable in that every internal exon of every transcript is considered a candidate cassette exon.
The columns of the alternative exon files are as follows:
alternative exon annotation file | |
---|---|
column name | description |
gene
|
The name of the gene that hosts the exon. Here "name" can be based on any system, for example an accepted gene symbol or an Ensembl identifier. The field is only important in that it will show up on output tables, so you can easily identify the gene of which the exon is a part. |
chr
|
The chromosome on which the exon is located |
strand
|
The strand on which the exon is located. |
reg.up
|
The "upstream region" of the exon. This is a union of genomic intervals which represents all the upstream exonic regions of the same gene. These are used to
calculate the "Psg" and "LORG" statistics. The regions are represented by a string of the form "start1,end1;start2,end2;...". In this context "upstream" has a slightly confusing definition: it is defined as "lower chromosomal coordinates". This means that it is 5' of the exon for a plus-strand gene but 3' of the exon for a minus-strand gene. |
ss.up
|
A comma-separated list of the known upstream splice sites. As with reg.up upstream means lower chromosomal coordinates, which are not 5' to the exon for minus-strand genes. For plus-strand genes 5' splice sites are given and for minus-strand genes 3' splice sites are given: these are the sites which could potentially splice over the candidate exon. The coordinates of the terminal exonic nucleotide are given.
|
reg.exon
|
The coordinates of the candidate exon itself, in a string of the form "start,end". |
reg.dn, ss.dn
|
Same as reg.up and ss.up , but for downstream region.
|
Here is an example of the first few lines of the mm9_acembly_AE_events_with_flanking_SS.tsv
organized into a table, and then as unformatted text:
gene | chr | strand | reg.up | ss.up | reg.exon | reg.dn | ss.dn | exon.ids | alt |
---|---|---|---|---|---|---|---|---|---|
CREB3L1 | chr11 | 1 | 46255796,46256340;46278062,46278290 | 46256340,46278290,46273460 | 46285943,46286127 | 46288116,46288194;46289159,46289316;46290452,46290601;46290739,46290797;46290978,46291046 | 46289159,46290739,46294413,46298391,46290978,46290452,46288116,46298837,46293298,46295488 | ex108878 | Const |
CREB3L1 | chr11 | 1 | 46255796,46256340;46278062,46278290;46285943,46286127;46288116,46288194 | 46256340,46286127,46278290,46273460,46288194 | 46289159,46289316 | 46290452,46290601;46290739,46290797;46290978,46291046 | 46290739,46294413,46298391,46290978,46290452,46298837,46293298,46295488 | ex108880 | Const |
CREB3L1 | chr11 | 1 | 46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194 | 46290739,46290797 | 46290978,46291046 | 46294413,46298391,46290978,46298837,46293298,46295488 | ex108882 | Const | |
CREB3L1 | chr11 | 1 | 46290739,46290797;46290978,46291046 | 46291046,46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797 | 46294413,46294512 | 46295488,46295614;46298391,46298655;46298837,46299519 | 46298391,46298837,46295488 | ex108884 | Const |
CREB3L1 | chr11 | 1 | 46294413,46294512;46295488,46295614 | 46295614,46291046,46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797,46294512 | 46298391,46298655 | 46298837,46299519 | 46298837 | ex108886 | Const |
CREB3L1 | chr11 | 1 | 46290739,46290797 | 46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797 | 46290978,46291046 | 46294413,46298391,46298837,46293298,46295488 | ex108883 | Const |
gene chr strand reg.up ss.up reg.exon reg.dn ss.dn exon.ids alt CREB3L1 chr11 1 46255796,46256340;46278062,46278290 46256340,46278290,46273460 46285943,46286127 46288116,46288194;46289159,46289316;46290452,46290601;46290739,46290797;46290978,46291046 46289159,46290739,46294413,46298391,46290978,46290452,46288116,46298837,46293298,46295488 ex108878 Const CREB3L1 chr11 1 46255796,46256340;46278062,46278290;46285943,46286127;46288116,46288194 46256340,46286127,46278290,46273460,46288194 46289159,46289316 46290452,46290601;46290739,46290797;46290978,46291046 46290739,46294413,46298391,46290978,46290452,46298837,46293298,46295488 ex108880 Const CREB3L1 chr11 1 46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194 46290739,46290797 46290978,46291046 46294413,46298391,46290978,46298837,46293298,46295488 ex108882 Const CREB3L1 chr11 1 46290739,46290797;46290978,46291046 46291046,46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797 46294413,46294512 46295488,46295614;46298391,46298655;46298837,46299519 46298391,46298837,46295488 ex108884 Const CREB3L1 chr11 1 46294413,46294512;46295488,46295614 46295614,46291046,46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797,46294512 46298391,46298655 46298837,46299519 46298837 ex108886 Const CREB3L1 chr11 1 46290739,46290797 46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797 46290978,46291046 46294413,46298391,46298837,46293298,46295488 ex108883 Const
ExpressionPlot includes scripts to create alternative exon files based on UCSC-like transcript tables.
Retained Introns
The -ri switch lets you provide intron retention events to the ExpressionPlot RNA-Seq backend.
This section under construction.
Alternative Terminal Exon Files
The -ate switch lets you provide alternative terminal exon events to the ExpressionPlot RNA-Seq backend.
This section under construction.