Format for ExpressionPlot annotation files

From ExpressionPlot
Jump to: navigation, search

Annotation files can be supplied to the ExpressionPlot backend with the following switches:

 -cl clusters_fn
 -nc ncRNA_clusters_fn
 -ensT ensembl_tRNA_clusters_fn
 -ae AE_fn
 -ri RI_fn
 -ate ATE_fn

These are described in more detail in the long usage message. Briefly, the first three (-cl, -nc and -ensT) can be used to supply three different gene annotations, -ae is for alternative (cassette) exon splicing events, -ri is for retained intron events and -ate is for alternative terminal exon events. In practice we usually use UCSC genes for -cl, ensembl genes with -ensT (actually a set of ensembl genes with annotated tRNAs added in). We don't use the -nc switch much, but in the past we've used to specifically for non-coding RNAs. In reality you could use any three different gene sets with these switches, and also you don't have to use more than one.

All of the files are in headered TSV format. The columns are described below.

Gene Cluster Files

This is the format of files supplied with -cl -nc or -ensT switches. These are called "gene cluster files" because I usually think of them as describing the cluster of transcripts associated with a gene, with the coordinates giving the union of all the exons of any transcript. However, it is not necessary to interpret them this way---you could create a file where each row is a different transcript. However, reads would get counted onto all the transcripts.

I usually create this file and then run a script to trim overlapping genes (from the opposite strand). This is important to avoid counting overlapping minus strand transcripts. The files available for download from the website already have this trimming.

The columns are as follows:

gene cluster annotation file
column name description
clusterID This is any identifier for the gene. For example, I use UCSC cluster ID (from knownIsoforms table) or Ensembl ENSG ID.
nKG This is the number of transcripts associated with the isoform. This field is not used by ExpressionPlot and can be omitted.
aliases This is a comma-separated list of names for the gene. These fields will be used by tools that need to look up genes by names

rather than IDs.

chr This is the chromosome of the gene (usually begins with the characters "chr").
str The strand of the gene: "+" or "-".
nInterval The number of "intervals" for the gene. This field is not really used and should probably be taken out. It is meant to give

the number of exons (or exon-like units) in the coords field.

coords Each exon (or exon-like unit) is described by a string of the form "$start,$end", and then all of the exons

are joined together with ";".

Here is an example of the first few lines of the mm9_gene_clusters.tsv file, organized into a table, and then as unformatted text:

clusterID nKG aliases chr str nInterval coords
1 2 Xrg4,mKIAA1889,Xkr4,XKR4_MOUSE chr1 - 4 3195985,3197398;3203520,3207049;3411783,3411982;3660633,3661579
2 1 uc007aev.1 chr1 - 2 3638392,3640590;3648928,3648985
3 2 RP1_MOUSE,Rp1h,Orp1,Rp1 chr1 - 6 4280927,4283093;4334224,4340172;4341991,4342162;4342283,4342918;4350281,4350473;4399251,4399322
4 5 SOX17_MOUSE,Sox-17,Sox17 chr1 - 5 4481009,4482749;4483181,4483816;4483853,4483944;4485217,4486023;4486372,4486494
5 3 Mrpl15,RM15_MOUSE chr1 - 6 4764015,4764597;4766458,4766882;4767606,4767729;4772649,4772814;4774032,4774186;4775654,4775768
6 2 Lypla1,Apt1,Pla1a,LYPA1_MOUSE chr1 + 9 4797974,4798063;4798536,4798567;4818665,4818730;4820349,4820396;4822392,4822462;4827082,4827155;4829468,4829569;4831037,4832908;4835044,4836816
clusterID	nKG	aliases	chr	str	nInterval	coords
1	2	Xrg4,mKIAA1889,Xkr4,XKR4_MOUSE	chr1	-	4	3195985,3197398;3203520,3207049;3411783,3411982;3660633,3661579
2	1	uc007aev.1	chr1	-	2	3638392,3640590;3648928,3648985
3	2	RP1_MOUSE,Rp1h,Orp1,Rp1	chr1	-	6	4280927,4283093;4334224,4340172;4341991,4342162;4342283,4342918;4350281,4350473;4399251,4399322
4	5	SOX17_MOUSE,Sox-17,Sox17	chr1	-	5	4481009,4482749;4483181,4483816;4483853,4483944;4485217,4486023;4486372,4486494
5	3	Mrpl15,RM15_MOUSE	chr1	-	6	4764015,4764597;4766458,4766882;4767606,4767729;4772649,4772814;4774032,4774186;4775654,4775768
6	2	Lypla1,Apt1,Pla1a,LYPA1_MOUSE	chr1	+	9	4797974,4798063;4798536,4798567;4818665,4818730;4820349,4820396;4822392,4822462;4827082,4827155;4829468,4829569;4831037,4832908;4835044,4836816


ExpressionPlot includes scripts to create gene cluster files based on UCSC-like transcript tables.

Alternative Exon Files

The -ae switch lets you provide alternative cassette exon events to the ExpressionPlot RNA-Seq backend. These are the cassette exons: exons which can be either included or entirely skipped in any particular transcript. They are all by definition internal (not first or last) exons. Since assigning a read to the exon-skipping isoform requires that the read be anchored in the same gene of which the exon is a part, the known splice sites of that gene must be included in the annotation of each event. The alternative exon files that come with the ExpressionPlot are notable in that every internal exon of every transcript is considered a candidate cassette exon.

The columns of the alternative exon files are as follows:

alternative exon annotation file
column name description
gene The name of the gene that hosts the exon. Here "name" can be based on any system, for example an accepted gene symbol or an Ensembl identifier. The field is only important in that it will show up on output tables, so you can easily identify the gene of which the exon is a part.
chr The chromosome on which the exon is located
strand The strand on which the exon is located.
reg.up The "upstream region" of the exon. This is a union of genomic intervals which represents all the upstream exonic regions of the same gene. These are used to

calculate the "Psg" and "LORG" statistics. The regions are represented by a string of the form "start1,end1;start2,end2;...". In this context "upstream" has a slightly confusing definition: it is defined as "lower chromosomal coordinates". This means that it is 5' of the exon for a plus-strand gene but 3' of the exon for a minus-strand gene.

ss.up A comma-separated list of the known upstream splice sites. As with reg.up upstream means lower chromosomal coordinates, which are not 5' to the exon for minus-strand genes. For plus-strand genes 5' splice sites are given and for minus-strand genes 3' splice sites are given: these are the sites which could potentially splice over the candidate exon. The coordinates of the terminal exonic nucleotide are given.
reg.exon The coordinates of the candidate exon itself, in a string of the form "start,end".
reg.dn, ss.dn Same as reg.up and ss.up, but for downstream region.

Here is an example of the first few lines of the mm9_acembly_AE_events_with_flanking_SS.tsv organized into a table, and then as unformatted text:

gene chr strand reg.up ss.up reg.exon reg.dn ss.dn exon.ids alt
CREB3L1 chr11 1 46255796,46256340;46278062,46278290 46256340,46278290,46273460 46285943,46286127 46288116,46288194;46289159,46289316;46290452,46290601;46290739,46290797;46290978,46291046 46289159,46290739,46294413,46298391,46290978,46290452,46288116,46298837,46293298,46295488 ex108878 Const
CREB3L1 chr11 1 46255796,46256340;46278062,46278290;46285943,46286127;46288116,46288194 46256340,46286127,46278290,46273460,46288194 46289159,46289316 46290452,46290601;46290739,46290797;46290978,46291046 46290739,46294413,46298391,46290978,46290452,46298837,46293298,46295488 ex108880 Const
CREB3L1 chr11 1 46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194 46290739,46290797 46290978,46291046 46294413,46298391,46290978,46298837,46293298,46295488 ex108882 Const
CREB3L1 chr11 1 46290739,46290797;46290978,46291046 46291046,46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797 46294413,46294512 46295488,46295614;46298391,46298655;46298837,46299519 46298391,46298837,46295488 ex108884 Const
CREB3L1 chr11 1 46294413,46294512;46295488,46295614 46295614,46291046,46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797,46294512 46298391,46298655 46298837,46299519 46298837 ex108886 Const
CREB3L1 chr11 1 46290739,46290797 46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797 46290978,46291046 46294413,46298391,46298837,46293298,46295488 ex108883 Const
gene	chr	strand	reg.up	ss.up	reg.exon	reg.dn	ss.dn	exon.ids	alt
CREB3L1	chr11	1	46255796,46256340;46278062,46278290	46256340,46278290,46273460	46285943,46286127	46288116,46288194;46289159,46289316;46290452,46290601;46290739,46290797;46290978,46291046	46289159,46290739,46294413,46298391,46290978,46290452,46288116,46298837,46293298,46295488	ex108878	Const
CREB3L1	chr11	1	46255796,46256340;46278062,46278290;46285943,46286127;46288116,46288194	46256340,46286127,46278290,46273460,46288194	46289159,46289316	46290452,46290601;46290739,46290797;46290978,46291046	46290739,46294413,46298391,46290978,46290452,46298837,46293298,46295488	ex108880	Const
CREB3L1	chr11	1		46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194	46290739,46290797	46290978,46291046	46294413,46298391,46290978,46298837,46293298,46295488	ex108882	Const
CREB3L1	chr11	1	46290739,46290797;46290978,46291046	46291046,46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797	46294413,46294512	46295488,46295614;46298391,46298655;46298837,46299519	46298391,46298837,46295488	ex108884	Const
CREB3L1	chr11	1	46294413,46294512;46295488,46295614	46295614,46291046,46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797,46294512	46298391,46298655	46298837,46299519	46298837	ex108886	Const
CREB3L1	chr11	1	46290739,46290797	46289603,46290601,46256340,46286127,46278290,46273460,46289316,46288194,46290797	46290978,46291046		46294413,46298391,46298837,46293298,46295488	ex108883	Const

ExpressionPlot includes scripts to create alternative exon files based on UCSC-like transcript tables.

Retained Introns

The -ri switch lets you provide intron retention events to the ExpressionPlot RNA-Seq backend.

This section under construction.

Alternative Terminal Exon Files

The -ate switch lets you provide alternative terminal exon events to the ExpressionPlot RNA-Seq backend.

This section under construction.