Generating new annotations

From ExpressionPlot
Revision as of 18:13, 24 March 2011 by Brad (Talk | contribs) (Created page with "You can see which genomes already have annotations by running <code>`expressionplot-config`/util/EP-manage.pl list repos_annot</code>. If your genome is on the list then you can ...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

You can see which genomes already have annotations by running `expressionplot-config`/util/EP-manage.pl list repos_annot. If your genome is on the list then you can automatically download and install the annotations with EP-manage.pl. If it is not on the list, or if you don't like the annotations on the repository, then you can create a new annotation. To create a new annotation, you have two options. The first is to use whatever method you please and generate files according to this specification. This method offers the most flexibility. The second is to use a suite of scripts in the annot directory (available in version 0.7 and later).

If you do create a new annotation either on your on or by the steps described below that you think might be of use to others, please contact the [[ExpressionPlot google group|expressionplot@] so that we can upload your files to the repository and make them available to others.


The Complete ExpressionPlot Annotation

The complete ExpressionPlot annotation consists of the following files. It is not strictly required to have all of the files. It depends on the type of analysis you want to do. For example, with just a gene clusters file it is possible to do a gene expression analysis!

file short description
gene_clusters.tsv Gene clusters
trimmed_gene_clusters.tsv Gene clusters, with alternate-strand-overlapping regions cut out. (In these regions it is hard to tell which gene the reads came from, so it is better not to count reads to either gene cluster. This includes the fairly common scenario of two alternate strand genes with overlapping 3' UTRs, such as Tdp43 and Masp2.
alternative_exons.tsv Candidate alternative (internal) exons.
retained_introns.tsv Candidate retained introns
alt_term_exons.tsv Candidate alternative terminal (5' or 3' UTR) exons. Perhaps a more precise name for these event are alternative transcript termini, since really we are interested only in the beginning/end of the transcript, and not the splice site part of the terminal exons. However, we use the reads supporting the body of the exon to determine the usage of the associated terminus. Furthermore, although biologically we think of alternative promoter usage (first exons) and polyA/cleavage site usage (last exons) as very different, algorithmically they are identical.
genome.*ebwt The bowtie indexes for your genome (get these from the bowtie FTP server.)
junctions.hjl=X.*ebwt This set of files constitute a bowtie index for the splice junctions. Note that a new splice junction database should be created for each read length so that overlap constraints are thereby enforced. A possible rule of thumb is to requiring 8 nucleotides of overlap. Therefore, the "half junction length" should be the read length - 8, and the total length of each sequence in the splice junction database would be <math>2*readlen-16</math>.
junctions


Generating the annotation files

To generate the annotation files you should first get the necessary tables from the UCSC mysql server. These can be found in $EP_HOME/annot/fetch_mysql_table.pl. There is some flexibility as to which table(s) you will use. You need at the very least a "transcript" table such as knownGene. For some genomes, the default UCSC genes are not so good. For example, for the rat genome it is much better to use ensGene because it is much more complete. If you use knownGene you will also need the knownIsoforms table that groups the knownGenes (transcripts) into gene clusters. If you use ensGene then you don't need that, since the name2 field of that table is the Ensembl gene number and serves the same purpose. Finally, another useful table for generating a very rich annotation is the acembly table. Here the transcript IDs begin with a gene name, and the annotation scripts know how to parse this to join the transcripts into gene clusters.

In addition, it is good to grab the kgXref and kgAlias tables to be able to have readable gene names. So I recommend getting the following tables, if they exist:

 knownGene
 kgXref
 kgAlias
 ensGene
 acembly

fetch_mysql_table.pl knows already the address of the UCSC MySQL server and will grab your tables from there. It downloads the tables to a temporary file, then changes the table names to have an ${org}_ prefix, then loads the data into your ExpressionPlot MySQL server. It figures out the hostname, database name, user name and password by querying your ExpressionPlot config file. You can override these things with switches but probably won't need to.

Gene cluster files

The first file you'll want to generate is the gene cluster file.