Generating new annotations

From ExpressionPlot
Revision as of 02:15, 6 December 2011 by Brad (Talk | contribs)

Jump to: navigation, search

You can see which genomes already have annotations by running `expressionplot-config`/util/ list repos_annot. If your genome is on the list then you can automatically download and install the annotations with If it is not on the list, or if you don't like the annotations on the repository, then you can create a new annotation. To create a new annotation, you have two options. The first is to use whatever method you please and generate files according to this specification. This method offers the most flexibility. The second is to use a suite of scripts in the annot directory (available in version 0.7 and later).

If you do create a new annotation either on your on or by the steps described below that you think might be of use to others, please contact the ExpressionPlot google group so that we can upload your files to the repository and make them available to others.

The Complete ExpressionPlot Annotation

The complete ExpressionPlot annotation consists of the following files. It is not strictly required to have all of the files. It depends on the type of analysis you want to do. For example, with just a gene clusters file and a bowtie index it is possible to do a gene expression analysis!

file short description
gene_clusters.tsv Gene clusters
trimmed_gene_clusters.tsv Gene clusters, with alternate-strand-overlapping regions cut out. (In these regions it is hard to tell which gene the reads came from, so it is better not to count reads to either gene cluster. This includes the fairly common scenario of two alternate strand genes with overlapping 3' UTRs, such as Tdp43 and Masp2.
alternative_exons.tsv Candidate alternative (internal) exons.
retained_introns.tsv Candidate retained introns
alt_term_exons.tsv Candidate alternative terminal (5' or 3' UTR) exons. Perhaps a more precise name for these event are alternative transcript termini, since really we are interested only in the beginning/end of the transcript, and not the splice site part of the terminal exons. However, we use the reads supporting the body of the exon to determine the usage of the associated terminus. Furthermore, although biologically we think of alternative promoter usage (first exons) and polyA/cleavage site usage (last exons) as very different, algorithmically they are identical.
genome.*ebwt The bowtie indexes for your genome (get these from the bowtie FTP server.)
junctions.hjl=X.*ebwt This set of files constitute a bowtie index for the splice junctions. Note that a new splice junction database should be created for each read length so that overlap constraints are thereby enforced. A possible rule of thumb is to requiring 8 nucleotides of overlap. Therefore, the "half junction length" should be the read length - 8, and the total length of each sequence in the splice junction database would be <math>2*readlen-16</math>.

Generating the annotation files

To generate the annotation files you should first get the necessary tables from the UCSC mysql server. These can be found in $EP_HOME/annot/util/ There is some flexibility as to which table(s) you will use. You need at the very least a "transcript" table such as knownGene. For some genomes, the default UCSC genes are not so good. For example, for the rat genome it is much better to use ensGene because it is much more complete. If you use knownGene you will also need the knownIsoforms table that groups the knownGenes (transcripts) into gene clusters. If you use ensGene then you don't need that, since the name2 field of that table is the Ensembl gene number and serves the same purpose. Finally, another useful table for generating a very rich annotation is the acembly table. Here the transcript IDs begin with a gene name, and the annotation scripts know how to parse this to join the transcripts into gene clusters.

In addition, it is good to grab the kgXref and kgAlias tables to be able to have readable gene names. So I recommend getting the following tables, if they exist:

 acembly knows already the address of the UCSC MySQL server and will grab your tables from there. It downloads the tables to a temporary file, then changes the table names to have an ${org}_ prefix, then loads the data into your ExpressionPlot MySQL server. It figures out the hostname, database name, user name and password by querying your ExpressionPlot config file. You can override these things with switches but probably won't need to. Here's a bash quickie to get what you need to make an hg19 annotation:

 for tbl in knownGene kgXref kgAlias ensGene ensGtp acembly knownCanonical; do
   `expressionplot-config`/util/annot/ hg19 $tbl

I put in the knownCanonical table, too. There is no harm in having extra tables and I like to use this one to analyze peaks, like from ChIP-Seq data.

Gene cluster files

The first file you'll want to generate is the gene cluster file. The script to do this is $EP_HOME/util/annot/ The usage message for that script contains a lot of details on how to run it. A typical example would be to use the knownGene and knownIsoforms table to get transcripts and genes, then the kgAlias table to get aliases:

 ANNOT_DIR=`expressionplot-config ANNOT_DIR`
 $EP_HOME/util/annot/ hg19 hg19_knownGene "SELECT * FROM hg19_knownIsoforms" hg19_kgAlias > $ANNOT_DIR/hg19/hg19_gene_clusters.tsv

After you run, I recommend "trimming" the parts of genes that are overlapped on the opposite strand by other annotated genes. This is done with the following command:

 $EP_HOME/util/annot/ $ANNOT_DIR/hg19/hg19_gene_clusters.tsv > $ANNOT_DIR/hg19/hg19_trimmed_gene_clusters.tsv

Alternative exons file

The "alternative exons" events consider the relative skipping or inclusion of entire exons. To create these events, as well as the other events further down, an intermediate file format called GTE (gene-transcript-exon) is used. The GTE file for an annotation is generated by the script $EP_HOME/util/annot/ The call is identical to that for, for example as follows:

 $EP_HOME/util/annot/ hg19 hg19_knownGene "SELECT * FROM hg19_knownIsoforms" hg19_kgAlias > $ANNOT_DIR/hg19/hg19.gte

Run with no arguments for more details, including if you want to you use a different annotation such as Ensembl or acembly.

Once the .gte file is created it is then used as input to the other annotation scripts. To create the alternative exons file, simply provide the .gte file as the only argument to

 $EP_HOME/util/annot/ $ANNOT_DIR/hg19/hg19.gte > $ANNOT_DIR/hg19/hg19_skipped_exons.tsv

Retained introns file

The "retained intron" events consider the relative splicing out or retention of entire introns. These events are also made from the GTE file describe above. Let's assume you have a GTE file called $ANNOT_DIR/hg19/hg19.gte. To create the retained introns file, simply provide the .gte file as the only argument to

 $EP_HOME/util/annot/ $ANNOT_DIR/hg19/hg19.gte > $ANNOT_DIR/hg19/hg19_retained_introns.tsv

Alternative terminal exons file

The "alternative terminal exons" events consider the relative usage of transcription start sites or poly-adenylation/cleavage sites. Each event considers the alternatives of a particular site with more distal sites. These events are also made from the GTE file described above. Let's assume you have a GTE file called $ANNOT_DIR/hg19/hg19.gte. To create the alternative terminal exons file, simply provide the .gte file as the only argument to

 $EP_HOME/util/annot/ $ANNOT_DIR/hg19/hg19.gte > $ANNOT_DIR/hg19/hg19_alt_terminal_exons.tsv

Junctions Database

The junctions database is made using the $EP_HOME/util/ script. To use this script you will need the sequence of your target genome as a .2bit file. This script will output a FASTA file which is then converted into a bowtie index with bowtie-build. works by pairing up all 5' splice sites with all downstream 3' splice sites for each gene in your annotation. This allows limited "discovery" of unannotated splice junctions.

In some annotations the antibody loci has many different transcripts, corresponding to different DNA recombinations. It is recommended to leave these loci out of the junction database, because the all-versus-all combinatorial expansion means that they can account for up to half of the entire junction database. So you may want to know their gene or cluster IDs to supply to the script.

The other thing you'll need to know is your desired "half-junction-length". Usually it is a good idea to enforce a "splice junction overhang": only accept splice junction reads with at least a certain number of bases (typically around 7 or 8) aligning on both sides of the junctions. This is thought to reduce mis-alignments due to sequencing errors. The way ExpressionPlot enforces this is to use a junction database with smaller half junctions. The formula is as follows:

 half-junction-length = read-length - minimum-overhang. 

For example, if you have 72 base reads and want to enforce an 8 base overhang, use a 72-8=64 base half-junction length. This means that you have to create a new junction database when you have a dataset with a different read length.

Once you have your .2bit file, your omitted gene IDs, and your half junction length, you are ready to run the script. Here is an example:

 $EP_HOME/util/ hg19 expressionplot hg19_knownGene 64 -cl knownIsoforms -omit 7622 -omit 14273 -omit 16423 > hg19_junctions,hjl=64,omit=7622,14273,16423.fa

I like to use a long and informative file name since there is no other record of how the database was constructed.

The arguments are as follows:

hg19 This refers to the 2bit file, which would be located in $EP_HOME/genomes/hg19.2bit. Alternatively, you could supply the full path to a 2bit file in this position.
expressionplot This is the MySQL database that contains the UCSC tables.
hg19_knownGene This is the table in that database that contains the transcripts from which you want to extract splice junctions.
64 This is the half-junction length.
-cl knownIsoforms This indicates that you want to cluster the transcripts based on the knownIsoforms table field.
-omit 7622 -omit 14273 -omit 16423 These are three antibody loci to omit: 7622 is IgHG1, 14273 is IgK and 16423 is IgL.

Run with no arguments for more information.

Once you've created this file, which only takes a few minutes of running time, you can create the bowtie junction database with a call such as the following:

 $EP/bowtie/bowtie-build hg19_junctions,hjl=64,omit=7622,14273,16423.fa $EP/bowtie/indexes/hg19_junctions,hjl=64,omit=7622,14273,16423

Once this is done you can use the resulting bowtie index as a value for the junctions index switch (-j) in [[Preparing_Raw_Data_with_the_EP_Backend#RNA-Seq_.28from_FASTQ.29]].

Starting from a GFF file

If you have a GFF file you can use the script $EP_HOME/util/annot/ to bring the annotation into ExpressionPlot. That script will generate a gene clusters file, a GTE file and a UCSC-like MySQL table. From these you can then continue above at the Alternative exons file to generate the rest of the alternative events files and junction database. Generating the MySQL table will also allow you to visualize your splice reads in the SeqView Tool.

The script requires a few parameters. First, you have to decide what levels within your GFF file will correspond to genes, transcripts and exons. By default these will be gene, mRNA and exon. As of the writing of this manual coding sequence is not implemented (although the presence of CDS or any other level will be ignored). Second, you have to decide if you will try to extract gene IDs or aliases from the tags field. If possible, this is a good idea. You then simply provide the name of the respective tags to the script. Third, you have to decide the name of the MySQL table. I recommed it take the form ${genome}_${tableName}, so that ExpressionPlot can possibly find it when looking up gene IDs in the SeqView and gene level barplotter tools.

Finally, you decide the names of the gene clusters file and the GTE file, which is the temporary file format used internally by ExpressionPlot only for creating the alternative event files.

Here is an example invocation:

 $EP_HOME/util/annot/ ~/annotation.gff3 --id ID --alias Alias $EP_HOME/annot/newOrg/gene_clusters_unsorted_untrimmed.tsv --table-name newOrg_genes

Here gene IDs and aliases will be extracted from the ID and Alias tags.

Since the output is untrimmed you may have gene models overlapping on opposite strands. Therefore I recommend running, and using the output of that script for all your gene expression analyses.

Once you've done this, you can follow the steps above to make the alternative events files. If you want to make the junctions database, which is recommended so that you can have ExpressionPlot analyze splice junctions, then you'll need the genome sequence as a .2bit file. You can use the faToTwoBit utility for this.

This section has not been thoroughly tested, so if you get stuck then please feel free to contact the google group and we will respond as quickly as possible. The script will not be available until version 2.1, but you can get it from the repository before then as follows:

 cd `expressionplot-config`/util/annot
 svn export