Mapping between annotations

From ExpressionPlot
Jump to: navigation, search

"Annotations" that come with ExpressionPlot include genes and splicing events, but can be expanded to include any user-defined genomic regions, such as binding peaks from CLIP-Seq or ChIP-Seq. The term "annotation" is used in two sometimes confusing ways in this manual. It can mean a set of files that describe different types of events, including genes, cassette exons, and retained introns, and this is the sense in which it is used by when annotations are downloaded from the repository. The second sense, which is the sense meant in this file, is a set of genomic features of the same type, such as "all the genes in a genome" or "all the binding sites for a transcription factor" or "all the known skipped exon events". These would be represented by a single flat-file in which each feature is a row of the file and the columns describe chromosomal coordinates and ancillary data about the features. A collection of such files would make up an "annotation" of the first sense.

Given two different annotations, it is natural to want to map the underlying objects. For example, it is handy to be able to map the UCSC genes onto Ensembl genes. To give more complicated examples, it might be good to known the genes to which particular retained introns belong, or the exons closest to CLIP-Seq binding sites. Sometimes the underlying genomes of the annotations differ, but the objects themselves are analogous: it is natural to map mouse genes onto human genes, or mouse cassette exons onto human cassette exons. ExpressionPlot does not yet include utilities to generate these maps de novo, but pre-computed maps can be automatically downloaded from the repository and installed using get_map. Once maps are generated and registered, ExpressionPlot knows how to use them to create 4way plots, heatmaps and event_heatmaps. The annotation files are extracted from the first lines of the levels or contrast files (such as gene_levels.tsv or 2way_comps/$COMP.xls), maps are looked up in the map registry, and then the statistics are converted from one annotation to the other. This is one of ExpressionPlot's strongest features: automated mapping between annotations. This document describes how to format new map files and register them with the front end.

Map file format

Each map file describes a not-necessarily-one-to-one mapping between two annotation files. Examples of non-one-to-one mappings including multiple exons onto one gene (mostly many to one), genes between organisms (mostly but not entirely one-to-one) or two gene annotations of the same organism (again mostly but not entirely one-to-one).

Each annotation file must have an ID field called either ID or clusterID. If both exist, then the first one is taken to be the identifier for the features of the annotation. If neither exist, then IDs are assigned to the features in the order in which they appear in the file, beginning with 1 (since a header row is included in the annotation files this usually means that row 2 is assigned ID 1, row 3 assigned ID 2, and so forth). Given these identifiers, the map file format is an hTSV file with two required columns: startID and targetID. Each row is a mapping between a ID in the first annotation file and the second annotation file.

Here are a couple of examples of the first few rows of map files (taken from the repository).

hg18 UCSC onto Ensembl genes

The first example gives a mapping from hg18 UCSC gene clusters to Ensembl genes. This would be installed with get_map hg18. The map file appears in $EP/annot/maps/hg18/hg18_knownGene_clusters_onto_ensembl.tsv, where EP=`expressionplot-config`. Here are the first few lines:

startID targetID
1 ENSG00000223972
2 ENSG00000227232
3 ENSG00000243485
4 ENSG00000237613
6 ENSG00000228463
6 ENSG00000230021

These few lines alone illustrate two features of the map. First, it is not one-to-one. UCSC cluster 6 maps onto both ENSG00000228463 and ENSG00000230021. Second, not every feature from both annotations even appears in the map (this map file happens to be ordered by startID—not required, but it makes it easy to see that UCSC cluster 5 has no counterpart in the Ensembl annotation).

hg18 cassette exon events onto genes

This example gives a mapping from hg18 cassette exon events onto gene clusters. This is also part of get_map hg18. The map file appears in $EP/annot/maps/hg18/hg18_acembly_AE_events_onto_gene_clusters.tsv. Here are the first few lines:

startID targetID
1 4033
2 4033
3 4033
4 4033
5 4033
6 4033

The first few cassette exon events are all from the same gene cluster (4033, which happens to be the gene CREB3L1).

mm9 to hg18 genes

The last example gives a mapping from mm9 to hg18 genes (UCSC gene clusters). This is part of get_map mm9-hg18. The map file appears in $EP/annot/maps/mm9-hg18/mm9_to_hg18_gene_clusters_map.tsv. Here are the first few lines:

startID targetID start.len target.len align.len
1 23820 6091 3909 3894
3 23819 9189 9067 7221
4 23818 3399 2350 2282
5 23817 1568 1749 965
6 23816 4128 2574 2376

Here, for example, mm9 cluster 1 corresponds to hg18 cluster 23820. You can reveal these genes with the following commands (leading "$" represents the linux prompt):

 $ gawk '$1 == 1' $EP/annot/mm9/mm9_gene_clusters.tsv
 1	2	Xrg4,mKIAA1889,Xkr4,XKR4_MOUSE	chr1	-	4	3195985,3197398;3203520,3207049;3411783,3411982;3660633,3661579
 $ gawk '$1 == 23820' $EP/annot/hg18/hg18_gene_clusters.tsv
 23820	1	KIAA1889,XKR4,XKR4_HUMAN,XRG4	chr8	+	3	56177571,56178408;56432792,56432991;56598394,56601264

They are the mouse and human orthologs of the gene called "XKR4".

The columns start.len, target.len and align.len indicate the length of the gene (only counting exonic sequences) in the two annotations, and the length of the part that aligns between them. This illustrates how extra fields can be included in the file.

Registering maps

Once the map files are created it is necessary to register them with ExpressionPlot. This is done either by using the script $EP/util/ or by editing the file $ANNOT_DIR/maps.tsv (where ANNOT_DIR=`expressionplot-config ANNOT_DIR`). Using is recommended since it will prevent typos that might be introduced editing by hand.

To register the map file map.tsv as a map from annotation file start_annot.tsv to target_annot.tsv, simply run $EP/util/ start_annot.tsv target_annot.tsv map.tsv. will automatically expand your file names to full paths, so they will not be dependent on having the same current directory. Sometimes you will want to have an isomorphism between two annotation files. This can happen, for example, when they are basically two versions of the same file, perhaps one having a few extra fields. In that case it is not necessary to create map.tsv, the map file. You can provide the special string "___SAME___" (that's three underscores on both sides!) as the third argument and ExpressionPlot will know that it can assume the two files are isomorphic.

The map registry

The actual map registry is stored in $ANNOT_DIR/maps.tsv, an hTSV file with three fields: fn1, fn2 and map. fn1 and fn2 give the full paths to the start and target annotation files respectively, and map gives the full path to the map file. If you register maps using get_map and then you can avoid editing this file directly, which would be good since doing so is an easy way to introduce typos and break the system.

Aside from each row having three file names this file has two tricks. First is that the special string "___SAME___" (three underscores on both sides) in the map column indicates that the two annotation files are isomorphic. The second trick is that the string "$ANNOT_DIR" can appear in the file names and it will be replaced with the annotation directory from your config.ini. This can improve the portability of your map registry, and is the standard for all maps available on the repository. Both of these tricks are accessible through the script—isomorphisms can be specified with the string ___SAME___, and annotation or map filenames in any subdirectory of $ANNOT_DIR are by default converted to use the string $ANNOT_DIR.

As an example, here are a few lines of $ANNOT_DIR/maps.tsv on my machine:

fn1 fn2 map
$ANNOT_DIR/mm9/mm9_gene_clusters.tsv /home/baf/RNASeq/annot/hg18/hg18_trimmed_gene_clusters.tsv $ANNOT_DIR/maps/mm9-hg18/mm9_to_hg18_gene_clusters_map.tsv
$ANNOT_DIR/mm9/mm9_gene_clusters.tsv /home/baf/RNASeq/annot/hg18/hg18_gene_clusters.tsv $ANNOT_DIR/maps/mm9-hg18/mm9_to_hg18_gene_clusters_map.tsv
$ANNOT_DIR/mm9/mm9_gene_clusters.tsv $ANNOT_DIR/hg18/hg18_gene_clusters.tsv $ANNOT_DIR/maps/mm9-hg18/mm9_to_hg18_gene_clusters_map.tsv
$ANNOT_DIR/hg18/hg18_gene_clusters.tsv $ANNOT_DIR/hg18/hg18_trimmed_gene_clusters.tsv ___SAME___

You can see how the special string "$ANNOT_DIR" can be interchanged with full paths. You can also see an example of the special map "___SAME___" in the last row. Here it identifies the untrimmed and trimmed versions of the hg18 gene clusters.