Mapping between annotations
"Annotations" that come with ExpressionPlot include genes and splicing events, but can be expanded to include any user-defined genomic regions, such as binding peaks from CLIP-Seq or ChIP-Seq. The term "annotation" is used in two sometimes confusing ways in this manual. It can mean a set of files that describe different types of events, including genes, cassette exons, and retained introns, and this is the sense in which it is used by
EP-manage.pl when annotations are downloaded from the repository. The second sense, which is the sense meant in this file, is a set of genomic features of the same type, such as "all the genes in a genome" or "all the binding sites for a transcription factor" or "all the known skipped exon events". These would be represented by a single flat-file in which each feature is a row of the file and the columns describe chromosomal coordinates and ancillary data about the features. A collection of such files would make up an "annotation" of the first sense.
Given two different annotations, it is natural to want to map the underlying objects. For example, it is handy to be able to map the UCSC genes onto Ensembl genes. To give more complicated examples, it might be good to known the genes to which particular retained introns belong, or the exons closest to CLIP-Seq binding sites. Sometimes the underlying genomes of the annotations differ, but the objects themselves are analogous: it is natural to map mouse genes onto human genes, or mouse cassette exons onto human cassette exons. ExpressionPlot does not yet include utilities to generate these maps de novo, but pre-computed maps can be automatically downloaded from the repository and installed using
EP-manage.pl get_map. Once maps are generated and registered, ExpressionPlot knows how to use them to create 4way plots, heatmaps and event_heatmaps. The annotation files are extracted from the first lines of the levels or contrast files (such as
2way_comps/$COMP.xls), maps are looked up in the map registry, and then the statistics are converted from one annotation to the other. This is one of ExpressionPlot's strongest features: automated mapping between annotations. This document describes how to format new map files and register them with the front end.
Map file format
Each map file describes a not-necessarily-one-to-one mapping between two annotation files. Examples of non-one-to-one mappings including multiple exons onto one gene (mostly many to one), genes between organisms (mostly but not entirely one-to-one) or two gene annotations of the same organism (again mostly but not entirely one-to-one).
Each annotation file must have an ID field called either
clusterID. If both exist, then the first one is taken to be the identifier for the features of the annotation. If neither exist, then IDs are assigned to the features in the order in which they appear in the file, beginning with 1 (since a header row is included in the annotation files this usually means that row 2 is assigned ID 1, row 3 assigned ID 2, and so forth). Given these identifiers, the map file format is an hTSV file with two required columns:
targetID. Each row is a mapping between a ID in the first annotation file and the second annotation file.
Here are a couple of examples of the first few rows of map files (taken from the repository).
hg18 UCSC onto Ensembl genes
The first example gives a mapping from
hg18 UCSC gene clusters to Ensembl genes. This would be installed with
EP-manage.pl get_map hg18. The map file appears in
EP=`expressionplot-config`. Here are the first few lines:
These few lines alone illustrate two features of the map. First, it is not one-to-one. UCSC cluster 6 maps onto both
ENSG00000230021. Second, not every feature from both annotations even appears in the map (this map file happens to be ordered by
startID—not required, but it makes it easy to see that UCSC cluster 5 has no counterpart in the Ensembl annotation).
hg18 cassette exon events onto genes
This example gives a mapping from
hg18 cassette exon events onto gene clusters. This is also part of
EP-manage.pl get_map hg18. The map file appears in
$EP/annot/maps/hg18/hg18_acembly_AE_events_onto_gene_clusters.tsv. Here are the first few lines:
The first few cassette exon events are all from the same gene cluster (4033, which happens to be the gene CREB3L1).
mm9 to hg18 genes
The last example gives a mapping from
hg18 genes (UCSC gene clusters). This is part of
EP-manage.pl get_map mm9-hg18. The map file appears in
$EP/annot/maps/mm9-hg18/mm9_to_hg18_gene_clusters_map.tsv. Here are the first few lines:
Here, for example,
mm9 cluster 1 corresponds to
hg18 cluster 23820. You can reveal these genes with the following commands (leading "$" represents the linux prompt):
$ gawk '$1 == 1' $EP/annot/mm9/mm9_gene_clusters.tsv 1 2 Xrg4,mKIAA1889,Xkr4,XKR4_MOUSE chr1 - 4 3195985,3197398;3203520,3207049;3411783,3411982;3660633,3661579 $ gawk '$1 == 23820' $EP/annot/hg18/hg18_gene_clusters.tsv 23820 1 KIAA1889,XKR4,XKR4_HUMAN,XRG4 chr8 + 3 56177571,56178408;56432792,56432991;56598394,56601264
They are the mouse and human orthologs of the gene called "XKR4".
align.len indicate the length of the gene (only counting exonic sequences) in the two annotations, and the length of the part that aligns between them. This illustrates how extra fields can be included in the file.
Once the map files are created it is necessary to register them with ExpressionPlot. This is done either by using the script
$EP/util/register_maps.pl or by editing the file
ANNOT_DIR=`expressionplot-config ANNOT_DIR`). Using
register_maps.pl is recommended since it will prevent typos that might be introduced editing by hand.
To register the map file
map.tsv as a map from annotation file
target_annot.tsv, simply run
$EP/util/register_maps.pl start_annot.tsv target_annot.tsv map.tsv.
register_maps.pl will automatically expand your file names to full paths, so they will not be dependent on having the same current directory. Sometimes you will want to have an isomorphism between two annotation files. This can happen, for example, when they are basically two versions of the same file, perhaps one having a few extra fields. In that case it is not necessary to create
map.tsv, the map file. You can provide the special string "___SAME___" (that's three underscores on both sides!) as the third argument and ExpressionPlot will know that it can assume the two files are isomorphic.
The map registry
The actual map registry is stored in
$ANNOT_DIR/maps.tsv, an hTSV file with three fields:
fn2 give the full paths to the start and target annotation files respectively, and
map gives the full path to the map file. If you register maps using
EP_manage.pl get_map and
register_maps.pl then you can avoid editing this file directly, which would be good since doing so is an easy way to introduce typos and break the system.
Aside from each row having three file names this file has two tricks. First is that the special string "___SAME___" (three underscores on both sides) in the
map column indicates that the two annotation files are isomorphic. The second trick is that the string "$ANNOT_DIR" can appear in the file names and it will be replaced with the annotation directory from your config.ini. This can improve the portability of your map registry, and is the standard for all maps available on the repository. Both of these tricks are accessible through the
register_maps.pl script—isomorphisms can be specified with the string
___SAME___, and annotation or map filenames in any subdirectory of
$ANNOT_DIR are by default converted to use the string
As an example, here are a few lines of
$ANNOT_DIR/maps.tsv on my machine:
You can see how the special string "$ANNOT_DIR" can be interchanged with full paths. You can also see an example of the special map "___SAME___" in the last row. Here it identifies the untrimmed and trimmed versions of the
hg18 gene clusters.