# A Generic Interface for Genomic Data

## Purpose

Genomic data are data satisfying the following the following conditions

1. Genomic: Linked to features of a reference genome.
2. Genomic: Organized into large sets of data.
3. Linked to a particular experimental condition.

We call data satisfying only the first two conditions annotations.

Genomic data usually arise from genomic technologies. Here are a few examples:

1. Genome analysis: SNP/CNV arrays, exome-seq, DNA-Seq
2. Biochemical analysis: ChIP-chip/ChIP-Seq, CLIP-Seq
3. Gene Expression Data: 3'-UTR array, exon array, smRNA-Seq, mRNA-Seq

This page will be a space to flesh out ideas about creating a generic interface between different genomic back ends and visualization front ends.

The immediate purpose is to make it easier to use a different back end with ExpressionPlot, but we also hope that this interface will be general enough to be useful for other front ends.

Regardless of its sources, genomic data analysis often compromises the following steps:

1. Processing of raw data: This step involves taking raw data off a machine and processing it to most basic form. This is typically highly platform-specific and done using manufacturer-provided software. For example, this would include imagine processing/base calling on an Illumina machine, or image processing/spot quantification on a microarray. It also includes basic quality filtering of sequence data, and background subtraction/normalization of microarray data.
2. Feature extraction: This step further processes the data from each sample to summarize specific genomic regions. For RNA-Seq or microarray this might be gene or splicing levels, and for ChIP-Seq this could be peak calls and associated read numbers. At this point the raw data of the previous step (read sequences or spot intensities) is abstracted out.
3. Statistical comparisons: This step involves statistical analyses to sort out significant events from noise. It typically involves the comparison of different samples or groups of samples.
4. Biological insight: This final step is the most nebulous and represents the true work of the computational biologist. Of all the possible events and comparisons that could be made, one must decide which ones are relevant and what new hypotheses can be gleamed from the data.

The goal of this document is to specify abstractly the requirements for an interface between the third and fourth steps. It should hide all of the complexities of the first three steps (which might be run as a standard pipeline regardless of the biological system), but still be general enough to encode all the information that might be of interest to browse, visual and compare data from a biological point of view.

In the case of ExpressionPlot, the first three steps are part of the back end and the front end is a tool to help with the fourth step. The purpose of this Interface, therefore, is to make it possible to exchange information from different back ends (including different genomes or genomic platforms) with different tools for biological insight.

## Definitions

### Data

In our set theory framework, data of all types are represented by functions onto some range $R$. $R$ can be any number of sets but is usually a cartesian product of some mixture of reals, integers and/or finite sets, allowing our data to have multiple dimensions. The dimensions can also be arbitrary-length tuples or subsets of these basic sets. We can formalize this with $R = \bigotimes_{i=1}^{n(R)} R_i$ and each $R_i = \mathbb R, \mathbb Z$ or some finite set. Arbritrary-length tuples are formalized with the star operator, $R_i = S^* := \bigcup_{d=0}^\infty S^d$, and subsets of arbitrary (finite) size as power sets, $R_i = \mathcal P(S)$ (power set of S).

### Genomes

In this framework, genome is any set of biological sequences. It can be a typical reference genome, such as mm9 or hg18, but it can also have extra sequences like transfected plasmid sequences. It could also be a proteome, where the sequences are the different proteins. In this framework, all of these types of sequences are still called chromosomes.

In the set theory formalization we denote a genome $\mathcal G$ as a set of $n(\mathcal G)$ chromosomes: $G=\{C_i\}_{i=1}^{n(\mathcal G)}$. Letting $n(C_i)$ denote the length of chromosome $i$, the chromosomes are linear sequences $C = x_1 \cdots x_{n(C)}$ taken from some finite alphabet $\mathcal A$.

### Features

Formally, a feature of a reference genome is a sequence of not necessarily contiguous genomic intervals. The intervals can be just position or both position and strand.

Typically the raw data will be processed with respect to a set, or multiple sets, of genomic features. These features can either be pre-defined, for example the set of UCSC knownGenes, recursively defined based on the data itself, or a mixture of the two.

As with experiments, genomic data can be linked to a single feature (such as the level of one exon), or to a tuple of features, (such as the relative levels of a set of isoforms of a gene, or the fraction of multiple different genotypes at particular locus).

In the set theory formalization, a feature $f$ is a sequence of $n(f)$ genomic intervals $i_1,\ldots,i_{n(f)}$. An interval is a 4-tuple of the chromosome, the start coordinate, the end coordinate and the strand: $i = (chr, start, end, strand)$ where

• $chr \in \mathcal G$
• $1 \le start \le end+1 \le n(chr)+1$
• and $strand \in \{+,-\}$.

The requirements for the start and end coordinates are that we use 1-based indexing and specify intervals inclusively (fully closed) so that the start and end coordinates are included in the interval. For example, the first 10 bases of the genome are specified with $start=1$ and $end=10$. A 1-base interval can be specified with $start=end$. In general we require $start \le end$ except for the special case of the empty interval where $start=end-1$. Although such an interval would be empty it is considered to lie between the two bases specified by $start$ and $end$. The meaning of this is not specified here, but one example would be an exon of length 0, also known as a resplicing site[1]. The empty intervals include the special cases $(start,end)=(n+1,n)$ and $(1,0)$, which indicate the position after the last base and before the first.

### Experiments

An experiment is a set of samples, each of which is characterized by a set of common variables (sometimes called phenoData). These can be of any dimension and type, and indicates things like genotypes (like wildtype or mutant), experimental conditions (like treated or untreated), and co-observed (but dependent) variables (like age, sex or blood pressure).

Genomic data can be linked to a single experiment (like a gene level), to n-tuples of experiments (like a fold-change in a pair of experiments, or statistics for groups of experiments), or, in general, to any sets of experiments.

In the set theoretical formulation an experiment $E$ has $n(E)$ samples. It is a function $E : [n(E)] \rightarrow R$ where $R$ is some appropriate data space.

### Annotation

An annotation is a set of features usually associated with other non-experimental variables. For example, it could be a set of genes and their names, or exons and their splice sites.

In the set theory formulation, an annotation is an ordered triple $\mathcal A = (\mathcal G, A,a)$ where $A=\{f_i\}_{i=1}^{n(\mathcal A)}$ is a set of features on the underlying genome $\mathcal G$ and $a:A \rightarrow R$ is the function giving the values of the associated variables in the appropriate data space. It is possible that among other numerical and string data some of the other variables are themselves features. For example, for coding transcripts the basic exon-intron structures could be the features in $A$ and the coordinates of the coding sequences, which are also features, could be a finite-set dimensions of $R$.

### Genomic Data

A genomic data set $\mathcal D=(\mathcal G, \mathcal A, E, D)$ is an underlying genome $\mathcal G=\mathcal G(\mathcal D)$, an annotation on that genome $\mathcal A=\mathcal A(D)$, an experiment $E=E(D)$, and a dependent variable function $D=D(\mathcal D)$. The domain of $D$ can formally be any family of (possibly ordered) groups of experiment samples or annotation features ($D : E^* \times A^* \to R$ where $S^*$ denotes the family of all finite tuples (of any length) taken from the set $S$ and $R$ denotes the range of the function). The function $D$ is not necessarily defined over the entire domain. For example, it might only be defined for tuples of a fixed length.

### Project

A genomic project is a family $\mathcal P=\{\mathcal D_i\}_{i=1}^{n(\mathcal P)}$ of genomic data sets with a common underlying genome and experiment: $\mathcal D_i=(\mathcal G, E, \mathcal A_i, D_i)$.

### Map

Different annotations can be linked together by a map, which can be thought of as a bipartite graph whose two parts are annotations. Maps can be between different genomes, for example between mouse and human genes, between different types of annotations, such as between genes and alternative splicing events of a single, or both. In general the maps are many-to-many so, for example, one human gene could be mapped onto two different mouse genes or vice versa. The nature of the map is not specified here. For example, maps could be used to represent overlapping features (like exons contained in genes), or just nearby features (like ChIP-Seq binding peaks located in promoter regions). Automaps (from one annotation onto itself) are also permitted, and these could be used, for example, to represent interactomes. In addition to the information about which features are linked, each such link can have auxiliary variables. For example, for gene maps between organisms these variables might give the number of bases from each genome that are aligned, or a perhaps string encoding the alignment itself. For an automap representing an interactome, the auxiliary variables might represent the strength and evidence for the interaction.

Should there be a type of genomic data that has links in a map as a dimension of its domain?

In the set theory formulation, a map is an ordered 4-tuple $\mathcal M=(\mathcal A_1, \mathcal A_2, L, m)$ where $\mathcal A_i$ are annotations, $L \subseteq A_1 \times A_2$ is the set of links between the feature sets of the annotations, and $m:L \rightarrow R$ is a function that gives the associated data for each link in the appropriate data space $R$.

Some maps are bijections (1-1 correspondences), so that they identify the features of $\mathcal A_1$ with those of $\mathcal A_2$. An example of this would be two different gene aggregations based on the UCSC knownIsoforms table. One might be the union of all the associated knownGene transcripts, and the other could be the intersection of those transcripts ("constitutive exons"). Although the annotations are different, there is a natural 1-1 correspondence between their gene models.

Maps, especially bijections, should be transitive, so that if $\mathcal M_1=(\mathcal A_1, \mathcal A_2, L_1, m_1)$ is a map from annotation $\mathcal A_1$ to $\mathcal A_2$ and $\mathcal M_2=(\mathcal A_2, \mathcal A_3, L_2, m_2)$ is a map from $\mathcal A_2$ to $\mathcal A_3$ then the compostition of the maps $\mathcal M_1 \circ \mathcal M_2$ is defined as $(\mathcal A_1, \mathcal A_3, L_3, m_1 \circ m_2)$ where $L_3 = \{(i,k) : \exists j \in A_2 : (i,j) \in L_1 \mbox { and } (j,k) \in L_2\}$.

Given a family of annotations $\mathcal A = \{\mathcal A_i\}$ and a family of maps $\mathcal M = \{\mathcal M_i\}$ defined on those annotations, we define the map graph whose nodes are the annotations and edges are the maps. An important algorithmic operation will be to find a map, if necessary by composing other maps, from one annotation to another. Weights of 0 could be placed on bijective edges and of 1 on non-bijective edges, and basic path-finding algorithms could be run efficiently, possibly with a maximum path-length restriction.

## Data Representation

With our set theoretic definition of a genomic project in hand we now propose a concrete interface to implement this definition. The basic building block of the interface will be a relational database. The engine is not specified here, only the table structures. We define different types of tables to represent the set theoretic objects described above. The actual names of the tables are specified in a separate configuration file. Ideally we will then create an API, for example using Perl's DBI scheme, that will allow the user to provide the project using any desired relational database engine. Although we think of interface as being composed of database tables, it is not strictly require that database software be running: it is possible a user of this system would simply generate flat files and then use the DBD::CSV interface to import them.

### Star tables

Before we define the tables that make up the Interface, we begin with a definition of star tables. These tables solve the problem of encoding $n$-tuples of varying length into a relational database.

Although $n$-tuples of fixed length are easy to encode (just use $n$ columns), when the length is varying this is not as straightforward. A typical example would be specifying the exons that make up a transcript. In short there are two solutions.

The flat format solution is to use strings to join together IDs representing the exons. Then a variable-length $n$-tuple can be represented in a single column. The star format solution is to create a table just to specify the tuples. Each row of this star table corresponds to a single index of a single tuple. It therefore would have an identifier column for the tuple, an index column, and one or more columns specifying the elements of tuple. The relative merits of the two methods are discussed elsewhere, but this interface will generally allow either specification.

### Genomes

Genomes will be specified in .2bit format. However, for many analyses the actual genomic sequence is not necessary.

### Features

Genomic features are specified in a table whose rows correspond to contiguous intervals. They are joined together into features by a combination of the feature_ID and index fields. This is a type of star table, with the features being the "tuples" and the intervals being their "elements".

interval table
column description
chr Chromosome of the interval. If there is an underlying .2bit file this should match. Otherwise any name can be used.
start Start position of the interval on that chromosome. start must always be less than or equal to end-1. We use 1-based indexing and fully closed intervals, so that

both the start and end positions are considered part of the interval. For example, to specify an interval of length 1 at position 2000 use start=2000 and end=2000. To specify the empty interval between 2000 and 2001 use start=2001 and end=2000.

end End position of the interval on that chromosome.
strand "+", "-" or the empty string "" for an unstranded interval
feature_ID This is an identifier for the feature of which this interval is a part.
idx This is the 1-based index of the interval within the feature. For example, for a transcript with 5 exons, the intervals (which would correspond to the exons) would have indices 1 through 5.

In MySQL you might define an interval table as follows ("interval" is a reserved word in MySQL). The table can have any name, such as "gene_interval" or "Stat1_peak" (you might have idx=1 throughout the table). However, the names of the columns are required.

 CREATE TABLE f_interval (
chrom VARCHAR(255) NOT NULL,
start INT(10) UNSIGNED NOT NULL,
end INT(10) UNSIGNED NOT NULL,
strand ENUM("+","-",""),
feature_id VARCHAR(255) NOT NULL,
idx INT(10) UNSIGNED NOT NULL,
INDEX cse (chrom, start, end),
INDEX feature_id (feature_id)
);


It is important to index the feature_ID field since that makes looking up the intervals in a given feature fast.

### Experiments

Experiments are specified in a table whose rows correspond to samples and columns correspond to different dimensions of the phenoData. The details of the data are specific to the scientific project but at least one column offering a unique ID must be included. This column should be called simply id.

With regards to indexing, the id column should definitely be indexed, as well as other columns on which one might want to search. The indexing might seem superfluous since many genomic projects involve only a few samples (sometimes only 2!). However there are many projects now that are becoming quite large and even some GWAS studies with several thousand patients. Therefore in the interest of scalability it is recommended to include the index as a matter of habit.

experiment table
column description
id A unique identifier for the sample.
... The rest of the columns are specifically defined for the experiment in question.

In MySQL you might define an experiment table as follows. This example shows a few made-up fields, but only the id field is absolutely essential.

 CREATE TABLE experiment (
id VARCHAR(255) PRIMARY KEY,
age FLOAT INDEX,
sex CHAR(1),
treatment VARCHAR(255),
INDEX age (age)
);


### Annotations

An annotation is a set of features along with the associated variables. Since the features are independently specified in an intervals table the annotation is specified in a table whose rows correspond to samples and columns correspond to different dimensions of the associated data. Similarly to the experiments, the details of the data are specific to the annotation but must include an id column with values corresponding to those of the feature_id column from the intervals table. Unlike the intervals table, in the annotation table the id column must be a unique key.

With regards to indexing, the id column should definitely be indexed, but that might be sufficient, since finding features by genomic location will already be efficient based on the indexing in the intervals table. The feature_id column of the intervals table corresponds to the id column of the annotation table. However, the relationship between the IDs is not necessarily a strict reference, since there could be features in the interval table that are not in the annotation table.

annotation table
column description
id A unique identifier for the underlying feature. These identifiers should correspond to identifiers from the intervals table.
... The rest of the columns are specifically defined for the experiment in question.

In MySQL you might define an annotation table as follows. This example shows a few made-up fields that might be associated with an annotation of genes. In general only the id field is absolutely essential.

 CREATE TABLE gene (
id VARCHAR(255) PRIMARY KEY,
geneSymbol VARCHAR(255),
len UNSIGNED INT,
percent_gc FLOAT
);


In terms of a data structure, an annotation would be fully specified by the pair of the intervals table and annotation table, and possibly also some sort of handle into the genome (like a path to the .2bit file).

### Genomic Data

A table represent a genomic data $\mathcal D=(\mathcal G, \mathcal A, E, D)$ has three groups of columns. The first group are columns that specify the annotation features and the second group specify the experimental features. All of these columns together form a unique key for the table. Finally, the third group specify the data.

As a data structure, a genomic data set consists of the annotation, the experiments, and a description of the first two groups of columns. This column description is given in a database table with the following fields:

annotation and experiment column description table
column description
name a string giving the name of the column
col_type (column type) whether it is an "annotation" column or an "experiment" column
dim (dimensionality) whether it is a "scalar" column, which specifies a single annotation or experiment, a "tuple" column, which specifies an arbitrary-length tuple of annotations or experiments, or a "set" column, which specifies an arbitrary sized set of annotations or experiments
format if it is a "tuple" or "set" column, whether it uses the "flat" or "star" encoding format (NULL if it is a scalar column)
sep if it uses the flat format, its separating character (NULL if it is not a flat format column)
star if it uses the star format, the name of its associated star table. In that table the "elements" would be identifiers into either experiment or annotation table. (NULL if it is not a star format column)

This would be a typical data definition for such a table:

 CREATE TABLE annot_exp_cols (name VARCHAR(255) PRIMARY KEY,
col_type ENUM("annotation","experiment") NOT NULL,
dim ENUM("scalar","tuple","set") NOT NULL,
format ENUM("flat", "star"),
sep CHAR(1),
star VARCHAR(255)
);


Note that the data columns are not described in this table. All the columns of the data table that are not annotation or experiment columns are assumed to be data columns.

#### Example: Gene levels

Here is an example data definition for a data table that can hold keep information about gene levels in an RNA-Seq data set.

 CREATE TABLE gene_levels (gene VARCHAR(255),
sample VARCHAR(255),
rpkm FLOAT
);


The associated column description table would be filled like this

 +--------+------------+--------+--------+------+------+
| name   | col_type   | dim    | format | sep  | star |
+--------+------------+--------+--------+------+------+
| gene   | annotation | scalar | NULL   | NULL | NULL |
| sample | experiment | scalar | NULL   | NULL | NULL |
+--------+------------+--------+--------+------+------+


The VARCHAR(255) type of gene and sample would be the same as the id columns from the annotation and experiments tables (but could in principle be something else).

#### Example: Gene Level Changes

To give a more complicated example, consider a table that represents the comparison of gene levels in two groups of samples. The groups will be stored in a star table, rather than as comma-separated values. The data definition could be like this:

 CREATE TABLE gene_level_changes (gene VARCHAR(255),
group1 INT UNSIGNED,
group2 INT UNSIGNED,
lfc float,
p float
);


Here group1 and group2 are integers, but any data type in principle could be used for the identifiers of the star table. lfc and p represent the log-fold-change and P-value of the changes. The star table would have the following data definition:

 CREATE TABLE experiment_groups (group_id INT UNSIGNED,
sample VARCHAR(255)
);


Since the groups are actually (unordered) sets there is no need for an idx column on this star table. Also note that the word group is reserved in MySQL, so the column name has to be something different. The values in the group_id column would then correspond to the values in the group1 and group2 columns of the gene_level_changes table. The values in the sample column would correspond to values in the experiment table.

The column description table for gene_level_changes would look like this:

 +--------+------------+--------+--------+------+-------------------+
| name   | col_type   | dim    | format | sep  | star              |
+--------+------------+--------+--------+------+-------------------+
| gene   | annotation | scalar | NULL   | NULL | NULL              |
| group1 | experiment | set    | star   | NULL | experiment_groups |
| group2 | experiment | set    | star   | NULL | experiment_groups |
+--------+------------+--------+--------+------+-------------------+


#### Data structure

The entire genomic data set data structure is specified by the following:

• the genome
• the gene annotation (which in turn specifies the names of the intervals and annotation table)
• the experiments table
• the column description table (which might be named something like gene_levels_annot_exp_cols), and
• the data table itself (called gene_levels and gene_level_changes in these two examples).

## Notes/Plans/Errata

• Add set variation to star table section (i.e. without index).
• Add primary keys to all the tables---e.g. feature_id + idx for intervals table
• It should be fairly straightforward to include star sets and power sets as part of ranges.
• Does a general system for representation of set theoretic structures in relational databases already exist?
• The genome should be folded into the annotation data structure.
• Power set is not correct since that would include infinite subsets. The notation $\mathcal F(S)$ has been used to indicate the family of finite subsets of a base set $S$, sometimes called the finite power set of $S$
• A Catchy name. How about "GINGER: a Generic INterface for GEnomic Relations"?