Project directory structure

From ExpressionPlot
(Redirected from HTSV)
Jump to: navigation, search

Each ExpressionPlot project is kept in a specific directory structure. I organize my filesystem so all the projects are in the projects/ subdirectory of the ExpressionPlot root directory, but in theory you could keep them anywhere, since ExpressionPlot finds your project by looking in a datasets.txt file with all registered projects. This is a simple two-column TSV file. The first column is the short name for the project and the second is the full path to the project directory. The location of datatsets.txt can be found by running expressionplot-config DATASETS_FN. The default location is in the root directory of the ExpressionPlot installation.

Meta-files

A complete project would contain meta-files in the root project directory (which we'll call $ROOT in this file). These are described here. You can also create a file called platform.txt which has just one line of text, one of RNASeq, microarray, HuExon or MoExon (this is normally created automatically by the backend).

A $ROOT/config.ini file can also be created. This is in standard .ini format. This file is needed to store specific variables that will be described below as needed.

File format

Most of the files within the ExpressionPlot directory structure begin with a possible annotation line and continue as a headered tab-separated values (hTSV) file.

Annotation line

The annotation line has the form "# $ANNOT_FILE", where $ANNOT_FILE is the full path to a file containing the annotation of exons/genes/events that are described by the subsequents rows of the file. For example, your annotation line could look like this:

# /usr/local/bin/expressionplot/annot/mm9/mm9_trimmed_gene_clusters.tsv

ExpressionPlot does two special substitutions on these annotation lines. If the string "%A" appears then it is substituted with the ANNOT_DIR parameter that would be returned by expressionplot-config ANNOT_DIR. This makes the projects more portable if you move them to a machine with a different ExpressionPlot directory. The above line would look like this:

# %A/mm9/mm9_trimmed_gene_clusters.tsv

The second substitution is "%C", which is replaced with the "current directory", which means the same directory where the hTSV is located. This is handy if you are using annotations specific to a particular project (such as ChIP-Seq binding sites). Using a relative path also helps to make the projects more portable. As an example, imagine in your project you have two subdirectories of the base directory called "comps" and "annot" containing, respectively, comparison files and annotation files. Then your annotation line for a file in the "comps" directory could look like this, without the "%C" substititon,

# /usr/local/bin/expressionplot/projects/drugstudy/annot/chipseq_peaks.tsv

or, using the "%C" substitution (which is recommended), like this:

# %C/../annot/chipseq_peaks.tsv

The "../annot" takes you up a directory out of "comps" into the base directory, then back down into the annotation directory.

headered TSV format

The hTSV (headered TSV) format used by ExpressionPlot is bare bones---tab-separated values with a first line of column headers. Escaping and quoting are generally not supported, so it is possible to freely include quotes and spaces in the field values (but not possible to escape in tabs or newlines).

These annotation lines are used for three things:

  1. Table browser joins the annotations onto the statistics, which are not generally stored in the hard drive along with the statistical values.
  2. map registry joins tables from different annotations based on their annotation lines. This is used, for example, by the 4way tool to join gene statistics from different organisms.
  3. The script util/annotate.pl is a command line tool that will automatically find the annotation line, join the annotation to your data file, and output to STDOUT.

The data files allow for two types of joins with the annotation files. If the number of rows is exactly the number of rows in the annotation file then it is assumed that the rows are in the same order. If the data file has fewer rows than the annotation file, then the first column of the data file should have the name ID and is taken to be indexes into the annotation file. If the column is integral (matches /^\d+$/) then it is taken to be 1-based row indexed (1 corresponds to first non-header row of the annotation file, so actually line 2 of the file). Otherwise it is considered to be a string, and the first column of the annotation file is matched using string equality.

Gene levels and read numbers files

ExpressionPlot looks for files in two different places describing gene/transcript levels. Two files are found in $ROOT/analysis/gene_levels/. The first, gene_levels.tsv gives the raw read counts and RPKMs of every gene in your annotation. It is an hTSV file with annotation line and the following fields:

cluster Identifier for the gene or transcript. It is called "cluster" because it originally referred to a cluster of transcripts (from UCSC knownIsoforms table). However, you can use any here, including

UCSC knownGene IDs and Ensembl ENSG or ENST IDs.

aliases comma-separated aliases for the gene or transcript. Usually this would have symbols associated with the gene, like "Brca2" or "Stat1".
chr The chromosome on which the gene is located. Usually this begins with "chr".
start The start position of the gene on that chromosome (1-indexed). Start is always less than end, even for minus-strand genes.
end The end position of the gene on that chromosome (1-indexed). End is always greater than start, even for minus-strand genes.
str The strand of the gene ("+" or "-").
len The length of the gene in nucleotides. This is used for calculating RPKM values, so should actually be the length of the gene model you used to aggregate reads.
SAMP (multiple fields) Following len is a column for each of the samples, with the same names as the samples. These should correspond to group names in groups.txt. These fields contain the raw numbers of reads in each sample for each gene. For example, in the drug study example, you would have the following four fields: WT_Ctl WT_Drug MT_Ctl MT_Drug.
rpkm.SAMP (multiple fields) Following the SAMP fields are another set of columns, one for each of the samples. The sample names here have "rpkm." prepended. These fields contain the calculated RPKM values for each sample for each gene. For example, in the drug study example, you would have the following four fields: rpkm.WT_Ctl rpkm.WT_Drug rpkm.MT_Ctl rpkm.MT_Drug.

The other file is $ROOT/analysis/gene_levels/read_types.tsv. It is also a headered TSV file, and it has the following fields:

type A type of read. Usually there are only 5-10 different types in one of these files. The important types to include are "match" (meaning a genomic alignment) and "junction" (meaning a genomic alignment). If you want the read_types tool to work then you should include the other relevant types: "exon", "intron" and "intergenic".

For paired-end data sets you can also have "uniq.paired" and "uniq.single". Other read types include "Mult" for multiply matching and "NM" for non-matching.

The default calculation for total number of matching reads is uniq.paired + uniq.single for paired-end data sets and match + junction for single-end data sets. If your backend doesn't fit well into these classes, you can override the total number of matching reads by providing a $ROOT/2way_comps/nmr.txt file.

SAMP (multiple fields) Following the type is a column for each of the samples, with the number of reads of that type in each sample.

Pre-computed RPKM values

By default ExpressionPlot's genelev tool will compute RPKMs from the raw read numbers, even if the rpkm.SAMP fields are present. You can tell it to display RPKM values that are already in your gene_levels.tsv file by changing the following values in the [genelev] section of your $ROOT/config.ini file:

precomputed.rpkm Set to 1 to indicate that precomputed RPKMs should be used. Omit or leave blank for default behavior.
rpkm.prefix, rpkm.suffix If precomputed.rpkm is set, then the gene level for the sample called SAMP is taken from the field of the same name in gene_levels.tsv. If you want to have a prefix or suffix added to that name, then set one or both of these variables.

Here is an example $ROOT/config.ini file telling ExpressionPlot to interpret the fields called SAMP as pre-computed RPKMs:

 [genelev]
 precomputed.rpkm=1

Note that neither rpkm.prefix or rpkm.suffix is specified.

Here is another example $ROOT/config.ini file telling ExpressionPlot to interpret the fields called rpkm.SAMP as pre-computed RPKMs (and plot them in the genelev tool):

 [genelev]
 precomputed.rpkm=1
 rpkm.prefix=rpkm.

And here is one to use the fields called SAMP.rpkm instead of rpkm.SAMP:

 [genelev]
 precomputed.rpkm=1
 rpkm.suffix=.rpkm

You can also specify your own error bar lengths (beginning in version 1.8). To do this add the field precomputed.err as well as err.prefix or err.suffix to the same section of your $ROOT/config.ini. Here's an example with user defined RPKM (fields called rpkm.SAMP) and error (fields called err.SAMP):

 [genelev]
 precomputed.rpkm=1
 rpkm.prefix=rpkm.
 precomputed.err=1
 err.prefix=err.

Gene/transcript level changes

Statistics about gene level changes are found in the $ROOT/2way_comps directory. One file for each comparison (from the first row of comps.txt) should be present with the name $ROOT/2way_comps/$COMP.xls where $COMP is the comparison name. Each of these files contains an annotation line followed by an hTSV file. The rows of the hTSV file correspond to the rows of that annotation file, and the columns are as follows:

ID Identifier for the gene. This field is optional, but if omitted it is assumed that the identifiers are the row numbers themselves, and that they correspond exactly to the rows of the annotation file.
n1, n2 The raw number of reads mapping to the gene in the first and second groups respectively.
rpkm1, rpkm2 The RPKM values of the gene in the two groups.
lfc The log2(fold change) of the gene. The sign goes with the second group: Positive values indicate an increase in the second group relative to the first, negative indicate a decrease.
p The significance (P-value) of the change
pwr.dn, pwr.up The statistic power to detect a change in the gene given the total number of reads seen in this gene in either group. The effect size and P-value cutoff

for the power are not specified in this file. These fields are not really used by the front end and can be omitted in most applications.

If the platform of your project (as specified in platform.txt is "microarray", "MoExon", or "HuExon" then the follow fields are expected:

lfc The log2(fold change) of the gene. The sign goes with the second group: Positive values indicate an increase in the second group relative to the first, negative indicate a decrease.
p The significance (P-value) of the change
lev The average gene level (in log2 units) across the two groups.

Skipped Exon Levels Files

A file giving the counts of each read type related to every skipped exon event is found in $ROOT/analysis/alt_exons/AE_counts.xls. This file is an intermediate file for the RNA-Seq pipeline and not used by the front end.

Skipped Exon changes

For each comparison name $COMP a file to describe the changes in cassette exon splicing is found at $ROOT/analysis/alt_exons/$COMP.AEstats.incl_and_skip.xls. It begins with an annotation line to the alt exons events file which is followed by a complete hTSV file. Here only a subset of the rows of the annotation file are described, since most exons don't even have support for both inclusion and skipping isoform. (The default pipeline only includes those with at least one inclusion and one skipping read.). The fields are as follows:

ID The identifier of the event. It corresponds to either a line number (with the header being line 0) of the annotation file.
eup1, eup2 The number of reads supporting the upstream exons in the first and second groups respectively. Upstream means lower genomic coordinates, even for a minus strand gene.
jsk1, jsk2 The number of reads supporting the skipping junction in the first and second groups respectively. This is any read overlapping a splice junction that skips the exon and is anchored on at least one side in a known splice site of that gene (this is an important restriction so that intronic-encoded genes don't count the splice junctions of their host introns, like ChAT/VAChT).
eal1, eal2 The number of reads supporting the candidate alternative exon in the first and second groups respectively
jdn1, jdn2 The number of reads supporting the downstream junction in the first and second groups respectively. This is any read overlapping a splice junction from the end of the exon going downstream. Downstream means higher genomic coordinates, even for a minus strand gene.
edn1, edn2 The number of reads supporting the downstream exons in the first and second groups respectively
pup1, pup2 If present, the number of paired-end reads with one mate's alignment terminating in the upstream exons and the other in the candidate alternative exon. By "terminating" we mean the aligned position of the last (3'-most) base of the read.
psk1, psk2 If present, the number of paired-end reads with one mate's alignment terminating in the upstream exons and the other in the downstream exons. If the exon is longer than the inner-mate-distance then this supports the skipping isoform.
pdn1, pdn2 If present, the number of paired-end reads with one mate's alignment terminating in the candidate alternative exon and the other in the upstream exons. By "terminating" we mean the aligned position of the last (3'-most) base of the read.
Pso Skipping-only P-value. This is based on the ratio of inclusion reads to skipping reads only (ignoring reads aligning to flanking exons and, even if available, paired-end information).
LISR1, LISR2 Log2(Inclusion:Skip Ratio) in the two groups, respectively. In formula <math>LISR1=\log_2((jup1+eal1+jdn1)/jsk1)</math> and analogously for LISR2.
Psg "General" P-value. This is based on the ratio of inclusion reads to the sum of skipping and flanking reads. (ignoring paired-end information, even if available)
LISGR1, LISGR2 Log2(Inclusion:Skip+General Ratio) in the two groups, respectively. In formula <math>LISGR1=\log_2((jup1+eal1+jdn1)/(eup1+jsk1+edn1))</math> and analogously for LISGR2.
Psp Paired-end P-value. This is based on the ratio of inclusion reads to skipping reads, including paired-end skipping reads for those exons in the 90th percentile or above of the inner-mate-distance distribution. For shorter exons this is identical to Pso. (This statistic also ignores flanking exon reads).
LISPR1, LISPR2 Log2(Inclusion:Skip+Paired Ratio) in the two groups, respectively. In formula <math>LISPR1=\log_2((jup1+eal1+jdn1)/(jsk1+psk1))</math> and analogously for LISPR2.
incl Total number of inclusion reads from either group: jup1 + eal1 + jdn1 + jup2 + eal2 + jdn2
skip Total number of junction-skipping reads from either group: jsk1 + jsk2
Psp Total number of paired-end mates flanking the candidate exon: psp1 + psp2

For the array pipeline this file has completely different fields.

Retained Intron changes

For each comparison name $COMP a file to describe the changes in intron retention is found at $ROOT/analysis/retained_introns/$COMP.RIstats.incl_and_splice.xls. It begins with an annotation line to the retained introns events file which is followed by a complete hTSV file. Here only a subset of the rows of the annotation file are described, since most introns don't even have support for both inclusion and splicing isoform. (The default pipeline only includes those with at least one inclusion and one splicing read.). The fields are as follows:

ID The identifier of the event. It corresponds to either a line number (with the header being line 0) of the annotation file.
gin1, gin2 The number of reads with genomic alignments to the intron in the two groups. The "intron" only includes bases that are not part of exons of other transcripts. This avoids, for example, counting reads from a cassette exon within the intron.
jsp1, jsp2 The number of reads with alignments to the junction corresponding to the splicing out of the intron.
jin1, jin2 The number of junction reads from the intron's 5' or 3' splice site to an internal splice site. This would typically correspond to

the inclusion of a cassette exon within the intron.

jsk1, jsk2 number of junction reads skipping over the intron: Junction start <= intron start and Junction end >= intron end and at least 1 inequality strict (but not necessarily both).
eup1, eup2 The number of reads supporting the upstream exons in the first and second groups respectively. Upstream means lower genomic coordinates, even for a minus strand gene.
edn1, edn2 The number of reads supporting the downstream exons in the first and second groups respectively. Downstream means greater genomic coordinates, even for a minus strand gene.
psk1, psk2 number of paired-end pairs flanking the intron (both ends in known exons of the same gene). This field is only included for paired-end data.
pup1, pup2 number of paired-end pairs ending in the intron and starting upstream (paired-end data only). Upstream means smaller genomic coordinates, even for minus strand genes.
pdn1, pdn2 number of paired-end pairs starting in the intron and ending downstream (paired-end data only). Downstream mean larger genomic coordinates, even for minus strand genes.
rpkm1, rpkm2 RPKM levels for the intron in the two groups. This only counts bases that don't overlap annotated exons.
LIFR1, LIFR2 Log2(Intron:Flank Ratio). In formula <math>LIFR1=\log_2(\frac{gin1/LI}{(eup1+edn1)/(LU+LD)})</math>. This indicates the level of the intron relative to the flanking exons. Here "LI" indicates the length of the intron, and "LU" and "LD" indicate the length of the upstream and downstream exonic regions. It is thus normalized for the region size. As with gin, rpkm1 and rpkm2, only bases than are not part of exons of other transcripts are counted towards "LI".
LORF Log-Odds-Ratio (Flanking). In formula <math>LORF=LIFR2 - LIFR1</math>. The sign convention means that higher value indicates more intron retention (unspliced isoform) in the second group.
Pf P-value based on inclusion and flanking reads.
LISR1, LISR2 Log2(Intron:Splice Ratio). In formula <math>LISR1=\log_2(gin1/(jsp1+psk1))</math>. Here there is no normalization for the region length. This ratio compares the level of the intron to the level of reads attributable to the spliced isoform (either because the alignment spans the splice junction or because it is a paired-end read flanking the intron).
LORS Log-Odds-Ratio (Spliced). In formula <math>LORS=LISR2 - LISR1</math>. The sign convention means that higher value indicates more intron retention (unspliced isoform) in the second group.
Ps P-value based on inclusion and skipping reads.

Project-specific event types

Aside from these default event types, you can add any number of extra event types to each project. These are defined in a hTSV file called $ROOT/event_types.tsv. Each row corresponds to a different event type. The only required field is event_type, which gives a name for the event type. The rest of the fields tell how to find that event type's contrast files (one for each comparison), and how to interpret the fields of the contrast files:

event_type A short name for the event type. This will appear in the drop-down list on the website.
dir The directory, relative to $ROOT, where the contrasts are stored. This is optional and defaults to analysis/$event_type for each event type.
ext The extension for the files. This defaults to ".xls". The full file name for a comparison $comp_name would be $ROOT/$dir/$comp_name.$ext, where $dir is the directory from the dir column. The "." between $comp_name and $ext is omitted if $ext already begins with a "." (so ".xls" and "xls" are equivalent).
p_field This is the field of the contrast files that contains the significance values. Usually this is a P-value (and in fact defaults to "p" if not given), but it could be any statistic where lower numbers are associated with higher significance. For MISO, where the statistic is a Bayes factor and high numbers indicate more significance, I have a script add a field which is the negative of this number.
ch_field This is the statistic that indicates the size of the change. It defaults to "lfc", but it could be something like log-odds ratio. For MISO the statistic is called "diff" and it represents a difference in PSI values.
lev1_field, lev2_field Names of fields containing the "levels" in the two groups. Defaults are "rpkm1" and "rpkm2". For MISO the fields are "sample1_posterior_mean" and "sample2_posterior_mean", and are actually PSI (percent spliced-in) values.
ch_scale This field gives the type of scale for the ch_field statistics. The valid values are "log2" "log10" "linear" and "delta". "linear" would take values on the interval <math>[0,\infty]</math> whereas delta is also linear but could be any real. "delta" is handy for MISO, which uses so-called "delta-PSI" statistics (change in percent inclusion). When "delta" is specified, then the fold-change cutoff is applied directly to the (absolute value of the) change statistic. Default is "log2". NOTE: "log10" and "linear" are not yet fully implemented!
lev_scale The type of scale lev*_field statistics. The valid values are "log2" "log10" and "linear". Default is "linear".
lev_display This field gives instructions to the 2way plot how to display the event levels. The valid values are "log" and "linear". For gene levels, where expression varies over several orders of magnitude, it is important to specify "log", but for PSI statistics, which range from 0 to 100%, "linear" is more appropriate. "log" is the default.
num_fields A separated list of fields which should be interpreted by the table browser tool as numeric. This will make them sortable and filterable, and also print them using "%.3g". The separator is ",", but this can be changed by providing a different separator with the "separator" field (see below).
int_fields A separated list of fields which should be interpreted by the table browser tool as integer. Like "numeric" they are still sortable and filterable, but they are printed as-is. The separator is ",", but this can be changed by providing a different separator with the "separator" field (see below).
text_fields A separated list of fields which should be interpreted by the table browser tool as text. This makes them (alphabetically) sortable and filterable. The separator is ",", but this can be changed by providing a different separator with the "separator" field (see below).
omit_fields A separated list of fields which should be omitted from the table browser tool. Sometimes the statistical software gives information that is not commonly interesting. This is a way to clean up the display a little. The separator is ",", but this can be changed by providing a different separator with the "separator" field (see below).
separator A string (interpreted as a perl regex) giving the separator for the num_fields, int_fields, text_fields and omit_fields fields. This is handy if you have a comma (which is the default) in the names of one of the fields.