Preparing Raw Data with the EP Backend

From ExpressionPlot
(Redirected from RNASeq-pipeline.pl)
Jump to: navigation, search

Uploading your data

ExpressionPlot takes FASTQ files as input. You can keep the data associated with ExpressionPlot anywhere you like on your hard drive so long as each project has its own directory (directory tree, really---ExpressionPlot will generate lots of subdirectories). One possible location is to have each project in a subdirectory of projects/ within the ExpressionPlot hierarchy. (If you are running selinux then this has the added advantage that the files will inherit httpd-accessible contexts, whereas keeping them in a home directory you'll have to change their contexts (since httpd is not by default allowed in home directories).

You can find the root of the ExpressionPlot hierarchy using

 expressionplot-config

Setting up your project

Once you've uploaded your sequence data, setting up your project requires only the creation of three to five text files, which all should appear in the same directory. This will become the base directory of your ExpressionPlot project.

  1. lanes.txt
  2. groups.txt
  3. comps.txt
  4. long_name.txt (optional)
  5. colors.txt (optional)

These files, except for long_name.txt are all in "tab-separated values" format, which here means that each row has a fixed number of fields, each separated by tabs. The fields typically contain names for your samples, groups of samples, or pairwise comparisons. In theory any names not containing whitespace would be handled properly but this feature has not been fully tested. It is therefore recommended to stick with names that begin with non-digits, and continue with digits, letters and the underscore (_) character.

We'll use a toy example of a simple experimental design. Let's imagine we are looking at wildtype and mutant cells treated with a drug or control. There are four samples: WT_CTL, MT_CTL, WT_DRUG and MT_DRUG. Let's imagine also that you ran two lanes of paired-end sequencing for each sample, each lane generating two FASTQ files (one for each end).

lanes.txt

Now say you've decided to put your project in a directory called drugstudy. Create drugstudy/lanes.txt. This will be a tab-separated values file with one row for each lane of sequencing. Each row should have three columns. The first two columns will be relative paths to the FASTQ files, and the third column should be an identifier for that lane of sequencing. You might put your sequencing data in a subdirectory of drugstudy called seq. Then the contents of lanes.txt might look like this:

 seq/30MW9AAXX_SL303_s_1_1.fq.gz	seq/30MW9AAXX_SL303_s_1_2.fq.gz	WT_Ctl_1
 seq/30MW9AAXX_SL304_s_2_1.fq.gz	seq/30MW9AAXX_SL304_s_2_2.fq.gz	WT_Ctl_2
 seq/30MW9AAXX_SL305_s_3_1.fq.gz	seq/30MW9AAXX_SL305_s_3_2.fq.gz	WT_Drug_1
 seq/30MW9AAXX_SL306_s_4_1.fq.gz	seq/30MW9AAXX_SL306_s_4_2.fq.gz	WT_Drug_2
 seq/30MW9AAXX_SL307_s_5_1.fq.gz	seq/30MW9AAXX_SL307_s_5_2.fq.gz	MT_Ctl_1
 seq/30MW9AAXX_SL308_s_6_1.fq.gz	seq/30MW9AAXX_SL308_s_6_2.fq.gz	MT_Ctl_2
 seq/30MW9AAXX_SL309_s_7_1.fq.gz	seq/30MW9AAXX_SL309_s_7_2.fq.gz	MT_Drug_1
 seq/30MW9AAXX_SL310_s_8_1.fq.gz	seq/30MW9AAXX_SL310_s_8_2.fq.gz	MT_Drug_2

That's three columns. The spaces between the columns should be tabs (if you copy and paste from this wiki make sure they are still tabs).

In our sequencing facility, 30MW9AAXX is an identifier for the flowcell, SLXXX is an identifier for the RNA-Seq library, s_X is an identifier for the lane of the flowcell, and the terminal _1 or _2 indicates the end of the paired-end reads. The FASTQ (.fq) files are stored gzip'd on the hard drive. ExpressionPlot will automatically detect that extension and unzip them in memory for input into bowtie. It can also autodetect the .bz2 extension for bzip2'd files, and of course a filename simply with any other extension, such as .fq, will not be unzipped but used as-is. I recommend gzip over bzip2. Although the compression ratios are slightly better for bzip2, gzip runs a lot faster.

The third column gives names that ExpressionPlot to refer to that lane of sequencing. In this case we use the _1 and _2 suffixes to indicate the first and second lane of sequencing for each library, but you can use any system you like.

If you have single-end data then lanes.txt will have only two columns---the fastq filenames in the first column and the lane names in the second column:

 seq/30MW9AAXX_SL303_s_1.fq.gz	WT_Ctl_1
 seq/30MW9AAXX_SL304_s_2.fq.gz	WT_Ctl_2
 seq/30MW9AAXX_SL305_s_3.fq.gz	WT_Drug_1
 seq/30MW9AAXX_SL306_s_4.fq.gz	WT_Drug_2
 seq/30MW9AAXX_SL307_s_5.fq.gz	MT_Ctl_1
 seq/30MW9AAXX_SL308_s_6.fq.gz	MT_Ctl_2
 seq/30MW9AAXX_SL309_s_7.fq.gz	MT_Drug_1
 seq/30MW9AAXX_SL310_s_8.fq.gz	MT_Drug_2


If you have microarray data, including exon arrays, then you could just leave the first column blank, like this:

 	WT_Ctl_1
 	WT_Ctl_2
 	WT_Drug_1
 	WT_Drug_2
 	MT_Ctl_1
 	MT_Ctl_2
 	MT_Drug_1
 	MT_Drug_2

groups.txt

This file gives you the opportunity to group your samples together. The simplest grouping is no grouping---just leave each lane as a single sample. This allows you to examine lane-specific effects later. The format of the file is two columns, where the first column is the name of the lane from the lanes.txt file, and the second is the name of the group. To keep each lane as its own "group", you would use the following (as above, spaces should be converted to tabs):

 WT_Ctl_1	WT_Ctl_1
 WT_Ctl_2	WT_Ctl_2
 WT_Drug_1	WT_Drug_1
 WT_Drug_2	WT_Drug_2
 MT_Ctl_1	MT_Ctl_1
 MT_Ctl_2	MT_Ctl_2
 MT_Drug_1	MT_Drug_1
 MT_Drug_2	MT_Drug_2

Here we gave the "groups" the same names as the "lanes". Another possible grouping might be to tell ExpressionPlot that it should just add the two lanes together and consider them as a single sample. To do that, you would use the following groups.txt (note the different last 2 characters of each line):

 WT_Ctl_1	WT_Ctl
 WT_Ctl_2	WT_Ctl
 WT_Drug_1	WT_Drug
 WT_Drug_2	WT_Drug
 MT_Ctl_1	MT_Ctl
 MT_Ctl_2	MT_Ctl
 MT_Drug_1	MT_Drug
 MT_Drug_2	MT_Drug

While grouping multiple lanes of the same library makes sense, biological replicates should not be grouped at this stage—that would make it very difficult to explore experimental or inter-individual variation, which is always an important question in gene expression analysis (and many other types of scientific inquiry!). Therefore, if you have one lane per library, which is now quite common, you can make your groups file with this command:

 perl -ne 'chomp; split /\t/, $_; $x = pop @_; print "$x\t$x\n"' lanes.txt > groups.txt

comps.txt

This file tells ExpressionPlot what pairwise comparisons you'd like to make. I recommend you take a few minutes to make an exhaustive list here, including any comparison at all that you think you might like to make. The first column gives the name of the comparison, and the second and third give the "control" and "treatment" groups. The group(s) in the "control" column will be associated with negative log-fold-changes, and the group(s) in the "treatment" column will be associated with positive log-fold-changes. You can combine groups here using the "+" character. For example, if you used the first groups.txt file above, you might use this for comps.txt:

 Drug_in_WT    WT_Ctl_1+WT_Ctl_2     WT_Drug_1+WT_Drug_2
 Drug_in_MT    MT_Ctl_1+MT_Ctl_2     MT_Drug_1+MT_Drug_2
 MT_in_Ctl     WT_Ctl_1+WT_Ctl_2     MT_Ctl_1+MT_Ctl_2
 MT_in_Drug    WT_Drug_1+WT_Drug_2   MT_Drug_1+MT_Drug_2
 Rep_WT_Ctl    WT_Ctl_1              WT_Ctl_2
 Rep_WT_Drug   WT_Drug_1             WT_Drug_2
 Rep_MT_Ctl    MT_Ctl_1              MT_Ctl_2
 Rep_MT_Drug   MT_Drug_1             MT_Drug_2

Here the first two comparisons would examine the effect of the drug in wildtype and mutant cells, respectively. Typically these two would be the most interesting. The second two would examine the effect of the mutation in control and drug-treated cells, respectively. The last four comparisons would be of technical interest, looking for genes that change between replicates. Ideally there would be very few changes there, but it is nice to check.

If you had used the second style of groups.txt file, then the first four lines would be simpler (no _1/_2 suffixes, and no + since there would be just one group in each field), but then you couldn't do the replicate comparisons.

Sometimes when the experimental design is complicated, I even write a little Perl script to generate the file. This can help avoid typos.

long_name.txt

This is the easiest of the four files, and it is optional. When you start up the ExpressionPlot pipeline you are going to give a name to this project. It should be a short, simple name such as the ones you used for the lanes (no whitespace or special characters). In long_name.txt you can specify a longer name to appear in the drop-down menus on the website. The file should have a single line of text, and that will be the name. For example, for this project, the short name will be drugstudy, but you might put the following in long_name.txt:

 Drug Study on Wildtype and Mutant Cells

colors.txt

This optional file allows you to assign colors to the different samples, which are later used by the website in making different types of plots. It is a headerless tab-separated file with either 2 or 3 columns. The first column is the name of the sample. The second gives a color name. This can be any valid Wikipedia:R color, which includes color names like "blue" and "magenta", as well as hexcodes like "#A300FF", and also the integers 1-8 to access R's default palette. R's default palette gives high-contrast colors, beginning black, red, green, blue. They are not the most esthetically pleasing but they are guaranteed to be clearly distinguishable, at least for non-colorblind people. The colors do not have to be unique, and in fact it is recommended to use the same color for replicates. The third column is optional, and only applies for exon array data. It specifies an R linetype, which can be either a string like "solid", "dashed" or "dotted", or, analogous to the color palette, an integer from 1-8. The linetypes are used to plot exon array intensities across a gene. Default is to make them all solid. Here is a possible file for our running example, in the case where the replicates were maintained as separate groups:

 WT_Ctl_1	1
 WT_Ctl_2	1
 WT_Drug_1	2
 WT_Drug_2	2
 MT_Ctl_1	3
 MT_Ctl_2	3
 MT_Drug_1	4
 MT_Drug_2	4

If the replicates were grouped together the file could look like this:

 WT_Ctl        1
 WT_Drug       2
 MT_Ctl        3
 MT_Drug       4

The four samples would be shown in black, red, green and blue.

The colors.txt file also gives an order to the samples to be used by the website in plotting. This overrides the order in groups.txt.

Starting the pipeline

The microarray (including exon array) and RNA-Seq pipelines are controlled by two different master scripts. I recommend running the pipeline from within a screen terminal so that you can disconnect the terminal but allow the pipeline to continue. This program is comes installed standard on most linux distributions, including ubuntu. Screen is a powerful program, but here is a quick primer: To start a new screen just type screen -S screen_name. Then start the pipeline. To disconnect once the pipeline has started use C-a d (control-'A' followed by 'D'). Then to reconnect later use screen -r screen_name.

RNA-Seq (from FASTQ)

The RNA-Seq pipeline is run by the RNASeq-pipeline.pl script. This is located in the RNASeq subdirectory of the ExpressionPlot hierarchy. Run RNASeq-pipeline.pl for a short usage message and RNASeq-pipeline.pl -lu for the complete usage message. For the drug study example above, you would use the following command, assuming that your current directory is projects/drugstudy:

 ../../RNASeq/RNASeq-pipeline.pl lanes.txt hg18/hg18 -g hg18 \
   -pn drugstudy -admin expressionplot \
   -l pipeline.%d%b%y.log -np 4 \
   -j hg18/hg18_all_junctions -hjl 31 -tl 32 \
   -cl ../../annot/hg18/hg18_trimmed_gene_clusters.tsv \
   -ae ../../annot/hg18/hg18_acembly_AE_events_with_flanking_SS.tsv \
   -ate ../../annot/hg18/hg18_ensGene_term_exons.tsv \
   -ri ../../annot/hg18/hg18_acembly_intron_events.xls \
   -ensT ../../annot/hg18/hg18_ensembl_and_tRNA_clusters.tsv

The backslashes ('\') at the end of each line signal to the BASH shell that it is a single continuous command. Here is what all the parts of the command mean:

lanes.txt Path to lanes.txt file
hg18/hg18 Path to genomic bowtie indexes. This will be supplied directly to the bowtie command. It can either be a full path, or a path

relative to the bowtie index directory. In the default installation this is /usr/local/expressionplot/bowtie/indexes.

-j hg18/hg18_all_junctions -hjl 31 Specifies the bowtie index for the junction database, and half junction length with which the database was constructed.
-g hg18 This specifies the genome. It will be used by the website when drawing read maps to decide which annotations to show underneath, and in

which annotation to look up gene names.

-pn drugstudy Gives a short name for the project. It is recommended to begin with a non-digit and use only digits, letters and numbers.
-admin expressionplot This gives the name of one website user who will receive administrative privileges. Other users can then be granted access later through the website.
-l pipeline.%d%b%y.log Name of a log file for this pipeline run. Will be passed through the unix date program, which substitutes day/month/year for %d%b%y
-np 4 Number of processors to use. Try to set this as high as you can up to the max on one node. This will be provided to bowtie with its -p switch, so using more nodes can speed up your run time significantly (1 node is default—not recommended). The latter parts of the pipeline don't yet take advantage of any parallelization, but the alignments are the slowest part anyway.
-cl ../../annot/hg18/hg18_trimmed_gene_clusters.tsv Path to gene clusters file (based on UCSC knownGene clusters).
-ae ../../annot/hg18/hg18_acembly_AE_events_with_flanking_SS.tsv Path to alternative (cassette) exons file. Leave it out if you only want to analyze gene expression.
-ate ../../annot/hg18/hg18_ensGene_term_exons.tsv Path to alternative terminal exons file. This include both alternative transcription start sites and

alternative poly-adenylation/cleavage sites. Leave it out if you only want to analyze gene expression.

-ri ../../annot/hg18/hg18_acembly_intron_events.xls Path to retained introns file. Leave it out if you only want to analyze gene expression.
-ensT ../../annot/hg18/hg18_ensembl_and_tRNA_clusters.tsv Path to combined Ensembl/tRNA gene clusters. This is similar to the UCSC clusters, but provides better

annotation of non-coding RNAs. Leave it out if you only want to use the UCSC clusters.

If you want to have ExpressionPlot output a BAM file (for example, so you can upload it to the UCSC website), supply the -BAM switch. They will appear in the analysis/ subdirectory of your project. It is also possible to generate these later using parse_paired_bowtie.pl. See the usage message for more information.

Special cases:

RNA-Seq (from BAM)

Although ExpressionPlot is designed to run using its own alignment pipeline starting with FASTQ (sequence/quality files), you can also provide BAM (alignment) files to ExpressionPlot. The samples should be in separate BAM files. The BAM files should be sorted by sequence name rather than alignment position so that all the alignments (including paired-end mates) for a given read will be adjacent. This can be achieved with samtools sort -n.

You can then import each BAM file into ExpressionPlot using the BAM_to_EP.pl script. The output prefix should be the analysis/ subdirectory of the ExpressionPlot project you are creating, followed by the name you will use for the sample in ExpressionPlot. For example, assuming you are already in the projects/drugstudy directory, you could run

 for lane in `cut -f1 groups.txt`; do
   ../../RNASeq/BAM_to_EP.pl $lane.bam analysis/$lane
 done

Then you would start the pipeline with the same command as above, but adding in -sa (skip alignments) and -rl readlen, where readlen is the length of reads in your BAM files. Here is an example with readlen=36.

 ../../RNASeq/RNASeq-pipeline.pl lanes.txt hg18/hg18 -g hg18 \
   -sa -rl 36 \
   -pn drugstudy -admin expressionplot \
   -l pipeline.%d%b%y.log \
   -j hg18/hg18_all_junctions -hjl 31 -tl 32 \
   -cl ../../annot/hg18/hg18_trimmed_gene_clusters.tsv \
   -ae ../../annot/hg18/hg18_acembly_AE_events_with_flanking_SS.tsv \
   -ate ../../annot/hg18/hg18_ensGene_term_exons.tsv \
   -ri ../../annot/hg18/hg18_acembly_intron_events.xls \
   -ensT ../../annot/hg18/hg18_ensembl_and_tRNA_clusters.tsv

Microarray

The microarray pipeline is run by the array_pipeline.pl script. This is located in the microarray subdirectory of the ExpressionPlot hierarchy. Run array_pipeline.pl without arguments for a usage message. For the drug study example above, you could use the following command, assuming that your current directory is projects/drugstudy:

 EP=`expressionplot-config`
 $EP/microarray/array_pipeline.pl . $EP/microarray/array_files/HuExon HuExon drugstudy cel_files/*.cel \
  -epua expressionplot \
  -l pipeline.%d%b%y.log \
  -cl $EP/annot/hg18/hg18_gene_clusters.tsv \
  -ce $EP/annot/hg18/hg18_acembly_AE_events_with_flanking_SS.tsv

The backslashes ('\') at the end of each line signal to the BASH shell that it is a single continuous command. Here is what all the parts of the command mean:

. Path to the base directory of the project (in this case "." refers to the current directory, projects/drugstudy.
$EP/microarray/array_files/HuExon Path to directory containing the array files for this array. For exon arrays, the directory should contain files called "pgf", "bgp" and "clf" (which could be just links to other files). These are provided by affymetrix and give information about the probe layout and GC-background sets. For 3' UTR arrays the cdf file is fine. It has to have the name ARRAY_NAME.CDF where ARRAY_NAME is affymetrix's short code for the name of the array (like HG-U133A or MOE430A). The best way to set up these directories is to have EP-manage.pl do it for you so you can be sure everything is in place.
HuExon This gives the actual affymetrix name of the array you are going to use, and it also corresponds to a table in the MySQL database that has the coordinates of the probe alignments.
drugstudy The fourth argument is the short name that ExpressionPlot will use to identify this project. As usual, longer names used for display and drop-down menus can be chosen by creating a long_name.txt file.
cel_files/*.cel All further arguments are the CEL files that you want to process. The sample names are taken from the CEL files, after dropping the .CEL extension, so sometimes I set up links to the the original CEL files with more useful names. You could also specify the files explicitly instead of using the * wildcard.
-epua expressionplot This gives the name of one website user who will receive administrative privileges. Other users can then be granted access later through the website. It stands for "ExpressionPlot Users Admin", and it corresponds to the -admin switch in the RNA-Seq pipeline.
-l pipeline.%d%b%y.log Name of a log file for this pipeline run. Will be passed through the unix date program, which substitutes day/month/year for %d%b%y
-cl $EP/annot/hg18/hg18_trimmed_gene_clusters.tsv Path to gene clusters file (based on UCSC knownGene clusters).
-ce $EP/annot/hg18/hg18_acembly_AE_events_with_flanking_SS.tsv Path to cassette (alternative) exons file. Leave it out if you only want to analyze gene expression. This corresponds to the -ae switch in the RNA-Seq pipeline

Using .CDF microarrays

The 3' UTR arrays come with .CDF files instead of the .pgf/.bgp/.clf of the exon arrays. You have add to the "-d array_name.cdf" switch to inform ExpressionPlot of this fact. Here is an example using HG-U133A arrays.

 $EP/microarray/array_pipeline.pl . $EP/microarray/array_files/HG-U133 HG-U133 drugstudy cel_files/*.CEL \
   -epua brad \
   -l pipeline.%d%b%y.log \
   -cl $EP/annot/hg18/hg18_gene_clusters.tsv \
   -d HG-U133A.CDF

The "HT_HG-U133" CDF files will be in the same directory. So the call to array_pipeline.pl will be the same except that the final switch should be -d HT_HG-U133A.CDF.

Some arrays, like the mouse arrays MOE430A and MOE430B come in pairs. It is a little more complicated to process these, because the probes intensities have to be extracted, then combined into a single table, then the pipeline proceeds. To do this, first call the pipeline with the -probes_only switch separately on the A and B arrays. Then use the combine_arrays.pl script to merge the normalized probe intensity tables. Finally, call the array pipeline once to continue on the merged table. In the following example, supped that all the 430A CEL files are in a subdirectory called 430A_CELs and all the 430B cel files in the 430B_CELs subdirectory.

 $EP/microarray/array_pipeline.pl 430A_CELs $EP/microarray/array_files/MOE430 MOE430 drugstudy_A 430A_CELs/*.CEL -d MOE430A.CDF -probes_only
 $EP/microarray/array_pipeline.pl 430B_CELs $EP/microarray/array_files/MOE430 MOE430 drugstudy_B 430B_CELs/*.CEL -d MOE430B.CDF -probes_only
 $EP/microarray/combine_arrays.pl 430A_CELs/qnpm.tsv 430B_CELs/qnpm.tsv -proj drugstudy > qnpm.tsv
 $EP/microarray/array_pipeline.pl . $EP/microarray/array_files/MOE430 MOE430 drugstudy -from_mysql -cl $EP/annot/mm9/mm9_gene_clusters.tsv

If you are using the HG-U133A/B arrays then you should additionally provide the switches -A U133A: -B U133B to combine_arrays.pl. See the usage message for more information.

Next

Once you've started the pipeline, it might take hours to a couple of days to finish, depending on the power of your computer and complexity of the project. You can follow its progress by examining the log file. The pipeline will automatically register your project with the web server, so once it finishes you can immediately begin harnessing the power of the web interface.