The
overall experimental approach of sampling rate design with OTS is
shown in the above figure. In the first step, an organism and
treatment or condition is selected.
In the second step, a biological experiment is
performed, and samples are preserved by freezing at dense
timepoints. A broad time range of interest should be selected based
on the expected response time of the gene group(s) of interest to
the treatment used. Samples should be collected frequently, and bias
in the sampling rate by the researcher should be avoided by sampling
with uniform timepoint spacing (though this is not a requirement of
the algorithm). In the third step, a sparse set of timepoints is
sampled in order to input an initial “current” dataset into OTS.
This needs to include at least two timepoints (at the first and the
last timepoint in the the time range of interest).
In step 4, high-throughput
time-series gene expression “training” datasets are collected to
guide the optimal timepoint selection for the current dataset. It is
not necessary for the training datasets to be collected using the
same technology (ie, microarray or RNA-seq experiments) as the
current dataset, but the training datasets should use treatments or
conditions that are expected to affect target treatment-response
genes in the same way as in the current dataset. The results of the
timepoint selection will be improved with more training data, more
densely sampled time-series training datasets, and training datasets
with genetic responses very similar to the current experiment.
Some biological background knowledge should be applied when choosing
training datasets, because if a researcher inputs training
experiments which do not elicit genetic responses similar to the
current experiment, then sub-optimal timepoints may be selected,
even though the NNLS curve-matching step may remove the influence of
poorly-matched datasets. For example, in iterative-online Yeast
study in the OTS publication, the Spellman et al α-factor
synchronization dataset used the same treatment applied to our
current dataset from Pramila et al, so it was included in our
training datasets. However, the other two treatments from the same
study synchronized cells at a different stage in the cell cycle
(cdc25 temperature synchronization) or selected for cells which were
significantly smaller than normal and may have affected cyclic gene
expression (elutriation treatment), so they were not included. The
Cho et al temperature synchronization method was expected to
synchronize normal-sized cells to the same cell cycle stage (G1) as
the current dataset, so it was included here. However, if the match
between the training and current datasets is very poor, the NNLS
curve-matching step will automatically reduce the influence of the
poorly matching training data.
The treatment-response gene set of interest is selected in step 5.
There are several possible approaches for choosing target genes;
Knowledge-based approaches include identifying important treatment
response genes from literature studies or Gene Ontology (GO)
categories. If annotation data is
sparse, or the treatment/condition is relatively unstudied, then a
mathematical approach can be used, in which the differential gene
expression value patterns in the training or current datasets can be
used to select genes with large differential expression values. OTS
is designed to handle large gene sets, so a combination of
knowledge-based and mathematical approaches can be applied, and
genes identified from both knowledge-based and mathematical
approaches can be added to the gene set of interest. Note that the
gene identifiers (either standard gene names or probe names for
microarray datasets) must be entered in the same format for all of
the training datasets as well as the current dataset.
The current dataset and the training datasets (containing only genes
of interest), the list of all timepoints available to be added to
the current dataset, and two parameters (cluster number and
threshold number) are then entered into OTS (Step 6). The default
cluster number is calculated according to the square root of (N/2),
but can be adjusted by the user. The threshold number should be
increased from the default value of 3 if the data has very low
levels of noise.
OTS produces a ranked list of the optimal timepoints to be selected
next, in order to best model the differential expression of the
target genes. The optimal timepoint can then be sampled and added to
the current dataset (Step 7), and if the researcher has sufficient
resources available, then another timepoint can be added to further
improve the accuracy of the gene expression data (Step 8).
The number of timepoints selected to be thawed
and sequenced will depend on the researcher’s experimental design.
For example, if a microarray experiment is being conducted, then the
researcher may only want to choose the top-ranked timepoint, and run
two chips (with control and treatment, each in duplicate). However,
RNA-seq chips have seven lanes and adapters can be used to run
several samples per lane, so several highly-ranked optimal
timepoints can be accommodated on one chip.