Introduction

Determining the best sampling rates for time-series high-throughput gene expression experiments is a challenging optimization problem. Existing approaches infer the best timepoints either with low-throughput technology, or by measuring interpolation uncertainty in existing gene expression curves, but cannot integrate knowledge from existing datasets. The Optimal Timepoint Selection (OTS) algorithm outlined here addresses the sampling rate problem by utilizing existing training datasets to determine the best timepoint to add to a sparsely sampled dataset. OTS combines training datasets using a novel global and local optimization approach, then identifies differences between the current and integrated training datasets.

Overview

The overall experimental approach of sampling rate design with OTS is shown in the above figure. In the first step, an organism and treatment or condition is selected. In the second step, a biological experiment is performed, and samples are preserved by freezing at dense timepoints. A broad time range of interest should be selected based on the expected response time of the gene group(s) of interest to the treatment used. Samples should be collected frequently, and bias in the sampling rate by the researcher should be avoided by sampling with uniform timepoint spacing (though this is not a requirement of the algorithm). In the third step, a sparse set of timepoints is sampled in order to input an initial “current” dataset into OTS. This needs to include at least two timepoints (at the first and the last timepoint in the the time range of interest).

In step 4, high-throughput time-series gene expression “training” datasets are collected to guide the optimal timepoint selection for the current dataset. It is not necessary for the training datasets to be collected using the same technology (ie, microarray or RNA-seq experiments) as the current dataset, but the training datasets should use treatments or conditions that are expected to affect target treatment-response genes in the same way as in the current dataset. The results of the timepoint selection will be improved with more training data, more densely sampled time-series training datasets, and training datasets with genetic responses very similar to the current experiment.

Some biological background knowledge should be applied when choosing training datasets, because if a researcher inputs training experiments which do not elicit genetic responses similar to the current experiment, then sub-optimal timepoints may be selected, even though the NNLS curve-matching step may remove the influence of poorly-matched datasets. For example, in iterative-online Yeast study in the OTS publication, the Spellman et al α-factor synchronization dataset used the same treatment applied to our current dataset from Pramila et al, so it was included in our training datasets. However, the other two treatments from the same study synchronized cells at a different stage in the cell cycle (cdc25 temperature synchronization) or selected for cells which were significantly smaller than normal and may have affected cyclic gene expression (elutriation treatment), so they were not included. The Cho et al temperature synchronization method was expected to synchronize normal-sized cells to the same cell cycle stage (G1) as the current dataset, so it was included here. However, if the match between the training and current datasets is very poor, the NNLS curve-matching step will automatically reduce the influence of the poorly matching training data.

The treatment-response gene set of interest is selected in step 5. There are several possible approaches for choosing target genes; Knowledge-based approaches include identifying important treatment response genes from literature studies or Gene Ontology (GO) categories. If annotation data is sparse, or the treatment/condition is relatively unstudied, then a mathematical approach can be used, in which the differential gene expression value patterns in the training or current datasets can be used to select genes with large differential expression values. OTS is designed to handle large gene sets, so a combination of knowledge-based and mathematical approaches can be applied, and genes identified from both knowledge-based and mathematical approaches can be added to the gene set of interest. Note that the gene identifiers (either standard gene names or probe names for microarray datasets) must be entered in the same format for all of the training datasets as well as the current dataset.

The current dataset and the training datasets (containing only genes of interest), the list of all timepoints available to be added to the current dataset, and two parameters (cluster number and threshold number) are then entered into OTS (Step 6). The default cluster number is calculated according to the square root of (N/2), but can be adjusted by the user. The threshold number should be increased from the default value of 3 if the data has very low levels of noise
.

OTS produces a ranked list of the optimal timepoints to be selected next, in order to best model the differential expression of the target genes. The optimal timepoint can then be sampled and added to the current dataset (Step 7), and if the researcher has sufficient resources available, then another timepoint can be added to further improve the accuracy of the gene expression data (Step 8).

The number of timepoints selected to be thawed and sequenced will depend on the researcher’s experimental design. For example, if a microarray experiment is being conducted, then the researcher may only want to choose the top-ranked timepoint, and run two chips (with control and treatment, each in duplicate). However, RNA-seq chips have seven lanes and adapters can be used to run several samples per lane, so several highly-ranked optimal timepoints can be accommodated on one chip.