Data Cleaning in Plant Photosynthesis Phenomics Data

Lei Xu, Jeffrey A Cruz, Linda J Savage, David M Kramer, Jin Chen

Plant phenomics, the collection of large-scale plant phenotype data is growing exponentially. The resources have become essential component of modern plant science. Such complex data sets are critical for understanding the mechanisms governing energy intake and storage in plants, and this is essential for improving crop productivity. However, a major issue facing these efforts is the determination of the quality of phenotypic data. Automated methods are needed to identify and characterize alteractions caused by system errors, all of which are difficult to remove in the data collection step, and distinguish them from more interesting cases of altered biological responses.

As a step towards solving this problem, we have developed a coarse-to-refined model called Dynamic Filter to identify abnormalities in plant photosynthesis phenotype data by comparing light responses of photosynthesis using a simplified kinetic model of photosynthesis. Dynamic Filter employs an Expectation-Maximization process to adjust the kinetic model in coarse and refined regions to identify both abnormalities and biological outliers. The experimental results show that our algorithm can effectively identify most of the abnormalities in both the real and synthetic datasets.


We introduce Dynamic Filter, a computational framework[figure 1] to effectively identify abnormalities in time-streaming data. The package is originally developed for automatic cleaning of Plant Phoenotying data that correlates Phi2 and Light data by Michaelis-Menten kinetics. With integration of modeling equations, it can be used to clean data of other regression models.


Dynamic Filter integrates coarse-to-refine residual analysis algorithms, and employs an Expectation-Maximization process to repeatedly classify the temporal data into two groups: abnormalities and normalities.

In coarse level, residuals are learned from the regression model. Based on the residuals learned, Gaussian Mixture model classifies the data into two classes of candidates: normal data and abnormalities.

In the refinement step, distributional pattern of the abnormalities among whole dataset is learned from their projection in Principle Component Space (feature space that maximizes the differences between normal data and abnormalities), in which consensus obtained from k-nearest-neighbors algorithm ( based on the assumption that data points with similar feature values should be assigned with same class membership) refine the membership classification of normal data and abnormalities. With the refined membership of data points, localized regions centering around abnormalities are defined, and each region contain both normal data and abnormalities. In each region, Expection-Maximization process is applied to iteratively refine the class membership of data points in the local region. Data in the normality class of the region is trained to generate regression curves, and the class membership of every value is reassigned based on their fitness to the curves.

After each region finishes EM process, consensus voting is applied to resolve cases where there are conflicts of membership between overlapping regions. Dynamic Filter would output the result when consensus votes converges.

The results show our algorithm can identify most of the abnormalities in both real and synthetic datasets. In summary, our model has the following advantages:

1. Dynamic Filter is the first work to integrate biological constrains with time-series phenotype data in a unified artifact for abnormality detection.

2. Our model can identify both abnormalities and biological discoveries using a residual analysis model.

3. Unlike the other residual analyses, our model does not require a precise theoretical curve.

User Guide

Dynamic Filter is programmed in MatLab 2013b, with statistics toolbox used. To facilitate its application by users who do not have MatLab Installed, the package could be compiled into standalone executable application or a Java class. Users may either install the standalone exe file and run it in command line, or may load the Java class into Java-supported IDE projects, and call Dynamic Filter.

In current version, Dynamic Filter takes the following files as inputs:

-- text files of time-streaming variable (light intensity data in current paper)
-- text file of time-streaming observed data to be cleaned (Phi2 data in current paper)
-- text files of parameters configuration.

The package outputs the cleaned data in the same format as text file of time-streaming observed data in the input, with abnormalities values labeled as NaNs.

1. Format of input files

a) Text files of time-streaming variable:

First row contains all the time points separated by \Tab.
Second row contains variable values separated by \Tab.
t1 t2 t3 ... tn
v1 v2 v3 ... vn

8 9 10 11 ...
100 200 300 350...

b) Text file of time-streaming observed data to be cleaned:
First row headed with "Time" and followed by time point values separated by \Tab.
Each following row represents an experiment:
Time 8 9 10 11 ...
Row1 0.7 0.67 0.65 0.64...

c) text files of parameters configuration
The configuration text file contains the parameters used in the cleaning process. Each row represents a parameter to be configurated: headed with parameter name and followed by parameter value separated with \Tab.
training_range 5
abnormal_Region_Interval 5
confidence fator 1.96

2. In the output, a matrix represent the membership of normal values or abnormalities:
0 - abnormalities, 1 - normal value

3. Main Function/Class to be called
The main function to be call is in format as follows: [Output_Matrix,~,~]=Supervised_Cleaning(fLight,fPhi2,fParas,dirOutput)
Where, fLight is the path to Light file, fPhi2 is the path to Phi2 file, and fParas is the path to Parameters file. dirOutput is optional argument. If dirOutput is specified with a directory path, the output matrix will be written into text format.


Source code and example files are here.