Plant phenomics, the collection of large-scale plant phenotype data is growing exponentially. The resources have become essential component of modern plant science. Such complex data sets are critical for understanding the mechanisms governing energy intake and storage in plants, and this is essential for improving crop productivity. However, a major issue facing these efforts is the determination of the quality of phenotypic data. Automated methods are needed to identify and characterize alteractions caused by system errors, all of which are difficult to remove in the data collection step, and distinguish them from more interesting cases of altered biological responses.
As a step towards solving this problem, we have developed a coarse-to-refined model called Dynamic Filter to identify abnormalities in plant photosynthesis phenotype data by comparing light responses of photosynthesis using a simplified kinetic model of photosynthesis. Dynamic Filter employs an Expectation-Maximization process to adjust the kinetic model in coarse and refined regions to identify both abnormalities and biological outliers. The experimental results show that our algorithm can effectively identify most of the abnormalities in both the real and synthetic datasets.
Dynamic Filter integrates coarse-to-refine residual analysis algorithms, and employs an Expectation-Maximization process to repeatedly classify the temporal data into two groups: abnormalities and normalities.
In coarse level, residuals are learned from the regression model. Based on the residuals learned, Gaussian Mixture model classifies the data into two classes of candidates: normal data and abnormalities.
In the refinement step, distributional pattern of the abnormalities among whole dataset is learned from their projection in Principle Component Space (feature space that maximizes the differences between normal data and abnormalities), in which consensus obtained from k-nearest-neighbors algorithm ( based on the assumption that data points with similar feature values should be assigned with same class membership) refine the membership classification of normal data and abnormalities. With the refined membership of data points, localized regions centering around abnormalities are defined, and each region contain both normal data and abnormalities. In each region, Expection-Maximization process is applied to iteratively refine the class membership of data points in the local region. Data in the normality class of the region is trained to generate regression curves, and the class membership of every value is reassigned based on their fitness to the curves.
After each region finishes EM process, consensus voting is applied to resolve cases where there are conflicts of membership between overlapping regions. Dynamic Filter would output the result when consensus votes converges.
The results show our algorithm can identify most of the abnormalities in both real and synthetic datasets. In summary, our model has the following advantages:
1. Dynamic Filter is the first work to integrate biological constrains with time-series phenotype data in a unified artifact for abnormality detection.
2. Our model can identify both abnormalities and biological discoveries using a residual analysis model.
3. Unlike the other residual analyses, our model does not require a precise theoretical curve.