PPDA Year Three
What are the major goals of the project?
The overarching goal of the PPDA project
is to identify genes and processes that control
photosynthesis efficiency in response to fluctuating
environmental conditions, which are critical for
understanding and improving plant energy storage and
improving crop productivity. To achieve this, we proposed to
discover, develop, and apply Plant Phenomics Data Analytics
(PPDA) solutions, such that massive phenomics data is
transformed into knowledge or testable hypotheses to
identify important genes to improve photosynthesis
efficiency under dynamic environmental conditions.
The proposed research work is comprised of four components: Aim 1. Develop and apply phenomics data quality control program to identify abnormal data and distinguish whether they arise from noise, artifacts or more interesting cases of altered biological responses. Aim 2. Develop and apply phenomics pattern discovery algorithms to identify important energy-related genes from photosynthesis phenomics data. Aim 3. Develop a data visualization package for complex phenomics data display. Aim 4. Provide proof of utility by applying PPDA to generate and test functional (biophysical, biochemical or physiological) hypotheses on photosynthesis efficiency using natural variations. In summary, PPDA will ensure high data quality, identify and visualize important genes from complex plant phenomics data, and will advance knowledge discovery in the broader community.
The proposed research work is comprised of four components: Aim 1. Develop and apply phenomics data quality control program to identify abnormal data and distinguish whether they arise from noise, artifacts or more interesting cases of altered biological responses. Aim 2. Develop and apply phenomics pattern discovery algorithms to identify important energy-related genes from photosynthesis phenomics data. Aim 3. Develop a data visualization package for complex phenomics data display. Aim 4. Provide proof of utility by applying PPDA to generate and test functional (biophysical, biochemical or physiological) hypotheses on photosynthesis efficiency using natural variations. In summary, PPDA will ensure high data quality, identify and visualize important genes from complex plant phenomics data, and will advance knowledge discovery in the broader community.
What was accomplished under these goals?
Emergent phenotype discoveryWe finished the emergent phenotype
discovery aim. The rapid improvement of phenotyping
capability, accuracy, and throughput have greatly increased
the volume and diversity of phenomics data. A computational
challenge is to efficiently identify phenotypic patterns to
improve our understanding of the quantitative variation of
complex phenotypes, and to attribute gene functions. To
address this challenge, we developed a new algorithm to
identify emerging phenomena from large-scale temporal plant
phenotyping experiments. An emerging phenomenon is defined as
a group of genotypes who exhibit a coherent phenotype
pattern during a relatively short time. Emerging phenomena
are highly transient and diverse, and are dependent in
complex ways on both environmental conditions and
development. Identifying emerging phenomena may help
biologists to examine potential relationships among
phenotypes and genotypes in a genetically diverse population
and to associate such relationships with the change of
environments or development. We present an emerging
phenomenon identification tool called Temporal Emerging
Phenomenon Finder (TEP-Finder). Using large-scale
longitudinal phenomics data as input, TEP-Finder first
encodes the complicated phenotypic patterns into a dynamic
phenotype network. Then, emerging phenomena in different
temporal scales are identified from dynamic phenotype network
using a maximal clique based approach. Meanwhile, a directed
acyclic network of emerging phenomena is composed to model
the relationships among the emerging phenomena. The
experiment that compares TEP-Finder with two state-of-art
algorithms shows that the emerging phenomena identified by
TEP-Finder are more functionally specific, robust, and
biologically significant.
Furthermore, we developed a framework which allows researchers to use TEP-Finder in their plant phenotyping experiments. For the demonstration purpose, we applied TEP-Finder on the identification quantitative trait loci (QTLs) of cowpea (Vigna unguiculate) for improving the robustness and efficiency of photosynthesis, leading to increased productivity. Cowpea, an annual herbaceous legume, is an important crop in the semi-arid regions across Africa. Understanding and improving the productivity and robustness of plant photosynthesis requires high-throughput phenotyping under environmental conditions that are relevant to the field. In this projects, three key photosynthetic parameters of 79 cowpea genotypes (3-5 biological replicates each) were measured every 30 minutes during the day time with the changes of temperature and light intensity in three days, resulting in large-scale temporal phenomics data. Since photosynthesis are time and environmental condition dependent, it is difficult to use the temporal phenomics data directly without relating them to the change of environmental conditions. To identify QTLs associated responsible to temperature changes, we first identified a set of emergent phenotypes using TEP-Finder. Then, we developed a new probabilistic model to estimate the secondary phenotypes that describe how likely a plant response to the change of environmental conditions. This model has three steps: 1) to identify plants with significant coherent phenotype patterns on specified continuous time periods, 2) for any given plant, to train a Hidden Markov Model (HMM) to obtain all the transition probabilities between the stressed status and the unstressed status, and 3) to identify the important QTLs using the transition probabilities of all genotypes. The experimental results on synthetic data show that with the emergent phenotypes identified using TEP-Finder, our model is more accurate than traditional models which uses phenotype data directly. In this project, the first manuscript has been published in Bioinformatics, and the second manuscript is under review.
Furthermore, we developed a framework which allows researchers to use TEP-Finder in their plant phenotyping experiments. For the demonstration purpose, we applied TEP-Finder on the identification quantitative trait loci (QTLs) of cowpea (Vigna unguiculate) for improving the robustness and efficiency of photosynthesis, leading to increased productivity. Cowpea, an annual herbaceous legume, is an important crop in the semi-arid regions across Africa. Understanding and improving the productivity and robustness of plant photosynthesis requires high-throughput phenotyping under environmental conditions that are relevant to the field. In this projects, three key photosynthetic parameters of 79 cowpea genotypes (3-5 biological replicates each) were measured every 30 minutes during the day time with the changes of temperature and light intensity in three days, resulting in large-scale temporal phenomics data. Since photosynthesis are time and environmental condition dependent, it is difficult to use the temporal phenomics data directly without relating them to the change of environmental conditions. To identify QTLs associated responsible to temperature changes, we first identified a set of emergent phenotypes using TEP-Finder. Then, we developed a new probabilistic model to estimate the secondary phenotypes that describe how likely a plant response to the change of environmental conditions. This model has three steps: 1) to identify plants with significant coherent phenotype patterns on specified continuous time periods, 2) for any given plant, to train a Hidden Markov Model (HMM) to obtain all the transition probabilities between the stressed status and the unstressed status, and 3) to identify the important QTLs using the transition probabilities of all genotypes. The experimental results on synthetic data show that with the emergent phenotypes identified using TEP-Finder, our model is more accurate than traditional models which uses phenotype data directly. In this project, the first manuscript has been published in Bioinformatics, and the second manuscript is under review.
High-dimensional phenotype data analysis using deep learning
Phenotype data can be collected from multiple sources, including imaging devices, diagnosis devices etc. Therefore, a phenotypic dataset is usually high-dimensional which is difficult to visualize, manage, manipulate and analyze. Often, it is plausible to assume that high-dimensional Phenotype data are controlled by or generated from the underlying low-dimensional phenotypic semantic factors (e.g. color and smoothness of peas). Since the phenotypic semantic factors are compact and interpretable, functions to map between the factors and data will be useful to address the Phenotype data analysis problems. For example, high-dimensional data can be projected to low-dimensional factor space for visualization purpose. Also, we can modify value of a specific factor to see the effects on data by generating new data by mapping from factor space to data space. This is especially useful if we can recover the factor space without any ground truth labels. In this project, we approximate those factors by learning disentangled variational latent in an unsupervised way (without ground truth phenotypic semantic factor values). Specifically, we propose a neural network model called Lie group auto-encoder to learn the latent.
We developed a new method to analyze Phenotype data by exploring a latent space which is approximately aligns with phenotypic semantic factor space, including functionalities to project data points to latent vectors and generate new data points from latent vectors. We assume that Phenotype data x are generated by a two-step generative process. First, a multivariate latent random variable z is sampled from a prior distribution P(z). In the second step, the Phenotype data x is sampled from the conditional distribution P(x|z). We model the prior distribution as standard Gaussian distribution. We parametrize the conditional distribution P(x|z) as a deep neural network, which is the so-called decoder. Similarly, the distribution P(z|x) is approximated using a variational distribution Q(z|x), again parametrized using a deep neural network. Q(z|x) is modeled as a diagonal multivariate Gaussian distribution N(μ,σ), which is represented as a transformation matrix in a Lie group. The neural network of Q(z│x) is further divided into two parts: the first part is the so-called encoder, which outputs vectors in the tangent Lie algebra of the Lie group; the second part is an exponential mapping layer which project a vector from the Lie algebra to Lie group. Fig. 1 gives an illustration of the model.

Figure 1: LGAE model
To train the neural networks, we minimize the loss function L=L_E+λL_LG, where L is the reconstruction loss which measures the distance between the input data x and corresponding recovered data x ̂. The Lie group loss L_LG measures the distance between Q(z|x) and the prior distribution. Hyper-parameter λ controls the balance between reconstruction accuracy and disentanglement. Large value of λ will encourage disentanglement but penalize reconstruction.
Modeling temporal phenotype data using deep learning
Phenotype data are composites of organism’s observable characteristics or traits, including morphology or physical form and structure, developmental processes, biochemical and physiological properties, and behaviors. Phenotype data are common in real world; and many researches of phenotype data analysis are in biology area and medical area. One difficulty for phenotype data analysis is to model its high temporal complexity. Leveraging the recently developed deep learning models, we developed a new tool called LSTMAE+SVM for modeling long temporal phenotype data.
Phenotype data are often represented as long sequences with measurements under dynamic environmental conditions over a long period. Extracting robust features from long temporal sequences is critical for developing phenotype data analysis methods.
In recent studies in deep learning, several auto-encoding technologies have been developed with advantages at the aspects of global features extraction and dimension reduction. Among them, a kind of auto-encoder called LSTMAE (auto-encoder based on LSTM) is leveraged to extract global features from long sequences. The approach of LSTMAE has been experimented over existing signaling data, such as EEG, for outcome classification. In the experiment, LSTMAE was pre-trained using data from the previously saved phenotype data. Using the cross-entropy loss in the decoder, we verified whether the LSTMAE was well trained using the newly collected phenotype data. Finally, LSTMAE was used to extract global features from new phenotype segments which were then used as inputs of an SVM model to further extract features critically related to class labels. Our experimental results showed that the performance of SVM is significantly improved with sensitivity, specificity, F1 Score, precision, and accuracy being 0.6300, 0.6100, 0.6238, 0.6176, and 0.6200, respectively.