What are the major goals of the project?

The overarching goal of the PPDA project is to identify genes and processes that control photosynthesis efficiency in response to fluctuating environmental conditions, which are critical for understanding and improving plant energy storage and improving crop productivity. To achieve this, we proposed to discover, develop, and apply Plant Phenomics Data Analytics (PPDA) solutions, such that massive phenomics data is transformed into knowledge or testable hypotheses to identify important genes to improve photosynthesis efficiency under dynamic environmental conditions.

The proposed research work is comprised of four components: Aim 1. Develop and apply phenomics data quality control program to identify abnormal data and distinguish whether they arise from noise, artifacts or more interesting cases of altered biological responses. Aim 2. Develop and apply phenomics pattern discovery algorithms to identify important energy-related genes from photosynthesis phenomics data. Aim 3. Develop a data visualization package for complex phenomics data display. Aim 4. Provide proof of utility by applying PPDA to generate and test functional (biophysical, biochemical or physiological) hypotheses on photosynthesis efficiency using natural variations. In summary, PPDA will ensure high data quality, identify and visualize important genes from complex plant phenomics data, and will advance knowledge discovery in the broader community.

What was accomplished under these goals?

Emergent phenotype discovery

We finished the emergent phenotype discovery aim. The rapid improvement of phenotyping capability, accuracy, and throughput have greatly increased the volume and diversity of phenomics data. A computational challenge is to efficiently identify phenotypic patterns to improve our understanding of the quantitative variation of complex phenotypes, and to attribute gene functions. To address this challenge, we developed a new algorithm to identify emerging phenomena from large-scale temporal plant phenotyping experiments. An emerging phenomenon is defined as a group of genotypes who exhibit a coherent phenotype pattern during a relatively short time. Emerging phenomena are highly transient and diverse, and are dependent in complex ways on both environmental conditions and development. Identifying emerging phenomena may help biologists to examine potential relationships among phenotypes and genotypes in a genetically diverse population and to associate such relationships with the change of environments or development. We present an emerging phenomenon identification tool called Temporal Emerging Phenomenon Finder (TEP-Finder). Using large-scale longitudinal phenomics data as input, TEP-Finder first encodes the complicated phenotypic patterns into a dynamic phenotype network. Then, emerging phenomena in different temporal scales are identified from dynamic phenotype network using a maximal clique based approach. Meanwhile, a directed acyclic network of emerging phenomena is composed to model the relationships among the emerging phenomena. The experiment that compares TEP-Finder with two state-of-art algorithms shows that the emerging phenomena identified by TEP-Finder are more functionally specific, robust, and biologically significant.

    Furthermore, we developed a framework which allows researchers to use TEP-Finder in their plant phenotyping experiments. For the demonstration purpose, we applied TEP-Finder on the identification quantitative trait loci (QTLs) of cowpea (Vigna unguiculate) for improving the robustness and efficiency of photosynthesis, leading to increased productivity. Cowpea, an annual herbaceous legume, is an important crop in the semi-arid regions across Africa. Understanding and improving the productivity and robustness of plant photosynthesis requires high-throughput phenotyping under environmental conditions that are relevant to the field. In this projects, three key photosynthetic parameters of 79 cowpea genotypes (3-5 biological replicates each) were measured every 30 minutes during the day time with the changes of temperature and light intensity in three days, resulting in large-scale temporal phenomics data. Since photosynthesis are time and environmental condition dependent, it is difficult to use the temporal phenomics data directly without relating them to the change of environmental conditions. To identify QTLs associated responsible to temperature changes, we first identified a set of emergent phenotypes using TEP-Finder. Then, we developed a new probabilistic model to estimate the secondary phenotypes that describe how likely a plant response to the change of environmental conditions. This model has three steps: 1) to identify plants with significant coherent phenotype patterns on specified continuous time periods, 2) for any given plant, to train a Hidden Markov Model (HMM) to obtain all the transition probabilities between the stressed status and the unstressed status, and 3) to identify the important QTLs using the transition probabilities of all genotypes. The experimental results on synthetic data show that with the emergent phenotypes identified using TEP-Finder, our model is more accurate than traditional models which uses phenotype data directly. In this project, the first manuscript has been published in Bioinformatics, and the second manuscript is under review.

High-dimensional phenotype data analysis using deep learning

Phenotype data can be collected from multiple sources, including imaging devices, diagnosis devices etc. Therefore, a phenotypic dataset is usually high-dimensional which is difficult to visualize, manage, manipulate and analyze. Often, it is plausible to assume that high-dimensional Phenotype data are controlled by or generated from the underlying low-dimensional phenotypic semantic factors (e.g. color and smoothness of peas). Since the phenotypic semantic factors are compact and interpretable, functions to map between the factors and data will be useful to address the Phenotype data analysis problems. For example, high-dimensional data can be projected to low-dimensional factor space for visualization purpose. Also, we can modify value of a specific factor to see the effects on data by generating new data by mapping from factor space to data space. This is especially useful if we can recover the factor space without any ground truth labels. In this project, we approximate those factors by learning disentangled variational latent in an unsupervised way (without ground truth phenotypic semantic factor values). Specifically, we propose a neural network model called Lie group auto-encoder to learn the latent.

    We developed a new method to analyze Phenotype data by exploring a latent space which is approximately aligns with phenotypic semantic factor space, including functionalities to project data points to latent vectors and generate new data points from latent vectors. We assume that Phenotype data x are generated by a two-step generative process. First, a multivariate latent random variable z is sampled from a prior distribution P(z).  In the second step, the Phenotype data x is sampled from the conditional distribution P(x|z). We model the prior distribution as standard Gaussian distribution. We parametrize the conditional distribution P(x|z) as a deep neural network, which is the so-called decoder. Similarly, the distribution P(z|x) is approximated using a variational distribution Q(z|x), again parametrized using a deep neural network. Q(z|x) is modeled as a diagonal multivariate Gaussian distribution N(μ,σ), which is represented as a transformation matrix in a Lie group. The neural network of Q(z│x) is further divided into two parts: the first part is the so-called encoder, which outputs vectors in the tangent Lie algebra of the Lie group; the second part is an exponential mapping layer which project a vector from the Lie algebra to Lie group. Fig. 1 gives an illustration of the model.

LGAE
Figure 1: LGAE model

    To train the neural networks, we minimize the loss function L=L_E+λL_LG, where L is the reconstruction loss which measures the distance between the input data x and corresponding recovered data  x ̂. The Lie group loss L_LG measures the distance between Q(z|x) and the prior distribution. Hyper-parameter λ controls the balance between reconstruction accuracy and disentanglement. Large value of λ will encourage disentanglement but penalize reconstruction.
 
Modeling temporal phenotype data using deep learning

Phenotype data are composites of organismís observable characteristics or traits, including morphology or physical form and structure, developmental processes, biochemical and physiological properties, and behaviors. Phenotype data are common in real world; and many researches of phenotype data analysis are in biology area and medical area. One difficulty for phenotype data analysis is to model its high temporal complexity. Leveraging the recently developed deep learning models, we developed a new tool called LSTMAE+SVM for modeling long temporal phenotype data.

    Phenotype data are often represented as long sequences with measurements under dynamic environmental conditions over a long period. Extracting robust features from long temporal sequences is critical for developing phenotype data analysis methods.
In recent studies in deep learning, several auto-encoding technologies have been developed with advantages at the aspects of global features extraction and dimension reduction. Among them, a kind of auto-encoder called LSTMAE (auto-encoder based on LSTM) is leveraged to extract global features from long sequences. The approach of LSTMAE has been experimented over existing signaling data, such as EEG, for outcome classification. In the experiment, LSTMAE was pre-trained using data from the previously saved phenotype data. Using the cross-entropy loss in the decoder, we verified whether the LSTMAE was well trained using the newly collected phenotype data. Finally, LSTMAE was used to extract global features from new phenotype segments which were then used as inputs of an SVM model to further extract features critically related to class labels. Our experimental results showed that the performance of SVM is significantly improved with sensitivity, specificity, F1 Score, precision, and accuracy being 0.6300, 0.6100, 0.6238, 0.6176, and 0.6200, respectively.

View My Stats