PPDA stands for Plant Phenotyping and Data
Analysis. By continuously monitoring plant photosynthesis
phenotypes over varying environmental conditions (such as
light intensity change or temperature change), we expect to
discover genomic mechanisms that regulate photosynthetic
efficiency in higher plants. PPDA aims to address three
computational challenges in phenomics data analysis in
general: 1) to maintain high data quality, 2) to discover new
knowledge from ultra-long time-series data, and 3) to
automatically identify and visualize patterns in the raw
spatial-temporal data.
1.
Research Summary
Large-scale phenotyping (phenomics) promises to bridge the
gap between genomics, gene functions and traits.
Specifically, to meet our growing needs for food and fuel,
new bio-imaging approaches were developed to allow
high-throughput, detailed plant phenotyping, with a focus on
improving the efficiency of photosynthesis. We aim to
identify genes and processes that control photosynthesis
efficiency in response to fluctuating environmental
conditions, which are critical for understanding and
improving plant energy storage and improving crop
productivity. To achieve this, we must resolve a wide range
of interacting factors that respond to environmental factors
over very wide dynamic ranges of frequency, duration and
intensity of conditions.
Recently, we have developed the Dynamic Environmental
Phenotype Imager (DEPI), a novel platform for monitoring
responses of plant phenotypes under dynamic conditions.
Initial data from DEPI reveals previously unseen effects
attributable to genes formerly thought to have no known
function. While these developments on plant phenotyping are
exciting, we are limited by the tools to analyze fully the
phenomics data. Removing that limitation is the proposed
goal of this project.
The
figure in below briefly introduces the four key components
of the phenotyping workflow. First, important plants traits
are captured under simulated environmental conditions.
Second, phenptyping images are processed to compute various
photosynthesis and growth phenotypes at the plant level.
Third, leaf-level photosynthesis, movement and growth are
measured using leaf alignment techniques. Fourth, temporal
and spatial heterogeneity patterns in plant phenotype images
are captured using advanced computer vision techniques
followed with statistical analysis.
We developed and applied Plant Phenomics Data Analytics
(PPDA) solutions, with
support from NSF and DOE, such that massive
phenomics data can be transformed into knowledge or testable
hypotheses to identify important genes to improve
photosynthesis efficiency under dynamic environmental
conditions. PPDA ensures high data quality, identify and
visualize important genes from complex plant phenomics data,
and will advance knowledge discovery in the broader
community.
The plant phenomics oriented research is comprised of four
components: 1) to develop, test and apply phenomics data
quality control program to identify abnormal data and
distinguish whether they arise from noise, artifacts or more
interesting cases of altered biological responses; 2) to
develop, test and apply phenomics pattern discovery
algorithms to identify important energy-related genes from
photosynthesis phenomics data. We will develop dynamic
phenotype network construction and phenotype module
discovery algorithms to turn sophisticated phenomics data to
testable hypotheses, to discover unknown genes, and to
connect biological processes; 3) to develop a data
visualization package for complex phenomics data display
using integrative multi-dimensional visualization methods,
in order to facilitate scientific discovery on
energy-related genes in response to changing environmental
conditions; and 4) to provide proof of utility by applying
PPDA to rationale for testing dynamic environment-induced
regulation on photosynthesis efficiency.
PPDA source code and phenomics data are available online
that allows PPDA to be used in classroom and research
settings, from where students and researchers can get
training and experience in phenomics, bioinformatics, and
plant biology. We welcome collaborations from research
groups working on sensor development, computer vision, or
biomedical problems to solve simulated or real-world
problems. Meanwhile, strength will continue to be maintained
in computer science serving as a foundation for work in
phenoinformatics projects.
Our world is confronting a confluence of unprecedented challenges of food and energy shortage. The U.N. estimates that food production will have to increase by 50-70% over the next 30-50 years. Plant-derived products are at the center of the grand challenge. Integrating approaches across all scales from molecular to field applications are necessary to develop sustainable plant productions with high yield and high resource-use efficiency. While significant progress has been made in molecular and genetic approaches in recent years, the quantitative analysis of plant phenotypes has become the major bottleneck that is restricting the flow-through of genomics advances into improvements in crop performance.
2.1 Plant phenotyping, then and nowPlants develop through a complex interaction between genotype and environment. This determines their structure, functions and thus performance such as yield or efficient use of resources. To understand the genetic basis of these economically important traits, it is essential to quantitatively assess plant phenotypes and identify relationships to genotypes and environmental factors.
Plant phenotyping has been performed by farmers and breeders for more than 5,000 years, essentially since the time humans started to select traits to increase yield. In the past, traditional phenotyping is based on experience and intuition and is laborious. Recent progress in sensor, robotics and automation technologies lead to the development of the ever-increasing new field of highly automated, non-destructive plant phenotyping. Modern plant phenotyping is a comprehensive, large-scale assessment of plant traits such as growth, development, tolerance, resistance, architecture, physiology, ecology, yield, and metadata that form the basis for more complex parameters. Examples for such measurements are chlorophyll fluorescence, stem diameter, plant height/width, canopy compactness, stress pigment concentration, leaf area, thickness, color and pose, seed number and size, flowering time, germination time and so on. It should be noted that it is crucial to simultaneously perform multiple measurements at high-throughput and high-precision to arrive at a more holistic characterization of plant performance.
Phenomics has been a central field of research and application in academia and industry. It provides a nourishing ground for the development of new phenotyping platforms and new phenomics data analysis methods in parallel. Challenges such as the establishment of robust phenotyping and data quality control protocols arise throughout this development. Furthermore, comprehensive data analysis approaches to integrate multiple measurements under dynamic environments are essential to meet our growing needs for food and fuel.
It is one thing to obtain a huge amount of phenomics data; it is quite another to use them effectively. While researchers are eager to boost their research using the modern phenotyping platforms, they may encounter difficulties with the data analysis portion of their workflow, rendering high-throughput phenotyping a less attractive option in their research plans. However, once pheno-informatics applications are developed and deployed, this bottleneck can be removed, resulting in smooth data interpretation processes and expedited research discoveries.
The major research aim in pheno-informatics is to turn sophisticated phenomics data into testable hypotheses for important gene discovery. During this post-genome era, there is a critical need to interpret phenomics data regarding the organism as a biological system. By dissecting from the phenomics data statistical significant relationships among measurable processes, researchers will move scientific inquiry beyond the limitations of human perceptions. In the absence of such efforts, the promise of identifying emergent phenotypes and important processes for improving plant productivity will likely remain difficult.
Our long-term goal is to develop effective strategies for design and construction of the pheno-informatics infrastructure for large-scale phenotyping projects, to discover the most important genes or processes from the data, and to predict their most probable mechanisms in key biological processes.
Our objective in this direction is to develop improved software packages for processing, modeling and visualizing sophisticated and overwhelming amount of phenomics data in plant science. The rationale is that application of rapid developing data mining, statistics, and computer vision techniques will allow identification of emergent phenotypes, and consequently lead to accurate prediction of unknown gene functions. These new multi-disciplinary bio-data modeling and interpretation will enable new ways of extracting information from massive phenomics data in a timely fashion, which complement and broadly extend existing methods of hypothesis testing and statistical inference.
Understanding how particular genotypes interact with the environment to produce specific phenotypic properties is a central goal of modern biology. However, associating phenotypes with the interactions of genotype and environment is generally a difficult problem due to a large number of genes and gene products that contribute to multiple phenotypes in concert with complex and dynamic environmental influences. We are developing, testing and applying phenome-genome-environment interaction discovery algorithms to identify emergent phenotype-genotype patterns under dynamic environmental conditions. We will take full advantage of the availability of multi-omics data, and will optimize the computational models using domain knowledge.
Exploring complex interactions among phenome, genome, and environment leads to key scientific discoveries such as new drug discovery, efficient aging treatment, and increased crop yield. By studying the relationships between phenotype and environment, we have developed Dynamic Filter to identify outliers in phenotype data. The tool can characterize abnormalities caused by system errors, which are difficult to remove in the data collection step, thus distinguishing errors from more interesting cases of altered biological responses. Specifically, Dynamic Filter derives a theoretical curve representing the interaction between light intensity and photosynthesis efficiency; adjusts the curve to fit the phenotype data via optimization and studies the deviations of individual phenotype values from theoretical curve. The resulting patterns in residuals indicate abnormalities, and the optimized theoretical curves reveal true biological outliers.
Our current research aims at learning the functional relationships between phenotypes and environmental parameters, with the ultimate goal to learn the relationships between genotypes. While many models assume there is only one function describing the average relationship, it would be more precise to adopt multiple functions, each describing the phenotype-environment relationship in a fixed environment, that are connected following the rule of phenotype plasticity. Using Bayesian theorem, we will develop robust curve-fitting algorithms for function parameter estimation.
We will also develop schemes to evaluate the performance of our methods. Specifically, to test if an algorithm is robust to noise, we will measure how significantly the random noise will affect the similarity between plants using precision/recall or Kendall tau. To determine if a model can capture the overall pattern in phenotype measures, we will identify the most similar and the most dissimilar pairs of plants, and query biological databases to check whether the genotypes of the similar (dissimilar) plants are likely to play similar (different) roles.
With the rapid development of advanced phenotyping tools, there is increasing recognition and appreciation of modeling dynamics of phenotypes and studying how they are evolved in response to perturbation of the genetic and environmental conditions. However, the traditional RIL/GWAS platforms primarily focus on steady-state phenotypes measured at a specified condition or time, regardless the fact that, in the real world, many phenotypes correlate to each other and change over time and conditions. To this clear need, we would like to present a dynamic IRL/GWAS model to estimate key parameters of the dynamic system and to test the association of genetic variants with multiple temporal phenotypes.
This work is based on our recent progress in
modeling temporal phenotypes for early plant disease detection
(in preparation). The rationale is that the particular disease
we study mainly affects plant metabolism, which may disturb
photosynthesis phenotypes in disease plants even in the early
stage. A simple linear regression showed the phenotypes of the
disease and the normal plants are different, but the
differences are not statistically significant. However, by
modeling temporal phenotypes as continuous functions of
time/conditions using kernel smoother, our new model is able
to separate disease and normal plants in the early stage with
precision as high as 95%. This work allows us to extend the
current IRL/GWAS models towards understanding how the genetic
variations and environmental perturbation act together to
dynamically alter regulations and metabolism leading to the
emergence of complex phenotypes.
Using plant photosynthesis phenomics as proof of principle, we are developing algorithms for phenomics knowledge discovery, including phenotype clustering, phenotype network construction, and phenotype ontology development. The set of tools provide a one-stop solution for phenomics data analytics as an integrated system. It turns sophisticated phenomics data into testable hypotheses, to discover important genes, and to connect biological processes.
4.1 Inter-functional phenomics data clustering
Photosynthesis must respond to the needs of the organism to provide the optimal amount of energy, in the correct forms, without producing toxic byproducts, e.g. reactive oxygen species or glycolate. In this context, photosynthesis is a set of integrated modules (called central components) that form a self-regulating network, which is modulated by signal transduction (peripheral processes). In general, photosynthesis phenotypes can be altered in two ways: either through the changes of peripheral processes of the photosynthesis regulatory network or through the modification of central components of the network. Changes in peripheral processes tend to preserve regulatory relationships within the network. In contrast, altering the central components of photosynthesis will perturb the relationships between key regulatory processes, leading to a different correlation among photosynthesis phenotypes. Inter-functional analysis, an umbrella term describing a number of technologies to integrate multiple phenotypes into one unified data representation, enables effective knowledge discovery in unknown protein identification in both central and peripheral components of photosynthesis.
Our first approach is a nonparametric clustering algorithm for dissecting multi-dimensional phenomics data. The center of the work is a cloud-of-points data representation, in which the properties of a genotype is characterized by a set of data points. Following the framework of mixture models, we assume that there are k different underlying distributions in the data, where each distribution is introduced to capture a different shape of the cloud-of-points representation, and all the data points observed in the cloud-of-points representation are drawn independently from one of the k distributions. The main challenge arises from finding the optimal density function for each distribution, in which the variables to optimize are functions or vectors of infinite dimension. This is in contrast to most optimization problems where the variables are of finite dimension. The optimization problem was solved by exploring the theory of kernel density estimation based on the Nadaraya-Watson method for density estimation. An experiment on photosynthesis phenotype data showed that our technique is more effective in capturing mutant lines with similar photosynthesis profiles in comparison to the conventional clustering algorithms.
Our second approach aims to further improve the analysis by taking into consideration the sequential order among phenotype measurements. By modeling phenotype data as multi-dimensional time series, we will adopt the partition-and-detect framework to compute the distance between any two plants with a kernel density estimation function, where the distance function in the kernel is defined as weighted sum of three different types of normalized distances, i.e. perpendicular, parallel and angle distances. Our algorithm will enable researchers to identify important trajectory features of multi-dimensional temporal phenomics data, which are crucial for function prediction for core photosynthesis proteins.
4.2 Constructing dynamic phenotype networksThe acquisition of high-quality phenotype data provides opportunities for modeling the molecular networks controlling complex traits such as development, stress tolerance, and metabolism or even the interactions of organisms in a community. In order to identify unique phenotype patterns under dynamic environmental conditions, thus linking environmental changes to genotypes, we are developing tools for constructing dynamic phenotype networks.
We have developed a series of biological network construction and functional module discovery algorithms for different aims. Among them, a soft-thresholding approach was proposed to construct networks of functional modules using gene expression datasets. Another algorithm NeMoFinder was developed to mine meso-scale repeated and unique network motifs in large biological networks. It utilizes repeated trees to partition a network into a set of graphs, and adopts the notion of graph cousins to facilitate the candidate generation and frequency counting processes. We labeled the network motifs with biological features associated with genes and proposed a frequent pattern mining algorithm to capture the interesting biological contexts of the network motifs.
Our current approach aims to develop dynamic phenotype network algorithms to model the responses of plants that vary with the change of conditions. We first measure the similarity between any two individual plants using a window based approach, and then model the dynamics of the network of phenomics. The topological properties of the dynamic phenotype networks will provide key information towards scientific discoveries. For example, mutants that are always connected under multiple environments may suggest they affect plant mechanism similarly; if two mutant lines are connected in the morning but are separated in the afternoon, it may indicate that they have a different level of vulnerability to photodamage.
We aim to develop dynamic graph clustering algorithms to identify functional modules of phenotypes. While existing work such as DHAC highlights the continuous change of edges in adjacent networks, a broader concept of comparison is needed. This is because the order of snapshots in a dynamic phenotype network should be interchangeable, allowing for the discovery of subgraphs frequently appeared at any combination of environmental conditions.
4.3 Data-driven ontologies to describe plant phenotypesThe sheer volume of data from large-scale phenotyping is now driving the need to manage phenotype information that enables the efficient use of data. A major problem of the current phenotype data management is the lack of common semantics across phenomics data sources, inhibiting data integration and comparison across different domains or species. This problem becomes even more challenging as modern phenotyping experiments are often conducted under dynamic environmental conditions over a relatively long period, resulting in even more complicated phenotype data. The key to mobilizing this problem is to develop an appropriate phenotype ontology construction strategy that uses controlled vocabularies to describe different phenotypes in a consistent and structured way.
Ontologies are important tools for structuring information. In the field of biomedical research, the use of ontologies for functional description is pervasive, such as Gene Ontology (GO) and Human Phenotype Ontology (HPO). We have developed a series of ontology construction and mining algorithms for different aims. Our current aim is to automatically construct phenotype ontologies with a logically well-formed structure across different domains and species, allowing for semantic and consistent data representation, such that the phenotype descriptions can be objectively related to each other and can be annotated to the corresponding phenotype data. We are developing a new meta-ontology approach to combine terms from multiple standard ontologies, each supporting a particular domain of knowledge, and to use a specified schema to provide the overall logical structure.
5. PhenoCloud – Interactive phenotype data explorationPhenomics data are usually collected from numerous sources representing different kinds of characteristics and for relatively long periods. The data are far beyond the direct perception of the human eyes. Biologists face challenges of developing efficient and robust computational tools to reduce large and diverse phenomics data into representations that can be interpreted in a biological context. Moreover, there is currently few tools that allow researchers to wander around the phenomics data and make discoveries by following intuition or simple serendipity.
As phenotype data continues to grow in complexity, diversity and volume, existing systems find it harder and harder to maintain a highly interactive experience. New intuitive exploration and visualization tools are required allowing for visualization, mapping, and synchronous adjustment to reduce the barrier to entry for researchers. We develop instant interactive data mining tools designed and optimized to exploratively analyze the massive phenomics data. This topic is novel because traditional data mining aims at finding highly interesting results, with the trade-off of being computationally demanding and time-consuming, and hence not suitable for people without computing background to explore large data.
We have developed an active learning software package with an interactive interface for determining the sampling rate of gene expression experiments. We have also developed a multi-heatmap visualization software tool called OLIVER, representing six types of data exploration: Observe, Link, Investigate, Visualize, Explore and Relate. The main workflow of OLIVER consists four major steps: visualize multiple heatmaps, sort and clustering, select genes in each heatmap with statistical tests, and map gene selections between heatmaps. Using integrative approximation and multi-dimensional visualization methods, OLIVER enables biologists to quickly integrate and compare large amounts of phenomics and genomics data with modest equipment and technical background.
Now the goal has been shifted to developing new techniques that instantly generate high-quality results, which are presented understandably, interactively and adaptively as to allow people to rapidly steer the method to the most informative areas. There are three perceptually-aware optimizations in our system. First, we model human perception as perceptual functions. The system automatically approximates data transformations that are perceptually indistinguishable, thus avoiding unnecessary computation. So the system can remain interactive while visualizing big phenomics data that need to be processed with computationally expensive procedures. Second, we model users' operations as feedback functions. Keeping human in the loop is a key enabling our system to silently adjust itself on the fly. Taking advantage of the existing work that has presented guidelines for using human perceptual insights to justify approximation algorithms, we build an active-learning based phenotype data visualization system that can learn from the user's actions and then iteratively adjusts its models and parameters, and extracts constraints based on human-computer interaction. Third, we develop fast any-time approximation algorithms by taking advantage of the high redundancy property in biomedical data. With a coarse-to-refine procedure, common tasks such as search and match can be finished in sublinear time.
Ultimately, by interactive visualization, exploring phenotype/genotype data will become a pleasant journey. Researchers can quickly grasp highly interesting results from large-scale phenomics data what will potentially lead to advances in understanding biological machinery or breakthroughs in bio-technology.