Research philosophy – the adventure in pheno-informatics

We are facing a revolution in phenomics just like the one created in genomics in the last two decades. Tremendous efforts have been recently spent on developing high-throughput phenotyping devices on probing important traits such as tumor size and volume, plant photosynthesis and nutrient-use efficiency, etc. and on using these devices to screen millions of individuals under real or simulated environmental conditions.

Large-scale, high-throughput phenotyping has provided huge amounts of phenotype data at relatively low cost. The high-quality phenotype data with explicit biological meanings provide opportunities for computational modeling the molecular networks that control complex traits such as outcome, stress tolerance and metabolism or even the interactions of organisms in a community. With the large volume of phenotype data, researchers expect to immediately identify traits with efficiency-boosting machinery, and quickly generate and test biomedical hypotheses that may lead to a new breakthrough in biological research or clinical trial. However, phenotype data are incredibly complex. Unlike the genotype, the phenotype of an organism can be described at many levels of granularity ranging from a single molecular to metabolic networks to physiological systems, all the way to population behaviors. Moreover, phenotypes are dynamic and are subject to environmental changes, and most phenotypes are described as continuous functions as opposed to the discrete characters (e.g. ACGT) of the genotype.

Phenomics poses formidable big data challenges. It is difficult to analyze a single phenotype represented as a huge matrix containing millions of individuals across hundreds of thousands sampling time points under varying environmental conditions, let alone multiple phenotypes, each with a different biological meaning. How can we process, mine and visualize all the phenotype data? Which variety has better yield than the others? How are they different from each other? There are now high-throughput phenotyping devices that can non-invasively measure hundreds or thousands of individuals in parallel all through their life span, generating huge amounts of data. But there is a lack of infrastructure to analyze phenotypic information, and to correlate it with other big biological data (e.g. transcriptomics data generated using NGS technologies) is still limited, making developing novel pheno-informatics approaches imperative.

Our research mainly focuses on exploring the strategies, algorithms and tools for organizing, mining and visualizing phenomics data. By tightly collaborating with the labs on phenotyping technology development, especially with Dr. David M Kramer, and with multiple funding supports from the DOE, NSF and MSU, we have developed a series of computational tools for phenomics data analysis, and have demonstrated their effectiveness on phenotype knowledge discoveries.

The rapid development of phenotyping platforms is never ending. New platforms, e.g. the Plant Accelerator in Australia and the Dynamic Environmental Phenotype Imager (DEPI) in PRL, have effectively increased the speed and scale of plant physiological characterization. To cope with the rapid phenotyping platform development, we set three aims on phenomics data analysis, which will result in intelligent pheno-informatics tools to be developed, leading to effective phenotypic pattern identification and precise gene function discovery. Taken together, we will introduce new phenomics data representations and develop novel algorithms for relating and visualizing phenotype data, thus significantly broadening the research area of phenomics.

•    Aim 1. Develop inter-functional pattern discovery algorithms to identify phenotype clusters, construct dynamic phenotype networks, and develop data-driven phenotype ontology to turn sophisticated phenomics data into testable hypotheses, to discover important genes, and to connect biological processes.

•    Aim 2. Develop phenome-genome-environment interaction discovery algorithms to identify emergent phenotype-genotype patterns under dynamic environments. The algorithms will take full advantage of the availability of multi-omics data and domain knowledge.

•    Aim 3. Develop an interactive instant phenotype analysis package for complex phenome/genome data processing and display using integrative approximation and multi-dimensional visualization methods.

Background and Significance – plant phenotyping and beyond

Our world is confronting a confluence of unprecedented challenges of food and energy shortage. The U.N. estimates that food production will have to increase by 50-70% over the next 30-50 years. Plant-derived products are at the center of the grand challenge. Integrating approaches across all scales from molecular to field applications are necessary to develop sustainable plant productions with high yield and high resource-use efficiency. While significant progress has been made in molecular and genetic approaches in recent years, the quantitative analysis of plant phenotypes has become the major bottleneck that is restricting the flow-through of genomics advances into improvements in crop performance.

Plant phenotyping, then and now

Plants develop through a complex interaction between genotype and environment. This determines their structure, functions and thus performance such as yield or efficient use of resources. To understand the genetic basis of these economically important traits, it is essential to quantitatively assess plant phenotypes and identify relationships to genotypes and environmental factors.

Plant phenotyping has been performed by farmers and breeders for more than 5,000 years, essentially since the time humans started to select traits to increase yield. In the past, traditional phenotyping is based on experience and intuition and is laborious. Recent progress in sensor, robotics and automation technologies lead to the development of the ever-increasing new field of highly automated, non-destructive plant phenotyping. Modern plant phenotyping is a comprehensive, large-scale assessment of plant traits such as growth, development, tolerance, resistance, architecture, physiology, ecology, yield, and metadata that form the basis for more complex parameters. Examples for such measurements are chlorophyll fluorescence, stem diameter, plant height/width, canopy compactness, stress pigment concentration, leaf area, thickness, color and pose, seed number and size, flowering time, germination time and so on. It should be noted that it is crucial to simultaneously perform multiple measurements at high-throughput and high-precision to arrive at a more holistic characterization of plant performance.

Phenomics has been a central field of research and application in academia and industry. It provides a nourishing ground for the development of new phenotyping platforms and new phenomics data analysis methods in parallel. Challenges such as the establishment of robust phenotyping and data quality control protocols arise throughout this development. Furthermore, comprehensive data analysis approaches to integrate multiple measurements under dynamic environments are essential to meet our growing needs for food and fuel.

Algorithm development for phenomics applications

It is one thing to obtain a huge amount of phenomics data; it is quite another to use them effectively. While researchers are eager to boost their research using the modern phenotyping platforms, they may encounter difficulties with the data analysis portion of their workflow, rendering high-throughput phenotyping a less attractive option in their research plans. However, once pheno-informatics applications are developed and deployed, this bottleneck can be removed, resulting in smooth data interpretation processes and expedited research discoveries.

The major research aim in pheno-informatics is to turn sophisticated phenomics data into testable hypotheses for important gene discovery. During this post-genome era, there is a critical need to interpret phenomics data regarding the organism as a biological system. By dissecting from the phenomics data statistical significant relationships among measurable processes, researchers will move scientific inquiry beyond the limitations of human perceptions. In the absence of such efforts, the promise of identifying emergent phenotypes and important processes for improving plant productivity will likely remain difficult.

Our long-term goal is to develop effective strategies for design and construction of the pheno-informatics infrastructure for large-scale phenotyping projects, to discover the most important genes or processes from the data, and to predict their most probable mechanisms in key biological processes.

Our objective in this direction is to develop improved software packages for processing, modeling and visualizing sophisticated and overwhelming amount of phenomics data in plant science. The rationale is that application of rapid developing data mining, statistics, and computer vision techniques will allow identification of emergent phenotypes, and consequently lead to accurate prediction of unknown gene functions. These new multi-disciplinary bio-data modeling and interpretation will enable new ways of extracting information from massive phenomics data in a timely fashion, which complement and broadly extend existing methods of hypothesis testing and statistical inference.

View My Stats