Introduction to Phenoinformatics

Jin Chen © 2020


Large-scale phenomics data analysis, aka. phenoinformatics, is an emerging informatics research direction that simultaneously models many phenotypic traits of many objects over time and space. It promises to bridge wide gaps between genome and disease, inspiring science aimed at understanding gene function, cell development, evolution, and so on.
    We develop and apply phenoinformatics tools to better understand phenotypic characteristics, such as health, disease, and evolutionary fitness, in the field of precision medicine and biological sciences. Specifically, we aim to construct a 'genotype-phenotype map' to describe the complex phenotypic variations over conditions using detailed phenomics data.

  • Project: A New Phenoinformatics Framework for Dissecting Large-scale Longitudinal Phenomics Data
  • Principle Investigator: Dr. Jin Chen
  • Funding: National Science Foundation (ABI-1458556), Department of Energy (DE–FG02–91ER20021), National Cancer Institute (NIH R21XXX), Kentucky Lung Cancer Research (KLCR XXX)
  • Keywords: Phenoinformatics; Spatial-temporal data analysis ; Imaging and image data analysis; Signal processing and data analysis


Phenome, referring to the full set of phenotypes of an individual above the molecular level, is critical for explaining important outcomes such as human disease and crop yield. In the recent years, new sensor techniques have been developed to enable high-throughput, detailed phenotyping. In medicine, the clinically characterizing traits signify health or disease, such as fever, rash, limp or an irregular heartbeat. In the realm of agriculture, they help improving yield, efficiency of nutrient-use and photosynthesis, to meet our growing needs for food and fuel in a changing climate. To date, high-throughput and high-dimensional phenotyping experiments have generated large amounts of phenomics data.

    While these development of phenotyping techniques is exciting, clinicians and researchers are limited by the tools to fully analyze the phenomics data. Removing that limitation is the proposed goal of this research project. To this clear need, we are designing, developing and applying phenomics data analytics solutions, or called phenoinformatics, such that big phenomics data can be transformed into knowledge or testable hypotheses to identify important genes or pathways, or to infer treatment prognosis. The solutions will ensure high phenomics data quality, identify and visualize important emergent phenotype patterns, and will advance knowledge discovery in the broader community.

    In conclusion, phenomics has broad importance in applied and basic biology. Unprecedented advances have been made in the throughput and pace of large-scale phenotyping platforms. However, our ability to find, associate and implicate the effects of the interactions of genetic variants and environmental conditions far outstrips our ability to understand them. This has imposed increasingly high demands on the pheno-informatics tools necessary for analysis of the exponentially growing phenomics data. we aim to develop, test and apply the phenomics data analytics solutions. These computational tools will turn the sophisticated phenomics data into testable hypotheses, facilitate scientific discovery on novel gene function, and thus significantly broaden the research area of phenomics.

    We welcome collaborations from research groups working on sensor development, computer vision, or biomedical problems to solve simulated or real-world problems. Meanwhile, strength will continue to be maintained in computer science serving as a foundation for work in phenoinformatics projects.

Specific Aims

We are facing a revolution in phenomics just like the one created in genomics in the last two decades. Tremendous efforts have been recently spent on developing high-throughput phenotyping devices on probing important traits such as tumor size and volume, plant photosynthesis and nutrient-use efficiency, etc. and on using these devices to screen millions of individuals under real or simulated environmental conditions.

Large-scale, high-throughput phenotyping has provided huge amounts of phenotype data at relatively low cost. The high-quality phenotype data with explicit biological meanings provide opportunities for computational modeling the molecular networks that control complex traits such as outcome, stress tolerance and metabolism or even the interactions of organisms in a community. With the large volume of phenotype data, researchers expect to immediately identify traits with efficiency-boosting machinery, and quickly generate and test biomedical hypotheses that may lead to a new breakthrough in biological research or clinical trial. However, phenotype data are incredibly complex. Unlike the genotype, the phenotype of an organism can be described at many levels of granularity ranging from a single molecular to metabolic networks to physiological systems, all the way to population behaviors. Moreover, phenotypes are dynamic and are subject to environmental changes, and most phenotypes are described as continuous functions as opposed to the discrete characters (e.g. ACGT) of the genotype.

Phenomics poses formidable big data challenges. It is difficult to analyze a single phenotype represented as a huge matrix containing millions of individuals across hundreds of thousands sampling time points under varying environmental conditions, let alone multiple phenotypes, each with a different biological meaning. How can we process, mine and visualize all the phenotype data? Which variety has better yield than the others? How are they different from each other? There are now high-throughput phenotyping devices that can non-invasively measure hundreds or thousands of individuals in parallel all through their life span, generating huge amounts of data. But there is a lack of infrastructure to analyze phenotypic information, and to correlate it with other big biological data (e.g. transcriptomics data generated using NGS technologies) is still limited, making developing novel pheno-informatics approaches imperative.

Our research mainly focuses on exploring the strategies, algorithms and tools for organizing, mining and visualizing phenomics data. By tightly collaborating with the labs on phenotyping technology development, especially with Dr. David M Kramer, and with multiple funding supports from the DOE, NSF and MSU, we have developed a series of computational tools for phenomics data analysis, and have demonstrated their effectiveness on phenotype knowledge discoveries.

The rapid development of phenotyping platforms is never ending. New platforms, e.g. the Plant Accelerator in Australia and the Dynamic Environmental Phenotype Imager (DEPI) in PRL, have effectively increased the speed and scale of plant physiological characterization. To cope with the rapid phenotyping platform development, we set three aims on phenomics data analysis, which will result in intelligent pheno-informatics tools to be developed, leading to effective phenotypic pattern identification and precise gene function discovery. Taken together, we will introduce new phenomics data representations and develop novel algorithms for relating and visualizing phenotype data, thus significantly broadening the research area of phenomics.

•    Aim 1. Develop inter-functional pattern discovery algorithms to identify phenotype clusters, construct dynamic phenotype networks, and develop data-driven phenotype ontology to turn sophisticated phenomics data into testable hypotheses, to discover important genes, and to connect biological processes.

•    Aim 2. Develop phenome-genome-environment interaction discovery algorithms to identify emergent phenotype-genotype patterns under dynamic environments. The algorithms will take full advantage of the availability of multi-omics data and domain knowledge.

•    Aim 3. Develop an interactive instant phenotype analysis package for complex phenome/genome data processing and display using integrative approximation and multi-dimensional visualization methods.