PhenoLogic – Phenomics knowledge discovery
Using plant photosynthesis phenomics as proof of principle, we are developing algorithms for phenomics knowledge discovery, including phenotype clustering, phenotype network construction, and phenotype ontology development. The set of tools provide a one-stop solution for phenomics data analytics as an integrated system. It turns sophisticated phenomics data into testable hypotheses, to discover important genes, and to connect biological processes.
Inter-functional phenomics data clustering
Photosynthesis must respond to the needs of the organism to provide the optimal amount of energy, in the correct forms, without producing toxic byproducts, e.g. reactive oxygen species or glycolate. In this context, photosynthesis is a set of integrated modules (called central components) that form a self-regulating network, which is modulated by signal transduction (peripheral processes). In general, photosynthesis phenotypes can be altered in two ways: either through the changes of peripheral processes of the photosynthesis regulatory network or through the modification of central components of the network. Changes in peripheral processes tend to preserve regulatory relationships within the network. In contrast, altering the central components of photosynthesis will perturb the relationships between key regulatory processes, leading to a different correlation among photosynthesis phenotypes. Inter-functional analysis, an umbrella term describing a number of technologies to integrate multiple phenotypes into one unified data representation, enables effective knowledge discovery in unknown protein identification in both central and peripheral components of photosynthesis.
Our first approach is a nonparametric clustering algorithm for dissecting multi-dimensional phenomics data. The center of the work is a cloud-of-points data representation, in which the properties of a genotype is characterized by a set of data points. Following the framework of mixture models, we assume that there are k different underlying distributions in the data, where each distribution is introduced to capture a different shape of the cloud-of-points representation, and all the data points observed in the cloud-of-points representation are drawn independently from one of the k distributions. The main challenge arises from finding the optimal density function for each distribution, in which the variables to optimize are functions or vectors of infinite dimension. This is in contrast to most optimization problems where the variables are of finite dimension. The optimization problem was solved by exploring the theory of kernel density estimation based on the Nadaraya-Watson method for density estimation. An experiment on photosynthesis phenotype data showed that our technique is more effective in capturing mutant lines with similar photosynthesis profiles in comparison to the conventional clustering algorithms.
Our current approach aims to further improve our analysis by taking into consideration the sequential order among phenotype measurements. By modeling phenotype data as multi-dimensional time series, we will adopt the partition-and-detect framework to compute the distance between any two plants with a kernel density estimation function, where the distance function in the kernel is defined as weighted sum of three different types of normalized distances, i.e. perpendicular, parallel and angle distances. Our algorithm will enable researchers to identify important trajectory features of multi-dimensional temporal phenomics data, which are crucial for function prediction for core photosynthesis proteins.
Constructing dynamic phenotype networks
The acquisition of high-quality phenotype data provides opportunities for modeling the molecular networks controlling complex traits such as development, stress tolerance, and metabolism or even the interactions of organisms in a community. In order to identify unique phenotype patterns under dynamic environmental conditions, thus linking environmental changes to genotypes, we are developing tools for constructing dynamic phenotype networks.
We have developed a series of biological network construction and functional module discovery algorithms for different aims. Among them, a soft-thresholding approach was proposed to construct networks of functional modules using gene expression datasets. Another algorithm NeMoFinder was developed to mine meso-scale repeated and unique network motifs in large biological networks. It utilizes repeated trees to partition a network into a set of graphs, and adopts the notion of graph cousins to facilitate the candidate generation and frequency counting processes. We labeled the network motifs with biological features associated with genes and proposed a frequent pattern mining algorithm to capture the interesting biological contexts of the network motifs.
Our current approach aims to develop dynamic phenotype network algorithms to model the responses of plants that vary with the change of conditions. We first measure the similarity between any two individual plants using a window based approach, and then model the dynamics of the network of phenomics. The topological properties of the dynamic phenotype networks will provide key information towards scientific discoveries. For example, mutants that are always connected under multiple environments may suggest they affect plant mechanism similarly; if two mutant lines are connected in the morning but are separated in the afternoon, it may indicate that they have a different level of vulnerability to photodamage.
We aim to develop dynamic graph clustering algorithms to identify functional modules of phenotypes. While existing work such as DHAC highlights the continuous change of edges in adjacent networks, a broader concept of comparison is needed. This is because the order of snapshots in a dynamic phenotype network should be interchangeable, allowing for the discovery of subgraphs frequently appeared at any combination of environmental conditions.
Data-driven ontologies to describe plant phenotypes
The sheer volume of data from large-scale phenotyping is now driving the need to manage phenotype information that enables the efficient use of data. A major problem of the current phenotype data management is the lack of common semantics across phenomics data sources, inhibiting data integration and comparison across different domains or species. This problem becomes even more challenging as modern phenotyping experiments are often conducted under dynamic environmental conditions over a relatively long period, resulting in even more complicated phenotype data. The key to mobilizing this problem is to develop an appropriate phenotype ontology construction strategy that uses controlled vocabularies to describe different phenotypes in a consistent and structured way.
Ontologies are important tools for structuring information. In the field of biomedical research, the use of ontologies for functional description is pervasive, such as Gene Ontology (GO) and Human Phenotype Ontology (HPO). We have developed a series of ontology construction and mining algorithms for different aims. Our current aim is to automatically construct phenotype ontologies with a logically well-formed structure across different domains and species, allowing for semantic and consistent data representation, such that the phenotype descriptions can be objectively related to each other and can be annotated to the corresponding phenotype data. We are developing a new meta-ontology approach to combine terms from multiple standard ontologies, each supporting a particular domain of knowledge, and to use a specified schema to provide the overall logical structure.