What are the major goals of the project?

The overarching goal of the project is to identify genes and processes that control photosynthesis efficiency in response to fluctuating environmental conditions, which are critical for understanding and improving plant energy storage and improving crop productivity. To achieve this, we proposed to discover, develop, and apply Plant Phenomics Data Analytics (PPDA) solutions, such that massive phenomics data is transformed into knowledge or testable hypotheses to identify important genes to improve photosynthesis efficiency under dynamic environmental conditions.

    The proposed research work is comprised of four components: Aim 1. Develop and apply phenomics data quality control program to identify abnormal data and distinguish whether they arise from noise, artifacts or more interesting cases of altered biological responses. Aim 2. Develop and apply phenomics pattern discovery algorithms to identify important energy-related genes from photosynthesis phenomics data. Aim 3. Develop a data visualization package for complex phenomics data display. Aim 4. Provide proof of utility by applying PPDA to generate and test functional (biophysical, biochemical or physiological) hypotheses on photosynthesis efficiency using natural variations. In summary, PPDA will ensure high data quality, identify and visualize important genes from complex plant phenomics data, and will advance knowledge discovery in the broader community.

What was accomplished under these goals?

Identification of phenotype modules. We have developed a knowledge-based curve fitting algorithm, aiming to identify the complex relationships between phenotypes and environments, thus studying both values and trends of phenomics data. Our method has three steps. First, it splits the phenotype and environment data into temporal windows. Second, it employs a non-linear curve fitting method for each window and classifies the windows into two groups by the fitting results, i.e. reliable one and unreliable one. Third, using the phenotype-environment relationships learned from the reliable windows and the local data, it estimates the curves for all the unreliable windows. The performance evaluation showed that our method has significantly better performance than the existing methods. Its application to photosynthesis hysteresis pattern identification reveals new functions of core genes that control photosynthetic efficiency in response to varying environmental conditions, which are critical for understanding plant energy storage and improving crop productivity. The manuscript has been published in Bioinformatics (Yang et al 2017).

Dynamic phenotype network construction
. We analyzed complex phenomics data by constructing dynamic phenotype networks, which take into consideration multiple types of phenotypes that interact with each other and are dependent on external factors, such as genotype and environmental conditions. We have developed a framework for partitioning complex and high dimensional phenotype data into distinct phenotype subnetworks that vary over time. To achieve this, we first represented measured phenotype data from each genotype as a cloud-of-points, divided them using a running window approach, and adopted our nonparametric clustering algorithm (developed in the previous year) to cluster all the genotypes at every time window. After that, a continuous subspace clustering algorithm was developed to identify longitudinal phenotype subnetworks where two genotypes are connected if they have similar phenoptypes in a relatively long temporal period. Finally, we visualized the results using Cytoscape. Compared to conventional network construction approaches, the new method is advantageous in that it can capture dynamic subnetworks that are subject to the changes of environmental factors. We demonstrated the utility of the new technique by distinguishing novel phenotypic patterns in both synthetic data and a high-throughput plant photosynthetic phenotype dataset. The paper has been published in XXX (Peng et al 2017).

NaVaLigHT (Natural Variation Linkage Hypothesis Testing). Based on our recent results, we have developed a new scientific approach that could be adopted by machine learning methods. We call this method NaVaLigHT (Natural Variation Linkage Hypothesis Testing) because it generates a certain class of hypotheses based on experimental observations of natural variations in phenoptypes, and tests these by mapping cause-and-effect relationships onto variations in the genome. The following is a simple example that illustrates NaVaLigHT based on our recent work. Briefly, we explored the effects of low night temperature on daytime photosynthesis in cowpea using a RIL library derived from parent lines that behaved similarly under ambient conditions but responded very differently to low night time temperatures. One line (CB27) was found to be very tolerant of low night temperature, while the other (24-125B-1) was quite sensitive. We used DEPI to measure various photosynthetic parameters and noted that 1) There was substantial heritable, genetic variation in several photosynthetic parameters in responses to low temperature; and 2) CB27 showed much larger diurnal leaf motions, with the leaves resting at a steeper angle (almost parallel to the stem) at night and opening more slowly compared to 24-125B-1. One possible hypothesis is that the cold tolerance of photosynthesis in CB27 is related to its more robust leaf motions.

    From a mechanistic point of view, keeping leaves in the supine position should decrease their exposure to light early in the morning when the greatest cold-related damage may occur. To test if the photosynthetic parameters we, indeed, mechanistically connected, we mapped the natural variations in these parameters to a series of QTLs. The variations in photosynthetic responses mapped to two very sharp QTLs, one on chromosome 8 and another on chromosome 11. We observed two clear QTLs for leaf movements, one associated with the early morning rising of the leaves was found on chromosome 5, and the second to a swaying motion that is probably related to stem elongation, was found on chromosome 9. Most importantly, neither of these QTLs overlapped the QTLs that mapped to genetic variations in photosynthetic processes, providing a clear answer to a specific hypothesis, i.e. we can conclude that the genetic variation in leaf movements was not functionally or genetically linked to the variations in photosynthetic responses to cold.

    Notice that this analysis does not allow us to exclude the possibility that leaf movements have effects on photosynthesis under other conditions or in other genetic variants. For instance, it may be possible that longer-term exposure to high light in the morning would induce a new effect on photosynthesis that would map to a leaf movement QTL, indicating a possible mechanistic or genetic linkage. We must keep this restriction is mind when interpreting results, but this is also true for any use of the scientific method, i.e. the results are only certain for the specific sets of conditions used in the experiments, but can lead to more general hypotheses that can be validated under a wider range of conditions. In this case, what we can conclude is that the observed variations in leaf movements are (almost certainly) not mechanistically or genetically linked to the observed variations in photosynthetic responses. If such links were present, we should have observed some overlap in the QTLs. Observing such overlap would suggest, but not prove, a causal connection. However, a more rigorous statistical approach would be needed to asses just how certain we could test these connections, and we propose to develop these methods in the current year of funding.  The basic premise of the approach is that we can generate and test classes of models, using the well-established Occam’s Razor approach to science in which, only when the simplest sets of hypotheses are proposed and tested, are more complex hypotheses proposed. For the testing to be valid, we must develop the methods to calculate the probability that our null hypothesis—that there is no causal connection between two processes—is rejected. Given that none of the parameters or their interrelationships are likely to have normal distributions, we are developing nonparameteric methods, based on the random forest models already used for QTL analyses. Potentially, this approach offers a powerful approach to addressing complex interactions among genes and mechanisms. One can use it to test a broad range of hypothetical linkages as long as 1) there are strong genetic variations in a series of processes; 2) one can adequately and quantitatively measure these processes; and 3) the variations are sufficiently heritable and thus map to specific loci.

    We have also found that the approach can work for processes that we know are mechanistically linked. In the experiment described above we measured several photosynthetic processes that are known to be mechanistically linked. For example, we observed overlaps in QTLs related to photosynthetic parameters Phi_II, nonphotochemical quenching (NPQ), qE (rapidly reversible NPQ) and qI (slowly reversible NPQ associated with photodamage or chloroplast movements). Interestingly, individual QTLs appeared in different combinations of phenotypes at different times of exposure to the environmental conditions, probably indicating that the causal relationships were different. For example, on the first day of low temperature exposure a prominent QTL peak on chromosome 8 appeared with the qE form of NPQ first, then an overlapping band appears in the qI form, and about the same time on the Phi_II parameter. We interpret these results as implying the existence of a genetic variation that causes the cold sensitive parent to lose photosynthetic capacity in consistent with the expected “classical” regulatory and photoihibitory responses. The loss of photosynthesis leads first to the buildup of thylakoid proton motive force (pmf), which acidifies the thylakoid lumen and activates the photoprotective qE response. At later time points, when light intensities are higher, the qE response is overwhelmed resulting in 1) decreases in photochemical efficiency (Phi_II) and increases in photodamage-related quenching. The effects of these processes can be seen in QTLs associated with other photosynthetic parameters, including Phi_NO, which measure the regulatory balance for light capture by the chloroplast antenna system. A different story emerges, a QTL on chromosome 3 is first seen in the slowly parameter, without a corresponding QTL for Phi_II or qE, leading us to propose a different mechanism, possibly related to light-induced chloroplast movements.

    In terms of bioinformatics, our NaVaLigHT approach is amenable to “automated science”, e.g. machine-learning approaches to developing and test hypotheses. In the specific example given above, the hypothesis was derived from the observation that the two parent lines displayed distinct behaviors: photosynthesis and leaf movements. From this behavior, a robot scientist can easily generate the hypothesis that these processes have a cause-effect relationship. This hypothesis can then be tested (automatically) by exploring the genetic linkages for both processes. 


What do you plan to do during the next reporting period to accomplish the goals?

Using Natural Variations to Generate and Test Functional (Biophysical, Biochemical or Physiological) Hypotheses
Biological systems such as photosynthesis are extremely complex, consisting of myriad of functional pathways that are controlled by a dizzying array of environmental sensors and regulator systems. Although traditional genetics have allowed scientists to dissect many of the core functionalities of photosynthesis, the majority of the regulatory processes are not understood. As a case in point, our recent work suggests that over 90% of the genes that code for chloroplast targeted proteins do not have functions under laboratory conditions, but are “ancillary”, and serve fine-tuning functions under specific sets of environmental conditions. Thus, mutants in these genes show “emergent” phenotypes under non-laboratory conditions. Figuring out what these proteins do using traditional genetics approaches is a daunting task because the interactions amongst the environment, the genetics and the physiology of a plant are hyper-dimensional and involve complex interactions among environmental, developmental, genetic and physiological constraints.

    Recent progress in our NSF project points to a new approach for this type of “hyper-dimensional” problem by exploring the relationships between specific phenotypes in libraries of natural variants. The major finding—that leads us to propose a modified strategy for automated hypothesis testing—comes from analyses of the phenotypes of T-DNA knockout lines of Arabidopsis that are disrupted in chloroplast-targeted nuclear genes. The vast majority of these genes have no identified function. Mutant lines defective in these genes show no strong photosynthetic phenotypes under laboratory conditions, but phenotypes “emerge” as environmental conditions are altered to reflect the types of fluctuations seen in the field. We assessed the degree of fluctuation required for each mutant to show statistically significant changes in photosynthetic parameters (with respect to wild type), and compared this to the evolutionary diversity of the genes. The results show that 1) The core genes for photosynthesis show strong phenotypes even under laboratory conditions and tend to be evolutionarily highly constrained, i.e. they do not appear to have been adapted substantially to local environments. In contrast, the mutants that require stronger environmental fluctuations to show a phenotype tend to be uncharacterized (no know function) and have substantially higher rates of evolution. This makes sense in terms of selection and evolution because the ancillary components are not essential under all conditions and thus are released from strict functional constraints. They therefore can undergo rapid evolutionary changes to modify plants responses to specific environments. These components are also important targets for short-term improvements in plant productivity. Does this statement imply that genes with high evolution rates are more likely to have emergent photosynthetic phenotypes under dynamic conditions? If true, would it be reasonable to prioritize the genes with rapid evolution rates?

    The big question is: how can we determine the functions of these ancillary components given that they may respond to different combinations of environmental, physiological and developmental factors? Our new approach uses “big data” analyses of the natural variations in phenotypes and genotypes to both develop and test mechanistic hypotheses. A good example is shown in the following. The approach uses some of the same tools that plant breeders have developed to map “quantitative trait loci” (QTLs) that condition certain desirable traits. In a typical experiment, plant breeders will phenotype a library of genetically diverse varieties of a crop or model organism. There are several types of libraries that can be used for these approaches. For example, a collection of closely-related natural variants can be used to conduct a genome-wide association study (GWAS). A more targeted approach uses recombinant inbred lines (RILs) that arise from crossing two distinct parent lines. There are also hybrid approaches that use multiple parent lines, e.g. nested association mapping (NAM) populations or Multiparent Advanced Generation Intercross (MAGIC) Populations. The phenotype outputs are fed into a statistical or genetic algorithms to compare the appearance of phenotypes with genetic markers and determine the likely range of locations on chromosomes that account for the observed phenotypes. In most cases, these QTL regions can contain tens to hundreds of genes and further work is needed to narrow the identity of specific genes involved; for basic research questions, this step of unambiguously identifying the effective genes in a QTL is often the most difficult part, though the introduction of CRISPR technology may change this. For practical applications, it is often possible to obtain the desired combinations of traits by introgressing the QTL regions into elite lines, without knowledge of the mechanisms or specific genes involved.

    The QTL process has key similarities and differences compared with traditional (forward or reverse) genetics. Both explore the effects of genetic variations on a phenotype or behavior of the organism. With genetics, the variations are generally loss of function changes induced my various types of mutagenesis. With the QTL approach, the differences are likely to have beneficial, gain of function effects, at least under the specific environmental conditions where the variants evolved. In addition, QTL mapping can reveal multiple factors that operate independently or interact with other factors and are thus dependent on several genetic components. Our work is taking this approach several steps forward by 1) using our DEPI systems to massively increasing the throughput and control of conditions, allowing us to explore a range of environments; 2) measuring a range phenotypes that reflect processes that are potentially mechanistically related; 3) developing a new statistical framework to generate and assess hypotheses that any two processes are genetically or mechanistically related; and 4) testing the feasibility of using these methods to constrain both biochemical/biophysical and genetic models. In addition to the modules of PPDA that have been developed in the last year, we evaluate and improve PPDA with users’ feedbacks and with the new phenomics data from the Kramer lab.

View My Stats