What are the major goals of the project?

The overarching goal of the project is to identify genes and processes that control photosynthesis efficiency in response to fluctuating environmental conditions, which are critical for understanding and improving plant energy storage and improving crop productivity. To achieve this, we proposed to discover, develop, and apply Plant Phenomics Data Analytics (PPDA) solutions, such that massive phenomics data is transformed into knowledge or testable hypotheses to identify important genes to improve photosynthesis efficiency under dynamic environmental conditions.

The proposed research work is comprised of four components: Aim 1. Develop and apply phenomics data quality control program to identify abnormal data and distinguish whether they arise from noise, artifacts or more interesting cases of altered biological responses. Aim 2. Develop and apply phenomics pattern discovery algorithms to identify important energy-related genes from photosynthesis phenomics data. Aim 3. Develop a data visualization package for complex phenomics data display. Aim 4. Provide proof of utility by applying PPDA to generate and test functional (biophysical, biochemical or physiological) hypotheses on photosynthesis efficiency using natural variations. In summary, PPDA will ensure high data quality, identify and visualize important genes from complex plant phenomics data, and will advance knowledge discovery in the broader community.

What was accomplished under these goals?

Emergent phenotype discovery

The rapid improvement of phenotyping capability, accuracy, and throughput have greatly increased the volume and diversity of phenomics data. A remaining challenge is an efficient way to identify phenotypic patterns to improve our understanding of the quantitative variation of complex phenotypes, and to attribute gene functions. To address this challenge, we developed a new algorithm to identify emerging phenomena from large-scale temporal plant phenotyping experiments. An emerging phenomenon is defined as a group of genotypes who exhibit a coherent phenotype pattern during a relatively short time. Emerging phenomena are highly transient and diverse, and are dependent in complex ways on both environmental conditions and development. Identifying emerging phenomena may help biologists to examine potential relationships among phenotypes and genotypes in a genetically diverse population and to associate such relationships with the change of environments or development. We present an emerging phenomenon identification tool called Temporal Emerging Phenomenon Finder (TEP-Finder). Using large-scale longitudinal phenomics data as input, TEP-Finder first encodes the complicated phenotypic patterns into a dynamic phenotype network. Then, emerging phenomena in different temporal scales are identified from dynamic phenotype network using a maximal clique based approach. Meanwhile, a directed acyclic network of emerging phenomena is composed to model the relationships among the emerging phenomena. The experiment that compares TEP-Finder with two state-of-art algorithms shows that the emerging phenomena identified by TEP-Finder are more functionally specific, robust, and biologically significant. The manuscript is currently under review by Bioinformatics.

Generating and Testing Mechanistic Hypotheses by Genomic-Projection

A major goal of the project is to harness data science tools to both generate and test unbiased scientific hypotheses. The need for these approaches is that all human-directed hypotheses testing contains some form of bias, at the very least because we tend to test for what we expect. We have formulated an approach that takes advantage of our high throughout and highly detailed phenotyping capabilities, with quantitative genomics, to reveal and test for linkages among processes. It is well established that statistical correlations between the appearance of a phenotype and genetic markers can identify regions of the genome that quantitatively contribute to a particular trait.

    In the past, most QTL analyses used bulk or aggregate phenotypes, such as yield or disease symptoms, partly because large numbers of measurements are required. However, the lack of specificity in these measurements makes it difficult to assess the contributions from individual processes. The proposed work takes advantage of high throughput phenotyping that measures multiple phenotypes simultaneously (Cruz et al., 2016; Kuhlgert, et al. 2016), which will allow us to assess linkages between processes, and thus test specific hypothetical models. By comparing the QTL profiles for the different processes or phenotypes we can ask if, to a reasonable statistical level, the genetic diversity in one process is linked to that of another. By “linked” we mean that it is either controlled by the same genetic loci, or is mechanistically related so that one process influences the other.
This “comparative QTL” approach may allow us to assess the mechanistic bases of natural variations in plant responses. It is critical, though, that the limitations of the approach be carefully considered. For example, observation that QTLs for two phenotypes do not overlap, would strongly indicate that genetic diversity controlling these processes are not genetically or mechanistically linked, at least in this particular population, and at the experimental conditions and timeframe, i.e. a linkage could exist in another population or under different conditions. Observation of QTL overlaps must also be considered with caution because each QTL may contain multiple genes, i.e. the observed “linkage” is to the entire region, not necessarily to any particular genes. However, such overlaps are good clues to possible linkages. Finally, observation of a potential linkage does not necessarily imply a particular cause-effect relationship, though in certain cases time-resolved QTL measurements can provide insights on mechanisms, e.g. a phenomenon related to the “cause” may appear at an earlier time than those associated with “effects.” However, it is also possible that a third (or more complex) factor controls any linkages.

    Quantitative Trait Loci (QTL) represent range of genetic components that are statistically associated with the presence of a certain trait (Broman, 2003). Mapping/ Identifying QTL is to find the genomic locations that are associated with phenotypes. QTL mapping have predominantly been used by plant breeders to identify genetic markers for desirable traits, that can be used to introgress multiple desirable traits into elite production lines. Our goal is to expand the utility of this type of approach to not only provide useful QTLs that could improve the responses of cowpea varieties under chilling conditions, but to test specific hypotheses related to environmental effects on photosynthetic processes.

    To conduct a QTL analysis, one first needs a genetically-diverse population for which polymorphisms--usually single nucleotide polymorphisms (SNPs)--have been mapped, to indicate how they differ from each other. These populations include recombinant inbred lines (RILs), or collections of divergent accessions collected in the field (for genome-wide association studies (GWAS), or various populations, including nested association mapping (NAM) population, multiparent advanced generation intercross (MAGIC) population (Rakshit et al., 2012) and so on. These populations are exposed to particular conditions and various phenotypes are recorded. Quantitative or categorical phenotype parameters are then statistically compared to the occurrence of the polymorphisms.

    Our Genome-projection approach compares the QTL patterns that arise from analyses of different phenotypes or behaviors, to assess potential mechanistic bases of natural variations in plant responses. It is critical, though, that the limitations of the approach be carefully considered. For example, observation that QTLs for two phenotypes do not overlap, would strongly indicate that genetic diversity controlling these processes are not genetically or mechanistically linked, at least in this particular population and at the experimental conditions and timeframe. In other words, linkage could exist in another population or under different conditions. Observation of QTL overlaps must also be considered with caution because each QTL may contain multiple genes, i.e. the observed “linkage” is to the entire region, not necessarily to any particular genes. However, such overlaps are good clues to possible linkages. Finally, observation of a potential linkage does not necessarily imply a particular cause-effect relationship, though in certain cases time-resolved QTL measurements can provide insights on mechanisms, e.g. a phenomenon related to the “cause” may appear at an earlier time than those associated with “effects.” However, it is also possible that a third (or more complex) factor controls multiple linkages.

    As long as these factors are considered, one may use “genome-projection” to test for mechanistic or genetic linkages between processes. Take for example the simple case of a two enzyme metabolic pathway, where A is converted to B by enzyme E1.
A -> B. Now, imaging that one observed a set of QTLs associated with variations in the accumulation of B. The obvious hypothesis is that these differences are controlled by the activity of E1. The hypothesis can be tested by assessing the probability that there are overlapping QTLs for the activity of E1. If not, the hypothesis is nullified, and a more complex model must be generated, e.g. that the content of B is also controlled by degradation to component X, ABX. Or that the production of A by X is rate-limiting, X->A->B. Linkages may also be due to interactions among regulatory genetic processes such transcription factors, signal cascades etc., but the logic remains similar.

    In order to establish this method, we derived an (apparently) novel statistical method for assessing whether QTL bands overlap.  To test our genetic projection approach, we measured a wide range of photosynthetic processes and other physiological parameters in a cowpea diversity panel under chilling stress conditions. We found, as expected, strong linkages (overlapping QTLs) among processes known to be mechanistically linked, such as photosynthetic efficiency, the establishment of the thylakoid proton motive force, photodamage, photoprotection, xanthophyll accumulation the activity of the ATP synthase and the redox state of electron carriers.

    One novel observation is that these effects appeared progressively over time in a way that suggests that, in the more tolerant lines, photoprotection may precede photodamage, but in cold sensitive lines, photodamage may precede the onset of photoprotection under these conditions. Most interestingly, we found strong linkages between very specific classes of thylakoid lipids and fatty acids and differences in photosynthetic responses at low temperature. The relative abundance of broad classes of lipids did not show strong phenotypic variations that could be mapped to QTLs, but that of two, very specific fatty acids, PG 16:1t (LG 4 and 9 both) and phospha¬tidyl¬ethanolamine/ phosphatidylinositol (PE/PI) 18:1 (LG 9), showed strong QTL bands, that overlapped those associated with all of the photosynthetic parameters. PG 16:1t showed a negative correlation with photosynthetic performance whereas showed a positive correlation, suggesting that two fatty acids may play complementary roles in photosynthesis under cold conditions.

    Considering all of our linkages together suggests a model where variations in photosynthetic processes control the relative rate of decreases in ATP synthase activity, which in turn led to the buildup of the thylakoid pmf, followed by acidification of the thylakoid lumen and activation of qE, as proposed for plants under low CO2 or other stressed conditions. The apparent linkage between qEt and qIt and increased electric field of pmf, suggests possible secondary effects of high pmf on PSII recombination reactions and 1O2 production (Davis et al. 2018). The linkages to lipids occurred as soon as chilling was initiated, suggesting a causative relationship between membrane-related reactions and the effects on photosynthesis, e.g. slow diffusion rates for diffusion of plastoquinone, rational of the thylakoid ATP synthase etc. We are currently performing new assays of these processes to test for QTL coincidences.

Phenotype data visualization

There is a critical unmet need for new tools to analyze and understand “big data” in the biological sciences where breakthroughs come from connecting massive genomics data with complex phenomics data. By integrating instant data visualization and statistical hypothesis testing, we have developed a new tool called OLIVER for phenomics visual data analysis with a unique function that any user adjustment will trigger real-time display updates for any affected elements in the workspace. By visualizing and analyzing omics data with OLIVER, biomedical researchers can quickly generate hypotheses and then test their thoughts within the same tool, leading to efficient knowledge discovery from complex, multi-dimensional biological data. The practice of OLIVER on multiple plant phenotyping experiments has shown that OLIVER can facilitate scientific discoveries. In the use case of OLIVER for large-scale plant phenotyping, a quick visualization identified emerging phenomena that are highly transient and heterogeneous. The unique circular heat map with false-color plant images also indicates that such emerging phenomena appear in different leaves under different conditions, suggesting that such previously unseen processes are critical for plant responses to dynamic environments. The paper has been published by the BIBM conference (Tessmer et al 2017).

NaVaLigHT (Natural Variation Linkage Hypothesis Testing)

We continue working on the NaVaLigHT (Natural Variation Linkage Hypothesis Testing) project, a new scientific approach based on the development of machine learning and phenotyping methods natural environment. We aim to generate a certain class of hypotheses based on experimental observations of natural variations in phenoptypes, and tests these by mapping cause-and-effect relationships onto variations in the genome.

    The following is a simple example that illustrates NaVaLigHT based on our recent work. Briefly, we explored the effects of low night temperature on daytime photosynthesis in cowpea using a RIL library derived from parent lines that behaved similarly under ambient conditions but responded very differently to low night time temperatures. One line (CB27) was found to be very tolerant of low night temperature, while the other (24-125B-1) was quite sensitive. We used DEPI to measure various photosynthetic parameters and noted that 1) There was substantial heritable, genetic variation in several photosynthetic parameters in responses to low temperature; and 2) CB27 showed much larger diurnal leaf motions, with the leaves resting at a steeper angle (almost parallel to the stem) at night and opening more slowly compared to 24-125B-1. One possible hypothesis is that the cold tolerance of photosynthesis in CB27 is related to its more robust leaf motions.

    From a mechanistic point of view, keeping leaves in the supine position should decrease their exposure to light early in the morning when the greatest cold-related damage may occur. To test if the photosynthetic parameters we, indeed, mechanistically connected, we mapped the natural variations in these parameters to a series of QTLs. The variations in photosynthetic responses mapped to two very sharp QTLs, one on chromosome 8 and another on chromosome 11. We observed two clear QTLs for leaf movements, one associated with the early morning rising of the leaves was found on chromosome 5, and the second to a swaying motion that is probably related to stem elongation, was found on chromosome 9. Most importantly, neither of these QTLs overlapped the QTLs that mapped to genetic variations in photosynthetic processes, providing a clear answer to a specific hypothesis, i.e. we can conclude that the genetic variation in leaf movements was not functionally or genetically linked to the variations in photosynthetic responses to cold.

    Notice that this analysis does not allow us to exclude the possibility that leaf movements have effects on photosynthesis under other conditions or in other genetic variants. For instance, it may be possible that longer-term exposure to high light in the morning would induce a new effect on photosynthesis that would map to a leaf movement QTL, indicating a possible mechanistic or genetic linkage. We must keep this restriction is mind when interpreting results, but this is also true for any use of the scientific method, i.e. the results are only certain for the specific sets of conditions used in the experiments, but can lead to more general hypotheses that can be validated under a wider range of conditions. In this case, what we can conclude is that the observed variations in leaf movements are (almost certainly) not mechanistically or genetically linked to the observed variations in photosynthetic responses. If such links were present, we should have observed some overlap in the QTLs.

    Observing such overlap would suggest, but not prove, a causal connection. However, a more rigorous statistical approach would be needed to asses just how certain we could test these connections, and we propose to develop these methods in the current year of funding.  The basic premise of the approach is that we can generate and test classes of models, using the well-established Occam’s Razor approach to science in which, only when the simplest sets of hypotheses are proposed and tested, are more complex hypotheses proposed. For the testing to be valid, we must develop the methods to calculate the probability that our null hypothesis—that there is no causal connection between two processes—is rejected. Given that none of the parameters or their interrelationships are likely to have normal distributions, we are developing nonparameteric methods, based on the random forest models already used for QTL analyses. Potentially, this approach offers a powerful approach to addressing complex interactions among genes and mechanisms. One can use it to test a broad range of hypothetical linkages as long as 1) there are strong genetic variations in a series of processes; 2) one can adequately and quantitatively measure these processes; and 3) the variations are sufficiently heritable and thus map to specific loci.

    We have also found that the approach can work for processes that we know are mechanistically linked. In the experiment described above we measured several photosynthetic processes that are known to be mechanistically linked. For example, we observed overlaps in QTLs related to photosynthetic parameters PhiII, nonphotochemical quenching (NPQ), qE (rapidly reversible NPQ) and qI (slowly reversible NPQ associated with photodamage or chloroplast movements). Interestingly, individual QTLs appeared in different combinations of phenotypes at different times of exposure to the environmental conditions, probably indicating that the causal relationships were different. For example, on the first day of low temperature exposure a prominent QTL peak on chromosome 8 appeared with the qE form of NPQ first, then an overlapping band appears in the qI form, and about the same time on the PhiII parameter. We interpret these results as implying the existence of a genetic variation that causes the cold sensitive parent to lose photosynthetic capacity in consistent with the expected “classical” regulatory and photoihibitory responses.  The loss of photosynthesis leads first to the buildup of thylakoid proton motive force (pmf), which acidifies the thylakoid lumen and activates the photoprotective qE response. At later time points, when light intensities are higher, the qE response is overwhelmed resulting in 1) decreases in photochemical efficiency (PhiII) and increases in photodamage-related quenching. The effects of these processes can be seen in QTLs associated with other photosynthetic parameters, including PhiNO and PhiNO, which measure the regulatory balance for light capture by the chloroplast antenna system. A different story emerges, a QTL on chromosome 3 is first seen in the slowly parameter, without a corresponding QTL for PhiII or qE, leading us to propose a different mechanism, possibly related to light-induced chloroplast movements. 


What do you plan to do during the next reporting period to accomplish the goals?


Using Natural Variations to Generate and Test Functional (Biophysical, Biochemical or Physiological) Hypotheses
Biological systems such as photosynthesis are extremely complex, consisting of myriad of functional pathways that are controlled by a dizzying array of environmental sensors and regulator systems. Although traditional genetics have allowed scientists to dissect many of the core functionalities of photosynthesis, the majority of the regulatory processes are not understood. As a case in point, our recent work suggests that over 90% of the genes that code for chloroplast targeted proteins do not have functions under laboratory conditions, but are “ancillary”, and serve fine-tuning functions under specific sets of environmental conditions. Thus, mutants in these genes show “emergent” phenotypes under non-laboratory conditions. Figuring out what these proteins do using traditional genetics approaches is a daunting task because the interactions amongst the environment, the genetics and the physiology of a plant are hyper-dimensional and involve complex interactions among environmental, developmental, genetic and physiological constraints.

    Recent progress in our NSF project points to a new approach for this type of “hyper-dimensional” problem by exploring the relationships between specific phenotypes in libraries of natural variants. The major finding—that leads us to propose a modified strategy for automated hypothesis testing—comes from analyses of the phenotypes of T-DNA knockout lines of Arabidopsis that are disrupted in chloroplast-targeted nuclear genes. The vast majority of these genes have no identified function. Mutant lines defective in these genes show no strong photosynthetic phenotypes under laboratory conditions, but phenotypes “emerge” as environmental conditions are altered to reflect the types of fluctuations seen in the field. We assessed the degree of fluctuation required for each mutant to show statistically significant changes in photosynthetic parameters (with respect to wild type), and compared this to the evolutionary diversity of the genes. The results show that 1) The core genes for photosynthesis show strong phenotypes even under laboratory conditions and tend to be evolutionarily highly constrained, i.e. they do not appear to have been adapted substantially to local environments. In contrast, the mutants that require stronger environmental fluctuations to show a phenotype tend to be uncharacterized (no know function) and have substantially higher rates of evolution. This makes sense in terms of selection and evolution because the ancillary components are not essential under all conditions and thus are released from strict functional constraints. They therefore can undergo rapid evolutionary changes to modify plants responses to specific environments. These components are also important targets for short-term improvements in plant productivity. Does this statement imply that genes with high evolution rates are more likely to have emergent photosynthetic phenotypes under dynamic conditions? If true, would it be reasonable to prioritize the genes with rapid evolution rates?

    The big question is: how can we determine the functions of these ancillary components given that they may respond to different combinations of environmental, physiological and developmental factors? Our new approach uses “big data” analyses of the natural variations in phenotypes and genotypes to both develop and test mechanistic hypotheses. A good example is shown in the following. The approach uses some of the same tools that plant breeders have developed to map “quantitative trait loci” (QTLs) that condition certain desirable traits. In a typical experiment, plant breeders will phenotype a library of genetically diverse varieties of a crop or model organism. There are several types of libraries that can be used for these approaches. For example, a collection of closely-related natural variants can be used to conduct a genome-wide association study (GWAS). A more targeted approach uses recombinant inbred lines (RILs) that arise from crossing two distinct parent lines. There are also hybrid approaches that use multiple parent lines, e.g. nested association mapping (NAM) populations () or Multiparent Advanced Generation Intercross (MAGIC) Populations (). The phenotype outputs are fed into a statistical or genetic algorithms to compare the appearance of phenotypes with genetic markers and determine the likely range of locations on chromosomes that account for the observed phenotypes. In most cases, these QTL regions can contain tens to hundreds of genes and further work is needed to narrow the identity of specific genes involved; for basic research questions, this step of unambiguously identifying the effective genes in a QTL is often the most difficult part, though the introduction of CRISPR technology may change this. For practical applications, it is often possible to obtain the desired combinations of traits by introgressing the QTL regions into elite lines, without knowledge of the mechanisms or specific genes involved.

    The QTL process has key similarities and differences compared with traditional (forward or reverse) genetics. Both explore the effects of genetic variations on a phenotype or behavior of the organism. With genetics, the variations are generally loss of function changes induced my various types of mutagenesis. With the QTL approach, the differences are likely to have beneficial, gain of function effects, at least under the specific environmental conditions where the variants evolved. In addition, QTL mapping can reveal multiple factors that operate independently or interact with other factors and are thus dependent on several genetic components.

    Our work is taking this approach several steps forward by 1) using our DEPI systems to massively increasing the throughput and control of conditions, allowing us to explore a range of environments; 2) measuring a range phenotypes that reflect processes that are potentially mechanistically related; 3) developing a new statistical framework to generate and assess hypotheses that any two processes are genetically or mechanistically related; and 4) testing the feasibility of using these methods to constrain both biochemical/biophysical and genetic models. 

View My Stats