In Gene Ontology, the "Molecular Function" (MF) categorization is a widely used knowledge framework for gene function comparison and prediction.
Its structure and annotation provide a convenient way to compare gene functional similarities at the molecular level. The existing gene similarity measures,
however, solely rely on one or few aspects of MF without utilizing all the rich information available including structure, annotation, common terms, lowest common parents.
In this project, we introduce a rank-based gene semantic similarity measure called InteGO by synergistically integrating the state-of-the-art gene-to-gene measures. By integrating three GO based seed measures, InteGO significantly improves the measurement performance by about two-fold in all the three species studied (yeast, Arabidopsis and human).
InteGO is divided into two steps: 1) compute similarity scores with every seed measure individually and rank the scores; and 2) integrate the ranks of multiple seed
measures. The framework of InteGO is shown in Figure 1.
To systematically evaluate the performance of InteGO, we tested it on three model organisms with different
levels of GO annotation scale and complexity. For each of them, we adopted EC numbers and protein
sequences as independent biological evidences. Fig 2 shows that InteGO(MAX) performs the best among the four
integration methods. In yeast, although almost all of the measures have the same median value, the 25th
percentile of InteGO(MAX) is 5, significantly higher than the Yu, Schlicker and Wang measure (1.68, 3.00 and 2.04 respectively) and the other integration methods. InArabidopsisand human, the median of MAX are both
5, which is also significantly higher than that of all of the other integration methods. It indicates that the
performance of InteGO(MAX), a simple integration approach, has been increased to around 2-fold. This is because the integration considers all of the aspects of GO, while an individual seed measure, although nicely designed, is compromised in that it focuses on only one of few kinds of knowledge in GO. The other integration measures, especially MIN, however, cannot distinguishably improve the gene similarity performance. As shown in Fig 2 (c), the result of MIN is even worse than the seed measures.
It indicates that the performance of gene-to-gene
similarity could be significantly improved only by the appropriate integration.
Furthermore, every integrated seed measures have their own favorable EC groups. To test whether InteGO(MAX) take advantage of all of the strength of the seed measures, we compared InteGO(MAX) with the Yu,
Schlicker and Wang measure on all of the ECs. In detail, Fig 3 shows that InteGO(MAX) performs the best in 140 and 172 out of
325 and 315 ECs inArabidopsisand human respectively, while the numbers are only 2, 9, 6 inArabidopsis
and 2, 2, 1 in human for the Yu, Wang and Schlicker measures respectively. In summary, the experiment
indicates that integrating multiple measures could improve the performance of gene similarity measurement
and MAX is the best integration method.
Supporting Package: JUNG library. A software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network.
Package: InteGOyeast.jar. The main package to compute gene-gene similarity on yeast
Experiment data : yeastdata.zip. All the sample data used in yeast experiments.
Package: InteGOarabidopsis.jar. The main package to compute gene-gene similarity on Arabidopsis
Experiment data : arabidopsisdata.zip. All the sample data used in Arabidopsis experiments.
Package: InteGOhuman.jar. The main package to compute gene-gene similarity on human
Experiment data : humandata.zip. All the sample data used in human experiments.
If you have any questions, please contact Jiajie Peng via jiajiepeng@hit.edu.cn