InteGO: Towards Integrative Gene Functional Similarity Measurement

MSU-DOE Plant Research Lab, Michigan State University

In Gene Ontology, the "Molecular Function" (MF) categorization is a widely used knowledge framework for gene function comparison and prediction. Its structure and annotation provide a convenient way to compare gene functional similarities at the molecular level. The existing gene similarity measures, however, solely rely on one or few aspects of MF without utilizing all the rich information available including structure, annotation, common terms, lowest common parents.

In this project, we introduce a rank-based gene semantic similarity measure called InteGO by synergistically integrating the state-of-the-art gene-to-gene measures. By integrating three GO based seed measures, InteGO significantly improves the measurement performance by about two-fold in all the three species studied (yeast, Arabidopsis and human). InteGO is divided into two steps: 1) compute similarity scores with every seed measure individually and rank the scores; and 2) integrate the ranks of multiple seed measures. The framework of InteGO is shown in Figure 1.

mini-example

To systematically evaluate the performance of InteGO, we tested it on three model organisms with different levels of GO annotation scale and complexity. For each of them, we adopted EC numbers and protein sequences as independent biological evidences. Fig 2 shows that InteGO(MAX) performs the best among the four integration methods. In yeast, although almost all of the measures have the same median value, the 25th percentile of InteGO(MAX) is 5, significantly higher than the Yu, Schlicker and Wang measure (1.68, 3.00 and 2.04 respectively) and the other integration methods. InArabidopsisand human, the median of MAX are both 5, which is also significantly higher than that of all of the other integration methods. It indicates that the performance of InteGO(MAX), a simple integration approach, has been increased to around 2-fold. This is because the integration considers all of the aspects of GO, while an individual seed measure, although nicely designed, is compromised in that it focuses on only one of few kinds of knowledge in GO. The other integration measures, especially MIN, however, cannot distinguishably improve the gene similarity performance. As shown in Fig 2 (c), the result of MIN is even worse than the seed measures. It indicates that the performance of gene-to-gene similarity could be significantly improved only by the appropriate integration.

res

Furthermore, every integrated seed measures have their own favorable EC groups. To test whether InteGO(MAX) take advantage of all of the strength of the seed measures, we compared InteGO(MAX) with the Yu, Schlicker and Wang measure on all of the ECs. In detail, Fig 3 shows that InteGO(MAX) performs the best in 140 and 172 out of 325 and 315 ECs inArabidopsisand human respectively, while the numbers are only 2, 9, 6 inArabidopsis and 2, 2, 1 in human for the Yu, Wang and Schlicker measures respectively. In summary, the experiment indicates that integrating multiple measures could improve the performance of gene similarity measurement and MAX is the best integration method.

res



InteGO User Manual

InteGO was developed on a Windows 7 (x64) computer, implemented upon JDK 1.6 and JUNG library. The JAR filesare platform independent tested on Windows 7). In this version, we integrated three measures: Yu, Wang and Schlicker. To run the JAR file, user must prepare input files and place them and the JUNG library files in the same folder as the JAR file. The background gene set file, named genes.txt, is provided in the "data" folder. The "data" folder includes all the data used in our experiments. JUNG library files are in the "lib" folder.

Usage: to compute gene-to-gene similarities for a set of genes using inteGO, run the following command in a command line. The output of the program is saved in the file out.txt

java -jar InteGOyeast.jar < genelistFile >

Example:

run command "java -jar InteGOyeast.jar data\genes.txt" to compare all the genes listed in the input file data\genes.txt



InteGO Download

Supporting Package: JUNG library. A software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network.

InteGO yeast Download

Package: InteGOyeast.jar. The main package to compute gene-gene similarity on yeast

Experiment data : yeastdata.zip. All the sample data used in yeast experiments.

InteGO Arabidopsis Download

Package: InteGOarabidopsis.jar. The main package to compute gene-gene similarity on Arabidopsis

Experiment data : arabidopsisdata.zip. All the sample data used in Arabidopsis experiments.

InteGO human Download

Package: InteGOhuman.jar. The main package to compute gene-gene similarity on human

Experiment data : humandata.zip. All the sample data used in human experiments.

 

If you have any questions, please contact Jiajie Peng via jiajiepeng@hit.edu.cn