CroGO: Identifying Cross-category Relations in Gene Ontology and Constructing Genome-specific Term Association Networks

MSU-DOE Plant Research Lab, Michigan State University

Gene Ontology (GO) has been widely used in biological databases, annotation projects, and computational analyses. Although three GO categories are structured as independent ontologies, the biological relationships across the categories are not negligible for biological reasoning and knowledge integration. However, the existing cross-category ontology term similarity measures are either developed by utilizing the GO data only or based on manually curated term name similarities, ignoring the fact that GO is evolving quickly and the gene annotations are far from complete.

In this project, we introduce a new cross-category similarity measurement called CroGO by incorporating genome-specific gene co-function network data. The performance study showed that our measurement outperforms the existing algorithms. We also generated genome-specific term association networks for yeast and human. An enrichment based test showed our networks are better than those generated by the other measures. Conclusions: The genome-specific term association networks constructed using CroGO provided a platform to enable a more consistent use of GO. In the networks, the frequently occurred MF-centered hub indicates that a molecular function may be shared by different genes in multiple biological processes, or a set of genes with the same functions may participate in distinct biological processes. And common subgraphs in multiple organisms also revealed conserved GO term relationships.

A mini-example is shown in Figure 1.

mini-example

Genome-specific MF-BP Association Networks

Network G_yeast has 613 MF terms, 843 BP terms and 1,485 edges between them. As shown in Figure in below, the yeast association network consists of many small disconnected graphs.

yeast

Network G_human has 1,209 MF terms, 2,250 BP terms and 5,138 edges between them, among which 1,583 edges are between terms that have no overlap on their annotated genes.

human

CroGO Method

To measure the similarity between the terms in different GO categories, CroGO has three steps. First, the association between two sets of genes that are annotated to any two given GO terms is calculated. Second, the gene annotations and gene set associations are integrated to calculate the pair-wise term similarity. Third, the directions of all the pair-wise term relationships are inferred with a GO structure based approach. please refer to the paper for the detailed method.

We compared the performance of CroGO with the existing measures with confirmed biological knowledge on a small gold-standard set based on the known reaction-to-pathway relationships in yeast. We calculated pair-wise term similarities for the term pairs in the gold-standard set and the term pairs in the random set using CroGO, and compared its performance with the ASR and VSM based measures by drawing a receiver operating characteristic (ROC) curve for each measure. The ROC curves in Figure 2 showed clearly that CroGO has the best performance. When the false positive threshold is 15%, the true positive rate of CroGO is 88%, while the true positive rates of the ASR and VSM based measures are both 83%. This analysis also showed that 102 more MF-BP pairs were recognized by CroGO than the ASR and VSM based measures when the number of true positives equals the number of false positives. This indicates that by incorporating the co-function network, CroGO has produced better coverage than the other measures by recognizing more gene associations between genes which are annotated to the gold-standard connected GO terms. In addition, the same experiments were applied on human data, and the results is consistent to the yeast data.

ROC

CroGO User Manual

CroGO was developed on a Windows 7 (x64) computer, implemented upon JDK 1.6 and JUNG library. The JAR file CroGO.jar is platform independent tested on Windows 7 and CentOS release 5.4).

To run the JAR file, a user must prepare input files and place them and the JUNG library files in the same folder as the JAR file. The sample input files (for human and yeast) are provided in the “data” folder, which include all the data needed in our experiments; the JUNG library files are in the “lib” folder. Usage: to compute cross-category GO term-term similarities, run the following command in a command line:

java -jar CroGO.jar < organism name > < MFtermID > < BPtermID >

Where “organism name” is either “yeast” or ”human”, “MFtermID” is the term ID of a MF term, and “BPtermID” is the term ID of a BP term.

The output of the program is:

Similarity = < similarity score >

Where “similarity score” is the cross-category GO term-term similarity score.

Example: to calculate the similarity between MF term GO: 0004652 and BP term GO:0043629 based on yeast co-function network, the command is:

java -jar CroGO.jar yeast 0004652 0043629

The output is:

Similarity = 0.9970899415729219

CroGO Download

Package: CroGO.jar. The main package to compute cross-category term-term similarity

Supporting Package: JUNG library. A software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network.

Experiment data : CroGOdata.zip. All the sample data used in our experiments.

MF-BP networks : Yeast network and Human network



How to cite CroGO

If you use CroGO or the MF-BP networks generated by CroGO, please cite:

Peng J, Chen J and Wang Y, Identifying Cross-category Relations in Gene Ontology and Constructing Genome-specific Term Association Networks. BMC Bioinformatics (special issue for selected papers presented at the 11th Asia-Pacific Bioinformatics Conference) 2012

 

If you have any questions, please contact Jiajie Peng via pengjj@msu.edu.