Computational approaches to identifying gene regulatory networks in Arabidopsis thaliana

This work was supported by grants from the BBSRC Exploiting Genomics Initiative (EGM16126, EGM 16127, EGM 16128) to Mike Bevan, Gavin Cawley (UEA) and Sean May (Nottingham). This grant brought together three groups to establish a multidisciplinary approach to the analysis of gene expression data and prediction of promoter element functions in the model plant Arabidopsis. The cis- regulatory code of Arabidopsis should be able to be broken using compendia of whole genome gene expression data, the sequences of promoter regions, and appropriate mathematics. We used three inputs to interpret promoter sequences: sets of Affymetrix ATH1 expression data, the predicted promoter regions of genes, and a careful manual analysis of gene function.


We set out to identify small sequence motifs that provide discriminatory information for predicting the expression patterns of genes. These sequence motifs are candidate transcriptional regulatory motifs that can then be tested experimentally. We wanted to apply emerging mathematical approaches that can deal with large datasets, cope with noise (both useful in biology) and have established real- life applications. Machine learning is a suitable application: it is used for text categorisation, face detection and other interesting applications. We reasoned that recognising transcription factor binding sites in promoter sequences was a related problem of pattern recognition. We used machine learning methods (specifically a Relevance Vector Machine or RVM) to establish a decision rule based on a training set of known examples of expression. This algorithm is applied to the whole data set using a ten-fold cross validation and the performance is estimated. We also applied a discriminative Bayesian approach to automatically detect the complexity of the model (ie the number of features to select for optimal prediction) and kernel features that allow multiple promoter elements and their frequency of occurrence in promoters to form the decision rule. When applied to different sets of Affymetrix data the algorithm achieved an error rate of 26.16% in predicting gene expression responses and an AURROC score of 0.7767, a useful level of skill. In biologists' terms the algorithm accurately predicts the expression pattern of 77% of genes in the test set. This is a very useful level of accuracy and provides candidate promoter elements that can be used to establish regulatory networks. One of these is shown.

 image
Cartoon of a predicted transcriptional regulatory network controlled by glucose and ABA. The inputs to the model were a careful functional categorisation of regulated genes, promoter elements with a significant association with glucose- or ABA- regulated gene expression, and knowledge of transcription factor binding sites. The model gives rise to several hypotheses about candidate transcription factors.



This work is described more fully in this publication.

Yunhai Li et al. Genome Res. 16/ 414-427(2006) link to pdf





Access to protocols and data

The machine learning methods, developed by Gavin Cawley and colleagues, are collectively called BLOGREG. This algorithm is freely available under the GNU general public licence from (http://theoval.cmp.uea.ac.uk/~gcc/cbl/blogreg/). To promote the use of the resources developed in this project the RVM has been implemented as an online tool called BRED (Bayesian Regulatory Element Detection- http://theoval.cmp.uea.ac.uk/~gcc/cbl/bred/ ).

The promoter database is drawn from the TAIR v.6 Arabidopsis annotation as an excel spreadsheet below. The two files contain 1000 bp upstream from the predicted ATG initiation codon, and the second contains 1000 bp upstream of the transcriptional start site, where known from full length cDNA sequences.

Link to Excel file: ("1000_bp_up_from_ATG_and_uppercase_gene_region")

Link to Excel file: ("1000_bp_up_from_UTR_and_down_to_ATG_and_uppercase_gene_region")

The Affymetrix array is available from the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress/ under accession E-MXP-475.

A curated set of genes regulated by glucose (one of the experimental conditions used) is available from this spreadsheet

Link to Excel file of glucose regulated genes in Arabidopsis

Link to word table: Glucose-regulated growth genes

Link to word table: Glucose-regulated light and circadian regulated genes

Link to word table: Glucose-regulated regulatory genes