Computational approaches to identifying gene regulatory networks in Arabidopsis thaliana
This work was supported by grants from the BBSRC Exploiting Genomics Initiative (EGM16126, EGM 16127, EGM 16128) to Mike Bevan, Gavin Cawley (UEA) and Sean May (Nottingham). This grant brought together three groups to establish a multidisciplinary approach to the analysis of gene expression data and prediction of promoter element functions in the model plant Arabidopsis. The cis- regulatory code of Arabidopsis should be able to be broken using compendia of whole genome gene expression data, the sequences of promoter regions, and appropriate mathematics. We used three inputs to interpret promoter sequences: sets of Affymetrix ATH1 expression data, the predicted promoter regions of genes, and a careful manual analysis of gene function.
We set out to identify small sequence motifs that
provide discriminatory information for predicting the expression patterns of
genes. These sequence motifs are candidate transcriptional regulatory motifs
that can then be tested experimentally. We wanted to apply emerging
mathematical approaches that can deal with large datasets, cope with noise
(both useful in biology) and have established real- life applications. Machine
learning is a suitable application: it is used for text categorisation, face
detection and other interesting applications. We reasoned that recognising
transcription factor binding sites in promoter sequences was a related problem
of pattern recognition. We used machine learning methods (specifically a
Relevance Vector Machine or RVM) to establish a decision rule based on a
training set of known examples of expression. This algorithm is applied to the
whole data set using a ten-fold cross validation and the performance is
estimated. We also applied a discriminative Bayesian approach to automatically
detect the complexity of the model (ie the number of features to select for
optimal prediction) and kernel features that allow multiple promoter elements
and their frequency of occurrence in promoters to form the decision rule. When
applied to different sets of Affymetrix data the algorithm achieved an error
rate of 26.16% in predicting gene expression responses and an AURROC score of
0.7767, a useful level of skill. In biologists' terms the algorithm accurately
predicts the expression pattern of 77% of genes in the test set. This is a very
useful level of accuracy and provides candidate promoter elements that can be
used to establish regulatory networks. One of these is shown.
Cartoon of a predicted transcriptional regulatory network controlled by glucose and ABA. The inputs to the model were a careful functional categorisation of regulated genes, promoter elements with a significant association with glucose- or ABA- regulated gene expression, and knowledge of transcription factor binding sites. The model gives rise to several hypotheses about candidate transcription factors.
This work is described more fully in this
publication.
Yunhai Li et al. Genome Res. 16/ 414-427(2006)
![]()
Access to protocols
and data
The machine learning methods, developed by Gavin
Cawley and colleagues, are collectively called BLOGREG. This algorithm is
freely available under the GNU general public licence from (http://theoval.cmp.uea.ac.uk/~gcc/cbl/blogreg/).
To promote the use of the resources developed in this project the RVM has been
implemented as an online tool called BRED (Bayesian Regulatory
Element Detection- http://theoval.cmp.uea.ac.uk/~gcc/cbl/bred/ ).
The promoter database is drawn from the TAIR v.6 Arabidopsis annotation
as an excel spreadsheet below. The two files contain 1000 bp upstream from the
predicted ATG initiation codon, and the second contains 1000 bp upstream of the
transcriptional start site, where known from full length cDNA sequences.
Link to
Excel file: ("1000_bp_up_from_ATG_and_uppercase_gene_region")
Link
to Excel file:
("1000_bp_up_from_UTR_and_down_to_ATG_and_uppercase_gene_region")
The
Affymetrix array is available from the ArrayExpress database
(http://www.ebi.ac.uk/arrayexpress/
under accession E-MXP-475.
A curated set of genes regulated by glucose
(one of the experimental conditions used) is available from this spreadsheet
Link to Excel file of glucose
regulated genes in Arabidopsis
Link to word table: Glucose-regulated growth genes
Link to word table: Glucose-regulated light and circadian regulated genes