![]() Computational & Systems Biology at the John Innes CentreIntroductionThe completion of the genome sequences for Arabidopsis, Streptomyces and rice have brought us into the first period of the 'post-genomic age', in which there will be a dramatic growth in the amount of raw data being generated at JIC and other research centres worldwide. This creates problems in data handling and storage, in integrating data from a variety of sources and in 'data mining'. JIC was involved in the systematic genomic sequencing of Arabidopsis and Streptomyces. Part of these projects involved the creation of databases to store, retrieve and analyze data and the development of methods to efficiently handle large datasets. Building on this expertise we are continuing to develop methods to allow access to databases through user-friendly web-based interfaces and search systems. The role of computational and systems biologyThe development, with the Institute of Food Research , of a joint proteomics facility has opened up exciting new areas of science. However, it has been estimated that these technologies alone will generate 5-6 Terabytes of data a year, at JIC. Many JIC groups are involved in gene expression experiments that have the potential to generate large volumes of data. About one third of the gene sequences identified from the Arabidopsis genome sequence have no equivalents in existing protein databases. Characterisation of the 26,000 Arabidopsis genes will largely depend on gene disruption experiments using transposon insertion. The transposon insertion programs planned for Arabidopsis could generate 50-100,000 different insertion events. Databases of disrupted genes will have to be maintained, but to use this data it will have to be screened, repeatedly, against existing gene and protein databases. Current systems will be unable to cope with the number of sequences and volume of data involved. The same problems will arise with other genomes sequences (Streptomyces, Lotus, Medicago, Oryza, Triticum) that may be explored at JIC. Analysing the effects of gene mutations, of genetic manipulation or of environmental changes on plant metabolism (metabolomics), similarly has the potential to produce huge quantities of data. Data handling and processingTo extract the maximum value from the results of these diverse techniques they must be fully integrated to provide a single data resource. Future systems have to allow inexperienced users to ask questions such as: 'How does this spot from a 2D gel relate to expression of this gene and how would it affect metabolite levels?' Use of experts in data mining, data analysis, data presentation and biomathematical techniques to develop these resources is a critical area in current biology. The need to combine secure storage of such data, for at least ten years, with ready and repeated access for end-users is a major technical and resource issue. Computer modellingOur understanding of how the primary (chemical) structure of molecules determines biological function, via the secondary and tertiary structure, is increasing and computer modelling of molecules (proteins, nucleic acids and metabolites) to predict function is a growth area. JIC has built the hardware and software infrastructure needed to ensure that data can easily be exchanged regardless of operating systems (Mac, PC or unix) and software (commercial or shareware). Our bioinformatics programme is spearheading transfer and conversion of data, core database construction and maintenance, provision of easy WWW interfaces to programs (using middleware such as CORBA or XML) and dealing with data production, user assistance etc. |