Exploiting Emerging Open I/O Protocols for High-Throughput BioAnalysis

Lawrence Lau

Advanced Computational Modelling Centre (ACMC)
University of Queensland

Storage and data-mining requirements for genomic analysis is expected to require Terabyte or even Petabyte capabilities in the near future (cbcg.lbl.gov/ssi-csb/Program.html). In preparation for analysing the richness and complexities of cellular substructures, consideration should be given, even at this early stage, to deploying *efficient* protocols based on open standards where possible. With the biological community evolving to a consensus on genomic data standards (www.bioperl.org), this talk instead focuses on a future stage with the need for communicating biological information between arbitrary computational models such as in the case of mapping genetic regulatory networks.

The ACMC is currently engaged in planning the computational and knowledge infrastructure to support a newly established Institute of Molecular BioSciences (http://www.imb.uq.edu.au/IMB_Research.html). The challenge is to identify *scalable* technologies to enable multi-institutional access to extremely high throughput services for bioanalysis, bioinfomatics, data mining, and biovisualisation. Design considerations include support for multi-lateral trans-disciplinary collaboration, end-to-end applications from raw genomic analysis to full-scale immersive biovisualisation, and multimode dissemination of research results. This includes the need to replicate simulated models which requires capturing intermediate information for peer-review and perhaps reinput for higher stage models.

The HyperText Transfer Protocol (HTTP) was originally designed as a simple client-server for fetching small files and images. While the (relatively) recent commercialisation has seen a multitude of supporting tools and applications, inherent design limitations constrain its performance (as measured by latency and throughput) for online data-mining and visulisation of dynamic models. A proposed Blocks eXtensible eXchange Protocol (BXXP) is considered instead which is intended for large-scale server-server interactions (http://search.ietf.org/internet-drafts/draft-mrose-blocks-protocol-04.txt ). Both Java and Perl OpenSource reference toolkits have been released (http://mappa.mundi.net/rocket/). The properties of this protocol (as compared with HTTP) are:

peer-peer connection-oriented (vs client-server stateless connections)
asynchronous request/response interactions (vs serialised get/put)
multiplexing of independent request/response streams
support for both binary and text (vs text-encoded MIME)

Much like the ACEDB has evolved organically due to early uncertainty in genome annotation strategies, there are advantages in supporting a flexible framework in a rapidly evolving field rather than static file formats which may solve immediate problems but may prove cumbersome at later stages. As the author of ACEDB notes (http://www.faqs.org/faqs/acedb-faq/), the advantage of ACEDB over conventional RDBMS is "a very complex schema ..[needing] .. continuous refinement in parallel with the accumulation of the data" and "... rather fuzzy answers [to questions], that one tries to refine progressively". Keeping the precepts of flexibility while maintaining performance, one can similarly adapt BXXP to support biological modelling, particularly on SMP or cluster platforms. Questions that need to be addressed are what broad classes of bioinformatics applications will be evolving, and how they can be efficiently represented.