BOSC 2000

Distributing Bioinformatics Applications with Piper

J:W:Bizzaro*, Gary Van Domselaar, Brad Chapman, Jean-Marc Valin, Jarl van Katwijk, Dominic Letourneau and Deanne Taylor

* Corresponding author:
Bioinformatics.org: The Open Lab
c/o Department of Chemistry
University of Massachusetts Lowell
Lowell, MA 01854
jeff@bioinformatics.org

A typical problem in bioinformatics research is linking together multiple pro- grams to analyze a set of information. A common example of this is building phylogenetic trees from DNA sequences. The sequences are initially aligned us- ing a sequence alignment program, are then analyzed in a phylogeny program to produce trees, and finally the trees are visualized in a viewer. This process can be further complicated by the fact that the programs may have incompatible inputs and outputs, as well as extensive memory and processing requirements. To address these problems the authors have developed Piper, a distributed plat- form that can be used to link bioinformatics programs. Piper is ideally suited to broker bioinformatics analyses in that it can combine highly specific, modular, data repositories and data analysis functions together to provide sophisticated and effcient data processing networks. Piper provides a wrapper around ex- isting bioinformatics programs so that they can be connected and executed in an intuitive manner. In addition, individual programs can be located on re- mote computers so that expensive calculations can be executed on faster, even dedicated, hardware. Piper is designed as a modular system using CORBA con- nectivity protocols as the backbone to link the modules. This design allows a number of different user interfaces to control the core processing engine. In ad- dition, Piper is being developed under the Open Source model, allowing contri- bution and design feedback from individuals in multiple areas of bioinformatics. Piper is especially well suited for automated, high-throughput data analysis like protein fold identification, sequence conditioning such as repeat/vector masking and data format conversion, and customization of batch scripting processes such as building phylogenetic trees from DNA sequences.