Biotechnology plants its analytic head deep into the cloud, deploying algorithms to derive meaning from a flood of information. But what’s the difference between “big data” and simply having lots of information? Sometimes we get enamored with the data itself and forget that it’s not just big data that’s important but meaningful data—data that we can accept or reject hypotheses with and make a significant step forward in our knowledge about the science.
Posts from the ‘Statistics’ Category
Wu, C., Kalyalaraman, A., Cannon, W.R. , . IEEE Trans. Par. Dist. Sys. 2012 http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.19.
Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.
VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data
Peterson E.S., McCue L.A., Schrimpe-Rutledge A.C., Jensen J.L., Walker H., Kobold M.A., Webb S.R., Payne S.H., Ansong C., Adkins J.N., Cannon W.R, Webb-Robertson B.J., VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data. BMC Genomics. 2012 Apr 5;13:131. doi: 10.1186/1471-2164-13-131
The procedural aspects of genome sequencing and assembly have become relatively inexpensive, yet the full, accurate structural annotation of these genomes remains a challenge. Next-generation sequencing transcriptomics (RNA-Seq), global microarrays, and tandem mass spectrometry (MS/MS)-based proteomics have demonstrated immense value to genome curators as individual sources of information, however, integrating these data types to validate and improve structural annotation remains a major challenge. Current visual and statistical analytic tools are focused on a single data type, or existing software tools are retrofitted to analyze new data forms. We present Visual Exploration and Statistics to Promote Annotation (VESPA) is a new interactive visual analysis software tool focused on assisting scientists with the annotation of prokaryotic genomes though the integration of proteomics and transcriptomics data with current genome location coordinates.
VESPA is a desktop Java™ application that integrates high-throughput proteomics data (peptide-centric) and transcriptomics (probe or RNA-Seq) data into a genomic context, all of which can be visualized at three levels of genomic resolution. Data is interrogated via searches linked to the genome visualizations to find regions with high likelihood of mis-annotation. Search results are linked to exports for further validation outside of VESPA or potential coding-regions can be analyzed concurrently with the software through interaction with BLAST. VESPA is demonstrated on two use cases (Yersinia pestis Pestoides F and Synechococcus sp. PCC 7002) to demonstrate the rapid manner in which mis-annotations can be found and explored in VESPA using either proteomics data alone, or in combination with transcriptomic data.
VESPA is an interactive visual analytics tool that integrates high-throughput data into a genomic context to facilitate the discovery of structural mis-annotations in prokaryotic genomes. Data is evaluated via visual analysis across multiple levels of genomic resolution, linked searches and interaction with existing bioinformatics tools. We highlight the novel functionality of VESPA and core programming requirements for visualization of these large heterogeneous datasets for a client-side application. The software is freely available at https://www.biopilot.org/docs/Software/Vespa.php
We report the development of a novel high performance computing method for the identification of proteins from unknown (environmental) samples. The method uses computational optimization to provide an effective way to control the false discovery rate for environmental samples and complements de novo peptide sequencing. Furthermore, the method provides information based on the expressed protein in a microbial community, and thus complements DNA-based identification methods. Testing on blind samples demonstrates that the method provides 79-95% overlap with analogous results from searches involving only the correct genomes. We provide scaling and performance evaluations for the software that demonstrate the ability to carry out large-scale optimizations on 1258 genomes containing 4.2M proteins.
A MapReduce Implementation of a Hybrid Spectral Library- Database Search Method for Peptide Identification
Kalyanaraman, A., Latt, B., Baxter, D. J. and Cannon, W.R.. Bioinformatics (2011) 27 (21): 3072-3073. doi: 10.1093/bioinformatics/btr523
A MapReduce based implementation called MR-MSPolygraph for parallelizing peptide identification from mass spectrometry data is presented. The underlying serial method, MSPoly-graph, uses a novel hybrid approach to match an experimental spectrum against a combination of a protein sequence database and a spectral library. Our MapReduce implementation can run on any Hadoop cluster environment. Experimental results demonstrate that, relative to the serial version, MR-MSPolygraph reduces the time to solution from weeks to hours, for processing tens of thousands of experimental spectra. Speedup and other related performance studies are also reported on a 400-core Hadoop cluster using spectral data sets from environmental microbial communities as inputs.
Analyzing Data for Systems Biology: Working at the Intersection of Thermodynamics and Data Analytics
Many challenges in systems biology have to do with analyzing data within the framework of molecular phenomena and cellular pathways. How does this relate to thermodynamics that we know govern the behavior of molecules? Making progress in relating data analysis to thermodynamics is essential in systems biology if we are to build predictive models that enable the field of synthetic biology. This report discusses work at the crossroads of thermodynamics and data analysis, and demonstrates that statistical mechanical free energy is a multinomial log likelihood. Applications to systems biology are presented.
Anyone who has tried to match an unfamiliar bird’s features to its field guide portrait knows that reality rarely provides a perfect comparison to the ideal specimen.
Scientists have faced a similar problem when attempting to decode protein patterns found in living cells – a field known as proteomics. Using mass spectrometry, the technology of choice for protein identification, scientists try to match protein fragments, or peptides, against idealized patterns in peptide databases. These databases often provide a poor
But using bioinformatics techniques, researchers at Pacific Northwest National Laboratory (PNNL) have developed a pattern-matching algorithm that improves the accuracy of peptide identification by between 50 and 150 percent, compared with standard approaches.correspondence – the industry standard for positive peptide identification is usually a dismal 15 to 20 percent.
Cannon, W.R., Rawlins, M. M., Baxter, D., J., Lipton, M., Callister, S., and Bryant, D. A., J. Proteome Res., 2011, 10 (5), pp 2306–2317, DOI: 10.1021/pr101130b
We report a hybrid search method combining database and spectral library searches that allows for a straightforward approach to characterizing the error rates from the combined data. Using these methods, we demonstrate significantly increased sensitivity and specificity in matching peptides to tandem mass spectra. The hybrid search method increased the number of spectra that can be assigned to a peptide in a global proteomics study by 57−147% at an estimated false discovery rate of 5%, with clear room for even greater improvements. The approach combines the general utility of using consensus model spectra typical of database search methods with the accuracy of the intensity information contained in spectral libraries. A common scoring metric based on recent developments linking data analysis and statistical thermodynamics is used, which allows the use of a conservative estimate of error rates for the combined data. We applied this approach to proteomics analysis of Synechococcus sp. PCC 7002, a cyanobacterium that is a model organism for studies of photosynthetic carbon fixation and biofuels development. The increased specificity and sensitivity of this approach allowed us to identify many more peptides involved in the processes important for photoautotrophic growth.