
Bioinformatics And Proteomic Approaches To Disease: In Vivo And In Silico Proteome Analysis Tools
Correspondence Address :
Fakher Rahim Msc. Bioinformatics, Physiology research Center, Ahwaz Jondishapur University of Medical Sciences, Ahwaz, Iran.
The availability of human genome sequences and transcriptomic, proteomic, and metabolomic data provides us with a challenging opportunity to develop computational approaches for systematic analysis of metabolic disorders. Mass spectrometry represents an important set of in vivo technologies for protein expression measurement. Among them, surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI TOF-MS), because of its high throughput and on-chip sample processing capability, has become a popular tool for clinical proteomics. Bioinformatics plays a critical role in the analysis of SELDI data, and therefore, it is important to understand the issues associated with the analysis of proteomic data. A variety of protein sequence databases exist, ranging from simple sequence repositories, which store data with little or no manual intervention in the creation of the records, to expertly curated universal databases that cover all species, and in which the original sequence data are enhanced by the manual addition of further information in each sequence record. As the focus of researchers moves from the genome to the proteins encoded by it, these databases play an even more important role as central comprehensive resources of protein information. In this review, we discuss such issues and the bioinformatics strategies and several leading protein sequence databases used for proteomic in silico analysis technologies associated with in vivo techniques.
proteinchip, surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS), bioinformatics, proteomic, in vivo, in silico
Introduction
One of the major goals of the post-genomic era understands the structures, interactions, and functions of all cell proteins. Since the cellular proteome is a dynamic profile, subject to change in response to various signals through posttranslational modification, translocation, and protein-protein and protein-nucleic acid interactions, the task becomes even more complex, looming to a million or more modification events. Proteomics encompasses the study of expressed proteins, including identification and elucidation of the structure-function interrelationships, which define healthy and disease conditions. Information at the level of the proteome is critical to understand the function of the cellular phenotype and its role in health and disease. Since posttranslational events, and indeed, an accurate assessment of protein expression levels cannot always be predicted by mRNA analysis, proteomics, used in concert with genomics, can provide a holistic understanding of the biology underlying the disease process. The challenge in deciphering the proteome is the development and integration of analytical instrumentation combined with bioinformatics, that provide rapid, high-throughput, sensitive, and reproducible tools. Continual advancement in proteome research has led to an influx of protein sequences from a wide range of species, representing a challenge in the field of Bioinformatics. Genome sequencing is also proceeding at an increasingly rapid rate, and this has led to an equally rapid increase in predicted protein sequences. All these sequences, both experimentally derived and predicted, need to be stored in comprehensive, non-redundant protein sequence databases. Moreover, they need to be assembled and analyzed to represent a solid basis for further comparisons and investigations. Especially the human sequences, but also those of the mouse and other model organisms, are of interest for the efforts towards a better understanding of health and disease. An important instrument is the in silico proteome analysis. The term “proteome” is used to describe the protein equivalent of the genome. Most of the predicted protein sequences lack a documented functional characterization. The challenge is to provide statistical and comparative analysis, and structural and other information for these sequences as an essential step towards the integrated analysis of organisms at the gene, transcript, protein, and functional levels. Especially, whole proteomes represent an important source for meaningful comparisons between the species, and furthermore, between individuals of different health states. To fully exploit the potential of this vast quantity of data, tools for in silico proteome analysis are necessary. In this article, some important sources for proteome analysis like sequence databases and analysis tools will be described, which represent highly useful proteomics tools for the discovery of protein function and protein characterization.
In vivo Techniques
Now that the human genome is completed, the characterization of the proteins encoded by the sequence remains a challenging task. The study of the complete protein complement of the genome, the “proteome,” referred to as proteomics, will be essential if new therapeutic drugs and new disease biomarkers for early diagnosis are to be developed. Research efforts are already underway to develop the technology necessary to compare the specific protein profiles of diseased versus non-diseased states.
2D gel electrophoresis:
Two-dimensional gel electrophoresis (2DE) is by far, the most widely used tool in proteomics approaches for more than 25 years (1). This technique involves the separation of complex mixtures of proteins, first on the basis of isoelectric point (pI) using isoelectric focusing (IEF), and then, in a second dimension, based on molecular mass. The proteins are separated by migration in a polyacrylamide gel. By use of different gel staining techniques such as silver staining (2), Coomassie blue stain, fluorescent dyes (3), or radiolabels, few thousands proteins can be visualized on a single gel. Fluorescent dyes are being developed to overcome some of the drawbacks of silver staining, in making the protein samples more amenable to mass spectrometry (4),(5). The data can be analyzed with software such as PDQuest by Bio-Rad Laboratories (Hercules, Calif, USA) (6), Melanie 3 by GeneBio (Geneva, Switzerland), Imagemaster 2D Elite by Amersham Biosciences, and DeCyder 2D Analysis by Amersham Biosciences (Buckinghamshire, UK) (7). Ratio analysis is used to detect quantitative changes in proteins between two samples. 2DE is currently being adapted to high-throughput platforms (8). Periplaneta americana is the predominant cockroach (CR) species and a major source of indoor allergens in Thailand. Nevertheless, data on the nature and molecular characteristics of its allergenic components are rare. There was a study to identify and characterize the P. americana allergenic protein. Two-dimensional gel electrophoresis, liquid chromatography, mass spectrometry, and peptide mass fingerprinting were used to identify the P. americana protein containing the MAb-specific epitope that show in (Table/Fig 1),(Table/Fig 2) and(Table/Fig 3)(9).
ProteinChips:
Unique ionization techniques, such as electrospray ionization and matrix-assisted laser desorption-ionization (MALDI), have facilitated the characterization of proteins by mass spectrometry (MS) (10),(11). Hence,a spectrum is generated with the molecular mass of individual peptides, which are used to search databases to find matching proteins. A minimum of three peptide molecular weights is necessary to minimize false-positive matches. The principle behind peptide mass mapping, is the matching of experimentally generated peptides with those determined for each entry in a sequence. The alternative process of ionization, through the electrospray ionization, involves dispersion of the sample through a capillary device at high voltage (12). Recent developments have led to the MALDI quadrupole TOF instrument, which combines peptide mapping with peptide sequencing approach [13, 14, 15]. An important feature of tandem MS (MS-MS) analysis is the ability to accurately identify posttranslational modifications such as phosphorylation and glycosylatio, through the measurement of mass shifts. Another MS-based proteinChip technology
The post-genomic era holds phenomenal promise for identifying the mechanistic bases of organismal development, metabolic processes, and disease, and we can confidently predict that bioinformatics research will have a dramatic impact on improving our understanding of such diverse areas as the regulation of gene expression, protein structure determination, comparative evolution, and drug discovery. Software packages and bioinformatic tools have been, and are being developed to analyze 2D gel protein patterns. These software applications possess user-friendly interfaces that are incorporated with tools for linearization and merging of scanned images. The tools also help in segmentation and detection of protein spots on the images, matching, and editing (44). Additional features include pattern recognition capabilities and the ability to perform multivariate statistics. New techniques and new collaborations between computer scientists, biostatisticians, and biologists are called for. There is a need to develop and integrate database repositories for the various sources of data being collected, to develop tools for transforming raw primary data into forms suitable for public dissemination or formal data analysis, to obtain and develop user interfaces to store, retrieve, and visualize data from databases, and to develop efficient and valid methods of data analysis.
In the past years, there has been a tremendous increase in the amount of data available concerning the human genome, and more particularly, the molecular basis of genetic diseases. Every week, new discoveries are being made, that link one or more genetic diseases to defects in specific genes. To take into account these developments, the SWISS-PROT protein sequence database, for example, is gradually enhanced by the addition of a number of features that are specifically intended for researchers working on the basis of human genetic diseases, as well as the extent of polymorphisms. The latter are very important too, since they may represent the basis for differences between individuals, which are particularly interesting for some aspects of medicine and drug research. Such comprehensive sequence databases are mandatory for the use of proteome analysis tools, like the proteome analysis database which combines the different protein sequences of a given organism to a complete proteome. This proteome can be regarded as a whole new unit, analyzable according to different points of view (like distribution of domains and protein families, and secondary and tertiary structures of proteins), and can be made comparable to other proteomes. In general, for using the proteomics data for healthcare and drug development, first, the characteristics of proteomes of entire species—mainly the human— have to be understood before secondly differentiation between individuals can be surveyed. But although the number of proteome analysis tools and databases is increasing, and most of them are providing a very good quality of computational efforts and/or annotation of information, the user should not forget that automated analysis always can hold some mistakes. Data material in databases is reliable, but only to a certain point. Automatic tools which use data derived from databases can thus be error-prone, rules built on their basis can be wrong, and sequence similarities can occur due to chance and not due to relationship. Users of bioinformatics tools should in no way feel discouraged in their using, provided they keep in mind the potential pitfalls of automated systems and even of humans, be encouraged to check all data as far as possible, and not blindly rely on them.
- Emerging Sources Citation Index (Web of Science, thomsonreuters)
- Index Copernicus ICV 2017: 134.54
- Academic Search Complete Database
- Directory of Open Access Journals (DOAJ)
- Embase
- EBSCOhost
- Google Scholar
- HINARI Access to Research in Health Programme
- Indian Science Abstracts (ISA)
- Journal seek Database
- Popline (reproductive health literature)
- www.omnimedicalsearch.com