Author: Chokshi Dave
Institution: Biology
Date: September 2005

Publication of working drafts of the human genome in February 2001 was the capstone achievement of two decades' work on deciphering the human "genetic code." It was coronated as one of the greatest scientific endeavors of mankind; it was certainly the greatest in recent history.

But any molecular biologist will tell you that cataloguing the genome is only the first step to understanding the human body on a molecular level. Scientists have already begun researching "the next big thing," a venture so complex that it dwarfs the Human Genome Project. Welcome to the era of proteomics.

Proteomics is the study of the way proteins expressed by genes interact inside cells. Essentially, the proteome is to proteins what the genome is to genes. However, the questions that proteomics researchers seek to answer are broader than those asked by genomicists-proteomics will not be limited to documenting protein inventory. Specifically, scientists will also have to examine relative abundance of proteins, functionality of activation states of proteins, and the myriad permutations of protein interactions.

A gene is a gene is a gene-to a first approximation, the DNA in a neuron is the same as the DNA in a skin cell. The same cannot be said about proteins. Some cells produce hemoglobin and insulin abundantly, and some cells do not. Moreover, the amount of protein produced varies by cell, and not just by cell type. Which proteins a cell contains depends on its age, its physical environment, the signals it receives from the nervous system, and even the time of day!

Many scientists are coming to regard proteomics as the pre-eminent approach to complex biological problems such as the nature of particular molecular complexes or pathways in disease pathogenesis (Banks et al., 2000). As evinced by the billions of dollars already poured into proteomics research by venture capital, biotechnology, and pharmaceutical companies, proteomics also holds hopes for innovative drug development and advances in diagnostic medicine. But much basic scientific research and evolution of experimental protocol must be navigated before clinical breakthroughs are finally realized.

Novel Challenges in Proteomics

A glance at the numbers reveals the main challenge in proteomics: complexity. Proteins are made of approximately 20 modifiable amino acid building blocks; compare that to the four static nucleic acid bases found in genes. Latest estimates peg the number of genes in the human body at about 34,000; there are probably 500,000 or more structurally distinct proteins in the body. Scientists once believed that genes could tell the story of the remarkable complexity of the human body-hence previous estimates of ~100,000 genes.

It may be that proteins are more responsible for our complexity than genes. And the folded structure of a single protein is so complicated that IBM plans to spend the next five years deciphering how just one protein forms its particular shape. To do that, the company will need to create a computer 500 times more powerful than any in existence today (Fischer, 2000).

Investigating the biological relationship between genes and proteins also reveals the intricacies of proteomics. The genome is the set of instructions for making proteins; a gene is a blueprint for making an individual protein. What the intracellular organelles, or protein "factories," decide to make is indeed based on these blueprints, but not strictly bound by them.

Some blueprints are so popular that they will be utilized millions of times. Others might not be accessed at all in a particular cell. Some organelles will mix and match genetic instruction to create fusion or hybrid proteins. Unlike relatively stable DNA, proteins get phosphorylated, sulfated, glycosylated, acetylated, and ubiquitinated. A single gene ends up encoding multiple different proteins, and by different methods: alternative splicing of the mRNA transcript, variation of translation start and stop signals, and frameshifting of codons (Fields 2001).

Genes have only one principal function, to conserve and provide information. Proteins, on the other hand, have myriad functions, from serving as intercellular messengers to identifying invading pathogens. Conversely, some functions may be performed by multiple different proteins. Furthermore, the proteome must be dynamic, and versatile proteins must respond to altered environmental conditions by relocating within the cell, adjusting their stability, and changing the molecules they bind to.

A particularly difficult task in unraveling the complexity of the proteome is mathematically describing the shapes of proteins. Proteins are manufactured as linear strings of amino acids but self-organize to yield the secondary, tertiary, and quaternary structures that characterize the three-dimensional protein in a cellular environment. We do not completely know how to predict a protein's ultimate structure simply from its amino acid sequence. Computational tools for describing this self-organization, as well as computer simulation of ligand-to-protein docking, are only starting to be developed. At the same time, the processes that link structure to function are fundamental to understanding all proteins (Meredith, 2001).

There has been much speculation about a large-scale proteomics initiative, or a Human Proteome Project, if you will. Without a doubt, researchers would benefit from networking on so complicated a problem. But is such a project in the cards?

There are some important disanalogies between genomics and proteomics that render the prospects of a Human Proteome Project unfavorable. Perhaps most importantly, the study of proteins diverges from the study of genes in that there is no analog to the linear sequence of DNA with a definite start and finish to examine. If proteomicists were actually in pursuit of identifying every single protein in the human body, the scope of the project would encompass almost all of biology!

There are undoubtedly some aspects of proteomics that lend themselves to systematic analysis. For instance, the Human Genome Project did not tell us what genes are actually expressed in a cell or tissue. Most of the genes are predicted to be genes based on application of a genetic algorithm to the genomic sequence. Yet on the whole, the consensus among biologists is to shy away from a Human Proteome Project because the most biologically relevant questions in proteomics deal with the dynamics of protein interactions rather than the cataloguing of each individual protein (Bradbury, 2000).

The complexity inherent in proteomics calls for novel experimental protocol and technology. For example, there is no protein analog to the polymerase chain reaction (for DNA amplification) for simple and efficient amplification of low-abundance proteins, so a range of detection from one to several million molecules per cell is needed. The analysis and significance of post-translational modifications provides a major hurdle. Proteins' properties arise largely from their folded structures, so general experimental methods-which do not necessarily maintain the integrity of a protein's structure-are difficult to apply. Clearly, ingenuity and innovation are required to understand the astounding complexity of the human proteome.

Experimental Approaches to Proteomics

The gold standard in basic proteomics research is a technology known as two-dimensional gel electrophoresis. A mixed protein sample is separated first by total charge and then by molecular mass. In this technique, a solution of cell contents (or other protein sample) is placed on a narrow polyacrylamide strip with an immobilized pH gradient. Application of an electric current induces the polypeptides to travel until they enter the region of the gradient that is equivalent in acidity. This results in a gel strip showing discrete protein bands that correspond with charge.

This strip is then placed against a rectangular polyacrylamide gel containing sodium dodecyl sulfate (SDS). An electric current induces variable migration of the bands from the strip to the rectangular slab; thus, proteins are separated by size (Banks et al., 2000). Two-dimensional gel electrophoresis is not without limitations, however. High-charge or low-mass polypeptides are not resolved well. Proteins with large hydrophobic regions (such as membrane-bound receptors) also are not clearly visualized. The latter restriction has important implications for drug development (see below), since membrane receptors are targets for pharmaceutical intervention (Ezzell, 2000).

Computer analysis of gel images can be used to compare a sample with others from the lab or with proteome databases accessible through the Internet. For example, researchers may wish to compare protein expression in healthy and diseased cells in the same tissue. Two-dimensional gels also can be used to identify specific proteins, although other methods have been developed that complement gel-based identification. Mass spectrometric techniques yield particularly precise characterization. Proteins or peptides (which can be isolated using gel electrophoresis) are ionized using various procedures; the mass of the ions is measured very accurately by coupled analyzers.

A similar protocol uses a protein that has been broken down into several peptide subunits and, upon application of mass spectrometry, yields a unique spectrometric fingerprint. Comparison of the fingerprint to predicted peptide masses from digestions of sequences in genomic databases identifies the protein.

Two-dimensional gel electrophoresis and mass spectrometry could be described as "classical proteomics," the branch of proteomics concerned with protein cataloguing. But as described previously, many challenges in proteomics stem from functional characterization of proteins. A method known as two-hybrid analysis is the principal experimental tool used to probe protein functionality.

The general idea is that if two proteins interact with one another, they usually participate in similar cellular functions. Two-hybrid analysis gauges whether two proteins physically associate using a clever technique. First, each protein is attached to separate fragments of a third protein. The third protein is a transcription factor-it has the ability to switch on genes. In this case, the third protein switches on a reporter gene. There are generally two domains in a transcription factor, the DNA-binding domain and the activating domain. The DNA-binding domain is fused to one protein, the "bait" protein. The activating domain is fused to the "prey" protein. Neither hybrid can activate transcription of the reporter gene by itself. However, if the two proteins of interest interact, then the two fragments of the transcription factor come into sufficiently close contact to switch on the reporter gene.

The name "two-hybrid" refers to the fact that two hybrid proteins are actually interacting. Discovering that an unidentified protein interacts with a protein of known function using two-hybrid analysis yields important information. This concept has been termed "guilt by association" (Oliver, 2000).

There are, of course, limitations to this technique. Two-hybrid analysis reveals potential protein interactions, but not the biological context in which they occur. Particular physiological conditions may yield false positive or false negative interaction. Some interaction may never be revealed by two-hybrid systems because the proteins involved are actually located in separate cellular compartments. Nevertheless, two-hybrid analysis is an important investigative tool for assaying protein function.

The Role of Proteomics in Disease

J. Craig Venter, guru of the private genome-mapping efforts, contends that all of today's medicine will seem antiquated once proteomics research begins yielding fruit (Fischer, 2000). Venter waxes optimistic because characterization of defective or missing proteins is the key to understanding diseases. Thus far, over half a dozen genes have been implicated in increased risk of Alzheimer's disease. But the only unambiguous diagnosis for Alzheimer's disease results from the presence of protein fragments in the brain.

Generally, the study of proteomics is relevant to molecular medicine for three reasons. First, almost all successful drugs either target or are themselves proteins. Second, proteins constitute the "final" product of gene expression. Finally, the function/dysfunction of a protein and the pathways that it participates in are often dependent on post-translational modifications that are not directly encoded in the genome. Thus, proteomics has been advanced as a core technology for translating genomic advances into a more coherent and pharmacologically useful understanding of proteins in disease.

The first (and perhaps most obvious) biomedical application of proteomics is extending progress in medical diagnosis. Diagnostic benefits were advertised as genomics' bread-and-butter; proteomics would build upon genomics in a predictable manner. One technology envisioned is a sort of clinical molecular scanner, a device that could examine tissue samples and detect subtle deviations from baseline normal states of health based on protein analysis. A protein analog to the DNA chip is also foreseen-such a chip would cheaply diagnose and precisely stage a range of diseases in a single patient (Weber, 2000).

Other medical breakthroughs from proteomics share these parallels to genomics. The challenge with genetic-based drug development was refining understanding of a biological process with the aim of identifying proteins pivotal to function. In genomics, a specific genetic lesion was identified, the resultant changes in proteins were elucidated, and a drug to counteract or correct aberrations was designed.

The difficult part of the method is determining the changes to proteins-often the function of the protein is not well understood. From that point, numerous compounds are tested against the target protein as potential drug candidates, an expensive process. Thus, companies have an economic incentive to use the knowledge gained from proteomics for target validation. The ability to narrow down the possible proteins affected (and isolating the actual effects) is time- and cost-saving.

Conversely, a major challenge in drug development is to increase the number of potential protein targets from the approximately 500 against which virtually all drugs available today act to the estimated 10,000 potential protein targets (Banks et al., 2000). Pharmaceutical companies generally shy away from allocating resources to find new protein targets because of the large overhead costs involved. Proteomics, with its ability to identify novel protein targets cheaply, would provide opportunities for drug companies to move into new areas of research.

Further, certain proteins are associated with drug toxicity-proteomics research to this effect might serve as an early warning that a drug candidate is associated with unacceptable side effects. By developing profiles of proteins associated with side effects, proteomics can help to identify side effects in drug candidates that might not otherwise be identified until after expensive and lengthy clinical trials.

Proteomics also has been hailed as the next step in basic science contributing to diagnosis and therapy for neurological disorders (e.g., Creutzfeld-Jakob disease), infectious diseases such as tuberculosis, heart failure, and cancer. Proteomics will no doubt present novel challenges in research, but the purported clinical benefits seem well worth the difficulty. The information yielded by proteomics will not only push the limits of genomics, but also push the frontiers of the current biomedical revolution.

Proteomics: Pushing the Frontiers of Genomics

Novel Challenges in Proteomics

Experimental Approaches to Proteomics

The Role of Proteomics in Disease

Suggested Reading

Stay up-to-date on news and publications: