|
|
Issue 9, June 2002
So Many Numbers - What Do You Do With The Data?
Margaret Harris
Physics, Duke University
harris@jyi.org
The
genetic instructions for making you, a human being, are written
in three billion DNA base pairs and tucked inside the nucleus of
every cell in your body (except red blood cells). Of those three
billion, only 2% are actually part of your roughly 35,000 genes.
The remainder may hold your chromosomes' structure together, play
unknown roles in regulating protein production, or simply take up
space as "junk" DNA, the detritus of humankind's long
evolution from earlier species.
If all that information - junk, genes, and all - were printed on
paper, it would fill 200 volumes each the size of a Manhattan telephone
book. If you started to read this weighty collection tomorrow, you'd
be at it for another 9.5 years. And if you took all three billion
DNA base pairs and laid them end-to-end, as you would inevitably
be inclined to do, they would reach pretty much anywhere on Earth
you wanted - several times over. When describing the size of these
numbers, "staggering" is an understatement.
Computers have, to a certain extent, solved the problem: The
Complete Works of You, Vols. 1-200, would fit comfortably onto
a reasonably-sized computer hard drive. Three billion pairs means
three gigabytes (GB) of disk space - a big number, certainly, but
nothing modern computers can't handle.
But what happens if researchers want to look at more than one genome
at a time? What if they want to examine, say, a few thousand, in
order to seek out and compare the genetic quirks that make us unique
- or give us rare diseases? What if they want to add their own comments
every few lines, as an aid to others' understanding? What should
they do with the data?
Different field, same problem: The next generation of fiber optic
cables will be fast enough to transmit the informational content
of the entire Library of Congress from New York to Paris in about
a millisecond, quicker than you can say "overdue fine."
But what happens once it gets there? What do you do with the data?
Consider the weatherman, the poor chap on TV who predicts flurries
and gets a blizzard. Creating better mathematical and computer models
of weather patterns - or any other near-chaotic, complex, and time-dependent
system - requires not only tremendous number-crunching power and
an immense amount of storage space, but also some means of organizing
and displaying the data in a meaningful way. Teasing predictions
out of such a system is, in the understated jargon of science, "nontrivial,"
not easy. How do they deal with the data?
The data problem is not new. Researchers have used computers for
scientific purposes since the days of vacuum tubes and relays, and
the need for effective data storage and access has always been a
driving force behind computer architecture. So far, the results
have been impressive; an empirical axiom, Moore's Law, states that
computing power will double every 18 months, and historically the
doubling period has often been even shorter.
In recent years, however, changes in the scientific process have
led to such a rapid proliferation of data that advances in storage
are no longer keeping up with the sheer volume of data. The biggest
culprit is genomics, a new field whose reams of raw data gobble
up storage space on desktops and mainframes alike - both in high-profile
efforts like the Human Genome Project and in studies of smaller
gene fragments, or other species. Unlike older areas of biology,
genomics is largely driven by discoveries rather than by hypotheses.
In discovery-driven science, researchers mine collected data for
anomalies and trends, instead of using the more traditional process
of formulating a hypothesis or theory and testing it experimentally.
For example, a genomic scientist might track the expression of various
genes in yeast samples at different temperatures, to see if differences
in temperature led to changes in the yeast's life cycle. The result
is a data-storage-and-access nightmare; even a humble fungus has
some 6300 genes for researchers to monitor over its six-hour life
cycle, and it's easy to extrapolate that humans are almost unimaginably
more complex.
Other fields also use the discovery approach to research, and they
face similar problems with data. Neuroscientists, for example, often
compare large amounts of data in an effort to find patterns and
trends, and for them data storage and access has proved even more
problematic. In one experiment at Duke University, volunteers' brains
were scanned as they looked at various words. The resulting images,
depicted in false color to show regions of activity in the brain,
were recorded on high-density DVDs. This would have been ideal,
since DVDs are efficient information-storage platforms. However,
once the images were in place, the DVDs were stacked in a closet.
Although a closet full of disembodied brain images might sound creepy
to a layman, it bothered the scientists even more: How do you compare
images and data when all the raw information is literally closeted
and no two scans will fit on the computer at the same time?
The scientific data problem is inherently interesting to computer
scientists. This is especially true because information proliferation
will eventually bring Moore's Law of progress up against a formidable
theoretical barrier: The quantum-mechanical limit to how much information
and how much computing power can fit onto a silicon wafer. Research
on quantum computing - an effort that hopes to use the quantum properties
of atoms or nuclei to mimic some functions of a computer's processor
and memory - is at least partly aimed at overcoming this physical
barrier.
Suggestions for a shorter-term "fix" are not hard to find.
In one typically clever proposal - with the catchy title "A
Petabyte in Your Pocket" - researchers at the University of
Wisconsin-Madison and the Oregon Graduate Institute suggested that
a new method of structuring databases could offer unheard-of storage
capacity without resorting to exotic quantum- or DNA-based computers.
Another approach proposes trading two-dimensional CDs and computer
chips for three-dimensional "information cubes;" the addition
of a z-axis to the x and y would allow more information to be encoded
at a single point.
Whatever the solution, though, the underlying problem is not going
to disappear anytime soon. Until then, the question remains: What
do you do with the data?
Journal
of Young Investigators. 2002. Volume Five.
Copyright © 2002 by Margaret Harris and JYI. All rights reserved.
|
|