What is life? USC computational biologists are adding it up, and the sum of their efforts is fueling the gene science revolution.

THIS IS THE CHALLENGE AT THE FAR EDGES OF BIOLOGY – the one that has mathematicians, computer scientists, engineers and bioscientists all in a foment: You can’t do research by telling stories any more. Once upon a time, biology was all about telling stories. First, simple, careful stories by alert observers about how plants and animals live – the story of how an egg turns into a caterpillar, a cocoon, a butterfly; as well as the far more baroque metamorphoses of the malaria parasite. Later, increasingly complex stories followed: about the cells that make up all living things and, more intricate still, about the chemistry that makes these cells run.
The storytelling isn’t over. Much work remains to be done, in medicine and biology, to relate the hidden tales of life. But now, at the genetic frontiers of biological science, the world has become too complicated and seemingly chaotic to explain in terms of narratives. The remaining stories involve hundreds or even thousands of near-indistinguishable characters, all interacting furiously.
Matters have gone beyond the old baseball catchphrase, “You can’t tell the players without a program.” In the new biology, you can’t read the program without a program.
To feel their way in a blinding blizzard of seemingly featureless data, gene scientists more and more must rely on a new set of tools – tools more familiar to theoretical physicists and communications engineers. The tools are mathematical, created by researchers who call themselves “computational biologists.”

UNTIL RECENTLY, SCIENTISTS WHO CARED TO straddle these worlds were rare. “One of the main attractions here at USC,” notes molecular biologist Norman Arnheim, who has been at the forefront of basic genetic research for decades, “is that we’ve long had a group of sophisticated mathematicians interested in analyzing complex biological data. Now, more and more people recognize the significance of this work,” he says, and the field has mushroomed. “But that has only been in the last four or five years.”
At USC, the story goes back decades. Mathematician-biologist Simon Tavaré and other USC researchers, particularly mathematician-biologist-computer scientist Michael Waterman, were among the pioneers in this new subject, one that promises to pay extraordinary dividends not just in the design of drugs and therapies but in


Frequently called “the father of computational biology,” Michael Waterman built the ingenious algorithms that make large-scale DNA analysis possible.
the basic understanding of life itself.
Perhaps belatedly, but resolutely and enthusiastically, USC has now built on that strength, taking advantage of the presence of Waterman, Tavaré and their collaborators in computational biology to formally establish a center in the field. Gifts from the Ralph M. Parsons Foundation, the Fletcher Jones Foundation, and $2.5 million from trustee Andrew J. Viterbi Ph.D. ’62 are meant to bring USC into the first decade of the 21st century as a pre-eminent powerhouse in this discipline.
No one is more pleased with these developments than Tavaré, a ruddy Englishman with an international reputation for using DNA to unravel evolutionary relationships between cells. He talks excitedly about activity now bubbling up at the Center for Computational and Experimental Genomics – part of the Department of Molecular Biology in USC’s College of Letters, Arts and Sciences.
The effort already includes 20 Ph.D.s – mostly post-docs but also three promising new hires, assistant professors Magnus Nordborg and Tim Chen and associate professor Fengzhu Sun – plus numerous graduate students. The attraction isn’t just the personnel in computational biology, but also the prospect of working with extraordinarily gifted molecular biologists like Arnheim and Myron Goodman, as well as Leonard Adleman, the cross-disciplinary computer scientist from the USC School of Engineering who was the first to use DNA as a computing medium. Linked activity animates the USC Health Sciences campus, with ties to Childrens Hospital Los Angeles and beyond.


TO BE SURE, THE USE OF ADVANCED MATHEMATICS, particularly statistics in biology, is hardly new.
“It has always been a part of genetics, almost right from the beginning,” says Tavaré. But a whole next generation of new and strange problems began to knock at biology’s door in the late 1970s, when scientists first acquired the lab tools to
follow up on Nobel Prize-winners James Watson and Francis Crick’s 1953 discovery of the structure and information-carrying properties of DNA.
USC’s Michael Waterman was present at this dawning. The lanky University Professor, usually dressed in blue jeans, grew up on a ranch on the Oregon coast. Biology was far from his thoughts when he began his career in the recondite field of
"It's like an orchestra with 40,000 musicians. We recognize the end result as music, but it’s very difficult to hear each individual musician at every moment. We have identified the musicians. Now we have to understand the score.”
theoretical statistics; the subject of computational biology didn’t yet exist.
Waterman had pictured a future teaching mathematics in a small Western college where he could spend time outdoors (he is a passionate hiker and fly fisherman). By his late 20s, Waterman was on his way to fulfilling that destiny. A tenured professor at Idaho State University, he published highly abstract papers with titles like “On Jacobi’s Solution of Linear Diophantine Equations.” Weekends, he happily journeyed to nearby rivers to try to outwit the trout.
Then in the early 1970s, the young statistician met Temple Smith, then a nuclear physicist visiting the Los Alamos National Laboratory. Smith was working with Stanislaus Ulam, a Polish theorist who had played a key role in the creation of the atomic and hydrogen bombs, but whose interest had since turned to the mysteries of biology – specifically, to chemical differences in the makeup of protein molecules from related species.
Waterman began collaborating with Smith in 1974. After 1976, his name began to appear with Smith’s (now a biomedical engineer at Boston University) on papers exploring a whole new subject matter, with titles like “Additive Evolutionary Trees.” Waterman himself moved to Los Alamos to follow the unexpected twist in his career path.
The new subject Smith and Waterman were pioneering soon began to explode. Researchers almost daily found faster and more accurate techniques for “sequencing,” or reading off, the letters of the genetic alphabet from DNA molecules. They got a quantum boost with the discovery – by a Cetus Corp. team including Arnheim and Nobelist Kary Mullis – of the polymerase chain reaction. The invention allows scientists to amplify a small sample of DNA to a quantity sufficiently large for analysis.
The brilliant lab work brought with it, however, a swarm of information that needed to be understood. The hard part was finding intelligible messages in the endless strings of A’s, G’s, C’s and T’s (see “Scrambled Library,” this page).
In 1978, scientists uncovered a fact that multiplied the level of difficulty. In complex animals like humans, it turned out, genetic information isn’t written on chromosomes in a continuous stream. Instead, the message skips. It begins, flows smoothly for a few hundred letters and then hiccoughs many thousands of apparent nonsense letters. The message resumes with another few hundred letters, more nonsense, another message bit, and so on. It stops and starts again up to 25 times before finally ending, like a TV show interrupted again and again for commercial breaks. Only there’s no easy way to tell the message units – called “exons” – apart from the surrounding nonsense.
“When we found out about this,” says Waterman, recollecting his and Smith’s reaction, “we just stood and looked at each other.” The researchers realized that the statistical ideas they’d adopted to unravel genetic messages needed to be completely rethought, and that the problem was far more difficult than they had imagined.
Their solution became the Smith-Waterman algorithm, an extraordinarily clever set of mathematical tricks that has remained the gold standard for comparing genes. Smith and Waterman refined the algorithm from its first 1981 version and have long been distributing successive versions as free software. Many other researchers – including Tavaré – have devised alternative methods for quicker sorting. Waterman himself collaborated recently with Eric Lander of the MIT Whitehead Institute to create the Lander- Waterman algorithm, which provides a road map for the sequencing of large amounts of DNA.
Bottom line, virtually all the numerous genes – human, microorganism, plant and animal – identified so far were decoded using variations on the algorithms Waterman and his collaborators invented, giving the USC professor unique status in the field.
“His work has advanced biological sequence analysis from a collection of ad hoc procedures to a rigorous and mature subject,” says David Eisenberg, director of the UCLA-U.S. Department of Energy Laboratory of Structural Biology and Molecular Medicine.


next page



Related Links

Scrambled Library

Monster Tinker-Toys

Stained Light Show

Other Features

Hooked on Classics

Giving Back to the Future

Mathematics of Life

In Memoriam: John H. McKay