You may think the Tree of Life was settled a long time ago, but scientists continue to refine, and sometimes radically alter, our understanding of how species are related. Once evolutionary history was based on the relationships of bones, skeletons and other morphological clues, but today DNA is the key player in the story of life.
At UT, scientists are using supercomputers to piece together this ancient puzzle.
Phylogenetics is the branch of life science that studies the evolutionary relationships among organisms based on genetic evidence. By aligning the molecular sequences of different species, scientists can determine where species diverged and create branching trees of relationships.
As gene sequencing becomes cheaper, researchers are performing more phylogenetic studies, helping them draw new conclusions about how organisms have evolved. However, the process of lining up tens of thousands of sequences from hundreds or thousands of different species is incredibly complicated, even for a computer.
UT computer science professor Tandy Warnow says crunching the numbers isn’t as easy as it sounds. “While those solutions can be done on small datasets or moderate sized data sets, on large datasets, they can take a very long time — weeks to months to years of computational time. The Texas Advanced Computing Center ends up being essential for those problems.”
The Texas Advanced Computing Center runs some of the biggest and most powerful systems in the world, but even their supercomputers can hardly keep up with the pace of genetic research. According to Moore’s law, the performance of computers doubles every two years; however, the ability of gene sequencers to create data has grown at an even faster rate.
Warnow, working with postdoctoral researcher Kevin Liu (Rice University) and Siavash Mirarab, a PhD student at UT, has been addressing these problems by creating smarter, faster, and more accurate algorithms and applying them to some of the biggest datasets ever created. With support from the National Science Foundation (through the Assembling the Tree of Life project), she and her colleagues have developed software that allows computers to draw better evolutionary trees faster.
Divide and Conquer
The software Warnow’s group developed over the course of several years is called SATé: Simultaneous Alignment and Tree Estimation. The method uses a novel divide-and-conquer approach.
“By dividing a really big data set that’s hard to align into small data sets that are closely related, you can get good estimates on each subset and then get an alignment on the full data set,” Warnow explains.
Massive supercomputers, like Ranger at TACC, align the sequences of each subset and combine the alignments into an alignment on the full set of sequences.
There’s no way to know if the tree that emerges is absolutely accurate. Some trees are obviously wrong—for example, those that show humans and crocodiles on the same branch, separated from chimps—but most are probable.
For that reason, SATé uses a statistical method to provide a maximum likelihood score: a measure by which to assess its accuracy against other answers. SATé repeats the process of alignment and tree-building many times until a tree with the highest likelihood score is reached.
In software development, it’s not enough to invent a new product. One must also prove the product is better than the alternatives. To this end, Warnow and her team have been working as quality assurance and reliability testers, solving hard evolutionary tree problems multiple times, with different methods and parameters, to ensure that SATé produces the highest-quality result.
First reported in Science and later explored in PLoS Currents and Systematic Biology, the researchers have shown again and again that SATé works as well as the alignment and tree estimation methods that are commonly used, but far faster, or with greater accuracy but in the same amount of time.
For the Birds
Warnow and her team’s efforts go beyond algorithmic and software development. They also collaborate with evolutionary biologists on projects where their methodological improvements can lead to new insights.
Since Charles Darwin’s day, scientists have debated the evolutionary history of flightless birds, known at ratites. How did so many similar species get to the far-flung corners of the Earth?
“The theory of continental drift provided a convenient answer,” says Michael Braun, a curator in the department of systematic biology at the Smithsonian Institute. “These birds evolved from a common flightless ancestor and then drifted to their current distributions. For 40 years, this remained the textbook explanation of species dispersal.”
That is, until Braun discovered through DNA analysis that an ancient family of birds found in South American, the tinamou, was one of the most closely related groups to emus and ostriches — and they could fly.
This fact, combined with the lack of skeletal evidence for flightless birds before the time of continental breakup, led to a re-conceptualization of the ratite branch of the avian tree. Ratites were in fact descended from flying birds that traveled to places where flight was no longer an evolutionary advantage, and consequently lost their ability to fly.
By improving the quality of the avian tree of life, a new history emerged.
Recently, Warnow worked with Braun, using SATé, to reanalyze his controversial findings. Their study confirmed the evolutionary relationship that Braun found.
Beyond telling us about the family history of the dodo bird, better, faster, more accurate phylogenetic methods can have a life-or-death impact. The Centers for Disease Control use sequence alignment and evolutionary tree-building tools when a new virus emerges to determine where it may have come from and how it differs from previous viruses. Plant scientists also use tree-building tools to determine which genes are associated with positive traits like hardiness and drought tolerance.
This knowledge is enabling scientists to breed more productive crops, helping to feed the world. But none of these problems are easily solved.
“Many research groups are estimating trees containing anywhere from a few thousand to hundreds of thousands of species, toward the eventual goal of estimating a Tree of Life, containing perhaps as many as several million leaves,” Warnow wrote in a recent article in Systematic Biology. “These phylogenetic estimations present enormous computational challenges, and current computational methods are likely to fail to run even on datasets in the low end of this range.”
In other words, small problems may be within reach, but the big ones remain.
“It’s not getting any easier, but it is getting more fun,” Warnow says.
Top photo by ecstaticist on Flickr. Inset photo: Tandy Warnow (UT photo).
This article first appeared on the TACC website.