Peter H. Weston
Royal Botanical Gardens
Mrs Macquarie's Road
Sydney NSW 2000
Michael D. Crisp
Botany and Zoology, School of Biology
Australian National University, Canberra ACT 2601
- Logical basis of phylogenetic analysis
- Maximum parsimony as a logical criterion
- Analysis incorporating explicit models of evolution
- Computer programs
- Further reading
Phylogenetic analysis was originally developed by biological systematists who wanted to reconstruct evolutionary genealogies of species based on morphological similarities. The German entomologist Willi Hennig was the first author to propose an explicit method of phylogenetic analysis, and the publication of his work in English (Hennig 1966) quickly led to the widespread use of his approach. Phylogenetic methods used to reconstruct the relationships between macromolecular sequences also involve the application of Hennigian principles.
Logical basis of phylogenetic analysis
The results of phylogenetic analysis may be depicted as a hierarchical branching diagram, a "cladogram" or "phylogenetic tree" (e.g. Figure 1a). Each cluster (branch) represents a postulated clade or monophyletic group, a group comprised of all the sampled descendants of a single ancestral lineage. Alternatively, the cladogram may be represented as a set of nested boxes or bracketed groups (e.g. Figure 1b).
Clades are characterised by shared possession of uniquely-derived evolutionary novelties or "synapomorphies" (literally "together, derived shape"). Since all similarities between organisms must have arisen as evolutionary novelties at some time, it follows that phylogenetic analysis is an attempt to recognise the identity and taxonomic distribution of synapomorphies. Figure 1c shows a phylogenetic tree with several synapomorphies plotted along the branches. These could be any kind of inherited phenotypic or genotypic characteristics; synapomorphy 3 for instance, could be the evolutionary appearance of nerve cells or the fixation of a change from guanine to adenine at a particular site in a DNA sequence. Note that the synapomorphies are perfectly congruent, covarying with one another, and are thus consistent with a single cladogram.
Similarities and differences between organisms can be coded as a set of characters, each with two or more alternative character states. In an alignment of DNA sequences, for example, each aligned site is a separate character, each with four character states, the four nucleotides. Similarly, in an alignment of amino acid sequences, each aligned site is a character, each with 20 states. Usually an alignment gap (an insertion or deletion) is treated as an absence of data rather than as an extra state.
Maximum parsimony as a logical criterion
Sets of characters sampled from a group of organisms, such as an alignment of DNA sequences, rarely covary perfectly. Usually, some of the characters are incongruent with each another, a phenomenon called "homoplasy". Homoplasy is most commonly due to multiple independent origins of indistinguishable evolutionary novelties. For example, if synapomorphies 2 and 9 in Figure 1c both involved a change to adenine at the same homologous site in a DNA sequence, then we would observe a misleading similarity between the Corn and Human sequences at that site (Figure 2a). The evidence of this site alone would suggest a clade of (Corn, Human); however, such a group is incongruent with the evidence of some of the other sites. For instance, the changes from G to A at site 6 and G to T at site 7 in the common ancestor of Human and Frog (Figure 2a,b), would now appear to have occurred independently in Human and Frog.
Given the existence of incongruence, a logical criterion is required in order to choose the cladogram that best fits the observed distribution of characters. The simplest and most commonly used criterion is maximum parsimony -- choosing the cladogram that minimises the number of postulated character-state changes (e.g. Figure 2b), thus minimising homoplasy and maximising congruence between the characters (Farris 1983).
Finding a maximally parsimonious cladogram is usually a computationally intensive task requiring computer analysis. Moreover, exact algorithms can handle only small problems, typically less than 20 taxa or sequences. For larger problems, fast heuristic algorithms must be employed, and although some of these have been found empirically to be effective, they cannot guarantee to find the optimal cladogram(s).
A point that we have avoided so far is the fact that phylogenetic analysis using maximum parsimony produces trees with no evolutionary root. For example, the cladogram in Figure 2b shows the lineage leading to Corn diverging earlier than those leading to the other species; that is, the cladogram is "rooted" on the internode connecting Corn to the rest of the tree. This rooting is not derived from the sequence alignment being analysed, but from other information -- we know from other sources of evidence that all Metazoans are more closely related to each other than any of them are to green plants. Analysis of the sequence alignment alone allows us only to produce an unrooted tree (Figure 2c). Then we place a root on the tree by using Corn as an "outgroup". Consideration of the kind of evidence that is relevant to cladogram rooting is beyond the scope of this chapter, but a recent review of this topic is provided by Weston (1994).
Having produced a cladogram, we will want to know how well corroborated it is as an estimate of evolutionary history. A number of techniques have been developed for quantifying the level of congruence shown by the characters used in an analysis, and in particular, for providing a measure of the level of support for groupings of taxa. These methods attempt to estimate the degree to which an analysis has converged on a stable result. Bootstrap analysis, a statistical technique, is the most commonly used method. This involves analysis of a sample of (usually 100 to 1000) randomly perturbed data sets. In each perturbation, the original characters (e.g. sites of a DNA sequence alignment) are randomly resampled with replacement, producing a new data set in which some characters are represented more than once, some appear once, and some are deleted. The perturbed data sets are each analysed in the same manner as for the real data, and the number of times that each grouping of species appears in the resulting profile of cladograms is taken as an index of relative support for that grouping. Other indices of support have been developed but the statistical bases of all (including the bootstrap) have been criticised.
Analysis incorporating explicit models of evolution
Under certain conditions, analyses of molecular sequences using maximum parsimony will converge to the wrong cladogram. The simplest such case is illustrated in Figure 3. This example represents the more general problem of "long branch attraction" (see e.g. Penny et al. 1990). Where two long evolutionary lineages (lineages that have undergone a high level of sequence evolution), are separated by a short lineage, the long lineages will tend to be spuriously joined in the most parsimonious cladogram produced from the resulting sequence data.
A number of methods have been developed to overcome this problem, most of which involve non-linear transformations of sequence alignments or of evolutionary distances derived from the sequence data. The "cost" of performing such transformations is that an explicit model of sequence evolution must be assumed, violation of which may also lead to spurious results. Maximum-likelihood estimation is the most rigorous of these approaches, but it is computationally very intensive. A quick approximation to maximum likelihood may be achieved by calculating transformed evolutionary distances and fitting these to a tree using either minimum evolution or least-squares criteria.
A number of programs for phylogenetic analysis are available. Some of these are summarised in the accompanying document Introduction to Some Computer Programs Used in Phylogenetics [not on this website].
Hillis et al. (1993), Swofford et al. (1996) and Morrison (1996) provide balanced and more detailed introductions, particularly to the phylogenetic analysis of sequence data. Penny et al. (1990) is an excellent discussion of sources of error in phylogenetic reconstruction. Forey et al. (1992) is a useful textbook on maximum parsimony. The documents included with the PHYLIP package (Felsenstein, 1993) constitute a readily accessible (free) introduction to phylogenetic analysis with numerous cited references.
Farris J.S. 1983. The logical basis of phylogenetic analysis. In: Advances in Cladistics (Edited by Platnick N.I. and Funk V.A.), pp. 1-36. Columbia Uni. Press, New York.
Felsenstein J. 1993. PHYLIP (Phylogeny Inference Package), version 3.5c. [Distributed by the author: email@example.com, from the anonymous ftp site at 126.96.36.199 or the WWW site at http://evolution.genetics.washington.edu/phylip.html.]
Forey P.L., Humphries C.J., Kitching I.J., Scotland R.W., Siebert D.J., and Williams D.M. 1992. Cladistics: a Practical Course in Systematics. Oxford Uni. Press, Oxford.
Hennig W. 1966. Phylogenetic Systematics. Uni. Illinois Press, Urbana. [Translated by Davis D.D. and Zangerl R. from Hennig W. 1950. Grundzüge einer Theorie der Phylogenetischen Systematik. Deutscher Zentralverlag, Berlin.]
Hillis D.M., Allard M.W. and Miyamoto M.M. 1993. Analysis of DNA sequence data: phylogenetic inference. Methods in Enzymology 224: 456-487. [Molecular Evolution: Producing the Biochemical Data (Edited by Zimmer E.A., White T.J., Cann R.L. and Wilson A.C.), pp. 456-487. Academic Press, San Diego.]
Morrison D.A. 1996. Phylogenetic tree-building. International Journal for Parasitology 26: 589-617.
Penny D., Hendy M.D., Zimmer E.A. and Hamby R.K. 1990. Trees from sequences: panacea or Pandora's box? Australian Systematic Botany 3: 21-38.
Swofford D.L., Olsen G.J., Waddell P.J. and Hillis D.M. 1996. Phylogenetic inference. In: Molecular Systematics, second edition (Edited by Hillis D.M., Moritz C.and Mable B.K.), pp. 407-514. Sinauer Associates, Sunderland.
Weston P.H. 1994. Methods for rooting cladistic trees. In: Models in Phylogeny Reconstruction (Edited by Scotland R.W., Siebert D.J. and Williams D.M.), pp. 125-155. Oxford Uni. Press, Oxford.
The original version of this article appeared as part of the Bioinformatics special issue on the Australian Biotechnology Association WWW home pages. The copyright in the original text and illustrations are the property of the authors.