phylip-3.697/ 0000755 0047320 0047320 00000000000 12407622317 012551 5 ustar joe felsenst_g phylip-3.697/doc/ 0000755 0047320 0047320 00000000000 13212365300 013305 5 ustar joe felsenst_g phylip-3.697/doc/clique.html 0000644 0047320 0047320 00000020120 12406201172 015450 0 ustar joe felsenst_g
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program uses the compatibility method for unrooted two-state characters to obtain the largest cliques of characters and the trees which they suggest. This approach originated in the work of Le Quesne (1969), though the algorithms were not precisely specified until the later work of Estabrook, Johnson, and McMorris (1976a, 1976b). These authors proved the theorem that a group of two-state characters which were pairwise compatible would be jointly compatible. This program uses an algorithm inspired by the Kent Fiala - George Estabrook program CLINCH, though closer in detail to the algorithm of Bron and Kerbosch (1973). I am indebted to Kent Fiala for pointing out that paper to me, and to David Penny for decribing to me his branch-and-bound approach to finding the largest cliques, from which I have also borrowed. I am particularly grateful to Kent Fiala for catching a bug in versions 2.0 and 2.1 which resulted in those versions failing to find all of the cliques which they should. The program computes a compatibility matrix for the characters, then uses a recursive procedure to examine all possible cliques of characters.
After one pass through all possible cliques, the program knows the size of the largest clique, and during a second pass it prints out the cliques of the right size. It also, along with each clique, prints out the tree suggested by that clique.
Input to the algorithm is standard, but the "?", "P", and "B" states are not allowed. This is a serious limitation of this program. If you want to find large cliques in data that has "?" states, I recommend that you use MIX instead with the T (Threshold) option and the value of the threshold set to 2.0. The theory underlying this is given in my paper on character weighting (Felsenstein, 1981b).
The options are chosen from a menu, which looks like this:
Largest clique program, version 3.69 Settings for this run: A Use ancestral states in input file? No F Use factors information? No W Sites weighted? No C Specify minimum clique size? No O Outgroup root? No, use as outgroup species 1 M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out compatibility matrix No 4 Print out tree Yes 5 Write out trees onto tree file? Yes Y to accept these or type the letter for one to change |
The A (Ancestors), F (Factors), O (Outgroup) ,M (Multiple Data Sets), and W (Weights) options are the usual ones, described in the main documentation file.
If you use option A (Ancestors) you should also choose it in the menu. The compatibility matrix calculation in effect assumes if the Ancestors option is invoked that there is in the data another species that has all the ancestral states. This changes the compatibility patterns in the proper way. The Ancestors option also requires information on the ancestral states of each character to be in the input file.
The O (Outgroup) option will take effect only if the tree is not rooted by the Ancestral States option.
The C (Clique Size) option indicates that you wish to specify a minimum clique size and print out all cliques (and their associated trees) greater than or equal to that size. The program prompts you for the minimum clique size.
Note that this allows you to list all cliques (each with its tree) by simply setting the minimum clique size to 1. If you do one run and find that the largest clique has 23 characters, you can do another run with the minimum clique size set at 18, thus listing all cliques within 5 characters of the largest one.
Output involves a compatibility matrix (using the symbols "." and "1") and the cliques and trees.
If you have used the F option there will be two lists of characters for each clique, one the original multistate characters and the other the binary characters. It is the latter that are shown on the tree. When the F option is not used the output and the cliques reflect only the binary characters.
The trees produced have it indicated on each branch the points at which derived character states arise in the characters that define the clique. There is a legend above the tree showing which binary character is involved. Of course if the tree is unrooted you can read the changes as going in either direction.
The program runs very quickly but if the maximum number of characters is large it will need a good deal of storage, since the compatibility matrix requires ActualChars x ActualChars boolean variables, where ActualChars is the number of characters (in the case of the factors option, the total number of true multistate characters).
Basically the following assumptions are made:
The assumptions of compatibility methods have been treated in several of my papers (1978b, 1979, 1981b, 1988b), especially the 1981 paper. For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b), but also read the exchange between Felsenstein and Sober (1986).
A constant available for alteration at the beginning of the program is the form width, "FormWide", which you may want to change to make it as large as possible consistent with the page width available on your output device, so as to avoid the output of cliques and of trees getting wrapped around unnecessarily.
5 6 Alpha 110110 Beta 110000 Gamma 100110 Delta 001001 Epsilon 001110 |
Largest clique program, version 3.69 5 species, 6 characters Species Character states ------- --------- ------ Alpha 11011 0 Beta 11000 0 Gamma 10011 0 Delta 00100 1 Epsilon 00111 0 Character Compatibility Matrix (1 if compatible) --------- ------------- ------ -- -- ----------- 111..1 111..1 111..1 ...111 ...111 111111 Largest Cliques ------- ------- Characters: ( 1 2 3 6) Tree and characters: 2 1 3 6 0 0 1 1 +1-Delta +0--1-+ +--0-+ +--Epsilon ! ! ! +--------Gamma ! +-------------Alpha ! +-------------Beta remember: this is an unrooted tree! |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
Consense reads a file of computer-readable trees and prints out (and may also write out onto a file) a consensus tree. At the moment it carries out a family of consensus tree methods called the Ml methods (Margush and McMorris, 1981). These include strict consensus and majority rule consensus. Basically the consensus tree consists of monophyletic groups that occur as often as possible in the data. If a group occurs in more than a fraction l of all the input trees it will definitely appear in the consensus tree.
The tree printed out has at each fork a number indicating how many times the group which consists of the species to the right of (descended from) the fork occurred. Thus if we read in 15 trees and find that a fork has the number 15, that group occurred in all of the trees. The strict consensus tree consists of all groups that occurred 100% of the time, the rest of the resolution being ignored. The tree printed out here includes groups down to 50%, and below it until the tree is fully resolved.
The majority rule consensus tree consists of all groups that occur more than 50% of the time. Any other percentage level between 50% and 100% can also be used, and that is why the program in effect carries out a family of methods. You have to decide on the percentage level, figure out for yourself what number of occurrences that would be (e.g. 15 in the above case for 100%), and resolutely ignore any group below that number. Do not use numbers at or below 50%, because some groups occurring (say) 35% of the time will not be shown on the tree. The collection of all groups that occur 35% or more of the time may include two groups that are mutually self contradictory and cannot appear in the same tree. In this program, as the default method I have included groups that occur less than 50% of the time, working downwards in their frequency of occurrence, as long as they continue to resolve the tree and do not contradict more frequent groups. In this respect the method is similar to the Nelson consensus method (Nelson, 1979) as explicated by Page (1989) although it is not identical to it.
The program can also carry out Strict consensus, Majority Rule consensus without the extension which adds groups until the tree is fully resolved, and other members of the Ml family, where the user supplied the fraction of times the group must appear in the input trees to be included in the consensus tree. For the moment the program cannot carry out any other consensus tree method, such as Adams consensus (Adams, 1972, 1986) or methods based on quadruples of species (Estabrook, McMorris, and Meacham, 1985).
Input is a tree file (called intree) which contains a series of trees in the Newick standard form -- the form used when many of the programs in this package write out tree files. Each tree starts on a new line. Each tree can have a weight, which is a real number and is located in comment brackets "[" and "]" just before the final ";" which ends the description of the tree. When the input trees have weights (like [0.01000]) then the total number of trees will be the total of those weights, which is often a number like 1.00. When the a tree doesn't have a weight it will be assigned a weight of 1. This means that when we have tied trees (as from a parsimony program) three alternative tied trees will be counted as if each was 1/3 of a tree.
Note that this program can correctly read trees whether or not they are bifurcating: in fact they can be multifurcating at any level in the tree.
The options are selected from a menu, which looks like this:
Consensus tree program, version 3.69 Settings for this run: C Consensus type (MRe, strict, MR, Ml): Majority rule (extended) O Outgroup root: No, use as outgroup species 1 R Trees to be treated as Rooted: No T Terminal type (IBM PC, ANSI, none): ANSI 1 Print out the sets of species: Yes 2 Print indications of progress of run: Yes 3 Print out tree: Yes 4 Write out trees onto tree file: Yes Are these settings correct? (type Y or the letter for one to change) |
Option C (Consensus method) selects which of four methods the program uses. The program defaults to using the extended Majority Rule method. Each time the C option is chosen the program moves on to another method, the others being in order Strict, Majority Rule, and Ml. Here are descriptions of the methods. In each case the fraction of times a set appears among the input trees is counted by weighting by the weights of the trees (the numbers like [0.6000] that appear at the ends of trees in some cases).
Option R (Rooted) toggles between the default assumption that the input trees are unrooted trees and the selection that specifies that the tree is to be treated as a rooted tree and not re-rooted. Otherwise the tree will be treated as outgroup-rooted and will be re-rooted automatically at the first species encountered on the first tree (or at a species designated by the Outgroup option).
Option O is the usual Outgroup rooting option. It is in effect only if the Rooted option selection is not in effect. The trees will be re-rooted with a species of your choosing. You will be asked for the number of the species that is to be the outgroup. If we want to outgroup-root the tree on the line leading to a species which appears as the third species (counting left-to-right) in the first computer-readable tree in the input file, we would invoke select menu option O and specify species 3.
Output is a list of the species (in the order in which they appear in the first tree, which is the numerical order used in the program), a list of the subsets that appear in the consensus tree, a list of those that appeared in one or another of the individual trees but did not occur frequently enough to get into the consensus tree, followed by a diagram showing the consensus tree. The lists of subsets consists of a row of symbols, each either "." or "*". The species that are in the set are marked by "*". Every ten species there is a blank, to help you keep track of the alignment of columns. The order of symbols corresponds to the order of species in the species list. Thus a set that consisted of the second, seventh, and eighth out of 13 species would be represented by:
.*....**.. ...
Note that if the trees are unrooted the final tree will have one group, consisting of every species except the Outgroup (which by default is the first species encountered on the first tree), which always appears. It will not be listed in either of the lists of sets, but it will be shown in the final tree as occurring all of the time. This is hardly surprising: in telling the program that this species is the outgroup we have specified that the set consisting of all of the others is always a monophyletic set. So this is not to be taken as interesting information, despite its dramatic appearance.
Option 2 in the menu gives you the option of turning off the writing of these sets into the output file. This may be useful if you are primarily interested in getting the tree file.
Option 3 is the usual tree file option. If this is on (it is by default) then the final tree will be written onto an output tree file (whose default name is "outtree").
Note that the lengths on the tree on the output tree file are not branch lengths but the number of times that each group appeared in the input trees. This number is the sum of the weights of the trees in which it appeared, so that if there are 11 trees, ten of them having weight 0.1 and one weight 1.0, a group that appeared in the last tree and in 6 others would be shown as appearing 1.6 times and its branch length will be 1.6. This means that if you take the consensus tree from the output tree file and try to draw it, the branch lengths will be strange. I am often asked how to put the correct branch lengths on these (this is one of our Frequently Asked Questions).
There is no simple answer to this. It depends on what "correct" means. For example, if you have a group of species that shows up in 80% of the trees, and the branch leading to that group has average length 0.1 among that 80%, is the "correct" length 0.1? Or is it (0.80 x 0.1)? There is no simple answer.
However, if you want to take the consensus tree as an estimate of the true tree (rather than as an indicator of the conflicts among trees) you may be able to use the User Tree (option U) mode of the phylogeny program that you used, and use it to put branch lengths on that tree. Thus, if you used Dnaml, you can take the consensus tree, make sure it is an unrooted tree, and feed that to Dnaml using the original data set (before bootstrapping) and Dnaml's option U. As Dnaml wants an unrooted tree, you may have to use Retree to make the tree unrooted (using the W option of Retree and choosing the unrooted option within it). Of course you will also want to change the tree file name from "outree" to "intree".
If you used a phylogeny program that does not infer branch lengths, you might want to use a different one (such as Fitch or Dnaml) to infer the branch lengths, again making sure the tree is unrooted, if the program needs that.
The program uses the consensus tree algorithm originally designed for the bootstrap programs. It is quite fast, and execution time is unlikely to be limiting for you (assembling the input file will be much more of a limiting step). In the future, if possible, more consensus tree methods will be incorporated (although the current methods are the ones needed for the component analysis of bootstrap estimates of phylogenies, and in other respects I also think that the present ones are among the best).
TEST SET OF INPUT TREES
(A,(B,(H,(D,(J,(((G,E),(F,I)),C)))))); (A,(B,(D,((J,H),(((G,E),(F,I)),C))))); (A,(B,(D,(H,(J,(((G,E),(F,I)),C)))))); (A,(B,(E,(G,((F,I),((J,(H,D)),C)))))); (A,(B,(E,(G,((F,I),(((J,H),D),C)))))); (A,(B,(E,((F,I),(G,((J,(H,D)),C)))))); (A,(B,(E,((F,I),(G,(((J,H),D),C)))))); (A,(B,(E,((G,(F,I)),((J,(H,D)),C))))); (A,(B,(E,((G,(F,I)),(((J,H),D),C))))); |
Consensus tree program, version 3.69 Species in order: 1. A 2. B 3. H 4. D 5. J 6. G 7. E 8. F 9. I 10. C Sets included in the consensus tree Set (species in order) How many times out of 9.00 .......**. 9.00 ..******** 9.00 ..****.*** 6.00 ..***..... 6.00 ..***....* 6.00 ..*.*..... 4.00 ..***..*** 2.00 Sets NOT included in consensus tree: Set (species in order) How many times out of 9.00 .....**... 3.00 .....***** 3.00 ..**...... 3.00 .....****. 3.00 ..****...* 2.00 .....*.**. 2.00 ..*.****** 2.00 ....****** 2.00 ...******* 1.00 Extended majority rule consensus tree CONSENSUS TREE: the numbers on the branches indicate the number of times the partition of the species into the two sets which are separated by that branch occurred among the trees, out of 9.00 trees +-----------------------C | +--6.00-| +-------H | | +--4.00-| | +--6.00-| +-------J +--2.00-| | | | +---------------D | | +--6.00-| | +-------F | | +------------------9.00-| | | +-------I +--9.00-| | | | +---------------------------------------G +-------| | | | +-----------------------------------------------E | | | +-------------------------------------------------------B | +---------------------------------------------------------------A remember: this is an unrooted tree! |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
The programs in this group use gene frequencies and quantitative character values. One (Contml) constructs maximum likelihood estimates of the phylogeny, another (Gendist) computes genetic distances for use in the distance matrix programs, and the third (Contrast) examines correlation of traits as they evolve along a given phylogeny.
When the gene frequencies data are used in Contml or Gendist, this involves the following assumptions:
How these assumptions affect the methods will be seen in my papers on inference of phylogenies from gene frequency and continuous character data (Felsenstein, 1973b, 1981c, 1985c).
The input formats are fairly similar to the discrete-character programs, but with one difference. When Contml is used in the gene-frequency mode (its usual, default mode), or when Gendist is used, the first line contains the number of species (or populations) and the number of loci and the options information. There then follows a line which gives the numbers of alleles at each locus, in order. This must be the full number of alleles, not the number of alleles which will be input: i. e. for a two-allele locus the number should be 2, not 1. There then follow the species (population) data, each species beginning on a new line. The first 10 characters are taken as the name, and thereafter the values of the individual characters are read free-format, preceded and separated by blanks. They can go to a new line if desired, though of course not in the middle of a number. Missing data is not allowed - an important limitation. In the default configuration, for each locus, the numbers should be the frequencies of all but one allele. The menu option A (All) signals that the frequencies of all alleles are provided in the input data -- the program will then automatically ignore the last of them. So without the A option, for a three-allele locus there should be two numbers, the frequencies of two of the alleles (and of course it must always be the same two!). Here is a typical data set without the A option:
5 3 2 3 2 Alpha 0.90 0.80 0.10 0.56 Beta 0.72 0.54 0.30 0.20 Gamma 0.38 0.10 0.05 0.98 Delta 0.42 0.40 0.43 0.97 Epsilon 0.10 0.30 0.70 0.62 |
whereas here is what it would have to look like if the A option were invoked:
5 3 2 3 2 Alpha 0.90 0.10 0.80 0.10 0.10 0.56 0.44 Beta 0.72 0.28 0.54 0.30 0.16 0.20 0.80 Gamma 0.38 0.62 0.10 0.05 0.85 0.98 0.02 Delta 0.42 0.58 0.40 0.43 0.17 0.97 0.03 Epsilon 0.10 0.90 0.30 0.70 0.00 0.62 0.38 |
The first line has the number of species (or populations) and the number of loci. The second line has the number of alleles for each of the 3 loci. The species lines have names (filled out to 10 characters with blanks) followed by the gene frequencies of the 2 alleles for the first locus, the 3 alleles for the second locus, and the 2 alleles for the third locus. You can start a new line after any of these allele frequencies, and continue to give the frequencies on that line (without repeating the species name).
If all alleles of a locus are given, it is important to have them add up to 1. Roundoff of the frequencies may cause the program to conclude that the numbers do not sum to 1, and stop with an error message.
While many compilers may be more tolerant, it is probably wise to make sure that each number, including the first, is preceded by a blank, and that there are digits both preceding and following any decimal points.
Contml and Contrast also treat quantitative characters (the continuous-characters mode in Contml, which is option C). It is assumed that each character is evolving according to a Brownian motion model, at the same rate, and independently. In reality it is almost always impossible to guarantee this. The issue is discussed at length in my review article in Annual Review of Ecology and Systematics (Felsenstein, 1988a), where I point out the difficulty of transforming the characters so that they are not only genetically independent but have independent selection acting on them. If you are going to use Contml to model evolution of continuous characters, then you should at least make some attempt to remove genetic correlations between the characters (usually all one can do is remove phenotypic correlations by transforming the characters so that there is no within-population covariance and so that the within-population variances of the characters are equal -- this is equivalent to using Canonical Variates). However, this will only guarantee that one has removed phenotypic covariances between characters. Genetic covariances could only be removed by knowing the coheritabilities of the characters, which would require genetic experiments, and selective covariances (covariances due to covariation of selection pressures) would require knowledge of the sources and extent of selection pressure in all variables.
Contrast is a program designed to infer, for a given phylogeny that is provided to the program, the covariation between characters in a data set. Thus we have a program in this set that allows us to take information about the covariation and rates of evolution of characters and make an estimate of the phylogeny (Contml), and a program that takes an estimate of the phylogeny and infers the variances and covariances of the character changes. But we have no program that infers both the phylogenies and the character covariation from the same data set.
In the quantitative characters mode, a typical small data set would be:
5 6 Alpha 0.345 0.467 1.213 2.2 -1.2 1.0 Beta 0.457 0.444 1.1 1.987 -0.2 2.678 Gamma 0.6 0.12 0.97 2.3 -0.11 1.54 Delta 0.68 0.203 0.888 2.0 1.67 Epsilon 0.297 0.22 0.90 1.9 1.74 |
Note that in the latter case, there is no line giving the numbers of alleles at each locus. In this latter case no square-root transformation of the coordinates is done: each is assumed to give directly the position on the Brownian motion scale.
For further discussion of options and modifiable constants in Contml, Gendist, and Contrast see the documentation files for those programs. phylip-3.697/doc/contml.html 0000644 0047320 0047320 00000037101 12406201172 015471 0 ustar joe felsenst_g
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program estimates phylogenies by the restricted maximum likelihood method based on the Brownian motion model. It is based on the model of Edwards and Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967). Gomberg (1966), Felsenstein (1973b, 1981c) and Thompson (1975) have done extensive further work leading to efficient algorithms. Contml uses restricted maximum likelihood estimation (REML), which is the criterion used by Felsenstein (1973b). The actual algorithm is an iterative EM Algorithm (Dempster, Laird, and Rubin, 1977) which is guaranteed to always give increasing likelihoods. The algorithm is described in detail in a paper of mine (Felsenstein, 1981c), which you should definitely consult if you are going to use this program. Some simulation tests of it are given by Rohlf and Wooten (1988) and Kim and Burgman (1988).
The default (gene frequency) mode treats the input as gene frequencies at a series of loci, and square-root-transforms the allele frequencies (constructing the frequency of the missing allele at each locus first). This enables us to use the Brownian motion model on the resulting coordinates, in an approximation equivalent to using Cavalli-Sforza and Edwards's (1967) chord measure of genetic distance and taking that to give distance between particles undergoing pure Brownian motion. It assumes that each locus evolves independently by pure genetic drift.
The alternative continuous characters mode (menu option C) treats the input as a series of coordinates of each species in N dimensions. It assumes that we have transformed the characters to remove correlations and to standardize their variances.
Many current users of Contml use it to analyze microsatellite data. There are three ways to do this:
http://evolution.gs.washington.edu/phylip/software.html
.
Those distance measures allow for mutation during the divergence of the populations. But even they are not perfect -- they do not allow us to use all the information contained in the gene frequency differences within a copy number allele. There is a need for a more complete statistical treatment of inference of phylogenies from microsatellite models, ones that take both mutation and genetic drift fully into account.
0.10 X 18 + 0.24 X 19 + 0.60 X 20 + 0.06 X 21 = 19.62
copies. These values can, I believe, be calculated by a spreadsheet program. Each microsatellite is represented by one character, and the continuous character mode of Contml is used (not the gene frequencies mode). This coding allows for mutation that changes copy number. It does not make complete use of all data, but neither does the treatment of microsatellite gene frequencies as changing only by genetic drift.
The input file is as described in the continuous characters documentation file above. Options are selected using a menu:
Continuous character Maximum Likelihood method version 3.69 Settings for this run: U Search for best tree? Yes C Gene frequencies or continuous characters? Gene frequencies A Input file has all alleles at each locus? No, one allele missing at each O Outgroup root? No, use as outgroup species 1 G Global rearrangements? No J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Y to accept these or type the letter for one to change |
Option U is the usual User Tree option. Options C (Continuous Characters) and A (All alleles present) have been described in the Gene Frequencies and Continuous Characters Programs documentation file. The options G, J, O and M are the usual Global Rearrangements, Jumble order of species, Outgroup root, and Multiple Data Sets options.
The M (Multiple data sets) option does not allow multiple sets of weights instead of multiple data sets, as there are no weights in this program.
The G and J options have no effect if the User Tree option is selected. User trees are given with a trifurcation (three-way split) at the base. They can start from any interior node. Thus the tree:
A ! *--B ! *-----C ! *--D ! E
can be represented by any of the following:
(A,B,(C,(D,E))); ((A,B),C,(D,E)); (((A,B),C),D,E);
(there are of course 69 other representations as well obtained from these by swapping the order of branches at an interior node).
The output has a standard appearance. The topology of the tree is given by an unrooted tree diagram. The lengths (in time or in expected amounts of variance) are given in a table below the topology, and a rough confidence interval given for each length. Negative lower bounds on length indicate that rearrangements may be acceptable.
The units of length are amounts of expected accumulated variance (not time). The log likelihood (natural log) of each tree is also given, and it is indicated how many topologies have been tried. The tree does not necessarily have all tips contemporary, and the log likelihood may be either positive or negative (this simply corresponds to whether the density function does or does not exceed 1) and a negative log likelihood does not indicate any error. The log likelihood allows various formal likelihood ratio hypothesis tests. The description of the tree includes approximate standard errors on the lengths of segments of the tree. These are calculated by considering only the curvature of the likelihood surface as the length of the segment is varied, holding all other lengths constant. As such they are most probably underestimates of the variance, and hence may give too much confidence in the given tree.
One should use caution in interpreting the likelihoods that are printed out. If the model is wrong, it will not be possible to use the likelihoods to make formal statistical statements. Thus, if gene frequencies are being analyzed, but the gene frequencies change not only by genetic drift, but also by mutation, the model is not correct. It would be as well-justified in this case to use Gendist to compute the Nei (1972) genetic distance and then use Fitch, Kitsch or Neighbor to make a tree. If continuous characters are being analyzed, but if the characters have not been transformed to new coordinates that evolve independently and at equal rates, then the model is also violated and no statistical analysis is possible. Doing such a transformation is not easy, and usually not even possible.
If the U (User Tree) option is used and more than one tree is supplied, the program also performs a statistical test of each of these trees against the one with highest likelihood. If there are two user trees, the test done is one which is due to Kishino and Hasegawa (1989), a version of a test originally introduced by Templeton (1983). In this implementation it uses the mean and variance of log-likelihood differences between trees, taken across loci. If the two trees' means are more than 1.96 standard deviations different then the trees are declared significantly different. This use of the empirical variance of log-likelihood differences is more robust and nonparametric than the classical likelihood ratio test, and may to some extent compensate for any lack of realism in the model underlying this program.
If there are more than two trees, the test done is an extension of the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that a correction for the number of trees was necessary, and they introduced a resampling method to make this correction. The version used here is a multivariate normal approximation to their test; it is due to Shimodaira (1998). The variances and covariances of the sum of log likelihoods across loci are computed for all pairs of trees. To test whether the difference between each tree and the best one is larger than could have been expected if they all had the same expected log-likelihood, log-likelihoods for all trees are sampled with these covariances and equal means (Shimodaira and Hasegawa's "least favorable hypothesis"), and a P value is computed from the fraction of times the difference between the tree's value and the highest log-likelihood exceeds that actually observed. Note that this sampling needs random numbers, and so the program will prompt the user for a random number seed if one has not already been supplied. With the two-tree KHT test no random numbers are used.
In either the KHT or the SH test the program prints out a table of the log-likelihoods of each tree, the differences of each from the highest one, the variance of that quantity as determined by the log-likelihood differences at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one.
One problem which sometimes arises is that the program is fed two species (or populations) with identical transformed gene frequencies: this can happen if sample sizes are small and/or many loci are monomorphic. In this case the program "gets its knickers in a twist" and can divide by zero, usually causing a crash. If you suspect that this has happened, check for two species with identical coordinates. If you find them, eliminate one from the problem: the two must always show up as being at the same point on the tree anyway.
The constants available for modification at the beginning of the program include "epsilon1", a small quantity used in the iterations of branch lengths, "epsilon2", another not quite so small quantity used to check whether gene frequencies that were fed in for all alleles do not add up to 1, "smoothings", the number of passes through a given tree in the iterative likelihood maximization for a given topology, "maxtrees", the maximum number of user trees that will be used for the Kishino-Hasegawa-Templeton test, and "namelength", the length of species names. There is no provision in this program for saving multiple trees that are tied for having the highest likelihood, mostly because an exact tie is unlikely anyway.
The algorithm does not run as quickly as the discrete character methods but is not enormously slower. Like them, its execution time should rise as the cube of the number of species.
This data set was compiled by me from the compilation of human gene frequencies by Mourant (1976). It appeared in a paper of mine (Felsenstein, 1981c) on maximum likelihood phylogenies from gene frequencies. The names of the loci and alleles are given in that paper.
5 10 2 2 2 2 2 2 2 2 2 2 European 0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205 0.8055 0.5043 African 0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600 0.7582 0.6207 Chinese 0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726 0.7482 0.7334 American 0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000 0.8086 0.8636 Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396 0.9097 0.2976 |
Continuous character Maximum Likelihood method version 3.69 5 Populations, 10 Loci Numbers of alleles at the loci: ------- -- ------- -- --- ----- 2 2 2 2 2 2 2 2 2 2 Name Gene Frequencies ---- ---- ----------- locus: 1 2 3 4 5 6 7 8 9 10 European 0.28680 0.56840 0.44220 0.42860 0.38280 0.72850 0.63860 0.02050 0.80550 0.50430 African 0.13560 0.48400 0.06020 0.03970 0.59770 0.96750 0.95110 0.06000 0.75820 0.62070 Chinese 0.16280 0.59580 0.72980 1.00000 0.38110 0.79860 0.77820 0.07260 0.74820 0.73340 American 0.01440 0.69900 0.32800 0.74210 0.66060 0.86030 0.79240 0.00000 0.80860 0.86360 Australian 0.12110 0.22740 0.58210 1.00000 0.20180 0.90000 0.98370 0.03960 0.90970 0.29760 +-----------------------------------------------------------African ! ! +-------------------------------Australian 1-------------3 ! ! +-----------------------American ! +-----2 ! +Chinese ! +European remember: this is an unrooted tree! Ln Likelihood = 38.71914 Between And Length Approx. Confidence Limits ------- --- ------ ------- ---------- ------ 1 African 0.09693444 ( 0.03123910, 0.19853605) 1 3 0.02252816 ( 0.00089799, 0.05598045) 3 Australian 0.05247406 ( 0.01177094, 0.11542376) 3 2 0.00945315 ( -0.00897717, 0.03795670) 2 American 0.03806240 ( 0.01095938, 0.07997877) 2 Chinese 0.00208822 ( -0.00960622, 0.02017434) 1 European 0.00000000 ( -0.01627246, 0.02516630) |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program implements the contrasts calculation described in my 1985 paper on the comparative method (Felsenstein, 1985d). It reads in a data set of the standard quantitative characters sort, and also a tree from the treefile. It then forms the contrasts between species that, according to that tree, are statistically independent. This is done for each character. The contrasts are all standardized by branch lengths (actually, square roots of branch lengths).
The method is explained in the 1985 paper. It assumes a Brownian motion model. This model was introduced by Edwards and Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967) as an approximation to the evolution of gene frequencies. I have discussed (Felsenstein, 1973b, 1981c, 1985d, 1988b) the difficulties inherent in using it as a model for the evolution of quantitative characters. Chief among these is that the characters do not necessarily evolve independently or at equal rates. This program allows one to evaluate this, if there is independent information on the phylogeny. You can compute the variance of the contrasts for each character, as a measure of the variance accumulating per unit branch length. You can also test covariances of characters.
The input file is as described in the continuous characters documentation file above, for the case of continuous quantitative characters (not gene frequencies). Options are selected using a menu:
Continuous character comparative analysis, version 3.69 Settings for this run: W Within-population variation in data? No, species values are means R Print out correlations and regressions? Yes C Print out contrasts? No M Analyze multiple trees? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Y to accept these or type the letter for one to change |
Option W makes the program expect not means of the phenotypes in each species, but phenotypes of individual specimens. The details of the input file format in that case are given below. In that case the program estimates the covariances of the phenotypic change, as well as covariances of within-species phenotypic variation. The model used is similar to (but not identical to) that of Lynch (1990). The algorithms used differ from the ones he gives in that paper. They are described in a recent paper (Felsenstein, 2008). In the case that has within-species samples contrasts are used by the program, but it does not make sense to write them out to an output file for direct analysis. They are of two kinds, contrasts within species and contrasts between species. The former are affected only by the within-species phenotypic covariation, but the latter are affected by both within- and between-species covariation. Contrast infers these two kinds of covariances and writes the estimates out.
M is similar to the usual multiple data sets input option, but is used here to allow multiple trees to be read from the treefile, not multiple data sets to be read from the input file. In this way you can use bootstrapping on the data that estimated these trees, get multiple bootstrap estimates of the tree, and then use the M option to make multiple analyses of the contrasts and the covariances, correlations, and regressions. In this way (Felsenstein, 1988b) you can assess the effect of the inaccuracy of the trees on your estimates of these statistics.
R allows you to turn off or on the printing out of the statistics. If it is off only the contrasts will be printed out (unless option 1 is selected). With only the contrasts printed out, they are in a simple array that is in a form that many statistics packages should be able to read. The contrasts are rows, and each row has one contrast for each character. Any multivariate statistics package should be able to analyze these (but keep in mind that the contrasts have, by virtue of the way they are generated, expectation zero, so all regressions must pass through the origin). If the W option has been set to analyze within-species as well as between-species variation, the R option does not appear in the menu as the regression and correlation statistics should always be computed in that case.
As usual, the tree file has the default name intree. It should contain the desired tree or trees. These can be either in bifurcating form, or may have the bottommost fork be a trifurcation (it should not matter which of these ways you present the tree). Note that the tree may not contain any multifurcations aside from a trifurcation at the root! If there are any, the program may not work, or may give misleading results.
The tree must, of course, have branch lengths. These cannot be negative. Trees from some distance methods, particularly Neighbor-Joining, are sometimes inferred to have negative branch lengths, so be sure to choose options in those programs that prevent negative branch lengths.
If you have a molecular data set (for example) and also, on the same species, quantitative measurements, here is how you can allow for the uncertainty of your estimate of the tree. Use Seqboot to generate multiple data sets from your molecular data. Then, whichever method you use to analyze it (the relevant ones are those that produce estimates of the branch lengths: Dnaml, Dnamlk, Fitch, Kitsch, and Neighbor -- the latter three require you to use Dnadist to turn the bootstrap data sets into multiple distance matrices), you should use the Multiple Data Sets option of that program. This will result in a tree file with many trees on it. Then use this tree file with the input file containing your continuous quantitative characters, choosing the Multiple Trees (M) option. You will get one set of contrasts and statistics for each tree in the tree file. At the moment there is no overall summary: you will have to tabulate these by hand. A similar process can be followed if you have restriction sites data (using Restml) or gene frequencies data.
The statistics that are printed out include the covariances between all pairs of characters, the regressions of each character on each other (column j is regressed on row i), and the correlations between all pairs of characters. In assessing degress of freedom it is important to realize that each contrast was taken to have expectation zero, which is known because each contrast could as easily have been computed xi-xj instead of xj-xi. Thus there is no loss of a degree of freedom for estimation of a mean. The degrees of freedom are thus the same as the number of contrasts, namely one less than the number of species (tips). If you feed these contrasts into a multivariate statistics program make sure that it knows that each variable has expectation exactly zero.
10 5 Alpha 2 2.01 5.3 1.5 -3.41 0.3 1.98 4.3 2.1 -2.98 0.45 Gammarus 3 6.57 3.1 2.0 -1.89 0.6 7.62 3.4 1.9 -2.01 0.7 6.02 3.0 1.9 -2.03 0.6 ... |
number of species, number of characters name of 1st species, # of individuals data for individual #1 data for individual #2 name of 2nd species, # of individuals data for individual #1 data for individual #2 data for individual #3 (and so on) |
The covariances, correlations, and regressions for the "additive" (between-species evolutionary variation) and "environmental" (within-species phenotypic variation) are printed out (the maximum likelihood estimates of each). The program also estimates the within-species phenotypic variation in the case where the between-species evolutionary covariances are forced to be zero. The log-likelihoods of these two cases are compared and a likelihood ratio test (LRT) is carried out. The program prints the result of this test as a chi-square variate, and gives the number of degrees of freedom of the LRT. You have to look up the chi-square variable on a table of the chi-square distribution. The A option is available (if the W option is invoked) to allow you to turn off the doing of this test if you want to.
The program prints out the log-likelihood of the data under the models with and without between-species variation. It shows the degrees of freedom and chi-square value for a likelihood ratio test of the absence of between-species variation. For the moment the program cannot handle the case where within-species variation is to be taken into account but where only species means are available. (It can handle cases where some species have only one member in their sample).
We hope to fix this soon. We are also on our way to incorporating full-sib, half-sib, or clonal groups within species, so as to do one analysis for within-species genetic and between-species phylogenetic variation.
The data set used as an example below is the example from a paper by Michael Lynch (1990), his characters having been log-transformed. In the case where there is only one specimen per species, Lynch's model is identical to our model of within-species variation (for multiple individuals per species it is not a subcase of his model).
5 2 Homo 4.09434 4.74493 Pongo 3.61092 3.33220 Macaca 2.37024 3.36730 Ateles 2.02815 2.89037 Galago -1.46968 2.30259
|
((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,Ateles:0.62):0.38,Galago:1.00); |
Continuous character contrasts analysis, version 3.69 5 Populations, 2 Characters Name Phenotypes ---- ---------- Homo 4.09434 4.74493 Pongo 3.61092 3.33220 Macaca 2.37024 3.36730 Ateles 2.02815 2.89037 Galago -1.46968 2.30259 Contrasts (columns are different characters) --------- -------- --- --------- ----------- 0.74593 2.17989 1.58474 0.71761 1.19293 0.86790 3.35832 0.89706 Covariance matrix ---------- ------ 3.9423 1.7028 1.7028 1.7062 Regressions (columns on rows) ----------- -------- -- ----- 1.0000 0.4319 0.9980 1.0000 Correlations ------------ 1.0000 0.6566 0.6566 1.0000 |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
These programs are intended for the use of morphological systematists who are dealing with discrete characters, or by molecular evolutionists dealing with presence-absence data on restriction sites. One of the programs (Pars) allows multistate characters, with up to 8 states, plus the unknown state symbol "?". For the others, the characters are assumed to be coded into a series of (0,1) two-state characters. For most of the programs there are two other states possible, "P", which stands for the state of Polymorphism for both states (0 and 1), and "?", which stands for the state of ignorance: it is the state "unknown", or "does not apply". The state "P" can also be denoted by "B", for "both".
There is a method invented by Sokal and Sneath (1963) for linear sequences of character states, and fully developed for branching sequences of character states by Kluge and Farris (1969) for recoding a multistate character into a series of two-state (0,1) characters. Suppose we had a character with four states whose character-state tree had the rooted form:
1 ---> 0 ---> 2 | | V 3
so that 1 is the ancestral state and 0, 2 and 3 derived states. We can represent this as three two-state characters:
Old State New States --- ----- --- ------ 0 001 1 000 2 011 3 101
The three new states correspond to the three arrows in the above character state tree. Possession of one of the new states corresponds to whether or not the old state had that arrow in its ancestry. Thus the first new state corresponds to the bottommost arrow, which only state 3 has in its ancestry, the second state to the rightmost of the top arrows, and the third state to the leftmost top arrow. This coding will guarantee that the number of times that states arise on the tree (in programs Mix, Move, Penny and Boot) or the number of polymorphic states in a tree segment (in the Polymorphism option of Dollop, Dolmove, Dolpenny and Dolboot) will correctly correspond to what would have been the case had our programs been able to take multistate characters into account. Although I have shown the above character state tree as rooted, the recoding method works equally well on unrooted multistate characters as long as the connections between the states are known and contain no loops.
However, in the default option of programs Dollop, Dolmove, Dolpenny and Dolboot the multistate recoding does not necessarily work properly, as it may lead the program to reconstruct nonexistent state combinations such as 010. An example of this problem is given in my paper on alternative phylogenetic methods (1979).
If you have multistate character data where the states are connected in a branching "character state tree" you may want to do the binary recoding yourself. Thanks to Christopher Meacham, the package contains a program, Factor, which will do the recoding itself. For details see the documentation file for Factor.
We now also have the program Pars, which can do parsimony for unordered character states.
The methods used in these programs make different assumptions about evolutionary rates, probabilities of different kinds of events, and our knowledge about the characters or about the character state trees. Basic references on these assumptions are my 1979, 1981b and 1983b papers, particularly the latter. The assumptions of each method are briefly described in the documentation file for the corresponding program. In most cases my assertions about what are the assumptions of these methods are challenged by others, whose papers I also cite at that point. Personally, I believe that they are wrong and I am right. I must emphasize the importance of understanding the assumptions underlying the methods you are using. No matter how fancy the algorithms, how maximum the likelihood or how minimum the number of steps, your results can only be as good as the correspondence between biological reality and your assumptions!
The input format is as described in the general documentation file. The input starts with a line containing the number of species and the number of characters.
In Pars, each character can have up to 8 states plus a "?" state. In any character, the first 8 symbols encountered will be taken to represent these states. Any of the digits 0-9, letters A-Z and a-z, and even symbols such as + and -, can be used (and in fact which 8 symbols are used can be different in different characters).
In the other discrete characters programs the allowable states are, 0, 1, P, B, and ?. Blanks may be included between the states (i. e. you can have a species whose data is DISCOGLOSS0 1 1 0 1 1 1). It is possible for extraneous information to follow the end of the character state data on the same line. For example, if there were 7 characters in the data set, a line of species data could read "DISCOGLOSS0110111 Hello there").
The discrete character data can continue to a new line whenever needed. The characters are not in the "aligned" or "interleaved" format used by the molecular sequence programs: they have the name and entire set of characters for one species, then the name and entire set of characters for the next one, and so on. This is known as the sequential format. Be particularly careful when you use restriction sites data, which can be in either the aligned or the sequential format for use in Restml but must be in the sequential format for these discrete character programs.
For Pars the discrete character data can be in either Sequential or Interleaved format; the latter is the default.
Errors in the input data will often be detected by the programs, and this will cause them to issue an error message such as 'BAD OUTGROUP NUMBER: ' together with information as to which species, character, or in this case outgroup number is the incorrect one. The program will then terminate; you will have to look at the data and figure out what went wrong and fix it. Often an error in the data causes a lack of synchronization between what is in the data file and what the program thinks is to be there. Thus a missing character may cause the program to read part of the next species name as a character and complain about its value. In this type of case you should look for the error earlier in the data file than the point about which the program is complaining.
Specific information on options will be given in the documentation file associated with each program. However, some options occur in many programs. Options are selected from the menu in each program.
An example is:
001??11
The ancestor information can be continued to a new line and can have blanks between any of the characters in the same way that species character data can. In the program Clique the ancestor is instead to be included as a regular species and no A option is available.
For example, if there were 20 binary characters that had been generated by nine multistate characters having respectively 4, 3, 3, 2, 1, 2, 2, 2, and 1 binary factors you would make the factors file be:
11112223334456677889
although it could equivalently be:
aaaabbbaaabbabbaabba
All that is important is that the symbol for each binary character change only when adjacent binary characters correspond to different mutlistate characters. The factors file contents can continue to a new line at any time except during the initial characters filling out the length of a species name.
WWWCC WWCWC
Note that blanks in the seqence of characters (after the first ones that are as long as the species names) will be ignored, and the information can go on to a new line at any point. So this could equally well have been specified by
WW CCCWWCWC
On the line in that table corresponding to each branch of the tree will also be printed "yes", "no" or "maybe" as an answer to the question of whether this branch is of nonzero length. If there is no evidence that any character has changed in that branch, then "no" will be printed. If there is definite evidence that one has changed, then "yes" will be printed. If the matter is ambiguous, then "maybe" will be printed. You should keep in mind that all of these conclusions assume that we are only interested in the assignment of states that require the least amount of change. In reality, the confidence limit on tree topology usually includes many different topologies, and presumably also then the confidence limits on amounts of change in branches are also very broad.
In addition to the table showing numbers of events, a table may be printed out showing which ancestral state causes the fewest events for each character. This will not always be done, but only when the tree is rooted and some ancestral states are unknown. This can be used to infer states of occurred and making it easy for the user to reconstruct all the alternative patterns of the characters states in the hypothetical ancestral nodes. In Pars you can, using the menu, turn off this dot-differencing convention and see all states at all hypothetical ancestral nodes of the tree.
If you select the proper menu option, a table of the number of events required in each character can also be printed, to help in reconstructing the placement of changes on the tree.
This table may not be obvious at first. A typical example looks like this:
steps in each character: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 2 2 2 2 1 1 2 2 1 10! 1 2 3 1 1 1 1 1 1 2 20! 1 2 2 1 2 2 1 1 1 2 30! 1 2 1 1 1 2 1 3 1 1 40! 1The numbers across the top and down the side indicate which character is being referred to. Thus character 23 is column "3" of row "20" and has 2 steps in this case.
I cannot emphasize too strongly that just because the tree diagram which the program prints out contains a particular branch MAY NOT MEAN THAT WE HAVE EVIDENCE THAT THE BRANCH IS OF NONZERO LENGTH. In some of the older programs, the procedure which prints out the tree cannot cope with a trifurcation, nor can the internal data structures used in some of my programs. Therefore, even when we have no resolution and a multifurcation, successive bifurcations may be printed out, although some of the branches shown will in fact actually be of zero length. To find out which, you will have to work out character by character where the placements of the changes on the tree are, under all possible ways that the changes can be placed on that tree.
In Pars, Mix, Penny, Dollop, and Dolpenny the trees will be (if the user selects the option to see them) accompanied by tables showing the reconstructed states of the characters in the hypothetical ancestral nodes in the interior of the tree. This will enable you to reconstruct where the changes were in each of the characters. In some cases the state shown in an interior node will be "?", which means that either 0 or 1 would be possible at that point. In such cases you have to work out the ambiguity by hand. A unique assignment of locations of changes is often not possible in the case of the Wagner parsimony method. There may be multiple ways of assigning changes to segments of the tree with that method. Printing only one would be misleading, as it might imply that certain segments of the tree had no change, when another equally valid assignment would put changes there. It must be emphasized that all these multiple assignments have exactly equal numbers of total changes, so that none is preferred over any other.
I have followed the convention of having a "." printed out in the table of character states of the hypothetical ancestral nodes whenever a state is 0 or 1 and its immediate ancestor is the same. This has the effect of highlighting the places where changes might have occurred and making it easy for the user to reconstruct all the alternative patterns of the characters states in the hypothetical ancestral nodes. In Pars you can, using the menu, turn off this dot-differencing convention and see all states at all hypothetical ancestral nodes of the tree.
On the line in that table corresponding to each branch of the tree will also be printed "yes", "no" or "maybe" as an answer to the question of whether this branch is of nonzero length. If there is no evidence that any character has changed in that branch, then "no" will be printed. If there is definite evidence that one has changed, then "yes" will be printed. If the matter is ambiguous, then "maybe" will be printed. You should keep in mind that all of these conclusions assume that we are only interested in the assignment of states that requires the least amount of change. In reality, the confidence limit on tree topology usually includes many different topologies, and presumably also then the confidence limits on amounts of change in branches are also very broad.
In addition to the table showing numbers of events, a table may be printed out showing which ancestral state causes the fewest events for each character. This will not always be done, but only when the tree is rooted and some ancestral states are unknown. This can be used to infer states of ancestors. For example, if you use the O (Outgroup) and A (Ancestral states) options together, with at least some of the ancestral states being given as "?", then inferences will be made for those characters, as the outgroup makes the tree rooted if it was not already.
In programs Mix and Penny, if you are using the Camin-Sokal parsimony option with ancestral state "?" and it turns out that the program cannot decide between ancestral states 0 and 1, it will fail to even attempt reconstruction of states of the hypothetical ancestors, printing them all out as "." for those characters. This is done for internal bookkeeping reasons -- to reconstruct their changes would require a fair amount of additional code and additional data structures. It is not too hard to reconstruct the internal states by hand, trying the two possible ancestral states one after the other. A similar comment applies to the use of ancestral state "?" in the Dollo or Polymorphism parsimony methods (programs Dollop and Dolpenny) which also can result in a similar hesitancy to print the estimate of the states of the hypothetical ancestors. In all of these cases the program will print "?" rather than "no" when it describes whether there are any changes in a branch, since there might or might not be changes in those characters which are not reconstructed.
For further information see the documentation files for the individual programs. phylip-3.697/doc/distance.html 0000644 0047320 0047320 00000040014 12406201172 015764 0 ustar joe felsenst_g
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
The programs Fitch, Kitsch, and Neighbor are for dealing with data which comes in the form of a matrix of pairwise distances between all pairs of taxa, such as distances based on molecular sequence data, gene frequency genetic distances, amounts of DNA hybridization, or immunological distances. In analyzing these data, distance matrix programs implicitly assume that:
These assumptions can be traced in the least squares methods of programs Fitch and Kitsch but it is not quite so easy to see them in operation in the Neighbor-Joining method of Neighbor, where the independence assumption is less obvious.
THESE TWO ASSUMPTIONS ARE DUBIOUS IN MOST CASES: independence will not be expected to be true in most kinds of data, such as genetic distances from gene frequency data. For genetic distance data in which pure genetic drift without mutation can be assumed to be the mechanism of change Contml may be more appropriate. However, Fitch, Kitsch, and Neighbor will not give positively misleading results (they will not make a statistically inconsistent estimate) provided that additivity holds, which it will if the distance is computed from the original data by a method which corrects for reversals and parallelisms in evolution. If additivity is not expected to hold, problems are more severe. A short discussion of these matters will be found in a review article of mine (1984a). For detailed, if sometimes irrelevant, controversy see the papers by Farris (1981, 1985, 1986) and myself (1986, 1988b).
For genetic distances from gene frequencies, Fitch, Kitsch, and Neighbor may be appropriate if a neutral mutation model can be assumed and Nei's genetic distance is used, or if pure drift can be assumed and either Cavalli-Sforza's chord measure or Reynolds, Weir, and Cockerham's (1983) genetic distance is used. However, in the latter case (pure drift) Contml should be better.
Restriction site and restriction fragment data can be treated by distance matrix methods if a distance such as that of Nei and Li (1979) is used. Distances of this sort can be computed in PHYLIP by the program Restdist.
For nucleic acid sequences, the distances computed in Dnadist allow correction for multiple hits (in different ways) and should allow one to analyse the data under the presumption of additivity. In all of these cases independence will not be expected to hold. DNA hybridization and immunological distances may be additive and independent if transformed properly and if (and only if) the standards against which each value is measured are independent. (This is rarely exactly true).
Fitch and the Neighbor-Joining option of Neighbor fit a tree which has the branch lengths unconstrained. Kitsch and the UPGMA option of Neighbor, by contrast, assume that an "evolutionary clock" is valid, according to which the true branch lengths from the root of the tree to each tip are the same: the expected amount of evolution in any lineage is proportional to elapsed time.
The input format for distance data is straightforward. The first line of the input file contains the number of species. There follows species data, starting, as with all other programs, with a species name. The species name is ten characters long, and must be padded out with blanks if shorter. For each species there then follows a set of distances to all the other species (options selected in the programs' menus allow the distance matrix to be upper or lower triangular or square). The distances can continue to a new line after any of them. If the matrix is lower-triangular, the diagonal entries (the distances from a species to itself) will not be read by the programs. If they are included anyway, they will be ignored by the programs, except for the case where one of them starts a new line, in which case the program will mistake it for a species name and get very confused.
For example, here is a sample input matrix, with a square matrix:
5 Alpha 0.000 1.000 2.000 3.000 3.000 Beta 1.000 0.000 2.000 3.000 3.000 Gamma 2.000 2.000 0.000 3.000 3.000 Delta 3.000 3.000 3.000 0.000 1.000 Epsilon 3.000 3.000 3.000 1.000 0.000 |
and here is a sample lower-triangular input matrix with distances continuing to new lines as needed:
14 Mouse Bovine 1.7043 Lemur 2.0235 1.1901 Tarsier 2.1378 1.3287 1.2905 Squir Monk 1.5232 1.2423 1.3199 1.7878 Jpn Macaq 1.8261 1.2508 1.3887 1.3137 1.0642 Rhesus Mac 1.9182 1.2536 1.4658 1.3788 1.1124 0.1022 Crab-E.Mac 2.0039 1.3066 1.4826 1.3826 0.9832 0.2061 0.2681 BarbMacaq 1.9431 1.2827 1.4502 1.4543 1.0629 0.3895 0.3930 0.3665 Gibbon 1.9663 1.3296 1.8708 1.6683 0.9228 0.8035 0.7109 0.8132 0.7858 Orang 2.0593 1.2005 1.5356 1.6606 1.0681 0.7239 0.7290 0.7894 0.7140 0.7095 Gorilla 1.6664 1.3460 1.4577 1.5935 0.9127 0.7278 0.7412 0.8763 0.7966 0.5959 0.4604 Chimp 1.7320 1.3757 1.7803 1.7119 1.0635 0.7899 0.8742 0.8868 0.8288 0.6213 0.5065 0.3502 Human 1.7101 1.3956 1.6661 1.7599 1.0557 0.6933 0.7118 0.7589 0.8542 0.5612 0.4700 0.3097 0.2712 |
Note that the name "Mouse" in this matrix must be padded out by blanks to the full length of 10 characters.
In general the distances are assumed to all be present: at the moment there is only one way we can have missing entries in the distance matrix. If the S option (which allows the user to specify the degree of replication of each distance) is invoked, with some of the entries having degree of replication zero, if the U (User Tree) option is in effect, and if the tree being examined is such that every branch length can be estimated from the data, it will be possible to solve for the branch lengths and sum of squares when there is some missing data. You may not get away with this if the U option is not in effect, as a tree may be tried on which the program will calculate a branch length by dividing zero by zero, and get upset.
The present version of Neighbor does allow the Subreplication option to be used and the number of replicates to be in the input file, but it actally does nothing with this information except read it in. It makes use of the average distances in the cells of the input data matrix. This means that you cannot use the S option to treat zero cells. We hope to modify Neighbor in the future to allow Subreplication. Of course the U (User tree) option is not available in Neighbor in any case.
The present versions of Fitch and Kitsch will do much better on missing values than did previous versions, but you will still have to be careful about them. Nevertheless you might (just) be able to explore relevant alternative tree topologies one at a time using the U option when there is missing data.
Alternatively, if the missing values in one cell always correspond to a cell with non-missing values on the opposite side of the main diagonal (i.e., if D(i,j) missing implies that D(j,i) is not missing), then use of the S option will always be sufficient to cope with missing values. When it is used, the missing distances should be entered as if present (any number can be used) and the degree of replication for them should be given as 0.
Note that the algorithm for searching among topologies in Fitch and Kitsch is the same one used in other programs, so that it is necessary to try different orders of species in the input data. The J (Jumble) menu option may be sufficient for most purposes.
The programs Fitch and Kitsch carry out the method of Fitch and Margoliash (1967) for fitting trees to distance matrices. They also are able to carry out the least squares method of Cavalli-Sforza and Edwards (1967), plus a variety of other methods of the same family (see the discussion of the P option below). They can also be set to use the Minimum Evolution method (Nei and Rzhetsky, 1993; Kidd and Sgaramella-Zonta, 1971).
The objective of these methods is to find that tree which minimizes
__ __ \ \ nij ( Dij - dij)2 Sum of squares = /_ /_ ------------------ i j Dijp
(the symbol made up of \, / and _ characters is of course a summation sign) where D is the observed distance between species i and j and d is the expected distance, computed as the sum of the lengths (amounts of evolution) of the segments of the tree from species i to species j. The quantity n is the number of times each distance has been replicated. In simple cases this is taken to be one, but the user can, as an option, specify the degree of replication for each distance. The distance is then assumed to be a mean of those replicates. The power P is what distinguished the various methods. For the Fitch- Margoliash method, which is the default method with this program, P is 2.0. For the Cavalli-Sforza and Edwards least squares method it should be set to 0 (so that the denominator is always 1). An intermediate method is also available in which P is 1.0, and any other value of P, such as 4.0 or -2.3, can also be used. This generates a whole family of methods.
The P (Power) option is not available in the Neighbor-Joining program Neighbor. Implicitly, in this program P is 0.0 (though it is hard to prove this). The UPGMA option of Neighbor will assign the same branch lengths to the particular tree topology that it finds as will Kitsch when given the same tree and Power = 0.0.
All these methods make the assumptions of additivity and independent errors. The difference between the methods is how they weight departures of observed from expected. In effect, these methods differ in how they assume that the variance of measurement of a distance will rise as a function of the expected value of the distance.
These methods assume that the variance of the measurement error is proportional to the P-th power of the expectation (hence the standard deviation will be proportional to the P/2-th power of the expectation). If you have reason to think that the measurement error of a distance is the same for small distances as it is for large, then you should set P=0 and use the least squares method, but if you have reason to think that the relative (percentage) error is more nearly constant than the absolute error, you should use P=2, the Fitch-Margoliash method. In between, P=1 would be appropriate if the sizes of the errors were proportional to the square roots of the expected distance.
One question which arises frequently is what the units of branch length are in the resulting trees. In general, they are not time but units of distance. Thus if two species have a distance 0.3 between them, they will tend to be separated by branches whose total length is about 0.3. In the case of DNA distances, for example, the unit of branch length will be substitutions per base. (In the case of protein distances, it will be amino acid substitutions per amino acid position.)
Here are the options available in all three programs. They are selected using the menu of options.
((A,B),C,(D,E));
while in Kitsch they are to be regarded as rooted and have a bifurcation at the base:
((A,B),(C,(D,E)));
Be careful not to move User trees from Fitch to Kitsch without changing their form appropriately (you can use Retree to do this). User trees are not available in Neighbor. In Fitch if you specify the branch lengths on one or more branches, you can select the L (use branch Lengths) option to avoid having those branches iterated, so that the tree is evaluated with their lengths fixed.
Delta 3.00 5 3.21 3 1.84 9
the 5, 3, and 9 being the number of times the measurement was replicated. When the number of replicates is zero, a distance value must still be provided, although its value will not afect the result. This option is not available in Neighbor.
The numerical options are the usual ones and should be clear from the menu.
Note that when the options L or R are used one of the species, the first or last one, will have its name on an otherwise empty line. Even so, the name should be padded out to full length with blanks. Here is a sample lower- triangular data set.
5 Alpha Beta 1.00 Gamma 3.00 3.00 Delta 3.00 3.00 2.00 Epsilon 3.00 3.00 2.00 1.00 | <--- note: five blanks should follow the name "Alpha" |
Be careful if you are using lower- or upper-triangular trees to make the corresponding selection from the menu (L or R), as the program may get horribly confused otherwise, but it still gives a result even though the result is then meaningless. With the menu option selected all should be well. phylip-3.697/doc/dnacomp.html 0000644 0047320 0047320 00000027526 12406201172 015630 0 ustar joe felsenst_g
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program implements the compatibility method for DNA sequence data. For a four-state character without a character-state tree, as in DNA sequences, the usual clique theorems cannot be applied. The approach taken in this program is to directly evaluate each tree topology by counting how many substitutions are needed in each site, comparing this to the minimum number that might be needed (one less than the number of bases observed at that site), and then evaluating the number of sites which achieve the minimum number. This is the evaluation of the tree (the number of compatible sites), and the topology is chosen so as to maximize that number.
Compatibility methods originated with Le Quesne's (1969) suggestion that one ought to look for trees supported by the largest number of perfectly fitting (compatible) characters. Fitch (1975) showed by counterexample that one could not use the pairwise compatibility methods used in Clique to discover the largest clique of jointly compatible characters.
The assumptions of this method are similar to those of Clique. In a paper in the Biological Journal of the Linnean Society (1981b) I discuss this matter extensively. In effect, the assumptions are that:
That these are the assumptions of compatibility methods has been documented in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For an opposing view arguing that arguments such as mine are invalid and that parsimony (and perhaps compatibility) methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b, 1988), but also read the exchange between Felsenstein and Sober (1986).
There is, however, some reason to believe that the present criterion is not the proper way to correct for the presence of some sites with high rates of change in nucleotide sequence data. It can be argued that sites showing more than two nucleotide states, even if those are compatible with the other sites, are also candidates for sites with high rates of change. It might then be more proper to use Dnapars with the Threshold option with a threshold value of 2.
Change from an occupied site to a gap is counted as one change. Reversion from a gap to an occupied site is allowed and is also counted as one change. Note that this in effect assumes that a gap N bases long is N separate events. This may be an overcorrection. When we have nonoverlapping gaps, we could instead code a gap as a single event by changing all but the first "-" in the gap into "?" characters. In this way only the first base of the gap causes the program to infer a change.
The input data is standard. The first line of the input file contains the number of species and the number of sites.
Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats described in the Molecular Sequence Programs document. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion.
The options are selected using an interactive menu. The menu looks like this:
DNA compatibility algorithm, version 3.69 Settings for this run: U Search for best tree? Yes J Randomize input order of sequences? No. Use input order O Outgroup root? No, use as outgroup species 1 W Sites weighted? No M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Print steps & compatibility at sites No 5 Print sequences at all nodes of tree No 6 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) |
The user either types "Y" (followed, of course, by a carriage-return) if the settings shown are to be accepted, or the letter or digit corresponding to an option that is to be changed.
The options U, J, O, W, M, and 0 are the usual ones. They are described in the main documentation file of this package. Option I is the same as in other molecular sequence programs and is described in the documentation file for the sequence programs.
The O (outgroup) option has no effect if the U (user-defined tree) option is in effect. The user-defined trees (option U) fed in must be strictly bifurcating, with a two-way split at their base.
The interpretation of weights (option W) in the case of a compatibility method is that they count how many times the character (in this case the site) is counted in the analysis. Thus a character can be dropped from the analysis by assigning it zero weight. On the other hand, giving it a weight of 5 means that in any clique it is in, it is counted as 5 characters when the size of the clique is evaluated. Generally, weights other than 0 or 1 do not have much meaning when dealing with DNA sequences.
Output is standard: if option 1 is toggled on, the data is printed out, with the convention that "." means "the same as in the first species". Then comes a list of equally parsimonious trees, and (if option 2 is toggled on) a table of the number of changes of state required in each character. If option 5 is toggled on, a table is printed out after each tree, showing for each branch whether there are known to be changes in the branch, and what the states are inferred to have been at the top end of the branch. If the inferred state is a "?" or one of the IUB ambiguity symbols, there will be multiple equally-parsimonious assignments of states; the user must work these out for themselves by hand. A "?" in the reconstructed states means that in addition to one or more bases, a gap may or may not be present. If option 6 is left in its default state the trees found will be written to a tree file, so that they are available to be used in other programs. If the program finds multiple trees tied for best, all of these are written out onto the output tree file. Each is followed by a numerical weight in square brackets (such as [0.25000]). This is needed when we use the trees to make a consensus tree of the results of bootstrapping or jackknifing, to avoid overrepresenting replicates that find many tied trees.
If the U (User Tree) option is used and more than one tree is supplied, the program also performs a statistical test of each of these trees against the one with highest likelihood. If there are two user trees, the test done is one which is due to Kishino and Hasegawa (1989), a version of a test originally introduced by Templeton (1983). In this implementation it uses the mean and variance of weighted compatibility differences between trees, taken across sites. If the two trees' compatibilities are more than 1.96 standard deviations different then the trees are declared significantly different.
If there are more than two trees, the test done is an extension of the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that a correction for the number of trees was necessary, and they introduced a resampling method to make this correction. In the version used here the variances and covariances of the sum of weighted compatibilities of sites are computed for all pairs of trees. To test whether the difference between each tree and the best one is larger than could have been expected if they all had the same expected compatibility, compatibilities for all trees are sampled with these covariances and equal means (Shimodaira and Hasegawa's "least favorable hypothesis"), and a P value is computed from the fraction of times the difference between the tree's value and the highest compatibility exceeds that actually observed. Note that this sampling needs random numbers, and so the program will prompt the user for a random number seed if one has not already been supplied. With the two-tree KHT test no random numbers are used.
In either the KHT or the SH test the program prints out a table of the compatibility of each tree, the differences of each from the highest one, the variance of that quantity as determined by the compatibility differences at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one.
The algorithm is a straightforward modification of Dnapars, but with some extra machinery added to calculate, as each species is added, how many base changes are the minimum which could be required at that site. The program runs fairly quickly.
The constants which can be changed at the beginning of the program are: the name length "nmlngth", "maxtrees", the maximum number of trees which the program will store for output, and "maxuser", the maximum number of user trees that can be used in the paired sites test.
5 13 Alpha AACGUGGCCAAAU Beta AAGGUCGCCAAAC Gamma CAUUUCGUCACAA Delta GGUAUUUCGGCCU Epsilon GGGAUCUCGGCCC |
DNA compatibility algorithm, version 3.69 5 species, 13 sites Name Sequences ---- --------- Alpha AACGUGGCCA AAU Beta ..G..C.... ..C Gamma C.UU.C.U.. C.A Delta GGUA.UU.GG CC. Epsilon GGGA.CU.GG CCC One most parsimonious tree found: +--Epsilon +--4 +--3 +--Delta ! ! +--2 +-----Gamma ! ! 1 +--------Beta ! +-----------Alpha remember: this is an unrooted tree! total number of compatible sites is 11.0 steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 2 1 3 2 0 2 1 1 1 10| 1 1 1 3 compatibility (Y or N) of each site with this tree: 0123456789 *---------- 0 ! YYNYYYYYY 10 !YYYN From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AABGTSGCCA AAY 1 2 maybe .....C.... ... 2 3 yes V.KD...... C.. 3 4 yes GG.A..T.GG .C. 4 Epsilon maybe ..G....... ..C 4 Delta yes ..T..T.... ..T 3 Gamma yes C.TT...T.. ..A 2 Beta maybe ..G....... ..C 1 Alpha maybe ..C..G.... ..T |
© Copyright 1986-2008 by the University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.
This program uses nucleotide sequences to compute a distance matrix, under four different models of nucleotide substitution. It can also compute a table of similarity between the nucleotide sequences. The distance for each pair of species estimates the total branch length between the two species, and can be used in the distance matrix programs Fitch, Kitsch or Neighbor. This is an alternative to using the sequence data itself in the maximum likelihood program Dnaml or the parsimony program Dnapars.
The program reads in nucleotide sequences and writes an output file containing the distance matrix, or else a table of similarity between sequences. The four models of nucleotide substitution are those of Jukes and Cantor (1969), Kimura (1980), the F84 model (Kishino and Hasegawa, 1989; Felsenstein and Churchill, 1996), and the model underlying the LogDet distance (Barry and Hartigan, 1987; Lake, 1994; Steel, 1994; Lockhart et. al., 1994). All except the LogDet distance can be made to allow for for unequal rates of substitution at different sites, as Jin and Nei (1990) did for the Jukes-Cantor model. The program correctly takes into account a variety of sequence ambiguities, although in cases where they exist it can be slow.
Jukes and Cantor's (1969) model assumes that there is independent change at all sites, with equal probability. Whether a base changes is independent of its identity, and when it changes there is an equal probability of ending up with each of the other three bases. Thus the transition probability matrix (this is a technical term from probability theory and has nothing to do with transitions as opposed to transversions) for a short period of time dt is:
To: A G C T --------------------------------- A | 1-3a a a a From: G | a 1-3a a a C | a a 1-3a a T | a a a 1-3a
where a is u dt, the product of the rate of substitution per unit time (u) and the length dt of the time interval. For longer periods of time this implies that the probability that two sequences will differ at a given site is:
p = 3/4 ( 1 - e- 4/3 u t)
and hence that if we observe p, we can compute an estimate of the branch length ut by inverting this to get
ut = - 3/4 loge ( 1 - 4/3 p )
The Kimura "2-parameter" model is almost as symmetric as this, but allows for a difference between transition and transversion rates. Its transition probability matrix for a short interval of time is:
To: A G C T --------------------------------- A | 1-a-2b a b b From: G | a 1-a-2b b b C | b b 1-a-2b a T | b b a 1-a-2b
where a is u dt, the product of the rate of transitions per unit time and dt is the length dt of the time interval, and b is v dt, the product of half the rate of transversions (i.e., the rate of a specific transversion) and the length dt of the time interval.
The F84 model incorporates different rates of transition and transversion, but also allows for different frequencies of the four nucleotides. It is the model which is used in Dnaml, the maximum likelihood nucelotide sequence phylogenies program in this package. You will find the model described in the document for that program. The transition probabilities for this model are given by Kishino and Hasegawa (1989), and further explained in a paper by me and Gary Churchill (1996).
The LogDet distance allows a fairly general model of substitution. It computes the distance from the determinant of the empirically observed matrix of joint probabilities of nucleotides in the two species. An explanation of it is available in the chapter by Swofford et al. (1996).
The first three models are closely related. The Dnaml model reduces to Kimura's two-parameter model if we assume that the equilibrium frequencies of the four bases are equal. The Jukes-Cantor model in turn is a special case of the Kimura 2-parameter model where a = b. Thus each model is a special case of the ones that follow it, Jukes-Cantor being a special case of both of the others.
The Jin and Nei (1990) correction for variation in rate of evolution from site to site can be adapted to all of the first three models. It assumes that the rate of substitution varies from site to site according to a gamma distribution, with a coefficient of variation that is specified by the user. The user is asked for it when choosing this option in the menu.
Each distance that is calculated is an estimate, from that particular pair of species, of the divergence time between those two species. For the Jukes- Cantor model, the estimate is computed using the formula for ut given above, as long as the nucleotide symbols in the two sequences are all either A, C, G, T, U, N, X, ?, or - (the latter four indicate a deletion or an unknown nucleotide). This estimate is a maximum likelihood estimate for that model. For the Kimura 2-parameter model, with only these nucleotide symbols, formulas special to that estimate are also computed. These are also, in effect, computing the maximum likelihood estimate for that model. In the Kimura case it depends on the observed sequences only through the sequence length and the observed number of transition and transversion differences between those two sequences. The calculation in that case is a maximum likelihood estimate and will differ somewhat from the estimate obtained from the formulas in Kimura's original paper. That formula was also a maximum likelihood estimate, but with the transition/transversion ratio estimated empirically, separately for each pair of sequences. In the present case, one overall preset transition/transversion ratio is used which makes the computations harder but achieves greater consistency between different comparisons.
For the F84 model, or for any of the models where one or both sequences contain at least one of the other ambiguity codons such as Y, R, etc., a maximum likelihood calculation is also done using code which was originally written for Dnaml. Its disadvantage is that it is slow. The resulting distance is in effect a maximum likelihood estimate of the divergence time (total branch length between) the two sequences. However the present program will be much faster than versions earlier than 3.5, because I have speeded up the iterations.
The LogDet model computes the distance from the determinant of the matrix of co-occurrence of nucleotides in the two species, according to the formula
D = - 1/4(loge(|F|) - 1/2loge(fA1fC1fG1fT1fA2fC2fG2fT2))Where F is a matrix whose (i,j) element is the fraction of sites at which base i occurs in one species and base j occurs in the other. fji is the fraction of sites at which species i has base j. The LogDet distance cannot cope with ambiguity codes. It must have completely defined sequences. One limitation of the LogDet distance is that it may be infinite sometimes, if there are too many changes between certain pairs of nucleotides. This can be particularly noticeable with distances computed from bootstrapped sequences.
Note that there is an assumption that we are looking at all sites, including those that have not changed at all. It is important not to restrict attention to some sites based on whether or not they have changed; doing that would bias the distances by making them too large, and that in turn would cause the distances to misinterpret the meaning of those sites that had changed.
For all of these distance methods, the program allows us to specify that "third position" bases have a different rate of substitution than first and second positions, that introns have a different rate than exons, and so on. The Categories option which does this allows us to make up to 9 categories of sites and specify different rates of change for them.
In addition to the four distance calculations, the program can also compute a table of similarities between nucleotide sequences. These values are the fractions of sites identical between the sequences. The diagonal values are 1.0000. No attempt is made to count similarity of nonidentical nucleotides, so that no credit is given for having (for example) different purines at corresponding sites in the two sequences. This option has been requested by many users, who need it for descriptive purposes. It is not intended that the table be used for inferring the tree.
Input is fairly standard, with one addition. As usual the first line of the file gives the number of species and the number of sites.
Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats described in the Molecular Sequence Programs document. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion -- neither is dot (".").
The options are selected using an interactive menu. The menu looks like this:
Nucleic acid sequence Distance Matrix program, version 3.69 Settings for this run: D Distance (F84, Kimura, Jukes-Cantor, LogDet)? F84 G Gamma distributed rates across sites? No T Transition/transversion ratio? 2.0 C One category of substitution rates? Yes W Use weights for sites? No F Use empirical base frequencies? Yes L Form of distance matrix? Square M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Y to accept these or type the letter for one to change |
The user either types "Y" (followed, of course, by a carriage-return) if the settings shown are to be accepted, or the letter or digit corresponding to an option that is to be changed.
The D option selects one of the four distance methods, or the similarity table. It toggles among the five methods. The default method, if none is specified, is the F84 model.
If the G (Gamma distribution) option is selected, the user will be asked to supply the coefficient of variation of the rate of substitution among sites. This is different from the parameters used by Nei and Jin but related to them: their parameter a is also known as "alpha", the shape parameter of the Gamma distribution. It is related to the coefficient of variation by
CV = 1 / a1/2
or
a = 1 / (CV)2
(their parameter b is absorbed here by the requirement that time is scaled so that the mean rate of evolution is 1 per unit time, which means that a = b). As we consider cases in which the rates are less variable we should set a larger and larger, as CV gets smaller and smaller.
The F (Frequencies) option appears when the Maximum Likelihood distance is selected. This distance requires that the program be provided with the equilibrium frequencies of the four bases A, C, G, and T (or U). Its default setting is one which may save users much time. If you want to use the empirical frequencies of the bases, observed in the input sequences, as the base frequencies, you simply use the default setting of the F option. These empirical frequencies are not really the maximum likelihood estimates of the base frequencies, but they will often be close to those values (what they are is maximum likelihood estimates under a "star" or "explosion" phylogeny). If you change the setting of the F option you will be prompted for the frequencies of the four bases. These must add to 1 and are to be typed on one line separated by blanks, not commas.
The T option in this program does not stand for Threshold, but instead is the Transition/transversion option. The user is prompted for a real number greater than 0.0, as the expected ratio of transitions to transversions. Note that this is not the ratio of the first to the second kinds of events, but the resulting expected ratio of transitions to transversions. The exact relationship between these two quantities depends on the frequencies in the base pools. The default value of the T parameter if you do not use the T option is 2.0.
The C option allows user-defined rate categories. The user is prompted for the number of user-defined rates, and for the rates themselves, which cannot be negative but can be zero. These numbers, which must be nonnegative (some could be 0), are defined relative to each other, so that if rates for three categories are set to 1 : 3 : 2.5 this would have the same meaning as setting them to 2 : 6 : 5. The assignment of rates to sites is then made by reading a file whose default name is "categories". It should contain a string of digits 1 through 9. A new line or a blank can occur after any character in this string. Thus the categories file might look like this:
122231111122411155 1155333333444
If both user-assigned rate categories and Gamma-distributed rates are allowed, the program assumes that the actual rate at a site is the product of the user-assigned category rate and the Gamma-distributed rate. This allows you to specify that certain sites have higher or lower rates of change while also allowing the program to allow variation of rates in addition to that. (This may not always make perfect biological sense: it would be more natural to assume some upper bound to the rate, as we have discussed in the Felsenstein and Churchill paper). Nevertheless you may want to use both types of rate variation.
The L option specifies that the output file is to have the distance matrix in lower triangular form.
The W (Weights) option is invoked in the usual way, with only weights 0 and 1 allowed. It selects a set of sites to be analyzed, ignoring the others. The sites selected are those with weight 1. If the W option is not invoked, all sites are analyzed. The Weights (W) option takes the weights from a file whose default name is "weights". The weights follow the format described in the main documentation file.
The M (multiple data sets) option will ask you whether you want to use multiple sets of weights (from the weights file) or multiple data sets from the input file. The ability to use a single data set with multiple weights means that much less disk space will be used for this input data. The bootstrapping and jackknifing tool Seqboot has the ability to create a weights file with multiple weights. Note also that when we use multiple weights for bootstrapping we can also then maintain different rate categories for different sites in a meaningful way. If you use the multiple data sets option rather than multiple weights, you should not at the same time use the user-defined rate categories option (option C), because the user-defined rate categories could then be associated with the wrong sites. This is not a concern when the M option is used by using multiple weights.
The option 0 is the usual one. It is described in the main documentation file of this package. Option I is the same as in other molecular sequence programs and is described in the documentation file for the sequence programs.
As the distances are computed, the program prints on your screen or terminal the names of the species in turn, followed by one dot (".") for each other species for which the distance to that species has been computed. Thus if there are ten species, the first species name is printed out, followed by nine dots, then on the next line the next species name is printed out followed by eight dots, then the next followed by seven dots, and so on. The pattern of dots should form a triangle. When the distance matrix has been written out to the output file, the user is notified of that.
The output file contains on its first line the number of species. The distance matrix is then printed in standard form, with each species starting on a new line with the species name, followed by the distances to the species in order. These continue onto a new line after every nine distances. If the L option is used, the matrix of distances is in lower triangular form, so that only the distances to the other species that precede each species are printed. Otherwise the distance matrix is square with zero distances on the diagonal. In general the format of the distance matrix is such that it can serve as input to any of the distance matrix programs.
If the option to print out the data is selected, the output file will precede the data by more complete information on the input and the menu selections. The output file begins by giving the number of species and the number of characters, and the identity of the distance measure that is being used.
If the C (Categories) option is used a table of the relative rates of expected substitution at each category of sites is printed, and a listing of the categories each site is in.
There will then follow the equilibrium frequencies of the four bases. If the Jukes-Cantor or Kimura distances are used, these will necessarily be 0.25 : 0.25 : 0.25 : 0.25. The output then shows the transition/transversion ratio that was specified or used by default. In the case of the Jukes-Cantor distance this will always be 0.5. The transition-transversion parameter (as opposed to the ratio) is also printed out: this is used within the program and can be ignored. There then follow the data sequences, with the base sequences printed in groups of ten bases along the lines of the Genbank and EMBL formats.
The distances printed out are scaled in terms of expected numbers of substitutions, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change, averaged over all sites analyzed, is set to 1.0 if there are multiple categories of sites. This means that whether or not there are multiple categories of sites, the expected fraction of change for very small branches is equal to the branch length. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes may occur in the same site and overlie or even reverse each other. The branch length estimates here are in terms of the expected underlying numbers of changes. That means that a branch of length 0.26 is 26 times as long as one which would show a 1% difference between the nucleotide sequences at the beginning and end of the branch. But we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.
One problem that can arise is that two or more of the species can be so dissimilar that the distance between them would have to be infinite, as the likelihood rises indefinitely as the estimated divergence time increases. For example, with the Jukes-Cantor model, if the two sequences differ in 75% or more of their positions then the estimate of divergence time would be infinite. Since there is no way to represent an infinite distance in the output file, the program regards this as an error, issues an error message indicating which pair of species are causing the problem, and stops. It might be that, had it continued running, it would have also run into the same problem with other pairs of species. If the Kimura distance is being used there may be no error message; the program may simply give a large distance value (it is iterating towards infinity and the value is just where the iteration stopped). Likewise some maximum likelihood estimates may also become large for the same reason (the sequences showing more divergence than is expected even with infinite branch length). I hope in the future to add more warning messages that would alert the user the this.
If the similarity table is selected, the table that is produced is not in a format that can be used as input to the distance matrix programs. It has a heading, and the species names are also put at the tops of the columns of the table (or rather, the first 8 characters of each species name is there, the other two characters omitted to save space). There is not an option to put the table into a format that can be read by the distance matrix programs, nor is there one to make it into a table of fractions of difference by subtracting the similarity values from 1. This is done deliberately to make it more difficult for to use these values to construct trees. The similarity values are not corrected for multiple changes, and their use to construct trees (even after converting them to fractions of difference) would be wrong, as it would lead to severe conflict between the distant pairs of sequences and the close pairs of sequences.
The constants that are available to be changed by the user at the beginning of the program include "maxcategories", the maximum number of site categories, "iterations", which controls the number of times the program iterates the EM algorithm that is used to do the maximum likelihood distance, "namelength", the length of species names in characters, and "epsilon", a parameter which controls the accuracy of the results of the iterations which estimate the distances. Making "epsilon" smaller will increase run times but result in more decimal places of accuracy. This should not be necessary.
The program spends most of its time doing real arithmetic. The algorithm, with separate and independent computations occurring for each pattern, lends itself readily to parallel processing.
5 13 Alpha AACGTGGCCACAT Beta AAGGTCGCCACAC Gamma CAGTTCGCCACAA Delta GAGATTTCCGCCT Epsilon GAGATCTCCGCCC |
(Note that when the options for displaying the input data are turned off, the output is in a form suitable for use as an input file in the distance matrix programs).
Nucleic acid sequence Distance Matrix program, version 3.69 5 species, 13 sites F84 Distance Transition/transversion ratio = 2.000000 Name Sequences ---- --------- Alpha AACGTGGCCA CAT Beta ..G..C.... ..C Gamma C.GT.C.... ..A Delta G.GA.TT..G .C. Epsilon G.GA.CT..G .CC Empirical Base Frequencies: A 0.24615 C 0.36923 G 0.21538 T(U) 0.16923 5 Alpha 0.000000 0.303900 0.857544 1.158927 1.542899 Beta 0.303900 0.000000 0.339727 0.913522 0.619671 Gamma 0.857544 0.339727 0.000000 1.631729 1.293713 Delta 1.158927 0.913522 1.631729 0.000000 0.165882 Epsilon 1.542899 0.619671 1.293713 0.165882 0.000000 |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program reads in nucleotide sequences for four species and computes the phylogenetic invariants discovered by James Cavender (Cavender and Felsenstein, 1987) and James Lake (1987). Lake's method is also called by him "evolutionary parsimony". I prefer Cavender's more mathematically precise term "invariants", as the method bears somewhat more relationship to likelihood methods than to parsimony. The invariants are mathematical formulas (in the present case linear or quadratic) in the EXPECTED frequencies of site patterns which are zero for all trees of a given tree topology, irrespective of branch lengths. We can consider at a given site that if there are no ambiguities, we could have for four species the nucleotide patterns (considering the same site across all four species) AAAA, AAAC, AAAG, ... through TTTT, 256 patterns in all.
The invariants are formulas in the expected pattern frequencies, not the observed pattern frequencies. When they are computed using the observed pattern frequencies, we will usually find that they are not precisely zero even when the model is correct and we have the correct tree topology. Only as the number of nucleotides scored becomes infinite will the observed pattern frequencies approach their expectations; otherwise, we must do a statistical test of the invariants.
Some explanation of invariants will be found in the above papers, and also in my review article on statistical aspects of inferring phylogenies (Felsenstein, 1988b). Although invariants have some important advantages, their validity also depends on symmetry assumptions that may not be satisfied. In the discussion below suppose that the possible unrooted phylogenies are I: ((A,B),(C,D)), II: ((A,C),(B,D)), and III: ((A,D),(B,C)).
Lake's invariants are fairly simple to describe: the patterns involved are only those in which there are two purines and two pyrimidines at a site. Thus a site with AACT would affect the invariants, but a site with AAGG would not. Let us use (as Lake does) the symbols 1, 2, 3, and 4, with the proviso that 1 and 2 are either both of the purines or both of the pyrimidines; 3 and 4 are the other two nucleotides. Thus 1 and 2 always differ by a transition; so do 3 and 4. Lake's invariants, expressed in terms of expected frequencies, are the three quantities:
(1) P(1133) + P(1234) - P(1134) - P(1233),
(2) P(1313) + P(1324) - P(1314) - P(1323),
(3) P(1331) + P(1342) - P(1341) - P(1332),
He showed that invariants (2) and (3) are zero under Topology I, (1) and (3) are zero under topology II, and (1) and (2) are zero under Topology III. If, for example, we see a site with pattern ACGC, we can start by setting 1=A. Then 2 must be G. We can then set 3=C (so that 4 is T). Thus its pattern type, making those substitutions, is 1323. P(1323) is the expected probability of the type of pattern which includes ACGC, TGAG, GTAT, etc.
Lake's invariants are easily tested with observed frequencies. For example, the first of them is a test of whether there are as many sites of types 1133 and 1234 as there are of types 1134 and 1233; this is easily tested with a chi-square test or, as in this program, with an exact binomial test. Note that with several invariants to test, we risk overestimating the significance of results if we simply accept the nominal 95% levels of significance (Li and Guoy, 1990).
Lake's invariants assume that each site is evolving independently, and that starting from any base a transversion is equally likely to end up at each of the two possible bases (thus, an A undergoing a transversion is equally likely to end up as a C or a T, and similarly for the other four bases from which one could start. Interestingly, Lake's results do not assume that rates of evolution are the same at all sites. The result that the total of 1133 and 1234 is expected to be the same as the total of 1134 and 1233 is unaffected by the fact that we may have aggregated the counts over classes of sites evolving at different rates.
Cavender's invariants (Cavender and Felsenstein, 1987) are for the case of a character with two states. In the nucleic acid case we can classify nucleotides into two states, R and Y (Purine and Pyrimidine) and then use the two-state results. Cavender starts, as before, with the pattern frequencies. Coding purines as R and pyrimidines as Y, the patterns types are RRRR, RRRY, and so on until YYYY, a total of 16 types. Cavender found quadratic functions of the expected frequencies of these 16 types that were expected to be zero under a given phylogeny, irrespective of branch lengths. Two invariants (called K and L) were found for each tree topology. The L invariants are particularly easy to understand. If we have the tree topology ((A,B),(C,D)), then in the case of two symmetric states, the event that A and B have the same state should be independent of whether C and D have the same state, as the events determining these happen in different parts of the tree. We can set up a contingency table:
C = D C =/= D ------------------------------ | A = B | YYYY, YYRR, YYYR, YYRY, | RRRR, RRYY RRYR, RRRY | A =/= B | YRYY, YRRR, YRYR, YRRY, | RYYY, RYRR RYYR, RYRY
where "=/=" means "is not equal to." We expect that the events C = D and A = B will be independent. Cavender's L invariant for this tree topology is simply the negative of the crossproduct difference,
P(A=/=B and C=D) P(A=B and C=/=D) - P(A=B and C=D) P(A=/=B and C=/=D).
One of these L invariants is defined for each of the three tree topologies. They can obviously be tested simply by doing a chi-square test on the contingency table. The one corresponding to the correct topology should be statistically indistinguishable from zero. Again, there is a possible multiple tests problem if all three are tested at a nominal value of 95%.
The K invariants are differences between the L invariants. When one of the tables is expected to have crossproduct difference zero, the other two are expected to be nonzero, and also to be equal. So the difference of their crossproduct differences can be taken; this is the K invariant. It is not so easily tested.
The assumptions of Cavender's invariants are different from those of Lake's. One obviously need not assume anything about the frequencies of, or transitions among, the two different purines or the two different pyrimidines. However one does need to assume independent events at each site, and one needs to assume that the Y and R states are symmetric, that the probability per unit time that a Y changes into an R is the same as the probability that an R changes into a Y, so that we expect equal frequencies of the two states. There is also an assumption that all sites are changing between these two states at the same expected rate. This assumption is not needed for Lake's invariants, since expectations of sums are equal to sums of expectations, but for Cavender's it is, since products of expectations are not equal to expectations of products.
It is helpful to have both sorts of invariants available; with further work we may appreciate what other invaraints there are for various models of nucleic acid change.
The input data for Dnainvar is standard. The first line of the input file contains the number of species (which must always be 4 for this version of Dnainvar) and the number of sites.
Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats described in the Molecular Sequence Programs document. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion.
The options are selected using an interactive menu. The menu looks like this:
Nucleic acid sequence Invariants method, version 3.69 Settings for this run: W Sites weighted? No M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out the counts of patterns Yes 4 Print out the invariants Yes Y to accept these or type the letter for one to change |
The user either types "Y" (followed, of course, by a carriage-return) if the settings shown are to be accepted, or the letter or digit corresponding to an option that is to be changed.
The options W, M and 0 are the usual ones. They are described in the main documentation file of this package. Option I is the same as in other molecular sequence programs and is described in the documentation file for the sequence programs.
The output consists first (if option 1 is selected) of a reprinting of the input data, then (if option 2 is on) tables of observed pattern frequencies and pattern type frequencies. A table will be printed out, in alphabetic order AAAA through TTTT of all the patterns that appear among the sites and the number of times each appears. This table will be invaluable for computation of any other invariants. There follows another table, of pattern types, using the 1234 notation, in numerical order 1111 through 1234, of the number of times each type of pattern appears. In this computation all sites at which there are any ambiguities or deletions are omitted. Cavender's invariants could actually be computed from sites that have only Y or R ambiguities; this will be done in the next release of this program.
If option 3 is on the invariants are then printed out, together with their statistical tests. For Lake's invariants the two sums which are expected to be equal are printed out, and then the result of an one-tailed exact binomial test which tests whether the difference is expected to be this positive or more. The P level is given (but remember the multiple-tests problem!).
For Cavender's L invariants the contingency tables are given. Each is tested with a one-tailed chi-square test. It is possible that the expected numbers in some categories could be too small for valid use of this test; the program does not check for this. It is also possible that the chi-square could be significant but in the wrong direction; this is not tested in the current version of the program. To check it beware of a chi-square greater than 3.841 but with a positive invariant. The invariants themselves are computed, as the difference of cross-products. Their absolute magnitudes are not important, but which one is closest to zero may be indicative. Significantly nonzero invariants should be negative if the model is valid. The K invariants, which are simply differences among the L invariants, are also printed out without any test on them being conducted. Note that it is possible to use the bootstrap utility Seqboot to create multiple data sets, and from the output from summing all of these get the empirical variability of these quadratic invariants.
The constants that are defined at the beginning of the program include "maxsp", which must always be 4 and should not be changed.
The program is very fast, as it has rather little work to do; these methods are just a little bit beyond the reach of hand tabulation. Execution speed should never be a limiting factor.
In a future version I hope to allow for Y and R codes in the calculation of the Cavender invariants, and to check for significantly negative cross-product differences in them, which would indicate violation of the model. By then there should be more known about invariants for larger number of species, and any such advances will also be incorporated.
4 13 Alpha AACGTGGCCAAAT Beta AAGGTCGCCAAAC Gamma CATTTCGTCACAA Delta GGTATTTCGGCCT |
Nucleic acid sequence Invariants method, version 3.69 4 species, 13 sites Name Sequences ---- --------- Alpha AACGTGGCCA AAT Beta ..G..C.... ..C Gamma C.TT.C.T.. C.A Delta GGTA.TT.GG CC. Pattern Number of times AAAC 1 AAAG 2 AACC 1 AACG 1 CCCG 1 CCTC 1 CGTT 1 GCCT 1 GGGT 1 GGTA 1 TCAT 1 TTTT 1 Symmetrized patterns (1, 2 = the two purines and 3, 4 = the two pyrimidines or 1, 2 = the two pyrimidines and 3, 4 = the two purines) 1111 1 1112 2 1113 3 1121 1 1132 2 1133 1 1231 1 1322 1 1334 1 Tree topologies (unrooted): I: ((Alpha,Beta),(Gamma,Delta)) II: ((Alpha,Gamma),(Beta,Delta)) III: ((Alpha,Delta),(Beta,Gamma)) Lake's linear invariants (these are expected to be zero for the two incorrect tree topologies. This is tested by testing the equality of the two parts of each expression using a one-sided exact binomial test. The null hypothesis is that the first part is no larger than the second.) Tree Exact test P value Significant? I 1 - 0 = 1 0.5000 no II 0 - 0 = 0 1.0000 no III 0 - 0 = 0 1.0000 no Cavender's quadratic invariants (type L) using purines vs. pyrimidines (these are expected to be zero, and thus have a nonsignificant chi-square, for the correct tree topology) They will be misled if there are substantially different evolutionary rate between sites, or different purine:pyrimidine ratios from 1:1. Tree I: Contingency Table 2 8 1 2 Quadratic invariant = 4.0 Chi-square = 0.23111 (not significant) Tree II: Contingency Table 1 5 1 6 Quadratic invariant = -1.0 Chi-square = 0.01407 (not significant) Tree III: Contingency Table 1 2 6 4 Quadratic invariant = 8.0 Chi-square = 0.66032 (not significant) Cavender's quadratic invariants (type K) using purines vs. pyrimidines (these are expected to be zero for the correct tree topology) They will be misled if there are substantially different evolutionary rate between sites, or different purine:pyrimidine ratios from 1:1. No statistical test is done on them here. Tree I: -9.0 Tree II: 4.0 Tree III: 5.0 |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program implements the maximum likelihood method for DNA sequences. The present version is faster than earlier versions of Dnaml. Details of the algorithm are published in the paper by Felsenstein and Churchill (1996). The model of base substitution allows the expected frequencies of the four bases to be unequal, allows the expected frequencies of transitions and transversions to be unequal, and has several ways of allowing different rates of evolution at different sites.
The assumptions of the present model are:
The ratio of the two purines in the purine replacement pool is the same as their ratio in the overall pool, and similarly for the pyrimidines.
The ratios of transitions to transversions can be set by the user. The substitution process can be diagrammed as follows: Suppose that you specified A, C, G, and T base frequencies of 0.24, 0.28, 0.27, and 0.21.
Purine pool: Pyrimidine pool: | | | | | 0.4706 A | | 0.5714 C | | 0.5294 G | | 0.4286 T | | (ratio is | | (ratio is | | 0.24 : 0.27) | | 0.28 : 0.21) | |_______________| |_______________|
Draw from the overall pool:
| | | 0.24 A | | 0.28 C | | 0.27 G | | 0.21 T | |__________________|
Note that if the existing base is, say, an A, the first kind of event has a 0.4706 probability of "replacing" it by another A. The second kind of event has a 0.24 chance of replacing it by another A. This rather disconcerting model is used because it has nice mathematical properties that make likelihood calculations far easier. A closely similar, but not precisely identical model having different rates of transitions and transversions has been used by Hasegawa et. al. (1985b). The transition probability formulas for the current model were given (with my permission) by Kishino and Hasegawa (1989). Another explanation is available in the paper by Felsenstein and Churchill (1996).
Note the assumption that we are looking at all sites, including those that have not changed at all. It is important not to restrict attention to some sites based on whether or not they have changed; doing that would bias branch lengths by making them too long, and that in turn would cause the method to misinterpret the meaning of those sites that had changed.
This program uses a Hidden Markov Model (HMM) method of inferring different rates of evolution at different sites. This was described in a paper by me and Gary Churchill (1996). It allows us to specify to the program that there will be a number of different possible evolutionary rates, what the prior probabilities of occurrence of each is, and what the average length of a patch of sites all having the same rate is. The rates can also be chosen by the program to approximate a Gamma distribution of rates, or a Gamma distribution plus a class of invariant sites. The program computes the the likelihood by summing it over all possible assignments of rates to sites, weighting each by its prior probability of occurrence.
For example, if we have used the C and A options (described below) to specify that there are three possible rates of evolution, 1.0, 2.4, and 0.0, that the prior probabilities of a site having these rates are 0.4, 0.3, and 0.3, and that the average patch length (number of consecutive sites with the same rate) is 2.0, the program will sum the likelihood over all possibilities, but give less weight to those that (say) assign all sites to rate 2.4, or that fail to have consecutive sites that have the same rate.
The Hidden Markov Model framework for rate variation among sites was independently developed by Yang (1993, 1994, 1995). We have implemented a general scheme for a Hidden Markov Model of rates; we allow the rates and their prior probabilities to be specified arbitrarily by the user, or by a discrete approximation to a Gamma distribution of rates (Yang, 1995), or by a mixture of a Gamma distribution and a class of invariant sites.
This feature effectively removes the artificial assumption that all sites have the same rate, and also means that we need not know in advance the identities of the sites that have a particular rate of evolution.
Another layer of rate variation also is available. The user can assign categories of rates to each site (for example, we might want first, second, and third codon positions in a protein coding sequence to be three different categories). This is done with the categories input file and the C option. We then specify (using the menu) the relative rates of evolution of sites in the different categories. For example, we might specify that first, second, and third positions evolve at relative rates of 1.0, 0.8, and 2.7.
If both user-assigned rate categories and Hidden Markov Model rates are allowed, the program assumes that the actual rate at a site is the product of the user-assigned category rate and the Hidden Markov Model regional rate. (This may not always make perfect biological sense: it would be more natural to assume some upper bound to the rate, as we have discussed in the Felsenstein and Churchill paper). Nevertheless you may want to use both types of rate variation.
Subject to these assumptions, the program is a correct maximum likelihood method. The input is fairly standard, with one addition. As usual the first line of the file gives the number of species and the number of sites.
Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats described in the Molecular Sequence Programs document. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion.
The options are selected using an interactive menu. The menu looks like this:
Nucleic acid sequence Maximum Likelihood method, version 3.69 Settings for this run: U Search for best tree? Yes T Transition/transversion ratio: 2.0000 F Use empirical base frequencies? Yes C One category of sites? Yes R Rate variation among sites? constant rate W Sites weighted? No S Speedier but rougher analysis? Yes G Global rearrangements? No J Randomize input order of sequences? No. Use input order O Outgroup root? No, use as outgroup species 1 M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes 5 Reconstruct hypothetical sequences? No Y to accept these or type the letter for one to change |
The user either types "Y" (followed, of course, by a carriage-return) if the settings shown are to be accepted, or the letter or digit corresponding to an option that is to be changed.
The options U, W, J, O, M, and 0 are the usual ones. They are described in the main documentation file of this package. Option I is the same as in other molecular sequence programs and is described in the documentation file for the sequence programs.
The T option in this program does not stand for Threshold, but instead is the Transition/transversion option. The user is prompted for a real number greater than 0.0, as the expected ratio of transitions to transversions. Note that this is not the ratio of the first to the second kinds of events, but the resulting expected ratio of transitions to transversions. The exact relationship between these two quantities depends on the frequencies in the base pools. The default value of the T parameter if you do not use the T option is 2.0.
The F (Frequencies) option is one which may save users much time. If you want to use the empirical frequencies of the bases, observed in the input sequences, as the base frequencies, you simply use the default setting of the F option. These empirical frequencies are not really the maximum likelihood estimates of the base frequencies, but they will often be close to those values (they are the maximum likelihood estimates under a "star" or "explosion" phylogeny). If you change the setting of the F option you will be prompted for the frequencies of the four bases. These must add to 1 and are to be typed on one line separated by blanks, not commas.
The R (Hidden Markov Model rates) option allows the user to approximate a Gamma distribution of rates among sites, or a Gamma distribution plus a class of invariant sites, or to specify how many categories of substitution rates there will be in a Hidden Markov Model of rate variation, and what are the rates and probabilities for each. By repeatedly selecting the R option one toggles among no rate variation, the Gamma, Gamma+I, and general HMM possibilities.
If you choose Gamma or Gamma+I the program will ask how many rate categories you want. If you have chosen Gamma+I, keep in mind that one rate category will be set aside for the invariant class and only the remaining ones used to approximate the Gamma distribution. For the approximation we do not use the quantile method of Yang (1995) but instead use a quadrature method using generalized Laguerre polynomials. This should give a good approximation to the Gamma distribution with as few as 5 or 6 categories.
In the Gamma and Gamma+I cases, the user will be asked to supply the coefficient of variation of the rate of substitution among sites. This is different from the parameters used by Nei and Jin (1990) but related to them: their parameter a is also known as "alpha", the shape parameter of the Gamma distribution. It is related to the coefficient of variation by
CV = 1 / a1/2
or
a = 1 / (CV)2
(their parameter b is absorbed here by the requirement that time is scaled so that the mean rate of evolution is 1 per unit time, which means that a = b). As we consider cases in which the rates are less variable we should set a larger and larger, as CV gets smaller and smaller.
If the user instead chooses the general Hidden Markov Model option, they are first asked how many HMM rate categories there will be (for the moment there is an upper limit of 9, which should not be restrictive). Then the program asks for the rates for each category. These rates are only meaningful relative to each other, so that rates 1.0, 2.0, and 2.4 have the exact same effect as rates 2.0, 4.0, and 4.8. Note that an HMM rate category can have rate of change 0, so that this allows us to take into account that there may be a category of sites that are invariant. Note that the run time of the program will be proportional to the number of HMM rate categories: twice as many categories means twice as long a run. Finally the program will ask for the probabilities of a random site falling into each of these regional rate categories. These probabilities must be nonnegative and sum to 1. The default for the program is one category, with rate 1.0 and probability 1.0 (actually the rate does not matter in that case).
If more than one HMM rate category is specified, then another option, A, becomes visible in the menu. This allows us to specify that we want to assume that sites that have the same HMM rate category are expected to be clustered so that there is autocorrelation of rates. The program asks for the value of the average patch length. This is an expected length of patches that have the same rate. If it is 1, the rates of successive sites will be independent. If it is, say, 10.25, then the chance of change to a new rate will be 1/10.25 after every site. However the "new rate" is randomly drawn from the mix of rates, and hence could even be the same. So the actual observed length of patches with the same rate will be a bit larger than 10.25. Note below that if you choose multiple patches, there will be an estimate in the output file as to which combination of rate categories contributed most to the likelihood.
Note that the autocorrelation scheme we use is somewhat different from Yang's (1995) autocorrelated Gamma distribution. I am unsure whether this difference is of any importance -- our scheme is chosen for the ease with which it can be implemented.
The C option allows user-defined rate categories. The user is prompted for the number of user-defined rates, and for the rates themselves, which cannot be negative but can be zero. These numbers, which must be nonnegative (some could be 0), are defined relative to each other, so that if rates for three categories are set to 1 : 3 : 2.5 this would have the same meaning as setting them to 2 : 6 : 5. The assignment of rates to sites is then made by reading a file whose default name is "categories". It should contain a string of digits 1 through 9. A new line or a blank can occur after any character in this string. Thus the categories file might look like this:
122231111122411155 1155333333444
With the current options R, A, and C the program has gained greatly in its ability to infer different rates at different sites and estimate phylogenies under a more realistic model. Note that Likelihood Ratio Tests can be used to test whether one combination of rates is significantly better than another, provided one rate scheme represents a restriction of another with fewer parameters. The number of parameters needed for rate variation is the number of regional rate categories, plus the number of user-defined rate categories less 2, plus one if the regional rate categories have a nonzero autocorrelation.
The G (global search) option causes, after the last species is added to the tree, each possible group to be removed and re-added. This improves the result, since the position of every species is reconsidered. It approximately triples the run-time of the program.
The User tree (option U) is read from a file whose default name is intree. The trees can be multifurcating.
If the U (user tree) option is chosen another option appears in the menu, the L option. If it is selected, it signals the program that it should take any branch lengths that are in the user tree and simply evaluate the likelihood of that tree, without further altering those branch lengths. This means that if some branches have lengths and others do not, the program will estimate the lengths of those that do not have lengths given in the user tree. Note that the program Retree can be used to add and remove lengths from a tree.
The U option can read a multifurcating tree. This allows us to test the hypothesis that a certain branch has zero length (we can also do this by using Retree to set the length of that branch to 0.0 when it is present in the tree). By doing a series of runs with different specified lengths for a branch we can plot a likelihood curve for its branch length while allowing all other branches to adjust their lengths to it. If all branches have lengths specified, none of them will be iterated. This is useful to allow a tree produced by another method to have its likelihood evaluated. The L option has no effect and does not appear in the menu if the U option is not used.
The W (Weights) option is invoked in the usual way, with only weights 0 and 1 allowed. It selects a set of sites to be analyzed, ignoring the others. The sites selected are those with weight 1. If the W option is not invoked, all sites are analyzed. The Weights (W) option takes the weights from a file whose default name is "weights". The weights follow the format described in the main documentation file.
The M (multiple data sets) option will ask you whether you want to use multiple sets of weights (from the weights file) or multiple data sets from the input file. The ability to use a single data set with multiple weights means that much less disk space will be used for this input data. The bootstrapping and jackknifing tool Seqboot has the ability to create a weights file with multiple weights. Note also that when we use multiple weights for bootstrapping we can also then maintain different rate categories for different sites in a meaningful way. If you use the multiple data sets option rather than multiple weights, you should not at the same time use the user-defined rate categories option (option C), because the user-defined rate categories could then be associated with the wrong sites. This is not a concern when the M option is used by using multiple weights.
The algorithm used for searching among trees is faster than it was in version 3.5, thanks to using a technique invented by David Swofford and J. S. Rogers. This involves not iterating most branch lengths on most trees when searching among tree topologies, This is of necessity a "quick-and-dirty" search but it saves much time. There is a menu option (option S) which can turn off this search and revert to the earlier search method which iterated branch lengths in all topologies. This will be substantially slower but will also be a bit more likely to find the tree topology of highest likelihood.
The output starts by giving the number of species, the number of sites, and the base frequencies for A, C, G, and T that have been specified. It then prints out the transition/transversion ratio that was specified or used by default. It also uses the base frequencies to compute the actual transition/transversion ratio implied by the parameter.
If the R (HMM rates) option is used a table of the relative rates of expected substitution at each category of sites is printed, as well as the probabilities of each of those rates.
There then follow the data sequences, if the user has selected the menu option to print them out, with the base sequences printed in groups of ten bases along the lines of the Genbank and EMBL formats. The trees found are printed as an unrooted tree topology (possibly rooted by outgroup if so requested). The internal nodes are numbered arbitrarily for the sake of identification. The number of trees evaluated so far and the log likelihood of the tree are also given. Note that the trees printed out have a trifurcation at the base. The branch lengths in the diagram are roughly proportional to the estimated branch lengths, except that very short branches are printed out at least three characters in length so that the connections can be seen.
A table is printed showing the length of each tree segment (in units of expected nucleotide substitutions per site), as well as (very) rough confidence limits on their lengths. If a confidence limit is negative, this indicates that rearrangement of the tree in that region is not excluded, while if both limits are positive, rearrangement is still not necessarily excluded because the variance calculation on which the confidence limits are based results in an underestimate, which makes the confidence limits too narrow.
In addition to the confidence limits, the program performs a crude Likelihood Ratio Test (LRT) for each branch of the tree. The program computes the ratio of likelihoods with and without this branch length forced to zero length. This is done by comparing the likelihoods changing only that branch length. A truly correct LRT would force that branch length to zero and also allow the other branch lengths to adjust to that. The result would be a likelihood ratio closer to 1. Therefore the present LRT will err on the side of being too significant. YOU ARE WARNED AGAINST TAKING IT TOO SERIOUSLY. If you want to get a better likelihood curve for a branch length you can do multiple runs with different prespecified lengths for that branch, as discussed above in the discussion of the L option.
One should also realize that if you are looking not at a previously-chosen branch but at all branches, that you are seeing the results of multiple tests. With 20 tests, one is expected to reach significance at the P = .05 level purely by chance. You should therefore use a much more conservative significance level, such as .05 divided by the number of tests. The significance of these tests is shown by printing asterisks next to the confidence interval on each branch length. It is important to keep in mind that both the confidence limits and the tests are very rough and approximate, and probably indicate more significance than they should. Nevertheless, maximum likelihood is one of the few methods that can give you any indication of its own error; most other methods simply fail to warn the user that there is any error! (In fact, whole philosophical schools of taxonomists exist whose main point seems to be that there isn't any error, that the "most parsimonious" tree is the best tree by definition and that's that).
The log likelihood printed out with the final tree can be used to perform various likelihood ratio tests. One can, for example, compare runs with different values of the expected transition/transversion ratio to determine which value is the maximum likelihood estimate, and what is the allowable range of values (using a likelihood ratio test, which you will find described in mathematical statistics books). One could also estimate the base frequencies in the same way. Both of these, particularly the latter, require multiple runs of the program to evaluate different possible values, and this might get expensive.
If the U (User Tree) option is used and more than one tree is supplied, and the program is not told to assume autocorrelation between the rates at different sites, the program also performs a statistical test of each of these trees against the one with highest likelihood. If there are two user trees, the test done is one which is due to Kishino and Hasegawa (1989), a version of a test originally introduced by Templeton (1983). In this implementation it uses the mean and variance of log-likelihood differences between trees, taken across sites. If the two trees' means are more than 1.96 standard deviations different then the trees are declared significantly different. This use of the empirical variance of log-likelihood differences is more robust and nonparametric than the classical likelihood ratio test, and may to some extent compensate for the any lack of realism in the model underlying this program.
If there are more than two trees, the test done is an extension of the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that a correction for the number of trees was necessary, and they introduced a resampling method to make this correction. In the version used here the variances and covariances of the sum of log likelihoods across sites are computed for all pairs of trees. To test whether the difference between each tree and the best one is larger than could have been expected if they all had the same expected log-likelihood, log-likelihoods for all trees are sampled with these covariances and equal means (Shimodaira and Hasegawa's "least favorable hypothesis"), and a P value is computed from the fraction of times the difference between the tree's value and the highest log-likelihood exceeds that actually observed. Note that this sampling needs random numbers, and so the program will prompt the user for a random number seed if one has not already been supplied. With the two-tree KHT test no random numbers are used.
In either the KHT or the SH test the program prints out a table of the log-likelihoods of each tree, the differences of each from the highest one, the variance of that quantity as determined by the log-likelihood differences at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one. However the test is not available if we assume that there is autocorrelation of rates at neighboring sites (option A) and is not done in those cases.
The branch lengths printed out are scaled in terms of expected numbers of substitutions, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change, averaged over all sites analyzed, is set to 1.0 if there are multiple categories of sites. This means that whether or not there are multiple categories of sites, the expected fraction of change for very small branches is equal to the branch length. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes occur in the same site and overlie or even reverse each other. The branch length estimates here are in terms of the expected underlying numbers of changes. That means that a branch of length 0.26 is 26 times as long as one which would show a 1% difference between the nucleotide sequences at the beginning and end of the branch. But we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.
Confidence limits on the branch lengths are also given. Of course a negative value of the branch length is meaningless, and a confidence limit overlapping zero simply means that the branch length is not necessarily significantly different from zero. Because of limitations of the numerical algorithm, branch length estimates of zero will often print out as small numbers such as 0.00001. If you see a branch length that small, it is really estimated to be of zero length. Note that versions 2.7 and earlier of this program printed out the branch lengths in terms of expected probability of change, so that they were scaled differently.
Another possible source of confusion is the existence of negative values for the log likelihood. This is not really a problem; the log likelihood is not a probability but the logarithm of a probability. When it is negative it simply means that the corresponding probability is less than one (since we are seeing its logarithm). The log likelihood is maximized by being made more positive: -30.23 is worse than -29.14.
At the end of the output, if the R option is in effect with multiple HMM rates, the program will print a list of what site categories contributed the most to the final likelihood. This combination of HMM rate categories need not have contributed a majority of the likelihood, just a plurality. Still, it will be helpful as a view of where the program infers that the higher and lower rates are. Note that the use in this calculation of the prior probabilities of different rates, and the average patch length, gives this inference a "smoothed" appearance: some other combination of rates might make a greater contribution to the likelihood, but be discounted because it conflicts with this prior information. See the example output below to see what this printout of rate categories looks like. A second list will also be printed out, showing for each site which rate accounted for the highest fraction of the likelihood. If the fraction of the likelihood accounted for is less than 95%, a dot is printed instead.
Option 3 in the menu controls whether the tree is printed out into the output file. This is on by default, and usually you will want to leave it this way. However for runs with multiple data sets such as bootstrapping runs, you will primarily be interested in the trees which are written onto the output tree file, rather than the trees printed on the output file. To keep the output file from becoming too large, it may be wisest to use option 3 to prevent trees being printed onto the output file.
Option 4 in the menu controls whether the tree estimated by the program is written onto a tree file. The default name of this output tree file is "outtree". If the U option is in effect, all the user-defined trees are written to the output tree file.
Option 5 in the menu controls whether ancestral states are estimated at each node in the tree. If it is in effect, a table of ancestral sequences is printed out (including the sequences in the tip species which are the input sequences). In that table, if a site has a base which accounts for more than 95% of the likelihood, it is printed in capital letters (A rather than a). If the best nucleotide accounts for less than 50% of the likelihood, the program prints out an ambiguity code (such as M for "A or C") for the set of nucleotides which, taken together, account for more than half of the likelihood. The ambiguity codes are listed in the sequence programs documentation file. One limitation of the current version of the program is that when there are multiple HMM rates (option R) the reconstructed nucleotides are based on only the single assignment of rates to sites which accounts for the largest amount of the likelihood. Thus the assessment of 95% of the likelihood, in tabulating the ancestral states, refers to 95% of the likelihood that is accounted for by that particular combination of rates.
The constants defined at the beginning of the program include "maxtrees", the maximum number of user trees that can be processed. It is small (100) at present to save some further memory but the cost of increasing it is not very great. Other constants include "maxcategories", the maximum number of site categories, "namelength", the length of species names in characters, and three others, "smoothings", "iterations", and "epsilon", that help "tune" the algorithm and define the compromise between execution speed and the quality of the branch lengths found by iteratively maximizing the likelihood. Reducing iterations and smoothings, and increasing epsilon, will result in faster execution but a worse result. These values will not usually have to be changed.
The program spends most of its time doing real arithmetic. The algorithm, with separate and independent computations occurring for each pattern, lends itself readily to parallel processing.
This program, which in version 2.6 replaced the old version of Dnaml, is not derived directly from it but instead was developed by modifying Contml, with which it shares many of its data structures and much of its strategy. It was speeded up by two major developments, the use of aliasing of nucleotide sites (version 3.1) and pretabulation of some exponentials (added by Akiko Fuseki in version 3.4). In version 3.5 the Hidden Markov Model code was added and the method of iterating branch lengths was changed from an EM algorithm to direct search. The Hidden Markov Model code slows things down, especially if there is autocorrelation between sites, so this version is slower than version 3.4. Nevertheless we hope that the sacrifice is worth it.
One change that is needed in the future is to put in some way of allowing for base composition of nucleotide sequences in different parts of the phylogeny.
5 13 Alpha AACGTGGCCAAAT Beta AAGGTCGCCAAAC Gamma CATTTCGTCACAA Delta GGTATTTCGGCCT Epsilon GGGATCTCGGCCC |
(It was run with HMM rates having gamma-distributed rates approximated by 5 rate categories, with coefficient of variation of rates 1.0, and with patch length parameter = 1.5. Two user-defined rate categories were used, one for the first 6 sites, the other for the last 7, with rates 1.0 : 2.0. Weights were used, with sites 1 and 13 given weight 0, and all others weight 1.)
Nucleic acid sequence Maximum Likelihood method, version 3.69 5 species, 13 sites Site categories are: 1111112222 222 Sites are weighted as follows: 01111 11111 110 Name Sequences ---- --------- Alpha AACGTGGCCA AAT Beta ..G..C.... ..C Gamma C.TT.C.T.. C.A Delta GGTA.TT.GG CC. Epsilon GGGA.CT.GG CCC Empirical Base Frequencies: A 0.23636 C 0.29091 G 0.25455 T(U) 0.21818 Transition/transversion ratio = 2.000000 Discrete approximation to gamma distributed rates Coefficient of variation of rates = 1.000000 (alpha = 1.000000) State in HMM Rate of change Probability 1 0.264 0.522 2 1.413 0.399 3 3.596 0.076 4 7.086 0.0036 5 12.641 0.000023 Expected length of a patch of sites having the same rate = 1.500 Site category Rate of change 1 1.000 2 2.000 +Beta | | +Epsilon | +----------------------------------------------3 1---2 +-Delta | | | +--Gamma | +-Alpha remember: this is an unrooted tree! Ln Likelihood = -58.41388 Between And Length Approx. Confidence Limits ------- --- ------ ------- ---------- ------ 1 Alpha 0.32320 ( zero, 0.93246) ** 1 Beta 0.02699 ( zero, 0.49959) 1 2 0.65789 ( zero, 2.29501) 2 3 7.11637 ( zero, 20.73855) ** 3 Epsilon 0.00006 ( zero, 0.52703) 3 Delta 0.30602 ( zero, 0.83268) ** 2 Gamma 0.43465 ( zero, 2.10073) * = significantly positive, P < 0.05 ** = significantly positive, P < 0.01 Combination of categories that contributes the most to the likelihood: 1122121111 111 Most probable category at each site if > 0.95 probability ("." otherwise) .......... ... Probable sequences at interior nodes: node Reconstructed sequence (caps if > 0.95) 1 .AgGTCGCCA AA. Beta AAGGTCGCCA AAC 2 .AkkTcGtCA cA. 3 .GGATCTCGG CC. Epsilon GGGATCTCGG CCC Delta GGTATTTCGG CCT Gamma CATTTCGTCA CAA Alpha AACGTGGCCA AAT |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program implements the maximum likelihood method for DNA sequences under the constraint that the trees estimated must be consistent with a molecular clock. The molecular clock is the assumption that the tips of the tree are all equidistant, in branch length, from its root. This program is indirectly related to Dnaml. Details of the algorithm are not yet published, but many aspects of it are similar to Dnaml, and these are published in the paper by Felsenstein and Churchill (1996). The model of base substitution allows the expected frequencies of the four bases to be unequal, allows the expected frequencies of transitions and transversions to be unequal, and has several ways of allowing different rates of evolution at different sites.
The assumptions of the model are:
The ratio of the two purines in the purine replacement pool is the same as their ratio in the overall pool, and similarly for the pyrimidines.
The ratios of transitions to transversions can be set by the user. The substitution process can be diagrammed as follows: Suppose that you specified A, C, G, and T base frequencies of 0.24, 0.28, 0.27, and 0.21.
Purine pool: Pyrimidine pool: | | | | | 0.4706 A | | 0.5714 C | | 0.5294 G | | 0.4286 T | | (ratio is | | (ratio is | | 0.24 : 0.27) | | 0.28 : 0.21) | |_______________| |_______________|
Draw from the overall pool:
| | | 0.24 A | | 0.28 C | | 0.27 G | | 0.21 T | |__________________|
Note that if the existing base is, say, an A, the first kind of event has a 0.4706 probability of "replacing" it by another A. The second kind of event has a 0.24 chance of replacing it by another A. This rather disconcerting model is used because it has nice mathematical properties that make likelihood calculations far easier. A closely similar, but not precisely identical model having different rates of transitions and transversions has been used by Hasegawa et. al. (1985b). The transition probability formulas for the current model were given (with my permission) by Kishino and Hasegawa (1989). Another explanation is available in the paper by Felsenstein and Churchill (1996).
Note the assumption that we are looking at all sites, including those that have not changed at all. It is important not to restrict attention to some sites based on whether or not they have changed; doing that would bias branch lengths by making them too long, and that in turn would cause the method to misinterpret the meaning of those sites that had changed.
This program uses a Hidden Markov Model (HMM) method of inferring different rates of evolution at different sites. This was described in a paper by me and Gary Churchill (1996). It allows us to specify to the program that there will be a number of different possible evolutionary rates, what the prior probabilities of occurrence of each is, and what the average length of a patch of sites all having the same rate is. The rates can also be chosen by the program to approximate a Gamma distribution of rates, or a Gamma distribution plus a class of invariant sites. The program computes the the likelihood by summing it over all possible assignments of rates to sites, weighting each by its prior probability of occurrence.
For example, if we have used the C and A options (described below) to specify that there are three possible rates of evolution, 1.0, 2.4, and 0.0, that the prior probabilities of a site having these rates are 0.4, 0.3, and 0.3, and that the average patch length (number of consecutive sites with the same rate) is 2.0, the program will sum the likelihood over all possibilities, but give less weight to those that (say) assign all sites to rate 2.4, or that fail to have consecutive sites that have the same rate.
The Hidden Markov Model framework for rate variation among sites was independently developed by Yang (1993, 1994, 1995). We have implemented a general scheme for a Hidden Markov Model of rates; we allow the rates and their prior probabilities to be specified arbitrarily by the user, or by a discrete approximation to a Gamma distribution of rates (Yang, 1995), or by a mixture of a Gamma distribution and a class of invariant sites.
This feature effectively removes the artificial assumption that all sites have the same rate, and also means that we need not know in advance the identities of the sites that have a particular rate of evolution.
Another layer of rate variation also is available. The user can assign categories of rates to each site (for example, we might want first, second, and third codon positions in a protein coding sequence to be three different categories). This is done with the categories input file and the C option. We then specify (using the menu) the relative rates of evolution of sites in the different categories. For example, we might specify that first, second, and third positions evolve at relative rates of 1.0, 0.8, and 2.7.
If both user-assigned rate categories and Hidden Markov Model rates are allowed, the program assumes that the actual rate at a site is the product of the user-assigned category rate and the Hidden Markov Model regional rate. (This may not always make perfect biological sense: it would be more natural to assume some upper bound to the rate, as we have discussed in the Felsenstein and Churchill paper). Nevertheless you may want to use both types of rate variation.
Subject to these assumptions, the program is a correct maximum likelihood method. The input is fairly standard, with one addition. As usual the first line of the file gives the number of species and the number of sites.
Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats described in the Molecular Sequence Programs document. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion.
The options are selected using an interactive menu. The menu looks like this:
Nucleic acid sequence Maximum Likelihood method with molecular clock, version 3.69 Settings for this run: U Search for best tree? Yes T Transition/transversion ratio: 2.0 F Use empirical base frequencies? Yes C One category of substitution rates? Yes R Rate variation among sites? constant rate G Global rearrangements? No W Sites weighted? No J Randomize input order of sequences? No. Use input order M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes 5 Reconstruct hypothetical sequences? No Are these settings correct? (type Y or the letter for one to change) |
The user either types "Y" (followed, of course, by a carriage-return) if the settings shown are to be accepted, or the letter or digit corresponding to an option that is to be changed.
The options U, W, J, O, M, and 0 are the usual ones. They are described in the main documentation file of this package. Option I is the same as in other molecular sequence programs and is described in the documentation file for the sequence programs.
The T option in this program does not stand for Threshold, but instead is the Transition/transversion option. The user is prompted for a real number greater than 0.0, as the expected ratio of transitions to transversions. Note that this is not the ratio of the first to the second kinds of events, but the resulting expected ratio of transitions to transversions. The exact relationship between these two quantities depends on the frequencies in the base pools. The default value of the T parameter if you do not use the T option is 2.0.
The F (Frequencies) option is one which may save users much time. If you want to use the empirical frequencies of the bases, observed in the input sequences, as the base frequencies, you simply use the default setting of the F option. These empirical frequencies are not really the maximum likelihood estimates of the base frequencies, but they will often be close to those values (they are maximum likelihood estimates under a "star" or "explosion" phylogeny). If you change the setting of the F option you will be prompted for the frequencies of the four bases. These must add to 1 and are to be typed on one line separated by blanks, not commas.
The R (Hidden Markov Model rates) option allows the user to approximate a Gamma distribution of rates among sites, or a Gamma distribution plus a class of invariant sites, or to specify how many categories of substitution rates there will be in a Hidden Markov Model of rate variation, and what are the rates and probabilities for each. By repeatedly selecting the R option one toggles among no rate variation, the Gamma, Gamma+I, and general HMM possibilities.
If you choose Gamma or Gamma+I the program will ask how many rate categories you want. If you have chosen Gamma+I, keep in mind that one rate category will be set aside for the invariant class and only the remaining ones used to approximate the Gamma distribution. For the approximation we do not use the quantile method of Yang (1995) but instead use a quadrature method using generalized Laguerre polynomials. This should give a good approximation to the Gamma distribution with as few as 5 or 6 categories.
In the Gamma and Gamma+I cases, the user will be asked to supply the coefficient of variation of the rate of substitution among sites. This is different from the parameters used by Nei and Jin (1990) but related to them: their parameter a is also known as "alpha", the shape parameter of the Gamma distribution. It is related to the coefficient of variation by
CV = 1 / a1/2
or
a = 1 / (CV)2
(their parameter b is absorbed here by the requirement that time is scaled so that the mean rate of evolution is 1 per unit time, which means that a = b). As we consider cases in which the rates are less variable we should set a larger and larger, as CV gets smaller and smaller.
If the user instead chooses the general Hidden Markov Model option, they are first asked how many HMM rate categories there will be (for the moment there is an upper limit of 9, which should not be restrictive). Then the program asks for the rates for each category. These rates are only meaningful relative to each other, so that rates 1.0, 2.0, and 2.4 have the exact same effect as rates 2.0, 4.0, and 4.8. Note that an HMM rate category can have rate of change 0, so that this allows us to take into account that there may be a category of sites that are invariant. Note that the run time of the program will be proportional to the number of HMM rate categories: twice as many categories means twice as long a run. Finally the program will ask for the probabilities of a random site falling into each of these regional rate categories. These probabilities must be nonnegative and sum to 1. The default for the program is one category, with rate 1.0 and probability 1.0 (actually the rate does not matter in that case).
If more than one category is specified, then another option, A, becomes visible in the menu. This allows us to specify that we want to assume that sites that have the same HMM rate category are expected to be clustered so that there is autocorrelation of rates. The program asks for the value of the average patch length. This is an expected length of patches that have the same rate. If it is 1, the rates of successive sites will be independent. If it is, say, 10.25, then the chance of change to a new rate will be 1/10.25 after every site. However the "new rate" is randomly drawn from the mix of rates, and hence could even be the same. So the actual observed length of patches with the same rate will be a bit larger than 10.25. Note below that if you choose multiple patches, there will be an estimate in the output file as to which combination of rate categories contributed most to the likelihood.
Note that the autocorrelation scheme we use is somewhat different from Yang's (1995) autocorrelated Gamma distribution. I am unsure whether this difference is of any importance -- our scheme is chosen for the ease with which it can be implemented.
The C option allows user-defined rate categories. The user is prompted for the number of user-defined rates, and for the rates themselves, which cannot be negative but can be zero. These numbers, which must be nonnegative (some could be 0), are defined relative to each other, so that if rates for three categories are set to 1 : 3 : 2.5 this would have the same meaning as setting them to 2 : 6 : 5. The assignment of rates to sites is then made by reading a file whose default name is "categories". It should contain a string of digits 1 through 9. A new line or a blank can occur after any character in this string. Thus the categories file might look like this:
122231111122411155 1155333333444
With the current options R, A, and C the program has gained greatly in its ability to infer different rates at different sites and estimate phylogenies under a more realistic model. Note that Likelihood Ratio Tests can be used to test whether one combination of rates is significantly better than another, provided one rate scheme represents a restriction of another with fewer parameters. The number of parameters needed for rate variation is the number of regional rate categories, plus the number of user-defined rate categories less 2, plus one if the regional rate categories have a nonzero autocorrelation.
The G (global search) option causes, after the last species is added to the tree, each possible group to be removed and re-added. This improves the result, since the position of every species is reconsidered. It approximately triples the run-time of the program.
The User tree (option U) is read from a file whose default name is intree. The trees can be multifurcating. This allows us to test the hypothesis that a given branch has zero length.
If the U (user tree) option is chosen another option appears in the menu, the L option. If it is selected, it signals the program that it should take any branch lengths that are in the user tree and simply evaluate the likelihood of that tree, without further altering those branch lengths. In the case of a clock, if some branches have lengths and others do not, the program does not estimate the lengths of those that do not have lengths given in the user tree. If any of the branches do not have lengths, the program re-estimates the lengths of all of them. This is done because estimating some and not others is hard in the case of a clock.
The W (Weights) option is invoked in the usual way, with only weights 0 and 1 allowed. It selects a set of sites to be analyzed, ignoring the others. The sites selected are those with weight 1. If the W option is not invoked, all sites are analyzed. The Weights (W) option takes the weights from a file whose default name is "weights". The weights follow the format described in the main documentation file.
The M (multiple data sets) option will ask you whether you want to use multiple sets of weights (from the weights file) or multiple data sets from the input file. The ability to use a single data set with multiple weights means that much less disk space will be used for this input data. The bootstrapping and jackknifing tool Seqboot has the ability to create a weights file with multiple weights. Note also that when we use multiple weights for bootstrapping we can also then maintain different rate categories for different sites in a meaningful way. If you use the multiple data sets option rather than multiple weights, you should not at the same time use the user-defined rate categories option (option C), because the user-defined rate categories could then be associated with the wrong sites. This is not a concern when the M option is used by using multiple weights.
The algorithm used for searching among trees is faster than it was in version 3.5, thanks to using a technique invented by David Swofford and J. S. Rogers. This involves not iterating most branch lengths on most trees when searching among tree topologies, This is of necessity a "quick-and-dirty" search but it saves much time.
The output starts by giving the number of species, the number of sites, and the base frequencies for A, C, G, and T that have been specified. It then prints out the transition/transversion ratio that was specified or used by default. It also uses the base frequencies to compute the actual transition/transversion ratio implied by the parameter.
If the R (HMM rates) option is used a table of the relative rates of expected substitution at each category of sites is printed, as well as the probabilities of each of those rates.
There then follow the data sequences, if the user has selected the menu option to print them out, with the base sequences printed in groups of ten bases along the lines of the Genbank and EMBL formats. The trees found are printed as a rooted tree topology. The internal nodes are numbered arbitrarily for the sake of identification. The number of trees evaluated so far and the log likelihood of the tree are also given. The branch lengths in the diagram are roughly proportional to the estimated branch lengths, except that very short branches are printed out at least three characters in length so that the connections can be seen.
A table is printed showing the length of each tree segment, and the time (in units of expected nucleotide substitutions per site) of each fork in the tree, measured from the root of the tree. I have not attempted to include code for approximate confidence limits on branch points, as I have done for branch lengths in Dnaml, both because of the extreme crudeness of that test, and because the variation of times for different forks would be highly correlated.
The log likelihood printed out with the final tree can be used to perform various likelihood ratio tests. One can, for example, compare runs with different values of the expected transition/transversion ratio to determine which value is the maximum likelihood estimate, and what is the allowable range of values (using a likelihood ratio test, which you will find described in mathematical statistics books). One could also estimate the base frequencies in the same way. Both of these, particularly the latter, require multiple runs of the program to evaluate different possible values, and this might get expensive.
This program makes possible a (reasonably) legitimate statistical test of the molecular clock. To do such a test, run Dnaml and Dnamlk on the same data. If the trees obtained are of the same topology (when considered as unrooted), it is legitimate to compare their likelihoods by the likelihood ratio test. In Dnaml the likelihood has been computed by estimating 2n-3 branch lengths, if there are n tips on the tree. In Dnamlk it has been computed by estimating n-1 branching times (in effect, n-1 branch lengths). The difference in the number of parameters is (2n-3)-(n-1) = n-2. To perform the test take the difference in log likelihoods between the two runs (Dnaml should be the higher of the two, barring numerical iteration difficulties) and double it. Look this up on a chi-square distribution with n-2 degrees of freedom. If the result is significant, the log likelihood has been significantly increased by allowing all 2n-3 branch lengths to be estimated instead of just n-1, and molecular clock may be rejected.
If the U (User Tree) option is used and more than one tree is supplied, and the program is not told to assume autocorrelation between the rates at different sites, the program also performs a statistical test of each of these trees against the one with highest likelihood. If there are two user trees, the test done is one which is due to Kishino and Hasegawa (1989), a version of a test originally introduced by Templeton (1983). In this implementation it uses the mean and variance of log-likelihood differences between trees, taken across sites. If the two trees' means are more than 1.96 standard deviations different then the trees are declared significantly different. This use of the empirical variance of log-likelihood differences is more robust and nonparametric than the classical likelihood ratio test, and may to some extent compensate for the any lack of realism in the model underlying this program.
If there are more than two trees, the test done is an extension of the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that a correction for the number of trees was necessary, and they introduced a resampling method to make this correction. In the version used here the variances and covariances of the sum of log likelihoods across sites are computed for all pairs of trees. To test whether the difference between each tree and the best one is larger than could have been expected if they all had the same expected log-likelihood, log-likelihoods for all trees are sampled with these covariances and equal means (Shimodaira and Hasegawa's "least favorable hypothesis"), and a P value is computed from the fraction of times the difference between the tree's value and the highest log-likelihood exceeds that actually observed. Note that this sampling needs random numbers, and so the program will prompt the user for a random number seed if one has not already been supplied. With the two-tree KHT test no random numbers are used.
In either the KHT or the SH test the program prints out a table of the log-likelihoods of each tree, the differences of each from the highest one, the variance of that quantity as determined by the log-likelihood differences at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one. However the test is not available if we assume that there is autocorrelation of rates at neighboring sites (option A) and is not done in those cases.
The branch lengths printed out are scaled in terms of expected numbers of substitutions, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change, averaged over all sites analyzed, is set to 1.0 if there are multiple categories of sites. This means that whether or not there are multiple categories of sites, the expected fraction of change for very small branches is equal to the branch length. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes occur in the same site and overlie or even reverse each other. The branch length estimates here are in terms of the expected underlying numbers of changes. That means that a branch of length 0.26 is 26 times as long as one which would show a 1% difference between the nucleotide sequences at the beginning and end of the branch. But we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.
Because of limitations of the numerical algorithm, branch length estimates of zero will often print out as small numbers such as 0.00001. If you see a branch length that small, it is really estimated to be of zero length.
Another possible source of confusion is the existence of negative values for the log likelihood. This is not really a problem; the log likelihood is not a probability but the logarithm of a probability. When it is negative it simply means that the corresponding probability is less than one (since we are seeing its logarithm). The log likelihood is maximized by being made more positive: -30.23 is worse than -29.14.
At the end of the output, if the R option is in effect with multiple HMM rates, the program will print a list of what site categories contributed the most to the final likelihood. This combination of HMM rate categories need not have contributed a majority of the likelihood, just a plurality. Still, it will be helpful as a view of where the program infers that the higher and lower rates are. Note that the use in this calculation of the prior probabilities of different rates, and the average patch length, gives this inference a "smoothed" appearance: some other combination of rates might make a greater contribution to the likelihood, but be discounted because it conflicts with this prior information. See the example output below to see what this printout of rate categories looks like.
A second list will also be printed out, showing for each site which rate accounted for the highest fraction of the likelihood. If the fraction of the likelihood accounted for is less than 95%, a dot is printed instead.
Option 3 in the menu controls whether the tree is printed out into the output file. This is on by default, and usually you will want to leave it this way. However for runs with multiple data sets such as bootstrapping runs, you will primarily be interested in the trees which are written onto the output tree file, rather than the trees printed on the output file. To keep the output file from becoming too large, it may be wisest to use option 3 to prevent trees being printed onto the output file.
Option 4 in the menu controls whether the tree estimated by the program is written onto a tree file. The default name of this output tree file is "outtree". If the U option is in effect, all the user-defined trees are written to the output tree file.
Option 5 in the menu controls whether ancestral states are estimated at each node in the tree. If it is in effect, a table of ancestral sequences is printed out (including the sequences in the tip species which are the input sequences). In that table, if a site has a base which accounts for more than 95% of the likelihood, it is printed in capital letters (A rather than a). If the best nucleotide accounts for less than 50% of the likelihood, the program prints out an ambiguity code (such as M for "A or C") for the set of nucleotides which, taken together, account for more than half of the likelihood. The ambiguity codes are listed in the sequence programs documentation file. One limitation of the current version of the program is that when there are multiple HMM rates (option R) the reconstructed nucleotides are based on only the single assignment of rates to sites which accounts for the largest amount of the likelihood. Thus the assessment of 95% of the likelihood, in tabulating the ancestral states, refers to 95% of the likelihood that is accounted for by that particular combination of rates.
The constants defined at the beginning of the program include "maxtrees", the maximum number of user trees that can be processed. It is small (100) at present to save some further memory but the cost of increasing it is not very great. Other constants include "maxcategories", the maximum number of site categories, "namelength", the length of species names in characters, and three others, "smoothings", "iterations", and "epsilon", that help "tune" the algorithm and define the compromise between execution speed and the quality of the branch lengths found by iteratively maximizing the likelihood. Reducing iterations and smoothings, and increasing epsilon, will result in faster execution but a worse result. These values will not usually have to be changed.
The program spends most of its time doing real arithmetic. The algorithm, with separate and independent computations occurring for each pattern, lends itself readily to parallel processing.
This program was developed in 1989 by combining code from Dnapars and from Dnaml. It was speeded up by two major developments, the use of aliasing of nucleotide sites (version 3.1) and pretabulation of some exponentials (added by Akiko Fuseki in version 3.4). In version 3.5 the Hidden Markov Model code was added and the method of iterating branch lengths was changed from an EM algorithm to direct search. The Hidden Markov Model code slows things down, especially if there is autocorrelation between sites, so this version is slower than version 3.4. Nevertheless we hope that the sacrifice is worth it.
One change that is needed in the future is to put in some way of allowing for base composition of nucleotide sequences in different parts of the phylogeny.
5 13 Alpha AACGTGGCCAAAT Beta AAGGTCGCCAAAC Gamma CATTTCGTCACAA Delta GGTATTTCGGCCT Epsilon GGGATCTCGGCCC |
(It was run with HMM rates having gamma-distributed rates approximated by 5 rate categories, with coefficient of variation of rates 1.0, and with patch length parameter = 1.5. Two user-defined rate categories were used, one for the first 6 sites, the other for the last 7, with rates 1.0 : 2.0. Weights were used, with sites 1 and 13 given weight 0, and all others weight 1.)
Nucleic acid sequence Maximum Likelihood method with molecular clock, version 3.69 5 species, 13 sites Site categories are: 1111112222 222 Sites are weighted as follows: 01111 11111 110 Name Sequences ---- --------- Alpha AACGTGGCCA AAT Beta ..G..C.... ..C Gamma C.TT.C.T.. C.A Delta GGTA.TT.GG CC. Epsilon GGGA.CT.GG CCC Empirical Base Frequencies: A 0.23636 C 0.29091 G 0.25455 T(U) 0.21818 Transition/transversion ratio = 2.000000 Discrete approximation to gamma distributed rates Coefficient of variation of rates = 1.000000 (alpha = 1.000000) State in HMM Rate of change Probability 1 0.264 0.522 2 1.413 0.399 3 3.596 0.076 4 7.086 0.0036 5 12.641 0.000023 Expected length of a patch of sites having the same rate = 1.500 Site category Rate of change 1 1.000 2 2.000 +-Epsilon +---------------------------------------------------4 ! +-Delta --3 ! +-------Gamma +---------------------------------------------2 ! +-Beta +-----1 +-Alpha Ln Likelihood = -58.51728 Ancestor Node Node Height Length -------- ---- ---- ------ ------ root 3 3 4 4.14820 4.14820 4 Epsilon 4.29769 0.14949 4 Delta 4.29769 0.14949 3 2 3.67522 3.67522 2 Gamma 4.29769 0.62247 2 1 4.12429 0.44907 1 Beta 4.29769 0.17340 1 Alpha 4.29769 0.17340 Combination of categories that contributes the most to the likelihood: 1122121111 111 Most probable category at each site if > 0.95 probability ("." otherwise) .......... ... Probable sequences at interior nodes: node Reconstructed sequence (caps if > 0.95) 3 .ayrtykcsr cm. 4 .GkaTctCgg Cc. Epsilon GGGATCTCGG CCC Delta GGTATTTCGG CCT 2 .AykTcgtcA ca. Gamma CATTTCGTCA CAA 1 .AcgTcGCCA AA. Beta AAGGTCGCCA AAC Alpha AACGTGGCCA AAT |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
Dnamove is an interactive DNA parsimony program, inspired by Wayne Maddison and David and Wayne Maddison's marvellous program MacClade, which is written for Macintosh computers. Dnamove reads in a data set which is prepared in almost the same format as one for the DNA parsimony program Dnapars. It allows the user to choose an initial tree, and displays this tree on the screen. The user can look at different sites and the way the nucleotide states are distributed on that tree, given the most parsimonious reconstruction of state changes for that particular tree. The user then can specify how the tree is to be rearraranged, rerooted or written out to a file. By looking at different rearrangements of the tree the user can manually search for the most parsimonious tree, and can get a feel for how different sites are affected by changes in the tree topology.
This program uses graphic characters that show the tree to best advantage on some computer systems. Its graphic characters will work best on MSDOS systems or MSDOS windows in Windows, and to any system whose screen or terminals emulate ANSI standard terminals such as old Digital VT100 terminals, Telnet programs, or VT100-compatible windows in the X windowing system. For any other screen types, (such as Macintosh windows) there is a generic option which does not make use of screen graphics characters. The program will work well in those cases, but the tree it displays will look a bit uglier.
The input data file is set up almost identically to the data files for Dnapars. The code for nucleotide sequences is the standard one, as described in the molecular sequence programs document. The user trees are contained in the input tree file which is used for input of the starting tree (if desired). The output tree file is used for the final tree.
The user interaction starts with the program presenting a menu. The menu looks like this:
Interactive DNA parsimony, version 3.69 Settings for this run: O Outgroup root? No, use as outgroup species 1 W Sites weighted? No T Use Threshold parsimony? No, use ordinary parsimony I Input sequences interleaved? Yes U Initial tree (arbitrary, user, specify)? Arbitrary 0 Graphics type (IBM PC, ANSI, none)? ANSI S Width of terminal screen? 80 L Number of lines on screen? 24 Are these settings correct? (type Y or the letter for one to change) |
The O (Outgroup), W (Weights), T (Threshold), and 0 (Graphics type) options are the usual ones and are described in the main documentation file. The I (Interleaved) option is the usual one and is described in the main documentation file and the molecular sequences programs documentation file. The U (initial tree) option allows the user to choose whether the initial tree is to be arbitrary, interactively specified by the user, or read from a tree file. Typing U causes the program to change among the three possibilities in turn. I would recommend that for a first run, you allow the tree to be set up arbitrarily (the default), as the "specify" choice is difficult to use and the "user tree" choice requires that you have available a tree file with the tree topology of the initial tree, which must be a rooted tree. Its default name is intree. The program will ask you for its name if it looks for the input tree file and does not find one of this name. If you wish to set up some particular tree you can also do that by the rearrangement commands specified below.
The W (Weights) option allows only weights of 0 or 1.
The T (threshold) option allows a continuum of methods between parsimony and compatibility. Thresholds less than or equal to 1.0 do not have any meaning and should not be used: they will result in a tree dependent only on the input order of species and not at all on the data!
The S (Screen width) option allows the width in characters of the display to be adjusted when more than 80 characters can be displayed on the user's screen.
The L (screen Lines) option allows the user to change the height of the screen (in lines of characters) that is assumed to be available on the display. This may be particularly helpful when displaying large trees on terminals that have more than 24 lines per screen, or on workstation or X-terminal screens that can emulate the ANSI terminals with more than 24 lines.
After the initial menu is displayed and the choices are made, the program then sets up an initial tree and displays it. Below it will be a one-line menu of possible commands, which looks like this:
NEXT? (Options: R # + - S . T U W O F C H ? X Q) (H or ? for Help)
If you type H or ? you will get a single screen showing a description of each of these commands in a few words. Here are slightly more detailed descriptions:
As we have seen, the initial menu of the program allows you to choose among three screen types (PCDOS, Ansi, and none). If you want to avoid having to make this choice every time, you can change some of the constants in the file phylip.h to have the terminal type initialize itself in the proper way, and recompile. We have tried to have the default values be correct for PC, Macintosh, and Unix screens. If the setting is "none" (which is necessary on Macintosh MacOS 9 screens), the special graphics characters will not be used to indicate nucleotide states, but only letters will be used for the four nucleotides. This is less easy to look at.
The constants that need attention are ANSICRT and IBMCRT. Currently these are both set to "false" on Macintosh MacOS 9 systems, to "true" on MacOS X and on Unix/Linux systems, and IBMCRT is set to "true" on Windows systems. If your system has an ANSI compatible terminal, you might want to find the definition of ANSICRT in phylip.h and set it to "true", and IBMCRT to "false".
This program carries out unrooted parsimony (analogous to Wagner trees) (Eck and Dayhoff, 1966; Kluge and Farris, 1969) on DNA sequences. The method of Fitch (1971) is used to count the number of changes of base needed on a given tree. The assumptions of this method are exactly analogous to those of MIX:
That these are the assumptions of parsimony methods has been documented in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b), but also read the exchange between Felsenstein and Sober (1986).
Change from an occupied site to a deletion is counted as one change. Reversion from a deletion to an occupied site is allowed and is also counted as one change.
Below is a test data set, but we cannot show the output it generates because of the interactive nature of the program.
5 13 Alpha AACGUGGCCA AAU Beta AAGGUCGCCA AAC Gamma CAUUUCGUCA CAA Delta GGUAUUUCGG CCU Epsilon GGGAUCUCGG CCC |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program carries out unrooted parsimony (analogous to Wagner trees) (Eck and Dayhoff, 1966; Kluge and Farris, 1969) on DNA sequences. The method of Fitch (1971) is used to count the number of changes of base needed on a given tree. The assumptions of this method are analogous to those of MIX:
That these are the assumptions of parsimony methods has been documented in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b, 1988), but also read the exchange between Felsenstein and Sober (1986).
Change from an occupied site to a deletion is counted as one change. Reversion from a deletion to an occupied site is allowed and is also counted as one change. Note that this in effect assumes that a deletion N bases long is N separate events.
Dnapars can handle both bifurcating and multifurcating trees. In doing its search for most parsimonious trees, it adds species not only by creating new forks in the middle of existing branches, but it also tries putting them at the end of new branches which are added to existing forks. Thus it searches among both bifurcating and multifurcating trees. If a branch in a tree does not have any characters which might change in that branch in the most parsimonious tree, it does not save that tree. Thus in any tree that results, a branch exists only if some character has a most parsimonious reconstruction that would involve change in that branch.
It also saves a number of trees tied for best (you can alter the number it saves using the V option in the menu). When rearranging trees, it tries rearrangements of all of the saved trees. This makes the algorithm slower than earlier versions of Dnapars.
The input data is standard. The first line of the input file contains the number of species and the number of sites.
Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats described in the Molecular Sequence Programs document. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion.
The options are selected using an interactive menu. The menu looks like this:
DNA parsimony algorithm, version 3.69 Setting for this run: U Search for best tree? Yes S Search option? More thorough search V Number of trees to save? 10000 J Randomize input order of sequences? No. Use input order O Outgroup root? No, use as outgroup species 1 T Use Threshold parsimony? No, use ordinary parsimony N Use Transversion parsimony? No, count all steps W Sites weighted? No M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Print out steps in each site No 5 Print sequences at all nodes of tree No 6 Write out trees onto tree file? Yes Y to accept these or type the letter for one to change |
The user either types "Y" (followed, of course, by a carriage-return) if the settings shown are to be accepted, or the letter or digit corresponding to an option that is to be changed.
The S (search) option controls how, and how much, rearrangement is done on the tied trees that are saved by the program. If the "More thorough search" option (the default) is chosen, the program will save multiple tied trees, without collapsing internal branches that have no evidence of change on them. It will subsequently rearrange on all parts of each of those trees. If the "Less thorough search" option is chosen, before saving, the program will collapse all branches that have no evidence that there is any change on that branch. This leads to less attempted rearrangement. If the "Rearrange on one best tree" option is chosen, only the first of the tied trees is used for rearrangement. This is faster but less thorough. If your trees are likely to have large multifurcations, do not use the default "More thorough search" option as it could result in too large a number of trees being saved.
The N option allows you to choose transversion parsimony, which counts only transversions (changes between one of the purines A or G and one of the pyrimidines C or T). This setting is turned off by default.
The Weights (W) option takes the weights from a file whose default name is "weights". The weights follow the format described in the main documentation file, with integer weights from 0 to 35 allowed by using the characters 0, 1, 2, ..., 9 and A, B, ... Z.
The User tree (option U) is read from a file whose default name is intree. The trees can be multifurcating. They must be preceded in the file by a line giving the number of trees in the file.
The options J, O, T, M, and 0 are the usual ones. They are described in the main documentation file of this package. Option I is the same as in other molecular sequence programs and is described in the documentation file for the sequence programs.
The M (multiple data sets option) will ask you whether you want to use multiple sets of weights (from the weights file) or multiple data sets. The ability to use a single data set with multiple weights means that much less disk space will be used for this input data. The bootstrapping and jackknifing tool Seqboot has the ability to create a weights file with multiple weights.
The O (outgroup) option will have no effect if the U (user-defined tree) option is in effect. The T (threshold) option allows a continuum of methods between parsimony and compatibility. Thresholds less than or equal to 1.0 do not have any meaning and should not be used: they will result in a tree dependent only on the input order of species and not at all on the data!
Output is standard: if option 1 is toggled on, the data is printed out, with the convention that "." means "the same as in the first species". Then comes a list of equally parsimonious trees. Each tree has branch lengths. These are computed using an algorithm published by Hochbaum and Pathria (1997) which I first heard of from Wayne Maddison who invented it independently of them. This algorithm averages the number of reconstructed changes of state over all sites over all possible most parsimonious placements of the changes of state among branches. Note that it does not correct in any way for multiple changes that overlay each other.
If option 2 is toggled on a table of the number of changes of state required in each character is also printed. If option 5 is toggled on, a table is printed out after each tree, showing for each branch whether there are known to be changes in the branch, and what the states are inferred to have been at the top end of the branch. This is a reconstruction of the ancestral sequences in the tree. If you choose option 5, a menu item "." appears which gives you the opportunity to turn off dot-differencing so that complete ancestral sequences are shown. If the inferred state is a "?" or one of the IUB ambiguity symbols, there will be multiple equally-parsimonious assignments of states; the user must work these out for themselves by hand. A "?" in the reconstructed states means that in addition to one or more bases, a deletion may or may not be present. If option 6 is left in its default state the trees found will be written to a tree file, so that they are available to be used in other programs.
If the U (User Tree) option is used and more than one tree is supplied, and the program is not told to assume autocorrelation between the rates at different sites, the program also performs a statistical test of each of these trees against the one with highest likelihood. If there are two user trees, this is a version of the test proposed by Alan Templeton (1983) and evaluated in a test case by me (1985a). It is closely parallel to a test using log likelihood differences due to Kishino and Hasegawa (1989) It uses the mean and variance of the differences in the number of steps between trees, taken across sites. If the two trees' means are more than 1.96 standard deviations different, then the trees are declared significantly different.
If there are more than two trees, the test done is an extension of the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that a correction for the number of trees was necessary, and they introduced a resampling method to make this correction. In the version used here the variances and covariances of the sums of steps across sites are computed for all pairs of trees. To test whether the difference between each tree and the best one is larger than could have been expected if they all had the same expected number of steps, numbers of steps for all trees are sampled with these covariances and equal means (Shimodaira and Hasegawa's "least favorable hypothesis"), and a P value is computed from the fraction of times the difference between the tree's value and the lowest number of steps exceeds that actually observed. Note that this sampling needs random numbers, and so the program will prompt the user for a random number seed if one has not already been supplied. With the two-tree KHT test no random numbers are used.
In either the KHT or the SH test the program prints out a table of the number of steps for each tree, the differences of each from the lowest one, the variance of that quantity as determined by the differences of the numbers of steps at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one.
Option 6 in the menu controls whether the tree estimated by the program is written onto a tree file. The default name of this output tree file is "outtree". If the U option is in effect, all the user-defined trees are written to the output tree file. If the program finds multiple trees tied for best, all of these are written out onto the output tree file. Each is followed by a numerical weight in square brackets (such as [0.25000]). This is needed when we use the trees to make a consensus tree of the results of bootstrapping or jackknifing, to avoid overrepresenting replicates that find many tied trees.
The program is a straightforward relative of MIX and runs reasonably quickly, especially with many sites and few species.
5 13 Alpha AACGUGGCCAAAU Beta AAGGUCGCCAAAC Gamma CAUUUCGUCACAA Delta GGUAUUUCGGCCU Epsilon GGGAUCUCGGCCC |
DNA parsimony algorithm, version 3.69 5 species, 13 sites Name Sequences ---- --------- Alpha AACGUGGCCA AAU Beta ..G..C.... ..C Gamma C.UU.C.U.. C.A Delta GGUA.UU.GG CC. Epsilon GGGA.CU.GG CCC One most parsimonious tree found: +-----Epsilon +----------------------------3 +------------2 +-------Delta | | | +----------------Gamma | 1----Beta | +---------Alpha requires a total of 19.000 between and length ------- --- ------ 1 2 0.217949 2 3 0.487179 3 Epsilon 0.096154 3 Delta 0.134615 2 Gamma 0.275641 1 Beta 0.076923 1 Alpha 0.173077 steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 2 1 3 2 0 2 1 1 1 10| 1 1 1 3 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AABGTCGCCA AAY 1 2 yes V.KD...... C.. 2 3 yes GG.A..T.GG .C. 3 Epsilon maybe ..G....... ..C 3 Delta yes ..T..T.... ..T 2 Gamma yes C.TT...T.. ..A 1 Beta maybe ..G....... ..C 1 Alpha yes ..C..G.... ..T
|
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
Dnapenny is a program that will find all of the most parsimonious trees implied by your data when the nucleic acid sequence parsimony criterion is employed. It does so not by examining all possible trees, but by using the more sophisticated "branch and bound" algorithm, a standard computer science search strategy first applied to phylogenetic inference by Hendy and Penny (1982). (J. S. Farris [personal communication, 1975] had also suggested that this strategy, which is well-known in computer science, might be applied to phylogenies, but he did not publish this suggestion).
There is, however, a price to be paid for the certainty that one has found all members of the set of most parsimonious trees. The problem of finding these has been shown (Graham and Foulds, 1982; Day, 1983) to be NP-complete, which is equivalent to saying that there is no fast algorithm that is guaranteed to solve the problem in all cases (for a discussion of NP-completeness, see the Scientific American article by Lewis and Papadimitriou, 1978). The result is that this program, despite its algorithmic sophistication, is VERY SLOW.
The program should be slower than the other tree-building programs in the package, but usable up to about ten species. Above this it will bog down rapidly, but exactly when depends on the data and on how much computer time you have. IT IS VERY IMPORTANT FOR YOU TO GET A FEEL FOR HOW LONG THE PROGRAM WILL TAKE ON YOUR DATA. This can be done by running it on subsets of the species, increasing the number of species in the run until you either are able to treat the full data set or know that the program will take unacceptably long on it. (Making a plot of the logarithm of run time against species number may help to project run times).
The search strategy used by Dnapenny starts by making a tree consisting of the first two species (the first three if the tree is to be unrooted). Then it tries to add the next species in all possible places (there are three of these). For each of the resulting trees it evaluates the number of base substitutions. It adds the next species to each of these, again in all possible spaces. If this process would continue it would simply generate all possible trees, of which there are a very large number even when the number of species is moderate (34,459,425 with 10 species). Actually it does not do this, because the trees are generated in a particular order and some of them are never generated.
This is because the order in which trees are generated is not quite as implied above, but is a "depth-first search". This means that first one adds the third species in the first possible place, then the fourth species in its first possible place, then the fifth and so on until the first possible tree has been produced. For each tree the number of steps is evaluated. Then one "backtracks" by trying the alternative placements of the last species. When these are exhausted one tries the next placement of the next-to-last species. The order of placement in a depth-first search is like this for a four-species case (parentheses enclose monophyletic groups):
Make tree of first two species: (A,B)
Add C in first place: ((A,B),C)
Add D in first place: (((A,D),B),C)
Add D in second place: ((A,(B,D)),C)
Add D in third place: (((A,B),D),C)
Add D in fourth place: ((A,B),(C,D))
Add D in fifth place: (((A,B),C),D)
Add C in second place: ((A,C),B)
Add D in first place: (((A,D),C),B)
Add D in second place: ((A,(C,D)),B)
Add D in third place: (((A,C),D),B)
Add D in fourth place: ((A,C),(B,D))
Add D in fifth place: (((A,C),B),D)
Add C in third place: (A,(B,C))
Add D in first place: ((A,D),(B,C))
Add D in second place: (A,((B,D),C))
Add D in third place: (A,(B,(C,D)))
Add D in fourth place: (A,((B,C),D))
Add D in fifth place: ((A,(B,C)),D)
Among these fifteen trees you will find all of the four-species rooted trees, each exactly once (the parentheses each enclose a monophyletic group). As displayed above, the backtracking depth-first search algorithm is just another way of producing all possible trees one at a time. The branch and bound algorithm consists of this with one change. As each tree is constructed, including the partial trees such as (A,(B,C)), its number of steps is evaluated. In addition a prediction is made as to how many steps will be added, at a minimum, as further species are added.
This is done by counting how many sites which are invariant in the data up to the most recent species added will ultimately show variation when further species are added. Thus if 20 sites vary among species A, B, and C and their root, and if tree ((A,C),B) requires 24 steps, then if there are 8 more sites which will be seen to vary when species D is added, we can immediately say that no matter how we add D, the resulting tree can have no less than 24 + 8 = 32 steps. The point of all this is that if a previously-found tree such as ((A,B),(C,D)) required only 30 steps, then we know that there is no point in even trying to add D to ((A,C),B). We have computed the bound that enables us to cut off a whole line of inquiry (in this case five trees) and avoid going down that particular branch any farther.
The branch-and-bound algorithm thus allows us to find all most parsimonious trees without generating all possible trees. How much of a saving this is depends strongly on the data. For very clean (nearly "Hennigian") data, it saves much time, but on very messy data it will still take a very long time.
The algorithm in the program differs from the one outlined here in some essential details: it investigates possibilities in the order of their apparent promise. This applies to the order of addition of species, and to the places where they are added to the tree. After the first two-species tree is constructed, the program tries adding each of the remaining species in turn, each in the best possible place it can find. Whichever of those species adds (at a minimum) the most additional steps is taken to be the one to be added next to the tree. When it is added, it is added in turn to places which cause the fewest additional steps to be added. This sounds a bit complex, but it is done with the intention of eliminating regions of the search of all possible trees as soon as possible, and lowering the bound on tree length as quickly as possible. This process of evaluating which species to add in which order goes on the first time the search makes a tree; thereafter it uses that order.
The program keeps a list of all the most parsimonious trees found so far. Whenever it finds one that has fewer losses than these, it clears out the list and restarts it with that tree. In the process the bound tightens and fewer possibilities need be investigated. At the end the list contains all the shortest trees. These are then printed out. It should be mentioned that the program Clique for finding all largest cliques also works by branch-and-bound. Both problems are NP-complete but for some reason Clique runs far faster. Although their worst-case behavior is bad for both programs, those worst cases occur far more frequently in parsimony problems than in compatibility problems.
Among the quantities available to be set from the menu of Dnapenny, two (howoften and howmany) are of particular importance. As Dnapenny goes along it will keep count of how many trees it has examined. Suppose that howoften is 100 and howmany is 1000, the default settings. Every time 100 trees have been examined, Dnapenny will print out a line saying how many multiples of 100 trees have now been examined, how many steps the most parsimonious tree found so far has, how many trees with that number of steps have been found, and a very rough estimate of what fraction of all trees have been looked at so far.
When the number of these multiples printed out reaches the number howmany (say 1000), the whole algorithm aborts and prints out that it has not found all most parsimonious trees, but prints out what is has gotten so far anyway. These trees need not be any of the most parsimonious trees: they are simply the most parsimonious ones found so far. By setting the product (howoften times howmany) large you can make the algorithm less likely to abort, but then you risk getting bogged down in a gigantic computation. You should adjust these constants so that the program cannot go beyond examining the number of trees you are reasonably willing to pay for (or wait for). In their initial setting the program will abort after looking at 100,000 trees. Obviously you may want to adjust howoften in order to get more or fewer lines of intermediate notice of how many trees have been looked at so far. Of course, in small cases you may never even reach the first multiple of howoften, and nothing will be printed out except some headings and then the final trees.
The indication of the approximate percentage of trees searched so far will be helpful in judging how much farther you would have to go to get the full search. Actually, since that fraction is the fraction of the set of all possible trees searched or ruled out so far, and since the search becomes progressively more efficient, the approximate fraction printed out will usually be an underestimate of how far along the program is, sometimes a serious underestimate.
A constant at the beginning of the program that affects the result is "maxtrees", which controls the maximum number of trees that can be stored. Thus if maxtrees is 25, and 32 most parsimonious trees are found, only the first 25 of these are stored and printed out. If maxtrees is increased, the program does not run any slower but requires a little more intermediate storage space. I recommend that maxtrees be kept as large as you can, provided you are willing to look at an output with that many trees on it! Initially, maxtrees is set to 100 in the distribution copy.
The counting of the length of trees is done by an algorithm nearly identical to the corresponding algorithms in Dnapars, and thus the remainder of this document will be nearly identical to the Dnapars document.
This program carries out unrooted parsimony (analogous to Wagner trees) (Eck and Dayhoff, 1966; Kluge and Farris, 1969) on DNA sequences. The method of Fitch (1971) is used to count the number of changes of base needed on a given tree. The assumptions of this method are exactly analogous to those of Dnapars:
Change from an occupied site to a deletion is counted as one change. Reversion from a deletion to an occupied site is allowed and is also counted as one change.
That these are the assumptions of parsimony methods has been documented in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b), but also read the exchange between Felsenstein and Sober (1986).
Change from an occupied site to a deletion is counted as one change. Reversion from a deletion to an occupied site is allowed and is also counted as one change. Note that this in effect assumes that a deletion N bases long is N separate events.
The input data is standard. The first line of the input file contains the number of species and the number of sites. If the Weights option is being used, there must also be a W in this first line to signal its presence. There are only two options requiring information to be present in the input file, W (Weights) and U (User tree). All options other than W (including U) are invoked using the menu.
Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats described in the Molecular Sequence Programs document. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion.
The options are selected using an interactive menu. The menu looks like this:
Penny algorithm for DNA, version 3.69 branch-and-bound to find all most parsimonious trees Settings for this run: H How many groups of 100 trees: 1000 F How often to report, in trees: 100 S Branch and bound is simple? Yes O Outgroup root? No, use as outgroup species 1 T Use Threshold parsimony? No, use ordinary parsimony W Sites weighted? No M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Print out steps in each site No 5 Print sequences at all nodes of tree No 6 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) |
The user either types "Y" (followed, of course, by a carriage-return) if the settings shown are to be accepted, or the letter or digit corresponding to an option that is to be changed.
The options O, T, W, M, and 0 are the usual ones. They are described in the main documentation file of this package. Option I is the same as in other molecular sequence programs and is described in the documentation file for the sequence programs.
The T (threshold) option allows a continuum of methods between parsimony and compatibility. Thresholds less than or equal to 1.0 do not have any meaning and should not be used: they will result in a tree dependent only on the input order of species and not at all on the data!
The W (Weights) option allows only weights of 0 or 1.
The options H, F, and S are not found in the other molecular sequence programs. H (How many) allows the user to set the quantity howmany, which we have already seen controls number of times that the program will report on its progress. F allows the user to set the quantity howoften, which sets how often it will report -- after scanning how many trees.
The S (Simple) option alters a step in Dnapenny which reconsiders the order in which species are added to the tree. Normally the decision as to what species to add to the tree next is made as the first tree is being constructed; that ordering of species is not altered subsequently. The S option causes it to be continually reconsidered. This will probably result in a substantial increase in run time, but on some data sets of intermediate messiness it may help. It is included in case it might prove of use on some data sets. The Simple option, in which the ordering is kept the same after being established by trying alternatives during the construction of the first tree, is the default. Continual reconsideration can be selected as an alternative.
Output is standard: if option 1 is toggled on, the data is printed out, with the convention that "." means "the same as in the first species". Then comes a list of equally parsimonious trees, and (if option 2 is toggled on) a table of the number of changes of state required in each character. If option 5 is toggled on, a table is printed out after each tree, showing for each branch whether there are known to be changes in the branch, and what the states are inferred to have been at the top end of the branch. If the inferred state is a "?" or one of the IUB ambiguity symbols, there will be multiple equally-parsimonious assignments of states; the user must work these out for themselves by hand. A "?" in the reconstructed states means that in addition to one or more bases, a deletion may or may not be present. If option 6 is left in its default state the trees found will be written to a tree file, so that they are available to be used in other programs. If the program finds multiple trees tied for best, all of these are written out onto the output tree file. Each is followed by a numerical weight in square brackets (such as [0.25000]). This is needed when we use the trees to make a consensus tree of the results of bootstrapping or jackknifing, to avoid overrepresenting replicates that find many tied trees.
8 6 Alpha1 AAGAAG Alpha2 AAGAAG Beta1 AAGGGG Beta2 AAGGGG Gamma1 AGGAAG Gamma2 AGGAAG Delta GGAGGA Epsilon GGAAAG |
Penny algorithm for DNA, version 3.69 branch-and-bound to find all most parsimonious trees 8 species, 6 sites Name Sequences ---- --------- Alpha1 AAGAAG Alpha2 ...... Beta1 ...GG. Beta2 ...GG. Gamma1 .G.... Gamma2 .G.... Delta GGAGGA Epsilon GGA... requires a total of 8.000 9 trees in all found +--------------------Alpha1 ! ! +-----------Alpha2 ! ! 1 +-----4 +--Epsilon ! ! ! +-----6 ! ! ! ! +--Delta ! ! +--5 +--2 ! +--Gamma2 ! +-----7 ! +--Gamma1 ! ! +--Beta2 +--------------3 +--Beta1 remember: this is an unrooted tree! steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AAGAAG 1 Alpha1 no ...... 1 2 no ...... 2 4 no ...... 4 Alpha2 no ...... 4 5 yes .G.... 5 6 yes G.A... 6 Epsilon no ...... 6 Delta yes ...GGA 5 7 no ...... 7 Gamma2 no ...... 7 Gamma1 no ...... 2 3 yes ...GG. 3 Beta2 no ...... 3 Beta1 no ...... +--------------------Alpha1 ! ! +-----------Alpha2 ! ! 1 +-----4 +--------Gamma2 ! ! ! ! ! ! +--7 +--Epsilon ! ! ! +--6 +--2 +--5 +--Delta ! ! ! +-----Gamma1 ! ! +--Beta2 +--------------3 +--Beta1 remember: this is an unrooted tree! steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AAGAAG 1 Alpha1 no ...... 1 2 no ...... 2 4 no ...... 4 Alpha2 no ...... 4 7 yes .G.... 7 Gamma2 no ...... 7 5 no ...... 5 6 yes G.A... 6 Epsilon no ...... 6 Delta yes ...GGA 5 Gamma1 no ...... 2 3 yes ...GG. 3 Beta2 no ...... 3 Beta1 no ...... +--------------------Alpha1 ! ! +-----------Alpha2 ! ! 1 +-----4 +-----Gamma2 ! ! ! +--7 ! ! ! ! ! +--Epsilon ! ! +--5 +--6 +--2 ! +--Delta ! ! ! +--------Gamma1 ! ! +--Beta2 +--------------3 +--Beta1 remember: this is an unrooted tree! steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AAGAAG 1 Alpha1 no ...... 1 2 no ...... 2 4 no ...... 4 Alpha2 no ...... 4 5 yes .G.... 5 7 no ...... 7 Gamma2 no ...... 7 6 yes G.A... 6 Epsilon no ...... 6 Delta yes ...GGA 5 Gamma1 no ...... 2 3 yes ...GG. 3 Beta2 no ...... 3 Beta1 no ...... +--------------------Alpha1 ! 1 +-----------------Alpha2 ! ! ! ! +--------Gamma2 +--2 ! ! +-----7 +--Epsilon ! ! ! +--6 ! ! +--5 +--Delta +--4 ! ! +-----Gamma1 ! ! +--Beta2 +-----------3 +--Beta1 remember: this is an unrooted tree! steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AAGAAG 1 Alpha1 no ...... 1 2 no ...... 2 Alpha2 no ...... 2 4 no ...... 4 7 yes .G.... 7 Gamma2 no ...... 7 5 no ...... 5 6 yes G.A... 6 Epsilon no ...... 6 Delta yes ...GGA 5 Gamma1 no ...... 4 3 yes ...GG. 3 Beta2 no ...... 3 Beta1 no ...... +--------------------Alpha1 ! ! +-----------------Alpha2 1 ! ! ! +--Epsilon ! ! +-----6 +--2 ! +--Delta ! +-----5 ! ! ! +--Gamma2 ! ! +-----7 +--4 +--Gamma1 ! ! +--Beta2 +-----------3 +--Beta1 remember: this is an unrooted tree! steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AAGAAG 1 Alpha1 no ...... 1 2 no ...... 2 Alpha2 no ...... 2 4 no ...... 4 5 yes .G.... 5 6 yes G.A... 6 Epsilon no ...... 6 Delta yes ...GGA 5 7 no ...... 7 Gamma2 no ...... 7 Gamma1 no ...... 4 3 yes ...GG. 3 Beta2 no ...... 3 Beta1 no ...... +--------------------Alpha1 ! ! +-----------------Alpha2 1 ! ! ! +-----Gamma2 ! ! +--7 +--2 ! ! +--Epsilon ! +-----5 +--6 ! ! ! +--Delta ! ! ! +--4 +--------Gamma1 ! ! +--Beta2 +-----------3 +--Beta1 remember: this is an unrooted tree! steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AAGAAG 1 Alpha1 no ...... 1 2 no ...... 2 Alpha2 no ...... 2 4 no ...... 4 5 yes .G.... 5 7 no ...... 7 Gamma2 no ...... 7 6 yes G.A... 6 Epsilon no ...... 6 Delta yes ...GGA 5 Gamma1 no ...... 4 3 yes ...GG. 3 Beta2 no ...... 3 Beta1 no ...... +--------------------Alpha1 ! ! +-----Alpha2 1 +-----------2 ! ! ! +--Beta2 ! ! +--3 +--4 +--Beta1 ! ! +--------Gamma2 ! ! +--------7 +--Epsilon ! +--6 +--5 +--Delta ! +-----Gamma1 remember: this is an unrooted tree! steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AAGAAG 1 Alpha1 no ...... 1 4 no ...... 4 2 no ...... 2 Alpha2 no ...... 2 3 yes ...GG. 3 Beta2 no ...... 3 Beta1 no ...... 4 7 yes .G.... 7 Gamma2 no ...... 7 5 no ...... 5 6 yes G.A... 6 Epsilon no ...... 6 Delta yes ...GGA 5 Gamma1 no ...... +--------------------Alpha1 ! ! +-----Alpha2 1 +-----------2 ! ! ! +--Beta2 ! ! +--3 ! ! +--Beta1 +--4 ! +-----Gamma2 ! +--7 ! ! ! +--Epsilon +--------5 +--6 ! +--Delta ! +--------Gamma1 remember: this is an unrooted tree! steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AAGAAG 1 Alpha1 no ...... 1 4 no ...... 4 2 no ...... 2 Alpha2 no ...... 2 3 yes ...GG. 3 Beta2 no ...... 3 Beta1 no ...... 4 5 yes .G.... 5 7 no ...... 7 Gamma2 no ...... 7 6 yes G.A... 6 Epsilon no ...... 6 Delta yes ...GGA 5 Gamma1 no ...... +--------------------Alpha1 ! ! +-----Alpha2 1 +-----------2 ! ! ! +--Beta2 ! ! +--3 ! ! +--Beta1 +--4 ! +--Epsilon ! +-----6 ! ! +--Delta +--------5 ! +--Gamma2 +-----7 +--Gamma1 remember: this is an unrooted tree! steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0| 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 AAGAAG 1 Alpha1 no ...... 1 4 no ...... 4 2 no ...... 2 Alpha2 no ...... 2 3 yes ...GG. 3 Beta2 no ...... 3 Beta1 no ...... 4 5 yes .G.... 5 6 yes G.A... 6 Epsilon no ...... 6 Delta yes ...GGA 5 7 no ...... 7 Gamma2 no ...... 7 Gamma1 no ...... |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
Penny is a program that will find all of the most parsimonious trees implied by your data. It does so not by examining all possible trees, but by using the more sophisticated "branch and bound" algorithm, a standard computer science search strategy first applied to phylogenetic inference by Hendy and Penny (1982). (J. S. Farris [personal communication, 1975] had also suggested that this strategy, which is well-known in computer science, might be applied to phylogenies, but he did not publish this suggestion).
There is, however, a price to be paid for the certainty that one has found all members of the set of most parsimonious trees. The problem of finding these has been shown (Graham and Foulds, 1982; Day, 1983) to be NP-complete, which is equivalent to saying that there is no fast algorithm that is guaranteed to solve the problem in all cases (for a discussion of NP-completeness, see the Scientific American article by Lewis and Papadimitriou, 1978). The result is that this program, despite its algorithmic sophistication, is VERY SLOW.
The program should be slower than the other tree-building programs in the package, but useable up to about ten species. Above this it will bog down rapidly, but exactly when depends on the data and on how much computer time you have. IT IS VERY IMPORTANT FOR YOU TO GET A FEEL FOR HOW LONG THE PROGRAM WILL TAKE ON YOUR DATA. This can be done by running it on subsets of the species, increasing the number of species in the run until you either are able to treat the full data set or know that the program will take unacceptably long on it. (Making a plot of the logarithm of run time against species number may help to project run times).
The search strategy used by Penny starts by making a tree consisting of the first two species (the first three if the tree is to be unrooted). Then it tries to add the next species in all possible places (there are three of these). For each of the resulting trees it evaluates the number of steps. It adds the next species to each of these, again in all possible spaces. If this process would continue it would simply generate all possible trees, of which there are a very large number even when the number of species is moderate (34,459,425 with 10 species). Actually it does not do this, because the trees are generated in a particular order and some of them are never generated.
Actually the order in which trees are generated is not quite as implied above, but is a "depth-first search". This means that first one adds the third species in the first possible place, then the fourth species in its first possible place, then the fifth and so on until the first possible tree has been produced. Its number of steps is evaluated. Then one "backtracks" by trying the alternative placements of the last species. When these are exhausted one tries the next placement of the next-to-last species. The order of placement in a depth-first search is like this for a four-species case (parentheses enclose monophyletic groups):
Make tree of first two species (A,B)
  Add C in first place ((A,B),C)
    Add D in first place (((A,D),B),C)
    Add D in second place ((A,(B,D)),C)
    Add D in third place (((A,B),D),C)
    Add D in fourth place ((A,B),(C,D))
    Add D in fifth place (((A,B),C),D)
  Add C in second place: ((A,C),B)
    Add D in first place (((A,D),C),B)
    Add D in second place ((A,(C,D)),B)
    Add D in third place (((A,C),D),B)
    Add D in fourth place ((A,C),(B,D))
    Add D in fifth place (((A,C),B),D)
  Add C in third place (A,(B,C))
    Add D in first place ((A,D),(B,C))
    Add D in second place (A,((B,D),C))
    Add D in third place (A,(B,(C,D)))
    Add D in fourth place (A,((B,C),D))
    Add D in fifth place ((A,(B,C)),D)
Among these fifteen trees you will find all of the four-species rooted bifurcating trees, each exactly once (the parentheses each enclose a monophyletic group). As displayed above, the backtracking depth-first search algorithm is just another way of producing all possible trees one at a time. The branch and bound algorithm consists of this with one change. As each tree is constructed, including the partial trees such as (A,(B,C)), its number of steps is evaluated. In addition a prediction is made as to how many steps will be added, at a minimum, as further species are added.
This is done by counting how many binary characters which are invariant in the data up the species most recently added will ultimately show variation when further species are added. Thus if 20 characters vary among species A, B, and C and their root, and if tree ((A,C),B) requires 24 steps, then if there are 8 more characters which will be seen to vary when species D is added, we can immediately say that no matter how we add D, the resulting tree can have no less than 24 + 8 = 32 steps. The point of all this is that if a previously-found tree such as ((A,B),(C,D)) required only 30 steps, then we know that there is no point in even trying to add D to ((A,C),B). We have computed the bound that enables us to cut off a whole line of inquiry (in this case five trees) and avoid going down that particular branch any farther.
The branch-and-bound algorithm thus allows us to find all most parsimonious trees without generating all possible trees. How much of a saving this is depends strongly on the data. For very clean (nearly "Hennigian") data, it saves much time, but on very messy data it will still take a very long time.
The algorithm in the program differs from the one outlined here in some essential details: it investigates possibilities in the order of their apparent promise. This applies to the order of addition of species, and to the places where they are added to the tree. After the first two-species tree is constructed, the program tries adding each of the remaining species in turn, each in the best possible place it can find. Whichever of those species adds (at a minimum) the most additional steps is taken to be the one to be added next to the tree. When it is added, it is added in turn to places which cause the fewest additional steps to be added. This sounds a bit complex, but it is done with the intention of eliminating regions of the search of all possible trees as soon as possible, and lowering the bound on tree length as quickly as possible.
The program keeps a list of all the most parsimonious trees found so far. Whenever it finds one that has fewer steps than these, it clears out the list and restarts the list with that tree. In the process the bound tightens and fewer possibilities need be investigated. At the end the list contains all the shortest trees. These are then printed out. It should be mentioned that the program Clique for finding all largest cliques also works by branch-and-bound. Both problems are NP-complete but for some reason Clique runs far faster. Although their worst-case behavior is bad for both programs, those worst cases occur far more frequently in parsimony problems than in compatibility problems.
Among the quantities available to be set at the beginning of a run of Penny, two (howoften and howmany) are of particular importance. As Penny goes along it will keep count of how many trees it has examined. Suppose that howoften is 100 and howmany is 1000, the default settings. Every time 100 trees have been examined, Penny will print out a line saying how many multiples of 100 trees have now been examined, how many steps the most parsimonious tree found so far has, how many trees with that number of steps have been found, and a very rough estimate of what fraction of all trees have been looked at so far.
When the number of these multiples printed out reaches the number howmany (say 1000), the whole algorithm aborts and prints out that it has not found all most parsimonious trees, but prints out what is has got so far anyway. These trees need not be any of the most parsimonious trees: they are simply the most parsimonious ones found so far. By setting the product (howoften times howmany) large you can make the algorithm less likely to abort, but then you risk getting bogged down in a gigantic computation. You should adjust these constants so that the program cannot go beyond examining the number of trees you are reasonably willing to wait for. In their initial setting the program will abort after looking at 100,000 trees. Obviously you may want to adjust howoften in order to get more or fewer lines of intermediate notice of how many trees have been looked at so far. Of course, in small cases you may never even reach the first multiple of howoften and nothing will be printed out except some headings and then the final trees.
The indication of the approximate percentage of trees searched so far will be helpful in judging how much farther you would have to go to get the full search. Actually, since that fraction is the fraction of the set of all possible trees searched or ruled out so far, and since the search becomes progressively more efficient, the approximate fraction printed out will usually be an underestimate of how far along the program is, sometimes a serious underestimate.
A constant at the beginning of the program that affects the result is "maxtrees", which controls the maximum number of trees that can be stored. Thus if "maxtrees" is 25, and 32 most parsimonious trees are found, only the first 25 of these are stored and printed out. If "maxtrees" is increased, the program does not run any slower but requires a little more intermediate storage space. I recommend that "maxtrees" be kept as large as you can, provided you are willing to look at an output with that many trees on it! Initially, "maxtrees" is set to 100 in the distribution copy.
The counting of the length of trees is done by an algorithm nearly identical to the corresponding algorithms in Mix, and thus the remainder of this document will be nearly identical to the Mix document. Mix is a general parsimony program which carries out the Wagner and Camin-Sokal parsimony methods in mixture, where each character can have its method specified. The program defaults to carrying out Wagner parsimony.
The Camin-Sokal parsimony method explains the data by assuming that changes 0 --> 1 are allowed but not changes 1 --> 0. Wagner parsimony allows both kinds of changes. (This under the assumption that 0 is the ancestral state, though the program allows reassignment of the ancestral state, in which case we must reverse the state numbers 0 and 1 throughout this discussion). The criterion is to find the tree which requires the minimum number of changes. The Camin-Sokal method is due to Camin and Sokal (1965) and the Wagner method to Eck and Dayhoff (1966) and to Kluge and Farris (1969).
Here are the assumptions of these two methods:
That these are the assumptions of parsimony methods has been documented in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b), but also read the exchange between Felsenstein and Sober (1986).
The input for Penny is the standard input for discrete characters programs, described above in the documentation file for the discrete-characters programs. States "?", "P", and "B" are allowed.
The options are selected using a menu:
Penny algorithm, version 3.696 branch-and-bound to find all most parsimonious trees Settings for this run: X Use Mixed method? No P Parsimony method? Wagner F How often to report, in trees: 100 H How many groups of 100 trees: 1000 O Outgroup root? No, use as outgroup species 1 S Branch and bound is simple? Yes T Use Threshold parsimony? No, use ordinary parsimony A Use ancestral states in input file? No W Sites weighted? No M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Print out steps in each character No 5 Print states at all nodes of tree No 6 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) |
The options X, O, T, A, and M are the usual miXed Methods, Outgroup, Threshold, Ancestral States, and Multiple Data Sets options. They are described in the Main documentation file and in the Discrete Characters Programs documentation file. The O option is only acted upon if the final tree is unrooted.
The option P toggles between the Camin-Sokal parsimony criterion and the Wagner parsimony criterion. Options F and H reset the variables howoften (F) and howmany (H). The user is prompted for the new values. By setting these larger the program will report its progress less often (howoften) and will run longer (howmany times howoften). These values default to 100 and 1000 which guarantees a search of 100,000 trees, but these can be changed. Note that option F in this program is not the Factors option available in some of the other programs in this section of the package.
The A (Ancestral states) option works in the usual way, described in the Discrete Characters Programs documentation file. If the A option is not used, then the program will assume 0 as the ancestral state for those characters following the Camin-Sokal method, and will assume that the ancestral state is unknown for those characters following Wagner parsimony. If any characters have unknown ancestral states, and if the resulting tree is rooted (even by outgroup), a table will be printed out showing the best guesses of which are the ancestral states in each character.
The S (Simple) option alters a step in Penny which reconsiders the order in which species are added to the tree. Normally the decision as to what species to add to the tree next is made as the first tree is being constructed; that ordering of species is not altered subsequently. The S option causes it to be continually reconsidered. This will probably result in a substantial increase in run time, but on some data sets of intermediate messiness it may help. It is included in case it might prove of use on some data sets. The Simple option, in which the ordering is kept the same after being established by trying alternatives during the construction of the first tree, is the default. Continual reconsideration can be selected as an alternative.
The F (Factors) option is not available in this program, as it would have no effect on the result even if that information were provided in the input file.
The final output is standard: a set of trees, which will be printed as rooted or unrooted depending on which is appropriate, and if the user elects to see them, tables of the number of changes of state required in each character. If the Wagner option is in force for a character, it may not be possible to unambiguously locate the places on the tree where the changes occur, as there may be multiple possibilities. A table is available to be printed out after each tree, showing for each branch whether there are known to be changes in the branch, and what the states are inferred to have been at the top end of the branch. If the inferred state is a "?" there will be multiple equally-parsimonious assignments of states; the user must work these out for themselves by hand.
If the Camin-Sokal parsimony method (option C or S) is invoked and the A option is also used, then the program will infer, for any character whose ancestral state is unknown ("?") whether the ancestral state 0 or 1 will give the fewest state changes. If these are tied, then it may not be possible for the program to infer the state in the internal nodes, and these will all be printed as ".". If this has happened and you want to know more about the states at the internal nodes, you will find helpful to use Move to display the tree and examine its interior states, as the algorithm in Move shows all that can be known in this case about the interior states, including where there is and is not amibiguity. The algorithm in Penny gives up more easily on displaying these states.
If the A option is not used, then the program will assume 0 as the ancestral state for those characters following the Camin-Sokal method, and will assume that the ancestral state is unknown for those characters following Wagner parsimony. If any characters have unknown ancestral states, and if the resulting tree is rooted (even by outgroup), a table will be printed out showing the best guesses of which are the ancestral states in each character. You will find it useful to understand the difference between the Camin-Sokal parsimony criterion with unknown ancestral state and the Wagner parsimony criterion.
If option 6 is left in its default state the trees found will be written to a tree file, so that they are available to be used in other programs. If the program finds multiple trees tied for best, all of these are written out onto the output tree file. Each is followed by a numerical weight in square brackets (such as [0.25000]). This is needed when we use the trees to make a consensus tree of the results of bootstrapping or jackknifing, to avoid overrepresenting replicates that find many tied trees.
At the beginning of the program are a series of constants, which can be changed to help adapt the program to different computer systems. Two are the initial values of howmany and howoften, constants "often" and "many". Constant "maxtrees" is the maximum number of tied trees that will be stored.
7 6 Alpha1 110110 Alpha2 110110 Beta1 110000 Beta2 110000 Gamma1 100110 Delta 001001 Epsilon 001110 |
Penny algorithm, version 3.69 branch-and-bound to find all most parsimonious trees 7 species, 6 characters Wagner parsimony method Name Characters ---- ---------- Alpha1 11011 0 Alpha2 11011 0 Beta1 11000 0 Beta2 11000 0 Gamma1 10011 0 Delta 00100 1 Epsilon 00111 0 requires a total of 8.000 3 trees in all found +-----------------Alpha1 ! ! +--------Alpha2 --1 ! ! +-----4 +--Epsilon ! ! ! +--6 ! ! +--5 +--Delta +--2 ! ! +-----Gamma1 ! ! +--Beta2 +-----------3 +--Beta1 remember: this is an unrooted tree! steps in each character: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 11011 0 1 Alpha1 no ..... . 1 2 no ..... . 2 4 no ..... . 4 Alpha2 no ..... . 4 5 yes .0... . 5 6 yes 0.1.. . 6 Epsilon no ..... . 6 Delta yes ...00 1 5 Gamma1 no ..... . 2 3 yes ...00 . 3 Beta2 no ..... . 3 Beta1 no ..... . +-----------------Alpha1 ! --1 +--------------Alpha2 ! ! ! ! +--Epsilon +--2 +--6 ! +-----5 +--Delta ! ! ! +--4 +-----Gamma1 ! ! +--Beta2 +--------3 +--Beta1 remember: this is an unrooted tree! steps in each character: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 11011 0 1 Alpha1 no ..... . 1 2 no ..... . 2 Alpha2 no ..... . 2 4 no ..... . 4 5 yes .0... . 5 6 yes 0.1.. . 6 Epsilon no ..... . 6 Delta yes ...00 1 5 Gamma1 no ..... . 4 3 yes ...00 . 3 Beta2 no ..... . 3 Beta1 no ..... . +-----------------Alpha1 ! ! +-----Alpha2 --1 +--------2 ! ! ! +--Beta2 ! ! +--3 +--4 +--Beta1 ! ! +--Epsilon ! +--6 +--------5 +--Delta ! +-----Gamma1 remember: this is an unrooted tree! steps in each character: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 1 1 1 2 2 1 From To Any Steps? State at upper node ( . means same as in the node below it on tree) 1 11011 0 1 Alpha1 no ..... . 1 4 no ..... . 4 2 no ..... . 2 Alpha2 no ..... . 2 3 yes ...00 . 3 Beta2 no ..... . 3 Beta1 no ..... . 4 5 yes .0... . 5 6 yes 0.1.. . 6 Epsilon no ..... . 6 Delta yes ...00 1 5 Gamma1 no ..... . |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program carries out the Dollo and polymorphism parsimony methods. The Dollo parsimony method was first suggested in print by Le Quesne (1974) and was first well-specified by Farris (1977). The method is named after Louis Dollo since he was one of the first to assert that in evolution it is harder to gain a complex feature than to lose it. The algorithm explains the presence of the state 1 by allowing up to one forward change 0-->1 and as many reversions 1-->0 as are necessary to explain the pattern of states seen. The program attempts to minimize the number of 1-->0 reversions necessary.
The assumptions of this method are in effect:
One problem can arise when using additive binary recoding to represent a multistate character as a series of two-state characters. Unlike the Camin-Sokal, Wagner, and Polymorphism methods, the Dollo method can reconstruct ancestral states which do not exist. An example is given in my 1979 paper. It will be necessary to check the output to make sure that this has not occurred.
The polymorphism parsimony method was first used by me, and the results published (without a clear specification of the method) by Inger (1967). The method was independently published by Farris (1978a) and by me (1979). The method assumes that we can explain the pattern of states by no more than one origination (0-->1) of state 1, followed by retention of polymorphism along as many segments of the tree as are necessary, followed by loss of state 0 or of state 1 where necessary. The program tries to minimize the total number of polymorphic characters, where each polymorphism is counted once for each segment of the tree in which it is retained.
The assumptions of the polymorphism parsimony method are in effect:
That these are the assumptions of parsimony methods has been documented in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b), but also read the exchange between Felsenstein and Sober (1986).
The input format is the standard one, with "?", "P", "B" states allowed. The options are selected using a menu:
Dollo and polymorphism parsimony algorithm, version 3.69 Settings for this run: U Search for best tree? Yes P Parsimony method? Dollo J Randomize input order of species? No. Use input order T Use Threshold parsimony? No, use ordinary parsimony A Use ancestral states in input file? No W Sites weighted? No M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Print out steps in each character No 5 Print states at all nodes of tree No 6 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) |
The options U, J, T, A, and M are the usual User Tree, Jumble, Ancestral States, and Multiple Data Sets options, described either in the main documentation file or in the Discrete Characters Programs documentation file. The A (Ancestral States) option allows implementation of the unordered Dollo parsimony and unordered polymorphism parsimony methods which I have described elsewhere (1984b). When the A option is used the ancestor is not to be counted as one of the species. The O (outgroup) option is not available since the tree produced is already rooted. Since the Dollo and polymorphism methods produce a rooted tree, the user-defined trees required by the U option have two-way forks at each level.
The P (Parsimony Method) option is the one that toggles between polymorphism parsimony and Dollo parsimony. The program defaults to Dollo parsimony.
The T (Threshold) option has already been described in the Discrete Characters programs documentation file. Setting T at or below 1.0 but above 0 causes the criterion to become compatibility rather than polymorphism parsimony, although there is no advantage to using this program instead of MIX to do a compatibility method. Setting the threshold value higher brings about an intermediate between the Dollo or polymorphism parsimony methods and the compatibility method, so that there is some rationale for doing that. Since the Dollo and polymorphism methods produces a rooted tree, the user-defined trees required by the U option have two-way forks at each level.
Using a threshold value of 1.0 or lower, but above 0, one can obtain a rooted (or, if the A option is used with ancestral states of "?", unrooted) compatibility criterion, but there is no particular advantage to using this program for that instead of MIX. Higher threshold values are of course meaningful and provide intermediates between Dollo and compatibility methods.
The X (Mixed parsimony methods) option is not available in this program. The Factors option is also not available in this program, as it would have no effect on the result even if that information were provided in the input file.
Output is standard: a list of equally parsimonious trees, and, if the user selects menu option 4, a table of the numbers of reversions or retentions of polymorphism necessary in each character. If any of the ancestral states has been specified to be unknown, a table of reconstructed ancestral states is also provided. When reconstructing the placement of forward changes and reversions under the Dollo method, keep in mind that each polymorphic state in the input data will require one "last minute" reversion. This is included in the tabulated counts. Thus if we have both states 0 and 1 at a tip of the tree the program will assume that the lineage had state 1 up to the last minute, and then state 0 arose in that population by reversion, without loss of state 1.
If the user selects menu option 5, a table is printed out after each tree, showing for each branch whether there are known to be changes in the branch, and what the states are inferred to have been at the top end of the branch. If the inferred state is a "?" there may be multiple equally-parsimonious assignments of states; the user must work these out for themselves by hand.
If the A option is used, then the program will infer, for any character whose ancestral state is unknown ("?") whether the ancestral state 0 or 1 will give the best tree. If these are tied, then it may not be possible for the program to infer the state in the internal nodes, and these will all be printed as ".". If this has happened and you want to know more about the states at the internal nodes, you will find it helpful to use Dolmove to display the tree and examine its interior states, as the algorithm in Dolmove shows all that can be known in this case about the interior states, including where there is and is not ambiguity. The algorithm in Dollop gives up more easily on displaying these states.
If the U (User Tree) option is used and more than one tree is supplied, the program also performs a statistical test of each of these trees against the best tree. This test is a version of the test proposed by Alan Templeton (1983), evaluated in a test case by me (1985a). It is closely parallel to a test using log likelihood differences invented by Kishino and Hasegawa (1989), and uses the mean and variance of step differences between trees, taken across characters. If the mean is more than 1.96 standard deviations different then the trees are declared significantly different. The program prints out a table of the steps for each tree, the differences of each from the highest one, the variance of that quantity as determined by the step differences at individual characters, and a conclusion as to whether that tree is or is not significantly worse than the best one. It is important to understand that the test assumes that all the binary characters are evolving independently, which is unlikely to be true for many suites of morphological characters.
If there are more than two trees, the test done is an extension of the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that a correction for the number of trees was necessary, and they introduced a resampling method to make this correction. In the version used here the variances and covariances of the sums of steps across characters are computed for all pairs of trees. To test whether the difference between each tree and the best one is larger than could have been expected if they all had the same expected number of steps, numbers of steps for all trees are sampled with these covariances and equal means (Shimodaira and Hasegawa's "least favorable hypothesis"), and a P value is computed from the fraction of times the difference between the tree's value and the lowest number of steps exceeds that actually observed. Note that this sampling needs random numbers, and so the program will prompt the user for a random number seed if one has not already been supplied. With the two-tree KHT test no random numbers are used.
In either the KHT or the SH test the program prints out a table of the number of steps for each tree, the differences of each from the lowest one, the variance of that quantity as determined by the differences of the numbers of steps at individual characters, and a conclusion as to whether that tree is or is not significantly worse than the best one.
If option 6 is left in its default state the trees found will be written to a tree file, so that they are available to be used in other programs. If the program finds multiple trees tied for best, all of these are written out onto the output tree file. Each is followed by a numerical weight in square brackets (such as [0.25000]). This is needed when we use the trees to make a consensus tree of the results of bootstrapping or jackknifing, to avoid overrepresenting replicates that find many tied trees.
The algorithm is a fairly simple adaptation of the one used in the program Sokal, which was formerly in this package and has been superseded by Mix. It requires two passes through each tree to count the numbers of reversions.
5 6 Alpha 110110 Beta 110000 Gamma 100110 Delta 001001 Epsilon 001110 |
Dollo and polymorphism parsimony algorithm, version 3.69 Dollo parsimony method 5 species, 6 characters Name Characters ---- ---------- Alpha 11011 0 Beta 11000 0 Gamma 10011 0 Delta 00100 1 Epsilon 00111 0 One most parsimonious tree found: +-----------Delta --3 ! +--------Epsilon +--4 ! +-----Gamma +--2 ! +--Beta +--1 +--Alpha requires a total of 3.000 reversions in each character: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 0 0 1 1 1 0 From To Any Steps? State at upper node ( . means same as in the node below it on tree) root 3 yes ..1.. . 3 Delta yes ..... 1 3 4 yes ...11 . 4 Epsilon no ..... . 4 2 yes 1.0.. . 2 Gamma no ..... . 2 1 yes .1... . 1 Beta yes ...00 . 1 Alpha no ..... . |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
Dolmove is an interactive parsimony program which uses the Dollo and Polymorphism parsimony criteria. It was inspired by Wayne Maddison and David Maddison's marvellous program MacClade, which was written for Macintosh computers. Dolmove reads in a data set which is prepared in almost the same format as one for the Dollo and polymorhism parsimony program Dollop. It allows the user to choose an initial tree, and displays this tree on the screen. The user can look at different characters and the way their states are distributed on that tree, given the most parsimonious reconstruction of state changes for that particular tree. The user then can specify how the tree is to be rearraranged, rerooted or written out to a file. By looking at different rearrangements of the tree the user can manually search for the most parsimonious tree, and can get a feel for how different characters are affected by changes in the tree topology.
This program is compatible with fewer computer systems than the other programs in PHYLIP. It can be adapted to PCDOS systems or to any system whose screen or terminals emulate DEC VT100 terminals (such as Telnet programs for logging in to remote computers over a TCP/IP network, VT100-compatible windows in the X windowing system, and any terminal compatible with ANSI standard terminals). For any other screen types, there is a generic option which does not make use of screen graphics characters to display the character states. This will be less effective, as the states will be less easy to see when displayed.
The input data file is set up almost identically to the input file for Dollop.
The user interaction starts with the program presenting a menu. The menu looks like this:
Interactive Dollo or polymorphism parsimony, version 3.69 Settings for this run: P Parsimony method? Dollo A Use ancestral states? No F Use factors information? No W Sites weighted? No T Use Threshold parsimony? No, use ordinary parsimony A Use ancestral states in input file? No U Initial tree (arbitrary, user, specify)? Arbitrary 0 Graphics type (IBM PC, ANSI, none)? ANSI L Number of lines on screen? 24 S Width of terminal screen? 80 Are these settings correct? (type Y or the letter for one to change) |
The P (Parsimony Method) option is the one that toggles between polymorphism parsimony and Dollo parsimony. The program defaults to Dollo parsimony.
The T (Threshold), F (Factors), A (Ancestors), and 0 (Graphics type) options are the usual ones and are described in the main documentation page and in the Discrete Characters Program documentation page.
The F (Factors) option is used to inform the program which groups of characters are to be counted together in computing the number of characters compatible with the tree. Thus if three binary characters are all factors of the same multistate character, the multistate character will be counted as compatible with the tree only if all three factors are compatible with it.
The X (miXed methods) option is not available in Dolmove.
The usual W (Weights) option is available in Dolmove. It allows integer weights up to 36, using the symbols 0-9 and A-Z. Increased weight on a step increases both the number of parsimony steps on the character and the contribution it makes to the number of compatibilities.
The T (threshold) option allows a continuum of methods between parsimony and compatibility. Thresholds less than or equal to 0 do not have any meaning and should not be used: they will result in a tree dependent only on the input order of species and not at all on the data!
The U (initial tree) option allows the user to choose whether the initial tree is to be arbitrary, interactively specified by the user, or read from a tree file. Typing U causes the program to change among the three possibilities in turn. I would recommend that for a first run, you allow the tree to be set up arbitrarily (the default), as the "specify" choice is difficult to use and the "user tree" choice requires that you have available a tree file with the tree topology of the initial tree. Its default name is intree. The program will ask you for its name if it looks for the input tree file and does not find one of this name. If you wish to set up some particular tree you can also do that by the rearrangement commands specified below.
The S (Screen width) option allows the width in characters of the display to be adjusted when more then 80 characters can be displayed on the user's screen.
The L (screen Lines) option allows the user to change the height of the screen (in lines of characters) that is assumed to be available on the display. This may be particularly helpful when displaying large trees on terminals that have more than 24 lines per screen, or on workstation or X-terminal screens that can emulate the ANSI terminals with more than 24 lines.
After the initial menu is displayed and the choices are made, the program then sets up an initial tree and displays it. Below it will be a one-line menu of possible commands, which looks like this:
NEXT? (Options: R # + - S . T U W O F C H ? X Q) (H or ? for Help)
If you type H or ? you will get a single screen showing a description of each of these commands in a few words. Here are slightly more detailed descriptions:
If the A option is used, then the program will infer, for any character whose ancestral state is unknown ("?") whether the ancestral state 0 or 1 will give the fewest changes (according to the criterion in use). If these are tied, then it may not be possible for the program to infer the state in the internal nodes, and many of these will be shown as "?". If the A option is not used, then the program will assume 0 as the ancestral state.
When reconstructing the placement of forward changes and reversions under the Dollo method, keep in mind that each polymorphic state in the input data will require one "last minute" reversion. This is included in the counts. Thus if we have both states 0 and 1 at a tip of the tree the program will assume that the lineage had state 1 up to the last minute, and then state 0 arose in that population by reversion, without loss of state 1.
When Dolmove calculates the number of characters compatible with the tree, it will take the F option into account and count the multistate characters as units, counting a character as compatible with the tree only when all of the binary characters corresponding to it are compatible with the tree.
As we have seen, the initial menu of the program allows you to choose among three screen types (PCDOS, Ansi, and none). If you want to avoid having to make this choice every time, you can change some of the constants in the file phylip.h to have the terminal type initialize itself in the proper way, and recompile. We have tried to have the default values be correct for PC, Macintosh, and Unix screens. If the setting is "none" (which is necessary on Macintosh MacOS 9 screens), the special graphics characters will not be used to indicate nucleotide states, but only letters will be used for the four nucleotides. This is less easy to look at.
The constants that need attention are ANSICRT and IBMCRT. Currently these are both set to "false" on Macintosh MacOS 9 systems, to "true" on MacOS X and on Unix/Linux systems, and IBMCRT is set to "true" on Windows systems. If your system has an ANSI compatible terminal, you might want to find the definition of ANSICRT in phylip.h and set it to "true", and IBMCRT to "false".
Dolmove uses as its numerical criterion the Dollo and polymorphism parsimony methods. The program defaults to carrying out Dollo parsimony.
The Dollo parsimony method was first suggested in print by Le Quesne (1974) and was first well-specified by Farris (1977). The method is named after Louis Dollo since he was one of the first to assert that in evolution it is harder to gain a complex feature than to lose it. The algorithm explains the presence of the state 1 by allowing up to one forward change 0-->1 and as many reversions 1-->0 as are necessary to explain the pattern of states seen. The program attempts to minimize the number of 1-->0 reversions necessary.
The assumptions of this method are in effect:
One problem can arise when using additive binary recoding to represent a multistate character as a series of two-state characters. Unlike the Camin-Sokal, Wagner, and Polymorphism methods, the Dollo method can reconstruct ancestral states which do not exist. An example is given in my 1979 paper. It will be necessary to check the output to make sure that this has not occurred.
The polymorphism parsimony method was first used by me, and the results published (without a clear specification of the method) by Inger (1967). The method was independently published by Farris (1978a) and by me (1979). The method assumes that we can explain the pattern of states by no more than one origination (0-->1) of state 1, followed by retention of polymorphism along as many segments of the tree as are necessary, followed by loss of state 0 or of state 1 where necessary. The program tries to minimize the total number of polymorphic characters, where each polymorphism is counted once for each segment of the tree in which it is retained.
The assumptions of the polymorphism parsimony method are in effect:
That these are the assumptions of parsimony methods has been documented in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b), but also read the exchange between Felsenstein and Sober (1986).
Below is a test data set, but we cannot show the output it generates because of the interactive nature of the program.
5 6 Alpha 110110 Beta 110000 Gamma 100110 Delta 001001 Epsilon 001110 |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
Dolpenny is a program that will find all of the most parsimonious trees implied by your data when the Dollo or polymorphism parsimony criteria are employed. It does so not by examining all possible trees, but by using the more sophisticated "branch and bound" algorithm, a standard computer science search strategy first applied to phylogenetic inference by Hendy and Penny (1982). (J. S. Farris [personal communication, 1975] had also suggested that this strategy, which is well-known in computer science, might be applied to phylogenies, but he did not publish this suggestion).
There is, however, a price to be paid for the certainty that one has found all members of the set of most parsimonious trees. The problem of finding these has been shown (Graham and Foulds, 1982; Day, 1983) to be NP-complete, which is equivalent to saying that there is no fast algorithm that is guaranteed to solve the problem in all cases (for a discussion of NP-completeness, see the Scientific American article by Lewis and Papadimitriou, 1978). The result is that this program, despite its algorithmic sophistication, is VERY SLOW.
The program should be slower than the other tree-building programs in the package, but useable up to about ten species. Above this it will bog down rapidly, but exactly when depends on the data and on how much computer time you have (it may be more effective in the hands of someone who can let a microcomputer grind all night than for someone who has the "benefit" of paying for time on the campus mainframe computer). IT IS VERY IMPORTANT FOR YOU TO GET A FEEL FOR HOW LONG THE PROGRAM WILL TAKE ON YOUR DATA. This can be done by running it on subsets of the species, increasing the number of species in the run until you either are able to treat the full data set or know that the program will take unacceptably long on it. (Making a plot of the logarithm of run time against species number may help to project run times).
The search strategy used by Dolpenny starts by making a tree consisting of the first two species (the first three if the tree is to be unrooted). Then it tries to add the next species in all possible places (there are three of these). For each of the resulting trees it evaluates the number of losses. It adds the next species to each of these, again in all possible spaces. If this process would continue it would simply generate all possible trees, of which there are a very large number even when the number of species is moderate (34,459,425 with 10 species). Actually it does not do this, because the trees are generated in a particular order and some of them are never generated.
Actually the order in which trees are generated is not quite as implied above, but is a "depth-first search". This means that first one adds the third species in the first possible place, then the fourth species in its first possible place, then the fifth and so on until the first possible tree has been produced. Its number of steps is evaluated. Then one "backtracks" by trying the alternative placements of the last species. When these are exhausted one tries the next placement of the next-to-last species. The order of placement in a depth-first search is like this for a four-species case (parentheses enclose monophyletic groups):
Make tree of first two species (A,B)
  Add C in first place ((A,B),C)
    Add D in first place (((A,D),B),C)
    Add D in second place ((A,(B,D)),C)
    Add D in third place (((A,B),D),C)
    Add D in fourth place ((A,B),(C,D))
    Add D in fifth place (((A,B),C),D)
  Add C in second place: ((A,C),B)
    Add D in first place (((A,D),C),B)
    Add D in second place ((A,(C,D)),B)
    Add D in third place (((A,C),D),B)
    Add D in fourth place ((A,C),(B,D))
    Add D in fifth place (((A,C),B),D)
  Add C in third place (A,(B,C))
    Add D in first place ((A,D),(B,C))
    Add D in second place (A,((B,D),C))
    Add D in third place (A,(B,(C,D)))
    Add D in fourth place (A,((B,C),D))
    Add D in fifth place ((A,(B,C)),D)
Among these fifteen trees you will find all of the four-species rooted bifurcating trees, each exactly once (the parentheses each enclose a monophyletic group). As displayed above, the backtracking depth-first search algorithm is just another way of producing all possible trees one at a time. The branch and bound algorithm consists of this with one change. As each tree is constructed, including the partial trees such as (A,(B,C)), its number of losses (or retentions of polymorphism) is evaluated.
The point of this is that if a previously-found tree such as ((A,B),(C,D)) required fewer losses, then we know that there is no point in even trying to add D to ((A,C),B). We have computed the bound that enables us to cut off a whole line of inquiry (in this case five trees) and avoid going down that particular branch any farther.
The branch-and-bound algorithm thus allows us to find all most parsimonious trees without generating all possible trees. How much of a saving this is depends strongly on the data. For very clean (nearly "Hennigian") data, it saves much time, but on very messy data it will still take a very long time.
The algorithm in the program differs from the one outlined here in some essential details: it investigates possibilities in the order of their apparent promise. This applies to the order of addition of species, and to the places where they are added to the tree. After the first two-species tree is constructed, the program tries adding each of the remaining species in turn, each in the best possible place it can find. Whichever of those species adds (at a minimum) the most additional steps is taken to be the one to be added next to the tree. When it is added, it is added in turn to places which cause the fewest additional steps to be added. This sounds a bit complex, but it is done with the intention of eliminating regions of the search of all possible trees as soon as possible, and lowering the bound on tree length as quickly as possible.
The program keeps a list of all the most parsimonious trees found so far. Whenever it finds one that has fewer losses than these, it clears out the list and restarts the list with that tree. In the process the bound tightens and fewer possibilities need be investigated. At the end the list contains all the shortest trees. These are then printed out. It should be mentioned that the program Clique for finding all largest cliques also works by branch-and-bound. Both problems are NP-complete but for some reason Clique runs far faster. Although their worst-case behavior is bad for both programs, those worst cases occur far more frequently in parsimony problems than in compatibility problems.
Among the quantities available to be set at the beginning of a run of Dolpenny, two (howoften and howmany) are of particular importance. As Dolpenny goes along it will keep count of how many trees it has examined. Suppose that howoften is 100 and howmany is 300, the default settings. Every time 100 trees have been examined, Dolpenny will print out a line saying how many multiples of 100 trees have now been examined, how many steps the most parsimonious tree found so far has, how many trees of with that number of steps have been found, and a very rough estimate of what fraction of all trees have been looked at so far.
When the number of these multiples printed out reaches the number howmany (say 1000), the whole algorithm aborts and prints out that it has not found all most parsimonious trees, but prints out what is has got so far anyway. These trees need not be any of the most parsimonious trees: they are simply the most parsimonious ones found so far. By setting the product (howoften X howmany) large you can make the algorithm less likely to abort, but then you risk getting bogged down in a gigantic computation. You should adjust these constants so that the program cannot go beyond examining the number of trees you are reasonably willing to pay for (or wait for). In their initial setting the program will abort after looking at 100,000 trees. Obviously you may want to adjust howoften in order to get more or fewer lines of intermediate notice of how many trees have been looked at so far. Of course, in small cases you may never even reach the first multiple of howoften and nothing will be printed out except some headings and then the final trees.
The indication of the approximate percentage of trees searched so far will be helpful in judging how much farther you would have to go to get the full search. Actually, since that fraction is the fraction of the set of all possible trees searched or ruled out so far, and since the search becomes progressively more efficient, the approximate fraction printed out will usually be an underestimate of how far along the program is, sometimes a serious underestimate.
A constant that affects the result is "maxtrees", which controls the maximum number of trees that can be stored. Thus if "maxtrees" is 25, and 32 most parsimonious trees are found, only the first 25 of these are stored and printed out. If "maxtrees" is increased, the program does not run any slower but requires a little more intermediate storage space. I recommend that "maxtrees" be kept as large as you can, provided you are willing to look at an output with that many trees on it! Initially, "maxtrees" is set to 100 in the distribution copy.
The counting of the length of trees is done by an algorithm nearly identical to the corresponding algorithms in Dollop, and thus the remainder of this document will be nearly identical to the Dollop document. The Dollo parsimony method was first suggested in print by Le Quesne (1974) and was first well-specified by Farris (1977). The method is named after Louis Dollo since he was one of the first to assert that in evolution it is harder to gain a complex feature than to lose it. The algorithm explains the presence of the state 1 by allowing up to one forward change 0-->1 and as many reversions 1-->0 as are necessary to explain the pattern of states seen. The program attempts to minimize the number of 1-->0 reversions necessary.
The assumptions of this method are in effect:
That these are the assumptions is established in several of my papers (1973a, 1978b, 1979, 1981b, 1983). For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b), but also read the exchange between Felsenstein and Sober (1986).
One problem can arise when using additive binary recoding to represent a multistate character as a series of two-state characters. Unlike the Camin-Sokal, Wagner, and Polymorphism methods, the Dollo method can reconstruct ancestral states which do not exist. An example is given in my 1979 paper. It will be necessary to check the output to make sure that this has not occurred.
The polymorphism parsimony method was first used by me, and the results published (without a clear specification of the method) by Inger (1967). The method was published by Farris (1978a) and by me (1979). The method assumes that we can explain the pattern of states by no more than one origination (0-->1) of state 1, followed by retention of polymorphism along as many segments of the tree as are necessary, followed by loss of state 0 or of state 1 where necessary. The program tries to minimize the total number of polymorphic characters, where each polymorphism is counted once for each segment of the tree in which it is retained.
The assumptions of the polymorphism parsimony method are in effect:
That these are the assumptions of parsimony methods has been documented in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For an opposing view arguing that the parsimony methods make no substantive assumptions such as these, see the papers by Farris (1983) and Sober (1983a, 1983b), but also read the exchange between Felsenstein and Sober (1986).
The input format is the standard one, with "?", "P", "B" states allowed. Most of the options are selected using a menu:
Penny algorithm for Dollo or polymorphism parsimony, version 3.69 branch-and-bound to find all most parsimonious trees Settings for this run: P Parsimony method? Dollo H How many groups of 100 trees: 1000 F How often to report, in trees: 100 S Branch and bound is simple? Yes T Use Threshold parsimony? No, use ordinary parsimony A Use ancestral states? No W Sites weighted? No M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Print out steps in each character No 5 Print states at all nodes of tree No 6 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) |
The P option toggles between the Polymorphism parsimony method and the default Dollo parsimony method.
The options T, A, and M are the usual Threshold, Ancestral States, and Multiple Data Sets options. They are described in the Main documentation file and in the Discrete Characters Programs documentation file.
Options F and H reset the variables howoften (F) and howmany (H). The user is prompted for the new values. By setting these larger the program will report its progress less often (howoften) and will run longer (howmany times howoften). These values default to 100 and 1000 which guarantees a search of 100,000 trees, but these can be changed. Note that option F in this program is not the Factors option available in some of the other programs in this section of the package.
The use of the A option allows implementation of the unordered Dollo parsimony and unordered polymorphism parsimony methods which I have described elsewhere (1984b). When the A option is used the ancestor is not to be counted as one of the species. The O (outgroup) option is not available since the tree produced is already rooted.
Setting T at or below 1.0 but above 0 causes the criterion to become compatibility rather than polymorphism parsimony, although there is no advantage to using this program instead of Penny to do a compatibility method. Setting the threshold value higher brings about an intermediate between the Dollo or polymorphism parsimony methods and the compatibility method, so that there is some rationale for doing that.
Using a threshold value of 1.0 or lower, but above 0, one can obtain a rooted (or, if the A option is used with ancestral states of "?", unrooted) compatibility criterion, but there is no particular advantage to using this program for that instead of MIX. Higher threshold values are of course meaningful and provide intermediates between Dollo and compatibility methods.
The S (Simple) option alters a step in Dolpenny which reconsiders the order in which species are added to the tree. Normally the decision as to what species to add to the tree next is made as the first tree is being constructucted; that ordering of species is not altered subsequently. The R option causes it to be continually reconsidered. This will probably result in a substantial increase in run time, but on some data sets of intermediate messiness it may help. It is included in case it might prove of use on some data sets. The Simple option, in which the ordering is kept the same after being established by trying alternatives during the construction of the first tree, is the default. Continual reconsideration can be selected as an alternative.
The Factors option is not available in this program, as it would have no effect on the result even if that information were provided in the input file.
The output format is also standard. It includes a rooted tree and, if the user selects option 4, a table of the numbers of reversions or retentions of polymorphism necessary in each character. If any of the ancestral states has been specified to be unknown, a table of reconstructed ancestral states is also provided. When reconstructing the placement of forward changes and reversions under the Dollo method, keep in mind that each polymorphic state in the input data will require one "last minute" reversion. This is included in the tabulated counts. Thus if we have both states 0 and 1 at a tip of the tree the program will assume that the lineage had state 1 up to the last minute, and then state 0 arose in that population by reversion, without loss of state 1.
A table is available to be printed out after each tree, showing for each branch whether there are known to be changes in the branch, and what the states are inferred to have been at the top end of the branch. If the inferred state is a "?" there will be multiple equally-parsimonious assignments of states; the user must work these out for themselves by hand.
If the A option is used, then the program will infer, for any character whose ancestral state is unknown ("?") whether the ancestral state 0 or 1 will give the best tree. If these are tied, then it may not be possible for the program to infer the state in the internal nodes, and these will all be printed as ".". If this has happened and you want to know more about the states at the internal nodes, you will find it helpful to use Dolmove to display the tree and examine its interior states, as the algorithm in Dolmove shows all that can be known in this case about the interior states, including where there is and is not ambiguity. The algorithm in Dolpenny gives up more easily on displaying these states.
If option 6 is left in its default state the trees found will be written to a tree file, so that they are available to be used in other programs. If the program finds multiple trees tied for best, all of these are written out onto the output tree file. Each is followed by a numerical weight in square brackets (such as [0.25000]). This is needed when we use the trees to make a consensus tree of the results of bootstrapping or jackknifing, to avoid overrepresenting replicates that find many tied trees.
At the beginning of the program are a series of constants, which can be changed to help adapt the program to different computer systems. Two are the initial values of howmany and howoften, constants "often" and "many". Constant "maxtrees" is the maximum number of tied trees that will be stored.
7 6 Alpha1 110110 Alpha2 110110 Beta1 110000 Beta2 110000 Gamma1 100110 Delta 001001 Epsilon 001110 |
Penny algorithm for Dollo or polymorphism parsimony, version 3.69 branch-and-bound to find all most parsimonious trees 7 species, 6 characters Dollo parsimony method Name Characters ---- ---------- Alpha1 11011 0 Alpha2 11011 0 Beta1 11000 0 Beta2 11000 0 Gamma1 10011 0 Delta 00100 1 Epsilon 00111 0 requires a total of 3.000 3 trees in all found +-----------------Delta ! --2 +--------------Epsilon ! ! +--3 +-----------Gamma1 ! ! +--6 +--------Alpha2 ! ! +--1 +--Beta2 ! +--5 +--4 +--Beta1 ! +-----Alpha1 reversions in each character: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 0 0 1 1 1 0 From To Any Steps? State at upper node ( . means same as in the node below it on tree) root 2 yes ..1.. . 2 Delta yes ..... 1 2 3 yes ...11 . 3 Epsilon no ..... . 3 6 yes 1.0.. . 6 Gamma1 no ..... . 6 1 yes .1... . 1 Alpha2 no ..... . 1 4 no ..... . 4 5 yes ...00 . 5 Beta2 no ..... . 5 Beta1 no ..... . 4 Alpha1 no ..... . +-----------------Delta ! --2 +--------------Epsilon ! ! +--3 +-----------Gamma1 ! ! +--6 +--Beta2 ! +-----5 ! ! +--Beta1 +--4 ! +--Alpha2 +-----1 +--Alpha1 reversions in each character: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 0 0 1 1 1 0 From To Any Steps? State at upper node ( . means same as in the node below it on tree) root 2 yes ..1.. . 2 Delta yes ..... 1 2 3 yes ...11 . 3 Epsilon no ..... . 3 6 yes 1.0.. . 6 Gamma1 no ..... . 6 4 yes .1... . 4 5 yes ...00 . 5 Beta2 no ..... . 5 Beta1 no ..... . 4 1 no ..... . 1 Alpha2 no ..... . 1 Alpha1 no ..... . +-----------------Delta ! --2 +--------------Epsilon ! ! +--3 +-----------Gamma1 ! ! ! ! +--Beta2 +--6 +--5 ! +--4 +--Beta1 ! ! ! +--1 +-----Alpha2 ! +--------Alpha1 reversions in each character: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------- 0! 0 0 1 1 1 0 From To Any Steps? State at upper node ( . means same as in the node below it on tree) root 2 yes ..1.. . 2 Delta yes ..... 1 2 3 yes ...11 . 3 Epsilon no ..... . 3 6 yes 1.0.. . 6 Gamma1 no ..... . 6 1 yes .1... . 1 4 no ..... . 4 5 yes ...00 . 5 Beta2 no ..... . 5 Beta1 no ..... . 4 Alpha2 no ..... . 1 Alpha1 no ..... . |
Written by Joseph Felsenstein and James McGill.
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved.
License terms here.
Drawtree and Drawgram are interactive tree-plotting programs that take a tree description in a file and read it, and then let you interactively make various settings and then make a plot of the tree in a file in some graphical file format, or plot the tree on a laser printer, plotter, or dot matrix printer. In most cases you can preview the resulting tree. This allows you to modify the tree until you like the result, then plot the result. Drawtree plots unrooted trees and Drawgram plots rooted cladograms and phenograms. With a plot to a file whose format is one acceptable to publishers both programs can produce fully publishable results.
These programs are descended from PLOTGRAM and PLOTREE, written by Christopher Meacham in 1984 and contributed to PHYLIP. I have incorporated his code for fonts and his plotter drivers, and in Drawtree have used some of his code for drawing unrooted trees. In both programs I have also included some plotter driver code by David Swofford, Julian Humphries and George D.F. "Buz" Wilson, to all of whom I am very grateful. Mostly, however, they consist of my own code and that of my programmers. The font files are printable-character recodings of the public-domain Hershey fonts, recoded by Christopher Meacham. The Java interface for the programs was created by Jim McGill.
This document will describe the features common to both programs. The documents for Drawtree and Drawgram describe the particular choices you can make in each of those programs. The Appendix to this documentation file contains some pieces of C code that can be inserted to make the program handle another plotting device -- the plotters by Calcomp.
To use Drawtree and Drawgram, you must have
Once you have all these pieces, the programs should be fairly self explanatory, particular if you can preview your plots so that you can discover the meaning of the different options by trying them out.
Once you have an executable version of the appropriate program (say Drawgram), and a file called (say) intree with the tree in it, and if necessary a font file (say font2 which you have copied as a file called fontfile), all you do is run the Drawgram program. It should automatically read the tree file and any font file needed, and will allow you to change the graphics device. Then it will let you see the options it has chosen, and ask you if you want to change these. Once you have modified those that you want to, you can tell it to accept those. The version of the program that has a Java interface will then allow you to preview the tree on the computer screen.
After you are done previewing the tree, the program will want to know whether you are ready to plot the tree. In the Java GUI version of the programs, you press on the Create Plot File button when you want to produce the final plot.
In the character-mode menu-driven versions of the programs, options can be changed but previewing does not occur. Plotting will occur after you close the menu by making the Y (yes) choice when you are asked whether you want to accept the plot as is. If you say no, it will once again allow you to change options, as many times as you want. If you say yes, then it will write a file called (say) plotfile. If you then copy this file to your printer or plotter, it should result in a beautifully plotted tree. You may need to change the filename to have the file format recognized by your operating system (for example, you may want to change plotfile to plotfile.ps if the file is in Postscript format).
If you don't want to print the file immediately, but want to edit the figure first, you should have chosen an output format that is readable by a draw program. Postscript format is readable by drawing programs such as Adobe Illustrator, Canvas, Freehand, and Coreldraw, and can be displayed by the Unix utilities Ghostscript and Ghostview. It can also be imported into word processors such as Microsoft Word as a figure. The PICT format was created for earlier Macintosh drawing programs such as MacDraw, and can be read by some other drawing programs and word processors. A widely-available bitmap drawing editor is GIMP (the Gnu Image Manipulation Program). On Windows systems bitmap drawing editors such as Paint can read Windows Bitmap files. We have provided output formats here for Xfig and Idraw drawing programs available on Linux or Unix systems.
Drawing programs can be used to add branch length numbers (something too hard for us to do automatically in these programs) and to make scale bars. Another use is as a way of printing out the trees, as most drawing programs are set up to print out their figures.
Having read the above, you may be ready to run the program. Below you will find more information about representation of trees in the tree file, on the different kinds of graphics devices supported by this program, and on how to recompile these programs.
The Newick Standard for representing trees in computer-readable form makes use of the correspondence between trees and nested parentheses, noticed in 1857 by the famous English mathematician Arthur Cayley. If we have this rooted tree:
A D \ E / \ C / / \ ! / / \ ! / / B \!/ / \ o / \ ! / \ ! / \ ! / \ ! / \!/ o ! !
then in the tree file it is represented by the following sequence of printable characters, starting at the beginning of the file:
(B,(A,C,E),D);
The tree ends with a semicolon. Everything after the semicolon in the input file is ignored, including any other trees. The bottommost node in the tree is an interior node, not a tip. Interior nodes are represented by a pair of matched parentheses. Between them are representations of the nodes that are immediately descended from that node, separated by commas. In the above tree, the immediate descendants are B, another interior node, and D. The other interior node is represented by a pair of parentheses, enclosing representations of its immediate descendants, A, C, and E.
Tips are represented by their names. A name can be any string of printable characters except blanks, colons, semcolons, parentheses, and square brackets. In the programs a maximum of 20 characters are allowed for names: this limit can easily be increased by recompiling the program and changing the constant declaration for "MAXNCH" in phylip.h.
Because you may want to include a blank in a name, it is assumed that an underscore character ("_") stands for a blank; any of these in a name will be converted to a blank when it is read in. Any name may also be empty: a tree like
(,(,,),);
is allowed. Trees can be multifurcating at any level (while in many of the programs multifurcations of user-defined trees are not allowed or restricted to a trifurcation at the bottommost level, these programs do make any such restriction).
Branch lengths can be incorporated into a tree by putting a real number, with or without decimal point, after a node and preceded by a colon. This represents the length of the branch immediately below that node. Thus the above tree might have lengths represented as:
(B:6.0,(A:5.0,C:3.0,E:4.0):5.0,D:11.0);
These programs will be able to make use of this information only if lengths exist for every branch, except the one at the bottom of the tree.
The tree starts on the first line of the file, and can continue to subsequent lines. It is best to proceed to a new line, if at all, immediately after a comma. Blanks can be inserted at any point except in the middle of a species name or a branch length.
The above description is of a subset of the Newick Standard. For example, interior nodes can have names in that standard, but if any are included the present programs will omit them.
To help you understand this tree representation, here are some trees in the above form:
((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700, seal:12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201, weasel:18.87953):2.09460):3.87382,dog:25.46154);(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460);
(Bovine:0.69395,(Hylobates:0.36079,(Pongo:0.33636,(G._Gorilla:0.17147, (P._paniscus:0.19268,H._sapiens:0.11927):0.08386):0.06124):0.15057):0.54939, Rodent:1.21460);
();
((A,B),(C,D));
(Alpha,Beta,Gamma,Delta,,Epsilon,,,);
The Newick standard is based on a standard invented by Christopher Meacham for his programs PLOTREE and PLOTGRAM. The Newick Standard was adopted June 26, 1986 by an informal committee that met during the Society for the Study of Evolution meetings in Durham, New Hampshire and consisted of James Archie, William H.E. Day, Wayne Maddison, Christopher Meacham, F. James Rohlf, David Swofford, and myself. A web page describing it will be found at http://evolution.gs.washington.edu/phylip/newicktree.html.
When the programs run they have a menu which allows you to set (on its option P) the final plotting device, and another menu which allows you to set the type of preview screen. The choices for previewing are a subset of those available for plotting, and they can be different (the most useful combination will usually be a previewing graphics screen with a hard-copy plotter or a drawing program graphics file format).
In the Java interface the "Final plot file type" menu gives you the choices
Which plotter or printer will the tree be drawn on? (many other brands or models are compatible with these) type: to choose one compatible with: L Postscript printer file format M PICT format (for drawing programs) J HP Laserjet PCL file format W MS-Windows Bitmap F FIG 2.0 drawing program format A Idraw drawing program format Z VRML Virtual Reality Markup Language file P PCX file format (for drawing programs) K TeKtronix 4010 graphics terminal X X Bitmap format V POVRAY 3D rendering program file R Rayshade 3D rendering program file H Hewlett-Packard pen plotter (HPGL file format) D DEC ReGIS graphics (VT240 terminal) E Epson MX-80 dot-matrix printer C Prowriter/Imagewriter dot-matrix printer O Okidata dot-matrix printer B Houston Instruments plotter U other: one you have inserted code for Choose one: |
Here are the choices, with some comments on each:
Postscript printer file format. This means that the program will generate a file containing Postscript commands as its plot file. This can be printed on any Postscript-compatible laser printer, and can be incorporated into Microsoft Word documents or into PDF documents. The page size is assumed to be 8.5 by 11 inches, but as plotting is within this limit A4 metric paper should work well too. This is the best quality output option. For this printer the menu options in Drawgram and Drawtree that allow you to select one of the built-in fonts will work. The programs default to Times-Roman when this plotting option is in effect. I have been able to use fonts Courier, Times-Roman, and Helvetica. The others have eluded me for some reason known only to those who really understand Postscript. The font name is written into the file, so any name that works there is possible.
PICT format (for drawing programs). This file format is read by many drawing programs (an early example was MacDraw). It has support for some fonts, though if fonts are used the species names can only be drawn horizontally or vertically, not at other angles in between. The control over line widths is a bit rough also, so that some lines at different angles may turn out to be different widths when you do not want them to be. If you are working on a Mac OS X system and have not been able to persuade it to print a Postscript file, even after adding a .ps extension to the file name, this option may be the best solution, as you could then read the file into a drawing program and then order it to print the resulting screen. The PICT file format has font support, and the default font for this plotting option is set to Times. You can also choose font attributes for the labels such as Bold, Italic, Outline, and Shadowed. PICT files can be read and supported by various drawing programs, but support for it in Adobe Photoshop has recently been dropped. It has been replaced by PDF format as the default graphics file format in Mac OS X.
HP Laserjet PCL file format. Hewlett-Packard's extremely popular line of laser printers has been emulated by many other brands of laser printer, so that for many years this format was compatible with many printers. It was also the default format for many inkjet printers. More recently almost all of these printers have support for the Postscript format, and their support for the PCL format may ultimately disappear. One limitation of the early versions of the PCL command language for these printers was that they did not have primitive operations for drawing arbitrary diagonal lines. This means that they must be treated by these programs as if they were dot matrix printers with a great many dots. This makes output files large, and output can be slow. The user will be asked to choose the dot resolution (75, 150, or 300 dots per inch). The 300 dot per inch setting should not be used if the laser printer's memory is less than 512k bytes. The quality of output is also not as good as it might be so that the Postscript file format will usually produce better results even at the same resolution. I am grateful to Kevin Nixon for inadvertently assisting me by pointing out that on Laserjets one does not have to dump the complete bitmap of a page to plot a tree.
MS-Windows Bitmap. This file format is used by most Windows drawing and paint programs, including Windows Paint which comes with the Windows operating system. It asks you to choose the height and width of the graphic image in pixels. For the moment, the image is set to be a monochrome image which can only be black or white. We hope to change that soon, but note that by pasting the image into a copy of Paint that is set to have a color image of the appropriate size, one can get a version whose color can be changed. Note also that large enough Windows Bitmap files can be used as "wallpaper" images for the background of a desktop.
FIG 2.0 drawing program format. This is the file format of the free drawing program Xfig, available for X-windows systems on Unix or Linux systems. Xfig can be downloaded from these places:
Xfig but may draw them with lines. This often makes the names look rather bumpy. We hope to change this soon.
Idraw drawing program format. Idraw is a free drawing program for X windows systems (such as Unix and Linux systems). Its interface is loosely based on MacDraw, and I find it much more useable than Xfig (almost no one else seems to agree with me). Though it was unsupported for a number of years, it has more recently been actively supported by Scott Johnston, of Vectaport, Inc. (http://www.vectaport.com). He has produced, in his ivtools package, a number of specialized versions of Idraw, and he also distributes the original Idraw as part of it. ivtools is available as a package on the Debian family of Linux distributions, as packages ivtools-bin, libiv-unidraw1, libiv1 and (for development) ivtools-dev. Thus on a Debia-family Linux system such as Ubuntu Linux, Linux Mint, SUSE, etc. you may simply need to type:
sudo apt-get install ivtools-bin sudo apt-get install libiv-unidraw1 sudo apt-get install libiv1in order to install Ivtools.
The Idraw file format that our programs produce can be read into Idraw, and also can be imported into the other Ivtools programs such as Drawtool. The file format saved from Idraw (or which can be exported from the other Ivtools programs) is Postscript, and if one does not print directly from Idraw one can simply send the file to the printer. But the format that we produce is missing some of the header information and will not work directly as a Postscript file. However if you read it into Idraw and then save it (or import it into one of the other Ivtools drawing programs such as Drawtool, and then export it) you will get a Postscript version that is fully useable.
Drawgram and Drawtree have font support in their Idraw file format options. The default font is Times-Bold but you can also enter the name of any other font that is supported by your Postscript printer. Idraw labels can be rotated to any angle. Some of these fonts are directly supported by the Idraw program. There is also a way to install new Postscript Type 1 fonts in the Ivtools programs.
Note that the Idraw drawing program from Ivtools is not related to the drawing program iDraw, which is produced by Indeeo, Inc.
VRML Virtual Reality Markup Language file. This is by far the most interesting plotting file format. VRML files describe objects in 3-dimensional space with lighting on them. A number of freely available "virtual reality browsers" or browser plugins can read VRML files. A list of available virtual reality browsers and browser plugins can be found at http://cic.nist.gov/vrml/vbdetect.html, a site that also automatically detects which VRML plugins are appropriate for your web browser. VRML plugins for your web browser or standalone browsers allow you to wander around looking at the tree from various angles, including from behind! I found VRMLView particularly easy to download -- it is distributed as an executable. It is not particulary fast and somewhat mysterious to use (try your mouse buttons). At the moment our VRML output is unsophisticated. The branches are made of tubes, with spheres at their joints. The tree is made of three-dimensional tubes but is basically flat. Names are made of connected tubes (to get this make sure you use a simple default font such as the Hershey font in file font1). This has the interesting effect that if you (virtually) move around and look at the tree from behind, the names will be backwards. VRML itself has VRML itself has been superseded by a standard called X3D (see http://www.web3d.org/), and we will be moving toward X3D support. Fortunately X3D is backwards compatible with VRML. What's next? Trees whose branches stick out in three dimensions? Animated trees whose forks rotate slowly? A video game involving combat among schools of systematists?
PCX file format (for drawing programs). A bitmap format that was formerly much used on the PC platform, this has been largely superseded by the Windows Bitmap (BMP) format, but it is still useful. This file format is simple and is read by many other programs as well. The user must choose one of three resolutions for the file, 640x480, 800x600, or 1024x768. The file is a monochrome paint file. Our PCX format is correct but is not read correctly by versions of Microsoft Paint (PBrush) that are running on systems that have loaded Word97. The version of the Paint utility provided with Windows 7 also does not support the PCX format. The free image manipulation program GIMP (Gnu Image Manipulation Program) is able to read the PCX format.
The plot devices from here on are only available in the non-Java-interface version of the programs:
Tektronix 4010 graphics terminal. The plot file will contain commands for driving the Tektronix series of graphics terminals. Other graphics terminals were compatible with the Tektronix 4010 and its immediate descendants. There are terminal emulation programs for Macintoshes that emulate Tektronix graphics. On workstations with X windows you can use one option of the "xterm" utility to create a Tektronix-compatible window. On Sun workstations there used to be a Tektronix emulator you can run called "tektool" which can be used to view the trees.
X Bitmap format. This produces an X-bitmap for the X Windows system on Unix or Linux systems, which can be displayed on X screens. You will be asked for the size of the bitmap (e.g., 16x16, or 256x256, etc.). This format cannot be printed out without further format conversion but is usable for backgrounds of windows ("wallpaper"). This can be a very bulky format if you choose a large bitmap. The bitmap is a structure that can actually be compiled into a C program (and thus built in to it), if you should have some reason for doing that.
POVRAY 3D rendering program file. This produces a file for the free ray-tracing program POVRay (Persistence of Vision Raytracer), which is available at http://www.povray.org/. It shows a tree floating above a flat landscape. The tree is flat but made out of tubes (as are the letters of the species names). It casts a realistic shadow across the landscape. lit from over the left shoulder of the viewer. You will be asked to confirm the colors of the tree branches, the species names, the background, and the bottom plane. These default to Blue, Yellow, White, and White respectively.
Rayshade 3D rendering program file. The input format for the free ray-tracing program "rayshade" which is available at http://www-graphics.stanford.edu/~cek/rayshade/rayshade.html for many kinds of systems. Rayshade takes files of this format and turns them into color scenes in "raw" raster format (also called "MTV" format after a raytracing program of that name). If you get the Netpbm package (available from http://netpbm.sourceforge.net/projects/netpbm/). and compile it on your system you can use the "mtvtoppm" and "ppmtogif" programs to convert this into the widely-used GIF raster format. (the Netpbm package will also allow you to convert into tiff, pcx and many other formats) The resultant image will show a tree floating above a landscape, rendered in a real-looking 3-dimensional scene with shadows and illumination. It is possible to use Rayshade to make two scenes that together are a stereo pair. When producing output for Rayshade you will be asked by the Drawgram or Drawtree whether you want to reset the values for the colors you want for the tree, the species names, the background, and the desired resolution.
Hewlett-Packard pen plotter (HPGL file format). This means that the program will generate a file as its plot file which uses the HPGL graphics language. Hewlett-Packard 7470, 7475, and many other plotters are compatible with this. The paper size is again assumed to be 8.5 by 11 inches (again, A4 should work well too). It is assumed that there are two pens, a finer one for drawing names, and the HPGL commands will call for switching between these. Few people have HP plotters these days, the PCL printer control language found in but recent Hewlett-Packard printers can emulate an HP plotter, as this feature is included in its PCL5 command language (but not in the PCL4 command languages of earlier Hewlett-Packard models).
DEC ReGIS graphics (VT240 terminal). The DEC ReGIS standard is used by the VT240 and VT340 series terminals by DEC (Digital Equipment Corporation). There used to be many graphics terminals that emulate the VT240 or VT340 as well. The DECTerm windows in many versions of Digital's (now Compaq's) DECWindows windowing system also did so. These days DEC ReGIS graphics is rarely seen: it is most likely to be encountered as an option in X11 Xterm windows.
Epson MX-80 dot-matrix printer. This file format is for the dot-matrix printers by Epson (starting with the MX80 and continuing on to many other models), as well as the IBM Graphics printers. The code here plots in double-density graphics mode. Many of the later models are capable of higher-density graphics but not with every dot printed. This density was chosen for reasonably wide compatibility. Many other dot-matrix printers on the market have graphics modes compatible with the Epson printers. I cannot guarantee that the plot files generated by these programs would be compatible with all of these, but they do work on Epsons. They have also worked, in our hands, on IBM Graphics Printers. There used to be many printers that claimed compatibility with these too, but I do not know whether it will work on all of them. If you have trouble with any of these you might consider trying in the epson option of procedure initplotter to put in a fprintf statement that writes to plotfile an escape sequence that changes line spacing. As dot matrix printers are rare these days, used mostly to print multipart receipts in business, I suspect this option will not get much testing.
Prowriter/Imagewriter dot-matrix printer. The trading firm C. Itoh distributed this line of dot-matrix printers, which was made by Tokyo Electric (TEC), now a subsidiary of Toshiba. It was also sold by NEC under the product number PC8023. These were 9-pin dot matrix printers. In a slightly modified form they were also the Imagewriter printer sold by Apple for their Macintosh line. The same escape codes seem to work on both machines, the Apple version being a serial interface version. They are not related to the IBM Proprinter, despite the name.
Okidata dot-matrix printer. The ML81, 82, 83 and ML181, 182, 183 line of dot-matrix printers from Okidata had their own graphics codes and those are dealt with by this option. The later Okidata ML190 series emulated IBM Graphics Printers so that you would not want to use this option for them but the option for that printer.
Houston Instruments plotter. The Houston Instruments line of plotters were also known as Bausch and Lomb plotters. The code in the programs for these has not been tested recently; I would appreciate anyone who tries it out telling me whether it works. I do not have access to such a plotter myself, and doubt most users will come across one.
Conversion from these formats to others is also possible. There is a free program NetPBM that interconverts many bitmap formats (see above under Rayshade). A more accessible option will be the free image manipulation program GIMP (Gnu Image Manipulation Program) which can read our Postscript, Windows Bitmap (.BMP), PCX, and X Bitmap formats and can write many raster and vector formats.
In the Java GUI version of Drawgram and Drawtree, the graphics capabilities of Java are used for previewing. The programs actually write a Postscript file called JavaPreview.ps, and each time the preview is displayed this is read in and displayed.
Another problem is adding labels (such as vertical scales and branch lengths) to the plots produced by this program. This may require you to use the Postcript, BMP, PICT, Idraw, Xfig, or PCX file format and use a draw or paint program to add them. GIMP and Adobe Illustrator can do this.
I would like to add more built-in fonts. The fontfiles now have recoded versions of the Hershey fonts. They are legally publicly distributable. Most other font families on the market are not public domain and I cannot afford to license them for distribution. Some people have noticed that the Hershey fonts, which are drawn by a series of straight lines, have noticeable angles in what are supposed to be curves, when they are printed on modern laser printers and looked at closely. This is less a problem than one might think since, fortunately, when scientific journals print a tree it is usually shrunk so small that these imperfections (and often the tree itself) are hard to see!
One more font that could be added from the Hershey font collection would be a Greek font. If Greek users would find that useful I could add it, but my impression is that they publish mostly in English anyway.
The C code of these programs consists of two C programs, "drawgram.c" and "drawtree.c". Each of these uses two other pieces of C code "draw.c", "draw2.c", plus a common header file, "draw.h". All of the graphics commands that are common to both programs will be found in "draw.c" and "draw2.c". The following instructions for writing your own code to drive a different kind of printer, plotter, or graphics file format, require you only to make changes in "draw.c" and "draw2.c". The two programs can then be recompiled.
If you want to write code for other printers, plotters, or vector file formats, this is not too hard. The plotter option "U" is provided as a place for you to insert your own code. Chris Meacham's system was to draw everything, including the characters in the names and all curves, by drawing a series of straight lines. Thus you need only master your plotter's commands for drawing straight lines. In function "plotrparms" you must set up the values of variables "xunitspercm" and "yunitspercm", which are the number of units in the x and y directions per centimeter, as well as variables "xsize" and "ysize" which are the size of the plotting area in centimeters in the x direction and the y direction. A variable "penchange" of a user-defined type is set to "yes" or "no" depending on whether the commands to change the pen must be issued when switching between plotting lines and drawing characters. Even though dot-matrix printers do not have pens, penchange should be set to "yes" for them. In function "plot" you must issue commands to draw a line from the current position (which is at (xnow, ynow) in the plotter's units) to the position (xabs, yabs), under the convention that the lower-left corner of the plotting area is (0.0, 0.0). In functions "initplotter" and "finishplotter" you must issue commands to initialize the plotter and to finish plotting, respectively. If the pen is to be changed an appropriate piece of code must be inserted in function "penchange". The code to print the text needs to be added to the "plottext" function.
For dot matrix printers and raster graphics matters are a bit more complex. The functions "plotrparms", "initplotter", "finishplotter" and "plot" still respectively set up the parameters for the plotter, initialize it, finish a plot, and plot one line. But now the plotting consists of drawing dots into a two-dimensional array called "stripe". Once the plot is finished this array is printed out. In most cases the array is not as tall as a full plot: instead it is a rectangular strip across it. When the program has finished drawing in ther strip, it prints it out and then moves down the plot to the next strip. For example, for Hewlett-Packard Laserjets we have defined the strip as 2550 dots wide and 20 dots deep. When the program goes to draw a line, it draws it into the strip and ignores any part of it that falls outside the strip. Thus the program does a complete plotting into the strip, then prints it, then moves down the diagram by (in this case) 20 dots, then does a complete plot into that strip, and so on.
To work with a new raster or dot matrix format, you will have to define the desired width of a strip ("strpwide"), the desired depth ("strpdeep"), and how many lines of bytes must be printed out to print a strip. Procedure "striprint" is the one that prints out a strip, and has special-case code for the different printers and file formats. For file formats, all of which print out a single row of dots at a time, the variable "strpdiv" is not used. The variable "dotmatrix" is set to "true" or "false" in function "plotrparms" according to whether or not "strpdiv" is to be used. Procedure "plotdot" sets a single dot in the array "strip" to 1 at position (xabs, yabs). The coordinates run from 1 at the top of the plot to larger numbers as we proceed down the page. Again, there is special-case code for different printers and file formats in that function. You will probably want to read the code for some of the dot matrix or file format options if you want to write code for one of them. Many of them have provision for printing only part of a line, ignoring parts of it that have no dots to print.
I would be happy to obtain the resulting code from you to consider adding it to this listing so we can cover more kinds of plotters, printers, and file formats.
These pieces of code are to be inserted in the places reserved for the "Y" plotter option. The variables necessary to run this have already been incorporated into the programs.
Calcomp's industrial-strength plotters were once a fixture of University computer centers, but are rarely found now, but just in case you need to use one, this code should work:
A global declaration needed near the front of drawtree.c:
Char cchex[16];
Code to be inserted into function plotrparms:
case 'Y': plotter = other; xunitspercm = 39.37; yunitspercm = 39.37; xsize = 25.0; ysize = 25.0; xposition = 12.5; yposition = 0.0; xoption = center; yoption = above; rotation = 0.0; break;
Code to be inserted into function plot:
Declare these variables at the beginning of the function:
long n, inc, xinc, yinc, xlast, ylast, xrel, yrel, xhigh, yhigh, xlow, ylow; Char quadrant;
and insert this into the switch statement:
case other: if (penstatus == pendown) putc('H', plotfile); else putc('D', plotfile); xrel = (long)floor(xabs + 0.5) - xnow; yrel = (long)floor(yabs + 0.5) - ynow; xnow = (long)floor(xabs + 0.5); ynow = (long)floor(yabs + 0.5); if (xrel > 0) { if (yrel > 0) quadrant = 'P'; else quadrant = 'T'; } else if (yrel > 0) quadrant = 'X'; else quadrant = '1'; xrel = labs(xrel); yrel = labs(yrel); if (xrel > yrel) n = xrel / 255 + 1; else n = yrel / 255 + 1; xinc = xrel / n; yinc = yrel / n; xlast = xrel % n; ylast = yrel % n; xhigh = xinc / 16; yhigh = yinc / 16; xlow = xinc & 15; ylow = yinc & 15; for (i = 1; i <= n; i++) fprintf(plotfile, "%c%c%c%c%c", quadrant, cchex[xhigh - 1], cchex[xlow - 1], cchex[yhigh - 1], cchex[ylow - 1]); if (xlast != 0 || ylast != 0) fprintf(plotfile, "%c%c%c%c%c", quadrant, cchex[-1], cchex[xlast - 1], cchex[-1], cchex[ylast - 1]); break;
Code to be inserted into function initplotter:
case other: cchex[-1] = 'C'; cchex[0] = 'D'; cchex[1] = 'H'; cchex[2] = 'L'; cchex[3] = 'P'; cchex[4] = 'T'; cchex[5] = 'X'; cchex[6] = '1'; cchex[7] = '5'; cchex[8] = '9'; cchex[9] = '/'; cchex[10] = '='; cchex[11] = '#'; cchex[12] = '"'; cchex[13] = '\''; cchex[14] = '^'; xnow = 0.0; ynow = 0.0; fprintf(plotfile, "CCCCCCCCCC"); break;
Code to be inserted into function finishplotter:
case other: plot(penup, 0.0, yrange + 50.0); break;
The Hershey fonts were digitized fonts created by Dr. A. V. Hershey in the late 1960s when he was working at the U. S. Naval Weapons Laboratory. They were published in U. S. National Bureau of Standards Special Publication No. 424, distributed by the U. S. National Technical Information Service. Legally, it is possible to freely distribute these fonts in any encoding system except the original one used by the U. S. National Technical Information Service, provided that you acknowledge that the original fonts were produced by Dr. Hershey and published by NBS. This is a somewhat odd restriction, but convenient for us. Chris Meacham developed the software we use to read the Hershey fonts, and it uses a simple coding system that he developed. The original Hershey fonts were transformed by him into this encoding system. Six of them are distributed with PHYLIP: three Roman fonts, one unserifed and two serifed, two Italic fonts, one unserifed and one serifed, and a Russian Cyrillic font.
Each font file consists of groups of lines, one for each character. Here are the lines for character "h" in the font #1 in this encoding:
Ch 608 21 19 28 -1456 1435 -1445 1748 1949 2249 2448 2545 2535 -12935 |
The group of lines starts with the letter C (for Character). Then follows the character that this font will draw (in this case "h"). It is the byte which, when read by the computer, signals that character. Then there is the number of this character in the original Hershey fonts (608). This is not used by our software.
The Hershey fonts are drawn on a grid of points as a series of lines. The next three numbers (21, 19, and 28) are the height (21), and two widths (19, and 28, which we don't use). Then comes a new line which shows the individual pen moves. When these are negative, they indicate that the pen is to be up when moving; when they are positive, the pen is to be down. They are integers. The last of them is greater than 10,000, and that is the signal to end after that move.
Each number has a final four digits that give the coordinate to which the pen is to move. These are given as (x,y) coordinates. Thus the first number (-1456) indicates the pen is to be up and the plotting is to move to coordinate (14, 56), which is x = 14, y = 56. Then the pen is put down and moved to (14, 35). This draws a line from (14, 56) to (14, 35), in fact the vertical line that forms the back of the "h". Then the pen is picked up and moved to (14, 45). Then there follow a series of moves with pen down to (14, 35), (14, 45), (17, 48), (19, 49), (22, 49), (24, 48), (25, 45), and finally (25, 35). This draws a series of connected line segments that make the arch and right-hand vertical, ending up at the bottom-right of the character. -12935 then signals a pen-up move to (29, 35). This moves to a point where the next character can start, putting in a little "white space".
As you can see, the coding system is quite simple. Does anyone want to draw us some new fonts to add to our repertoire? I have spared you the Gothic, Old English, and Greek Hershey fonts, but perhaps there are some other nice ones people might want to use.
phylip-3.697/doc/drawgram.html 0000644 0047320 0047320 00000050402 12406201172 016000 0 ustar joe felsenst_g
Written by Joseph Felsenstein and James McGill.
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved.
License terms here.
Drawgram interactively plots a cladogram- or phenogram-like rooted tree diagram, with many options including orientation of tree and branches, style of tree, label sizes and angles, tree depth, margin sizes, stem lengths, and placement of nodes in the tree. Particularly if you can use your computer to preview the plot, you can very effectively adjust the details of the plotting to get just the kind of plot you want.
To understand the working of Drawgram you should first read the Tree Drawing Programs web page in this documentation.
Java Interface
All Phylip programs will get Java interfaces in the 4.0 release. But under some operating systems there are currently serious problems with Drawgram, so it has received its Java interface early as part of the 3.695 bug fix release. We do not anticipate changing this Java interface substantially in the 3.7 release, but don't be surprised if we do.
This new Java interface supersedes the old character-mode menu interface. PHYLIP also contains versions of Drawgram and Drawtree that have the character-mode menu interface. We have kept these available because PHYLIP is used in many places as part of pipelines driven by scripts. Since these scripts do not usually invoke the preview mode of Drawgram, we have disabled the previewing of tree plotting in Drawgram in this release. Previewing is available in the version of Drawgram that has the interactive Java interface.
The Java interface is different from the previous character-mode menu interface; it calls the C code of Drawgram, which is in a dynamic library. Thus, after the previewing is done, the code producing final plot file should make plots that are indistinguishable from those produced by previous versions of Drawgram.
The Java Drawgram Interface is a modern GUI. It will run only on a machine that has a recent version of Oracle Java installed. This is not a serious limitation because Java is freeware that is universally available.
When you start the Drawgram Java interface it looks similar to the following, which has been edited to generate the plot which follows:
It has all the usual GUI funtionality: input and output file selectors, drop down menu options, data entry boxes and toggles. "Preview" brings up a nearly WYSIWYG preview window that displays the Postscript plot created by the current settings (the fonts used in the previewing window are not the same, but use Serif, SansSerif, and Monospaced fonts that approximate the PostScript fonts that are used in the output plot):
Each time you select "Preview" another preview window is generated, so that multiple previews can be visible. This allows you to compare various display options. When the plot has been fine tuned, clicking "Create Plot File" writes the Postscript file that generated the last Preview to the plot file specified. Note that if there are multiple preview windows open, the most recent one is the one that shows how the tree in the final plot file will look, since it will be plotted using the most recent settings.
All the functionality in the Java GUI is the same as in the equivalent menu item in the character-mode menu interface. To ease the transition, we have kept the text in the Java GUI as close as possible to the description in the character-mode menu interface. So, for example, "S" in the old interface, which has the description "Tree style", has the counterpart "Tree style" in the new interface. The detailed explanations of each label are found below.
To understand the working of Drawgram and Drawtree, you should first read the Tree Drawing Programs web page in this documentation.
The Command Line Interface gives the user access to a huge collection of both display systems and output formats (some of them are historical curiosities at this point, but they still work so there is no reason to remove them). It can also be driven by scripting because it is a command line interface. But, as most users have little experience with command line systems, it is a bit daunting.
As with Drawtree, to run Drawgram you need a compiled copy of the program, a font file, and a tree file. The tree file has a default name of intree. The font file has a default name of "fontfile". If there is no file of that name, the program will ask you for the name of a font file (we provide ones that have the names font1 through font6). Once you decide on a favorite one of these, you could make a copy of it and call it fontfile, and it will then be used by default. Note that the program will get confused if the input tree file has the number of trees on the first line of the file, so that number may have to be removed.
Once these choices have been made you will see the central menu of the program, which looks like this:
Rooted tree plotting program version 3.695 Here are the settings: 0 Screen type (IBM PC, ANSI): ANSI P Final plotting device: Postscript printer V Previewing device: X Windows display H Tree grows: Horizontally S Tree style: Phenogram B Use branch lengths: (no branch lengths available) L Angle of labels: 90.0 R Scale of branch length: Automatically rescaled D Depth/Breadth of tree: 0.53 T Stem-length/tree-depth: 0.05 C Character ht / tip space: 0.3333 A Ancestral nodes: Centered F Font: Times-Roman M Horizontal margins: 1.65 cm M Vertical margins: 2.16 cm Y to accept these or type the letter for one to change |
These are the settings that control the appearance of the tree, which has already been read in. You can either accept these as is, in which case you would answer Y to the question and press the Return or Enter key, or you can answer N if you want to change one, or simply type the character corresponding to the one you want to change (if you answer N it will just immediately ask you for that number anyway).
For a first run in the Java interface version, you might accept these default values and see what the result looks like.
You can resize the preview window, though you may have to ask the system to redraw the preview to see it at the new window size.
Once you are finished looking at the preview, you will want to specify whether the program should make the final plot or change some of the settings. The possible settings are listed below.
When you are ready to produce the final plot file, you should use the button "Create Plot File" (if you are using the Java interface) or you should type Y (if you are using the character-mode menu). In the Java-interface version, the name of the plot file has been set in the dialog box near the top of the Java window. It defaults to plotfile.ps. In the character-mode menu, the file name defaults to plotfile.
If there is already a file of that name, the program will ask you whether you want to Overwrite the file, Append to the file, or Quit (in the character-mode menu version it also gives the option of writing to a new file whose name you will be asked to supply.
Below I will describe the options one by one; you may prefer to skip reading this unless you are puzzled about one of them.
In spite of the words "cladogram" and "phenogram", there is no implication of the extent to which you consider these diagrams as being genealogies or phenetic clustering diagrams. The names refer to pictorial style, not your own intended final use for the diagram. The six styles can be described as follows (assuming a vertically growing tree):
You should experiment with these and decide which you want -- it depends very much on the effect you want.
Should interior node positions: be Intermediate between their immediate descendants, Weighted average of tip positions Centered among their ultimate descendants iNnermost of immediate descendants or so that tree is V-shaped (type I, W, C, N or V): |
The five methods (Intermediate, Weighted, Centered, Innermost, and V-shaped) are different horizontal positionings of the interior nodes. It will be helpful to you to try these out and see which you like best. Intermediate places the node halfway between its immediate descendants (horizontally), Weighted places it closer to that descendant who is closer vertically as well, and Centered centers the node below the horizontal positions of the tips that are descended from that node. You may want to choose that option that prevents lines from crossing each other.
V-shaped is another option, one designed, if there are no branch lengths being used, to yield a v-shaped tree of regular appearance. At the moment it can give somewhat wierd trees; we intend to make it better in the next release. With branch lengths it will not necessarily make the tree perfectly V-shaped. "Innermost" is the most unusual option: it chooses a center for the tree, and always places interior nodes below the innermost of their immediate descendants. This leads to a tree that has vertical lines in the center, like a tree with a trunk.
If the tree you are plotting has a full set of lengths, then when it is read in, the node position option is automatically set to "intermediate", which is the setting with the least likelihood of lines in the tree crossing. If it does not have lengths the option is set to "V-shaped". If you change the option which tells the program whether to try to use the branch lengths, then the node position option will automatically be reset to the appropriate one of these defaults. This may be confusing if you do not realise that it is happening.
I recommend that you try all of these options (particularly if you can preview the trees). It is of particular use to try combinations of the style of tree (option S) with the different methods of placing interior nodes (option A). You will find that a wide variety of effects can be achieved.
AfterwordI would appreciate suggestions for improvements in Drawgram, but please be aware that the source code is already very large and I may not be able to implement all suggestions.
phylip-3.697/doc/drawtree.html 0000644 0047320 0047320 00000050551 12406201172 016016 0 ustar joe felsenst_g
Written by Joseph Felsenstein and James McGill.
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved.
License terms here.
Drawtree interactively plots an unrooted tree diagram, with many options including orientation of tree and branches, label sizes and angles, and margin sizes. Particularly if you can use your computer screen to preview the plot, you can very effectively adjust the details of the plotting to get just the kind of plot you want.
To understand the working of Drawtree you should first read the Tree Drawing Programs web page in this documentation.
Java Interface
All Phylip programs will get Java interfaces in the 4.0 release. But under some operating systems there are currently serious problems with Drawtree, so it has received its Java interface early as part of the 3.695 bug fix release. We do not anticipate changing this Java interface substantially in the 4.0 release, but don't be surprised if we do.
This new Java interface supersedes the old character-mode menu interface. PHYLIP also contains versions of Drawgram and Drawtree that have the character-mode menu interface. We have kept these available because PHYLIP is used in many places as part of pipelines driven by scripts. Since these scripts do not usually invoke the preview mode of Drawtree, we have disabled the previewing of tree plotting in Drawtree in this release. Previewing is available in the version of Drawtree that has the interactive Java interface.
The Java interface is different from the previous character-mode menu interface; it calls the C code of Drawgram, which is in a dynamic library. Thus, after the previewing is done, the code producing final plot file should make plots that are indistinguishable from those produced by previous versions of Drawgram.
The Java Drawtree Interface is a modern GUI. It will run only on a machine that has a recent version of Oracle Java installed. This is not a serious limitation because Java is freeware that is universally available.
When you start the Drawtree Java interface it looks similar to the following, which has been edited to generate the plot which follows:
It has all the usual GUI functionality: input and output file selectors, drop down menu options, data entry boxes and toggles. "Preview" brings up a nearly WYSIWYG preview window that displays the Postscript plot created by the current settings:
Each time you select "Preview" another preview window is generated, so that multiple previews can be visible. This allows you to compare various display options. When the plot has been fine tuned, clicking "Create Plot File" writes the Postscript file that generated the last Preview to the plot file specified. Note that if there are multiple preview windows open, the most recent one is the one that shows how the tree in the final plot file will look, since it will be plotted using the most recent settings.
All the functionality in the Java GUI is the same as in the equivalent menu item in the character-mode menu interface. To ease the transition, we have kept the text in the Java GUI as close as possible to the description in the character-mode menu interface. So, for example, "L" in the old interface, which has the helper message "Angle of labels", maps to "Angle of labels" in the new interface. All the detailed explanations of each label are found below.
The Command Line Interface gives the user access to a huge collection of both display systems and output formats (some of them are historical curiosities at this point, but they still work so there is no reason to remove them). It can also be driven by scripting because it is a command line interface. But, as most users have little experience with command line systems, it is a bit daunting.
As with Drawgram, to run Drawtree you need a compiled copy of the program, a font file, and a tree file. The tree file has a default name of intree. The font file has a default name of "fontfile". If there is no file of that name, the program will ask you for the name of a font file (we provide ones that have the names font1 through font5). Once you decide on a favorite one of these, you could make a copy of it and call it fontfile, and it will then be used by default.
Once these choices have been made you will see the central menu of the program, which looks like this:
Unrooted tree plotting program version 3.695 Here are the settings: 0 Screen type (IBM PC, ANSI)? ANSI P Final plotting device: Postscript printer B Use branch lengths: (no branch lengths available) L Angle of labels: branch points to Middle of label R Rotation of tree: 90.0 I Iterate to improve tree: Equal-Daylight algorithm D Try to avoid label overlap? No S Scale of branch length: Automatically rescaled C Relative character height: 0.3333 F Font: Times-Roman M Horizontal margins: 1.65 cm M Vertical margins: 2.16 cm # Page size submenu: one page per tree Y to accept these or type the letter for one to change |
These are the settings that control the appearance of the tree, which has already been read in. You can either accept these as is, in which case you would answer Y to the question and press the Return or Enter key, or you can answer N if you want to change one, or simply type the character corresponding to the one you want to change (if you answer N it will just immediately ask you for that number anyway).
For a first run in the Java interface version you might accept these default values and see what the result looks like.
You can resize the preview window, though you may have to ask the system to redraw the preview to see it at the new window size.
Once you are finished looking at the preview, you will want to specify whether the program should make the final plot or change some of the settings. The possible settings are listed below.
When you are ready to produce the final plot file, you should use the button "Create Plot File" (if you are using the Java interface) or you should type Y (if you are using the character-mode menu). In the Java-interface version, the name of the plot file has been set in the dialog box near the top of the Java window. It defaults to plotfile.ps. In the character-mode menu, the file name defaults to plotfile.
If there is already a file of that name, the program will ask you whether you want to Overwrite the file, Append to the file, or Quit (in the character-mode menu version it also gives the option of writing to a new file whose name you will be asked to supply.
Below I will describe the options one by one; you may prefer to skip reading this unless you are puzzled about one of them.
I recommend that you try all of these options (particularly if you can preview the trees). It is of particular use to try trees with different iteration methods (option I) and with regularization (option G). You will find that a variety of effects can be achieved.
AfterwordI would appreciate suggestions for improvements in Drawtree, but please be aware that the source code is already very large and I may not be able to implement all suggestions.
phylip-3.697/doc/factor.html 0000644 0047320 0047320 00000027161 12406201172 015460 0 ustar joe felsenst_g
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program factors a data set that contains multistate characters, creating a data set consisting entirely of binary (0,1) characters that, in turn, can be used as input to any of the other discrete character programs in this package, except for PARS. Besides this primary function, Factor also provides an easy way of deleting characters from a data set. The input format for Factor is very similar to the input format for the other discrete character programs except for the addition of character-state tree descriptions.
Note that this program has no way of converting an unordered multistate character into binary characters. Fortunately, PARS has joined the package, and it enables unordered multistate characters, in which any state can change to any other in one step, to be analyzed with parsimony.
Factor is really for a different case, that in which there are multiple states related on a "character state tree", which specifies for each state which other states it can change to. That graph of states is assumed to be a tree, with no loops in it.
The first line of the input file should contain the number of species and the number of multistate characters. This first line is followed by the lines describing the character-state trees, one description per line. The species information constitutes the last part of the file. Any number of lines may be used for a single species.
The first line is free format with the number of species first, separated by at least one blank (space) from the number of multistate characters, which in turn is separated by at least one blank from the options, if present.
The options are selected from a menu that looks like this:
Factor -- multistate to binary recoding program, version 3.69 Settings for this run: A put ancestral states in output file? No F put factors information in output file? No 0 Terminal type (IBM PC, ANSI, none)? (none) 1 Print indications of progress of run Yes Are these settings correct? (type Y or the letter for one to change) |
The options particular to this program are:
The character-state trees are described in free format. The character number of the multistate character is given first followed by the description of the tree itself. Each description must be completed on a single line. Each character that is to be factored must have a description, and the characters must be described in the order that they occur in the input, that is, in numerical order.
The tree is described by listing the pairs of character states that are adjacent to each other in the character-state tree. The two character states in each adjacent pair are separated by a colon (":"). If character fifteen has this character state tree for possible states "A", "B", "C", and "D":
A ---- B ---- C | | | D
then the character-state tree description would be
15 A:B B:C D:B
Note that either symbol may appear first. The ancestral state is identified, if desired, by putting it "adjacent" to a period. If we wanted to root character fifteen at state C:
A <--- B <--- C | | V D
we could write
15 B:D A:B C:B .:C
Both the order in which the pairs are listed and the order of the symbols in each pair are arbitrary. However, each pair may only appear once in the list. Any symbols may be used for a character state in the input except the character that signals the connection between two states (in the distribution copy this is set to ":"), ".", and, of course, a blank. Blanks are ignored completely in the tree description so that even B:DA:BC:B.:C or B : DA : BC : B. : C would be equivalent to the above example. However, at least one blank must separate the character number from the tree description.
If no description line appears in the input for a particular character, then that character will be omitted from the output. If the character number is given on the line, but no character-state tree is provided, then the symbol for the character in the input will be copied directly to the output without change. This is useful for characters that are already coded "0" and "1". Characters can be deleted from a data set simply by listing only those that are to appear in the output.
The last character-state tree description should be followed by a line containing the number "999". This terminates processing of the trees and indicates the beginning of the species information.
The format for the species information is basically identical to the other discrete character programs. The first ten character positions are allotted to the species name (this value may be changed by altering the value of the constant nmlngth at the beginning of the program). The character states follow and may be continued to as many lines as desired. There is no current method for indicating polymorphisms. It is possible to either put blanks between characters or not.
There is a method for indicating uncertainty about states. There is one character value that stands for "unknown". If this appears in the input data then "?" is written out in all the corresponding positions in the output file. The character value that designates "unknown" is given in the constant unkchar at the beginning of the program, and can be changed by changing that constant. It is set to "?" in the distribution copy.
The first line of output will contain the number of species and the number of binary characters in the factored data set. The factored characters will be written for each species in the format required for input by the other discrete programs in the package. The maximum length of the output lines is 80 characters, but this maximum length can be changed prior to compilation.
If the A (Ancestors) option was chosen, an output file whose default name is ancestors will be written with the ancestors information. If F (Factors) was chosen in the menu, am output file whose default name is factors will be written containing the factors information.
ERRORS
The output should be checked for error messages. Errors will occur in the character-state tree descriptions if the format is incorrect (colons in the wrong place, etc.), if more than one root is specified, if the tree contains loops (and hence is not a tree), and if the tree is not connected, e.g.
A:B B:C D:E
describes
A ---- B ---- C D ---- E
This "tree" is in two unconnected pieces. An error will also occur if a symbol appears in the data set that is not in the tree description for that character. Blanks at the end of lines when the species information is continued to a new line will cause this kind of error.
At the beginning of the program a number of constants are available to be changed to accomodate larger data sets. These are "maxstates", "maxoutput", "sizearray", "factchar" and "unkchar". The constant "maxstates" gives the maximum number of states per character (set at 20 in the distribution copy). The constant "maxoutput" gives the maximum width of a line in the output file (80 in the distribution copy). The constant "sizearray" must be less than the sum of squares of the numbers of states in the characters. It is initially set to set to 2000, so that although 20 states are allowed (at the initial setting of maxstates) per character, there cannot be 20 states in all of 100 characters.
Particularly important constants are "factchar" and "unkchar" which are not numerical values but a character. Initially set to the colon ":", "factchar" is the character that will be used to separate states in the input of character state trees. It can be changed by changing this constant. (We could have used a hyphen ("-") but didn't because that would make the minus-sign ("-") unavailable as a character state in +/- characters). The constant "unkchar" is the character value in the input data that indicates that the state is unknown. It is set to "?" in the distribution copy. If your computer is one that lacks the colon ":" in its character set or uses a nonstandard character code such as EBCDIC, you will want to change the constant "factchar".
The input file for the program has the default file name "infile" and the output file, the one that has the binary character state data, has the name "outfile".
----SAMPLE INPUT----- | -----Comments (not part of input file) ----- |
4 6 1 A:B B:C 2 A:B B:. 4 5 0:1 1:2 .:0 6 .:# #:$ #:% 999 Alpha CAW00# Beta BBX01% Gamma ABY12# Epsilon CAZ01$ |
4 species; 6 characters A ---- B ---- C B ---> A Character 3 deleted; 4 unchanged 0 ---> 1 ---> 2 % <--- # ---> $ Signals end of trees Species information begins |
---SAMPLE OUTPUT----- | -----Comments (not part of output file) ----- |
4 8 Alpha 11100000 Beta 10001001 Gamma 00011100 Epsilon 11101010 |
4 species; 8 factors Chars. 1 and 2 come from old number 1 Char. 3 comes from old number 2 Char. 4 is old number 4 Chars. 5 and 6 come from old number 5 Chars. 7 and 8 come from old number 6 |
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program carries out Fitch-Margoliash, Least Squares, and a number of similar methods as described in the documentation file for distance methods.
The options for Fitch are selected through the menu, which looks like this:
Fitch-Margoliash method version 3.69 Settings for this run: D Method (F-M, Minimum Evolution)? Fitch-Margoliash U Search for best tree? Yes P Power? 2.00000 - Negative branch lengths allowed? No O Outgroup root? No, use as outgroup species 1 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No G Global rearrangements? No J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Y to accept these or type the letter for one to change
|
Most of the input options (U, P, -, O, L, R, S, J, and M) are as given in the documentation page for distance matrix programs, and their input format is the same as given there. The U (User Tree) option has one additional feature when the N (Lengths) option is used. This menu option will appear only if the U (User Tree) option is selected. If N (Lengths) is set to "Yes" then if any branch in the user tree has a branch length, that branch will not have its length iterated. Thus you can prevent all branches from having their lengths changed by giving them all lengths in the user tree, or hold only one length unchanged by giving only that branch a length (such as, for example, 0.00). You may find program Retree useful for adding and removing branch lengths from a tree. This option can also be used to compute the Average Percent Standard Deviation for a tree obtained from Neighbor, for comparison with trees obtained by Fitch or Kitsch.
The D (methods) option allows choice between the Fitch-Margoliash criterion and the Minimum Evolution method (Kidd and Sgaramella-Zonta, 1971; Rzhetsky and Nei, 1993). Minimum Evolution (not to be confused with parsimony) uses the Fitch-Margoliash criterion to fit branch lengths to each topology, but then chooses topologies based on their total branch length (rather than the goodness of fit sum of squares). There is no constraint on negative branch lengths in the Minimum Evolution method; it sometimes gives rather strange results, as it can like solutions that have large negative branch lengths, as these reduce the total sum of branch lengths!
Another input option available in Fitch that is not available in Kitsch or Neighbor is the G (Global) option. G is the Global search option. This causes, after the last species is added to the tree, each possible group to be removed and re-added. This improves the result, since the position of every species is reconsidered. It approximately triples the run-time of the program. It is not an option in Kitsch because it is the default and is always in force there. The O (Outgroup) option is described in the main documentation file of this package. The O option has no effect if the tree is a user-defined tree (if the U option is in effect). The U (User Tree) option requires an unrooted tree; that is, it requires that the tree have a trifurcation at its base:
((A,B),C,(D,E));
The output consists of an unrooted tree and the lengths of the interior segments. The sum of squares is printed out, and if P = 2.0 Fitch and Margoliash's "average percent standard deviation" is also computed and printed out. This is the sum of squares, divided by N-2, and then square-rooted and then multiplied by 100:
APSD = ( SSQ / (N-2) )1/2 x 100.
where N is the total number of off-diagonal distance measurements that are in the (square) distance matrix. If the S (subreplication) option is in force it is instead the sum of the numbers of replicates in all the non-diagonal cells of the distance matrix. But if the L or R option is also in effect, so that the distance matrix read in is lower- or upper-triangular, then the sum of replicates is only over those cells actually read in. If S is not in force, the number of replicates in each cell is assumed to be 1, so that N is n(n-1), where n is the number of species. The APSD gives an indication of the average percentage error. The number of trees examined is also printed out.
The constants available for modification at the beginning of the program are: "smoothings", which gives the number of passes through the algorithm which adjusts the lengths of the segments of the tree so as to minimize the sum of squares, "delta", which controls the size of improvement in sum of squares that is used to control the number of iterations improving branch lengths, and "epsilonf", which defines a small quantity needed in some of the calculations. There is no feature saving multiple trees tied for best, partly because we do not expect exact ties except in cases where the branch lengths make the nature of the tie obvious, as when a branch is of zero length.
The algorithm can be slow. As the number of species rises, so does the number of distances from each species to the others. The speed of this algorithm will thus rise as the fourth power of the number of species, rather than as the third power as do most of the others. Hence it is expected to get very slow as the number of species is made larger.
7 Bovine 0.0000 1.6866 1.7198 1.6606 1.5243 1.6043 1.5905 Mouse 1.6866 0.0000 1.5232 1.4841 1.4465 1.4389 1.4629 Gibbon 1.7198 1.5232 0.0000 0.7115 0.5958 0.6179 0.5583 Orang 1.6606 1.4841 0.7115 0.0000 0.4631 0.5061 0.4710 Gorilla 1.5243 1.4465 0.5958 0.4631 0.0000 0.3484 0.3083 Chimp 1.6043 1.4389 0.6179 0.5061 0.3484 0.0000 0.2692 Human 1.5905 1.4629 0.5583 0.4710 0.3083 0.2692 0.0000 |
7 Populations Fitch-Margoliash method version 3.69 __ __ 2 \ \ (Obs - Exp) Sum of squares = /_ /_ ------------ 2 i j Obs Negative branch lengths not allowed Name Distances ---- --------- Bovine 0.00000 1.68660 1.71980 1.66060 1.52430 1.60430 1.59050 Mouse 1.68660 0.00000 1.52320 1.48410 1.44650 1.43890 1.46290 Gibbon 1.71980 1.52320 0.00000 0.71150 0.59580 0.61790 0.55830 Orang 1.66060 1.48410 0.71150 0.00000 0.46310 0.50610 0.47100 Gorilla 1.52430 1.44650 0.59580 0.46310 0.00000 0.34840 0.30830 Chimp 1.60430 1.43890 0.61790 0.50610 0.34840 0.00000 0.26920 Human 1.59050 1.46290 0.55830 0.47100 0.30830 0.26920 0.00000 +---------------------------------------------Mouse ! ! +------Human ! +--5 ! +-4 +--------Chimp ! ! ! ! +--3 +---------Gorilla ! ! ! 1------------------------2 +-----------------Orang ! ! ! +---------------------Gibbon ! +------------------------------------------------------Bovine remember: this is an unrooted tree! Sum of squares = 0.01375 Average percent standard deviation = 1.85418 Between And Length ------- --- ------ 1 Mouse 0.76985 1 2 0.41983 2 3 0.04986 3 4 0.02121 4 5 0.03695 5 Human 0.11449 5 Chimp 0.15471 4 Gorilla 0.15680 3 Orang 0.29209 2 Gibbon 0.35537 1 Bovine 0.91675 |
phylip-3.697/doc/gendist.html 0000644 0047320 0047320 00000031606 12406201172 015636 0 ustar joe felsenst_g
© Copyright 1986-2014 by Joseph Felsenstein. All rights reserved. License terms here.
This program computes any one of three measures of genetic distance from a set of gene frequencies in different populations (or species). The three are Nei's genetic distance (Nei, 1972), Cavalli-Sforza's chord measure (Cavalli- Sforza and Edwards, 1967) and Reynolds, Weir, and Cockerham's (1983) genetic distance. These are written to an output file in a format that can be read by the distance matrix phylogeny programs Fitch and Kitsch.
The three measures have somewhat different assumptions. All assume that all differences between populations arise from genetic drift. Nei's distance is formulated for an infinite isoalleles model of mutation, in which there is a rate of neutral mutation and each mutant is to a completely new allele. It is assumed that all loci have the same rate of neutral mutation, and that the genetic variability initially in the population is at equilibrium between mutation and genetic drift, with the effective population size of each population remaining constant.
Nei's distance is:
\ \ /_ /_ p1mi p2mi m i D = - ln ( ------------------------------------- ). \ \ \ \ [ /_ /_ p1mi2]1/2 [ /_ /_ p2mi2]1/2 m i m i
where m is summed over loci, i over alleles at the m-th locus, and where
p1mi
is the frequency of the i-th allele at the m-th locus in population 1. Subject to the above assumptions, Nei's genetic distance is expected, for a sample of sufficiently many equivalent loci, to rise linearly with time.
The other two genetic distances assume that there is no mutation, and that all gene frequency changes are by genetic drift alone. However they do not assume that population sizes have remained constant and equal in all populations. They cope with changing population size by having expectations that rise linearly not with time, but with the sum over time of 1/N, where N is the effective population size. Thus if population size doubles, genetic drift will be taking place more slowly, and the genetic distance will be expected to be rising only half as fast with respect to time. Both genetic distances are different estimators of the same quantity under the same model.
Cavalli-Sforza's chord distance is given by
\ \ \ D2 = 4 /_ [ 1 - /_ p1mi1/2 p 2mi1/2] / /_ (am - 1) m i m
where m indexes the loci, where i is summed over the alleles at the m-th locus, and where a is the number of alleles at the m-th locus. It can be shown that this distance always satisfies the triangle inequality. Note that as given here it is divided by the number of degrees of freedom, the sum of the numbers of alleles minus one. The quantity which is expected to rise linearly with amount of genetic drift (sum of 1/N over time) is D squared, the quantity computed above, and that is what is written out into the distance matrix.
Reynolds, Weir, and Cockerham's (1983) genetic distance is
\ \ /_ /_ [ p1mi - p2mi]2 m i D2 = -------------------------------------- \ \ 2 /_ [ 1 - /_ p1mi p2mi ] m i
where the notation is as before and D2 is the quantity that is expected to rise linearly with cumulated genetic drift.
Having computed one of these genetic distances, one which you feel is appropriate to the biology of the situation, you can use it as the input to the programs Fitch, Kitsch or Neighbor. Keep in mind that the statistical model in those programs implicitly assumes that the distances in the input table have independent errors. For any measure of genetic distance this will not be true, as bursts of random genetic drift, or sampling events in drawing the sample of individuals from each population, cause fluctuations of gene frequency that affect many distances simultaneously. While this is not expected to bias the estimate of the phylogeny, it does mean that the weighing of evidence from all the different distances in the table will not be done with maximal efficiency. One issue is which value of the P (Power) parameter should be used. This depends on how the variance of a distance rises with its expectation. For Cavalli-Sforza's chord distance, and for the Reynolds et. al. distance it can be shown that the variance of the distance will be proportional to the square of its expectation; this suggests a value of 2 for P, which the default value for Fitch and Kitsch (there is no P option in Neighbor).
If you think that the pure genetic drift model is appropriate, and are thus tempted to use the Cavalli-Sforza or Reynolds et. al. distances, you might consider using the maximum likelihood program Contml instead. It will correctly weigh the evidence in that case. Like those genetic distances, it uses approximations that break down as loci start to drift all the way to fixation. Although Nei's distance will not break down in that case, it makes other assumptions about equality of substitution rates at all loci and constancy of population sizes.
The most important thing to remember is that genetic distance is not an abstract, idealized measure of "differentness". It is an estimate of a parameter (time or cumulated inverse effective population size) of the model which is thought to have generated the differences we see. As an estimate, it has statistical properties that can be assessed, and we should never have to choose between genetic distances based on their aesthetic properties, or on the personal prestige of their originators. Considering them as estimates focuses us on the questions which genetic distances are intended to answer, for if there are none there is no reason to compute them. For further perspective on genetic distances, I recommend my own paper evaluating different genetic distances (Felsenstein, 1985c), Reynolds, Weir, and Cockerham (1983), and the material in Nei's book (Nei, 1987).
The input to this program is standard and is as described in the Gene Frequencies and Continuous Characters Programs documentation file above. It consists of the number of populations (or species), the number of loci, and after that a line containing the numbers of alleles at each of the loci. Then the gene frequencies follow in standard format.
The options are selected using a menu:
Genetic Distance Matrix program, version 3.69 Settings for this run: A Input file contains all alleles at each locus? One omitted at each locus N Use Nei genetic distance? Yes C Use Cavalli-Sforza chord measure? No R Use Reynolds genetic distance? No L Form of distance matrix? Square M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print indications of progress of run? Yes Y to accept these or type the letter for one to change |
The A (All alleles) option is described in the Gene Frequencies and Continuous Characters Programs documentation file. As with Contml, it is the signal that all alleles are represented in the gene frequency input, without one being left out per locus. C, N, and R are the signals to use the Cavalli-Sforza, Nei, or Reynolds et. al. genetic distances respectively. The Nei distance is the default, and it will be computed if none of these options is explicitly invoked. The L option is the signal that the distance matrix is to be written out in Lower triangular form. The M option is the usual Multiple Data Sets option, useful for doing bootstrap analyses with the distance matrix programs. It allows multiple data sets, but does not allow multiple sets of weights (since there is no provision for weighting in this program).
The output file simply contains on its first line the number of species (or populations). Each species (or population) starts a new line, with its name printed out first, and then and there are up to nine genetic distances printed on each line, in the standard format used as input by the distance matrix programs. The output, in its default form, is ready to be used in the distance matrix programs.
A constant "epsilong" is available to be changed by the user if the program is recompiled which defines a small quantity that is used when checking whether allele frequencies at a locus sum to more than one: if all alleles are input (option A) and the sum differs from 1 by more than epsilong, or if not all alleles are input and the sum is greater than 1 by more then epsilon, the program will see this as an error and stop. You may find this causes difficulties if you gene frequencies have been rounded. I have tried to keep epsilong from being too small to prevent such problems.
The program is quite fast and the user should effectively never be limited by the amount of time it takes. All that the program has to do is read in the gene frequency data and then, for each pair of species, compute a genetic distance formula for each pair of species. This should require an amount of effort proportional to the total number of alleles over loci, and to the square of the number of populations.
The main change that will be made to this program in the future is to add provisions for taking into account the sample size for each population. The genetic distance formulas have been modified by their inventors to correct for the inaccuracy of the estimate of the genetic distances, which on the whole should artificially increase the distance between populations by a small amount dependent on the sample sizes. The main difficulty with doing this is that I have not yet settled on a format for putting the sample size in the input data along with the gene frequency data for a species or population.
I may also include other distance measures, but only if I think their use is justified. There are many very arbitrary genetic distances, and I am reluctant to include most of them.
5 10 2 2 2 2 2 2 2 2 2 2 European 0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205 0.8055 0.5043 African 0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600 0.7582 0.6207 Chinese 0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726 0.7482 0.7334 American 0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000 0.8086 0.8636 Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396 0.9097 0.2976 |
5 European 0.000000 0.078002 0.080749 0.066805 0.103014 African 0.078002 0.000000 0.234698 0.104975 0.227281 Chinese 0.080749 0.234698 0.000000 0.053879 0.063275 American 0.066805 0.104975 0.053879 0.000000 0.134756 Australian 0.103014 0.227281 0.063275 0.134756 0.000000 |
phylip-3.697/doc/images/ 0000755 0047320 0047320 00000000000 13212363632 014560 5 ustar joe felsenst_g phylip-3.697/doc/images/DrawGramCat.png 0000644 0047320 0047320 00000140643 12406201357 017431 0 ustar joe felsenst_g ‰PNG
IHDR G û ÷K!ù $iCCPICC Profile 8…UßoÛT>‰oR¤? XG‡ŠÅ¯US[¹ÆI“¥íJ¥éØ*$ä:7‰©Û鶪O{7ü@ÙH§kk?ì<Ê»øÎí¾kktüqóÝ‹mÇ6°nÆ¶ÂøØ¯±-ümR;`zŠ–¡Êðv x#=\Ó%
ëoàYÐÚRÚ±£¥êùÐ#&Á?È>ÌÒ¹áЪþ¢þ©n¨_¨Ôß;j„;¦$}*}+ý(}'}/ýLŠtYº"ý$]•¾‘.9»ï½Ÿ%Ø{¯_aÝŠ]hÕkŸ5'SNÊ{äå”ü¼ü²<°¹_“§ä½ðì öÍý½t
³jMµ{-ñ4%ׯTÅ„«tYÛŸ“¦R6ÈÆØô#§v\œå–Šx:žŠ'H‰ï‹OÄÇâ3·ž¼ø^ø&°¦õþ“0::àm,L%È3â:qVEô
t›ÐÍ]~ߢI«vÖ6ÊWÙ¯ª¯) |ʸ2]ÕG‡Í4Ïå(6w¸½Â‹£$¾ƒ"ŽèAÞû¾EvÝmî[D‡ÿÂ;ëVh[¨}íõ¿Ú†ðN|æ3¢‹õº½âç£Hä‘S:°ßûéKâÝt·Ñx€÷UÏ'D;7ÿ®7;_"ÿÑeó?Y qxl+ @ IDATxì|TEׯOHB½„Ы"ˆHQĆ^åó±V챃¨AáU@DÀ‚€4ÒQzï½·PÈ7Ï$g¹ ›er³û¿Í;åÌ™ÿ¹÷îafîlب
ë“8 ʉãÇ!999åxÊ“O¥I³'Ž28×ògÄQ?ƒœ*7ƒl&“ dÓgHll¬äÉ“GŽ=*…¶ñˆˆÁ'W®\öfY©fRR’ôêÕKºwï.ùóç÷»©çŸ^7n,×]w$$$øU711Q:$E‹MÓξ}û¤H‘"iÒpräÈûý—½öîÝ+
·ÞzKž~úiÉ›7ïr™p&\kgž?qÈÑr¸–5¾cÇ{À&agý}ò¤”+TXÊ› ¹2
p‚THFe2“î”§q=fFÊúª‡Î#œ:uÊÞÈ8ú
(¶2¾êk^ äxë×¹ÈÕ:8"x럷¶´/8ž-ßYÖŸ¸SžÆõèO}g_õœ}Vκéãþ”I_ÇÛy äxë×¹ÈÕ:8"ÐþÞ,–’¦¬2.á_N äÐþþ±öÆIk:¯ylâO•íë9Þúu.rµŽ¼ÿ3¶Üq3H´k×.‰ˆ2ŽQݘ))bâÍ’3À‚ 4<<ÜÆ5Íž¤þQàš‡²HSƒœ4òñ¿üO ²5OeëÅ 2RE[9HÓöU®¶ 9§ú¦¯¯r=´]gº¯:(çÌ×öµ=äkšS'¤;ËàÜÒç9õÓ<=z“¯iN™ˆkãèÔùÎ ¶QÝ5Ï)yÊß)KËj]Í£ýSDN†ÊJéóp®×§æéQmœq
kšÊÓ£ÖÁ9âjÄ5
qÚÿôs.…Èé¿ÊÊÉO¹;9kšÓ>â,sZjJ,}Îiÿ3m¦à¯Ïo¬ÁNy:mådîä›>ÝW”uækûÚò5
GýŽq–±Ž?éóœúiž½É×4‡HÕ:8AÁ©¿¦Ùó‡÷ÿéklŠ+&a“,H®bFŒ ðšÆq®ÆÖ<þé¸ ´.ÎQ‘Ê@¾³-•¡upÄG/,M‡,œkÐ/_=ײÎ2Î4•©”³œS⪗ƵÎõPNÏ5î