DupTree is a program for phylogenetic analyses using gene tree parsimony. That is, given a collection of binary gene trees, DupTree searches for a species supertree that implies the fewest number of gene duplication events. The user input is a list of rooted or unrooted, binary gene trees in Newick format. DupTree first builds an intitial species tree using a stepwise addition algorithm. Next, DupTree searches for a better species tree using a standard search heuristic of choice starting from the initial species tree. The output is a species tree(s) with the lowest reconciliation cost among all trees encountered during the heuristic search, along with the reconciliation cost for each individual gene tree. General input/output -------------------- All input and output is handled as plain ASCII text. White spaces are allowed when they don't affect the syntax. All trees (gene trees or species trees) must be expressed using the Newick format terminated by a semicolon, and they must be fully binary (fully resolved). Multifurcations in the gene trees will trigger a warning or error message. Species names in the species tree must be unique. In the gene trees, the leaf names must correspond to the species from which the gene was sequenced. Multiple genes corresponding to a single species can occur in a gene tree. Species Tree e.g. ((speciesA,speciesB),speciesC); Gene Tree e.g. (((speciesA,speciesB),speciesC),speciesA); Lables with non-alphabetic characters (e.g spaces) have to be encapsulated in apostrophes or quotation marks. e.g. speciesA, "species A" or 'species A' Trees can span multiple lines and contain comments. Comments have to be encapsulated in square brackets. e.g. ( (speciesA,speciesB), [example comment text] speciesC ); Gene trees can be unrooted or rooted. An unrooted gene tree will still be stored with an artificial root, but this root is ignored during analyses. The rooting of a tree can be specified by the prefix [&U] or [&R] for unrooted or rooted respectively. A gene tree with no prefix is assumed to be rooted. Species trees are always handled rooted and need no prefix. e.g. unrooted gene tree [&U]((speciesA,speciesB),speciesC); e.g. rooted gene tree [&R]((speciesA,speciesB),speciesC); OR ((speciesA,speciesB),speciesC); The standard input and output of the operating system is used by all tools unless specified by the user. Using the standard input and output allows file-less communication between tools using pipes. It is always possible to specify files and store the input and output in them. e.g. file input/output duptree -i -o e.g. standard input/output example on Unix cat | duptree > DupTree ---------- SYNOPSIS duptree [OPTIONS] DESCRIPTION Given a collection of rooted and/or unrooted, binary gene trees, the default behaviour of duptree is to quickly construct an intitial species tree using a stepwise addition algorithm. The resulting species tree will be rooted, fully binary, and will contain all species in the gene trees exactly once. This initial species tree serves as a starting point for the heuristic search that follows. DupTree can also accept user-given starting species trees. This can be done by using the argument "--generator 0". In this case, the first tree in the input will be interpreted as the starting species tree. The starting species tree for duptree can contain topological constraints. In the tree search, the species in the constrained subtree will always form a clade. In other words, no species inside the constrained subtree can be moved outside the constrained subtree and no species outside the constrained subtree can be moved inside the constrained subtree. However, relationships within the constrained subtree can change. Constraints can be specified by using the "--constraints" argument. For a user specified species tree constraints can be specified by attaching [&CONSTRAINT] to the root node of the subtree in the species tree. Note that if constraints are used to build the starting species tree, the starting tree will be a constraint tree that is used in duptree. e.g. (((cat,turtle),dog)[&CONSTRAINT],(rice,arabidopsis)); In this example, cat, turtle, and dog will always form a clade. However, the final species tree might not maintain the relationships within this clade. Thus, the resulting species tree might be (((cat,dog),turtle),(rice,arabidopsis)); Next, duptree infers an optimal species tree using a local search heuristic based on rSPR operations. This species tree is guaranteed to be locally optimal. However, multiple runs can lead to different species trees depending on the search path taken by the heuristic. The gene trees also can be weighted. If the gene tree is weighted, the reconciliation cost (number of gene duplications) for each weighted gene tree is multiplied by its weight. Thus, gene trees with a higher weight should have a larger impact in the tree search. The weight must be a number between 0 and 1. A weight of 1.0 is equivalent to no weighting, and if the weight of a gene tree is 0.0, the gene tree has no effect on the gene tree parsimony analysis (i.e., it is equivalent to not including the gene tree in the analysis). A weight can be specified by the prefix [&WEIGHT=]. The value is a positive floating point number from 0.0 to 1.0. By default a tree has weight 1.0. e.g. [&WEIGHT=0.5]((speciesA,speciesB),speciesC); e.g. ((speciesA,speciesB),speciesC); [No specified weight is equivalent to a weight of 1] OPTIONS -i, --input Input file -o, --output Output file --oformat newick | nexus By default the output format is a list of trees in Newick format separated by semicolons. The output can be changed to the NEXUS format. The output consists of the best species tree(s) identified by the rSPR heuristic followed by a list of the gene trees with their reconciliation cost. --nogenetree This option will only output the species tree. By default the input gene trees are listed along with the species trees. -g, --generator 0|1|2 The algorithm used to build the starting species tree. 0 - don't generate 1 - leaf adding heuristic [default] First we consider all possible triplets of species in all possible topologies. The best triplets and their corresponding topologies, are stored. Among these best triplets (triplets with the lowest reconciliation cost), we randomly pick a triplet and make that our starting species tree. Next, we consider all possible leaves that have yet to be added, and try to add each of them at all possible positions in the current species tree. Again, all the trees with the best score (lowest reconciliation cost) are stored, and one of these is randomly picked to be the next species tree. In this way, we incrementally add all species to the species tree. All gene trees are assumed to be rooted. 2 - random tree The species tree obtains a random topology. -f, --fast This option implements a much faster version of the leaf adding heuristic to build a starting tree. Instead of trying out all possible triplets, we randomly pick three leaves from the species leaf set, and then consider all the triplets that these three leaves can form. We randomly pick a triplet with the best score as the starting species tree. Then, we randomly pick another leaf to add to the current species tree. We try to add this leaf in all possible positions in the current species tree, and store all those positions that give the best score. Then we randomly pick a position from among these best locations and add the leaf at that location. This option is much faster than the original leaf adding heuristic, but the starting trees may have higher reconciliation costs. --constraints A file containing groupings of species for the leaf adding heuristic. Each constraint is a comma separated list of species terminated with a semicolon. In the resulting species tree, species in a constraint will form a clade. Multiple constraints can be specified, but any single species can be part of at most one constraint. In other words, the constraints cannot be nested. e.g. example of constraints in a constraint file cat, dog, turtle; rice, arabidopsis; [Note: if these are the two constraints, the additional constraint "cat, dog" is not valid because both species already are part of the first constraint.] -e, --heuristic 1 | 2 | 3 Heuristic used in the search 1 - Randomized hill climbing [default] This heuristic seeks to find a best tree in the rSPR neighborhood of the currently best species tree. This procedure is called a local search step. If there are multiple best species trees in the neighborhood, then one is chosen randomly and the local search step is repeated with this new tree. The heuristic terminates when no better trees can be found in the local search step. 2 - Partial queue based This heuristic uses randomized hill climbing, where all the best trees found in the current rSPR neighborhood are enqueued. Each queued tree serves as the starting tree for the hill climbing heuristic search until a better tree is found. The heuristic terminates when none of the trees found during the hill climbing searches has a lower reconciliation cost. 3 - Full queue based [default: --queue=500 --trees=1] This heuristic uses randomized hill climbing, where all the best trees found so far are enqueued. Each queued tree serves as the starting tree for the hill climbing heuristic search until a better tree is found. The heuristic terminates when none of the trees found during the hill climbing searches has a lower reconciliation cost. Note that this heuristic is a more general and thorough version of the partial queue based heuristic discussed above. --queue The queue size for the full queue based heuristic. [default=500] -t, --trees | all Maximum number of species trees to be output. This only applies to the queue based heuristic. - up to a specific number of species trees are output [default=1] all - output all optimal species trees found -r, --reroot opt | all Rerooting of unrooted gene trees. opt - Rerooting occurs when an optimal species tree for the current rooting is found. [default] all - Rerooting occurs during every rSPR operation. -q, --quiet No processing output. --seed Set a user defined random number generator seed. By default the seed is generated from the local wall clock. -v, --version Output the version number. -h, --help Output a brief help message with information about the various arguments. NOTES AND RECOMMENDATIONS: The initial species tree generated by the leaf adding heuristic is NOT expected to be optimal in terms of gene duplications, and in many, cases it may be quite bad. However, we have found that starting the rSPR heuristic from one of these initial starting trees often leads to much better solutions than starting from a random species tree. Also, because of the random steps, the initial species trees from multiple runs of the leaf adding heuristic likely will be different. The starting tree in a duptree analysis can greatly affect the resulting species tree. Thus, we strongly recommend running duptree multiple times using different starting trees. In many cases, using the -r (reroot) option can greatly decrease the overall reconciliation cost (number of gene duplications). It can be very difficult to root gene trees accurately, and this option searches for the root that will minimize the number of gene duplications. While the "-r all" option is the most thorough option for examining alternate rootings of gene trees, it can be extremely computationally burdensome, if not entirely intractable, for large data sets. The "-r opt" option is much faster, and we have successfully tested it with some very large data sets. In addition to the program "duptree", we have included tools to convert files of trees from nexus format to newick format or from newick format to nexus format. Tool nexus2newick and newick2nexus ---------------------------------- SYNOPSIS nexus2newick [OPTIONS] newick2nexus [OPTIONS] DESCRIPTION Convert a NEXUS file to a list of trees in Newick format separated by commas and vice versa. The NEXUS format is compatible with the tool GeneTree by Rod Page. OPTIONS -i, --input Input file -o, --output Output file -q, --quiet No processing output. -v, --version Output the version number. -h, --help Output a brief help message with information about the various arguments. Examples -------- We provide several example files using a vertebrate data set from Rod Page. The file "vertebrates01.newick" contains a set of 9 rooted gene trees with NO starting species tree. The file "vertebrates01.UNR.newick" contains the same 9 gene trees that are denoted as unrooted. These files can be used to infer a phylogeny using the program "duptree". e.g. These files can be used to infer a phylogeny using the program "duptree". duptree -i vertebrates01.newick -o output.newick duptree -i vertebrates01.UNR.newick -o output.newick The files "vertebrates02.newick" and "vertebrates02.UNR.newick" contain a species tree followed by the 9 gene trees. The gene trees are rooted in "vertebrates02.newick" and unrooted in "vertebrates02.UNR.newick". e.g. These files can be used to infer a phylogeny using the program "duptree". duptree -g 0 -i vertebrates02.newick -o output.newick duptree -g 0 -i vertebrates02.UNR.newick -o output.newick