DupTree is a program for phylogenetic analyses using gene tree parsimony. That is, given a collection
of binary gene trees, DupTree searches for a species supertree that implies the fewest number of gene
duplication events. The user input is a list of rooted or unrooted, binary gene trees in Newick format.
DupTree first builds an intitial species tree using a stepwise addition algorithm. Next, DupTree
searches for a better species tree using a standard search heuristic of choice starting from the initial
species tree. The output is a species tree(s) with the lowest reconciliation cost among all trees
encountered during the heuristic search, along with the reconciliation cost for each individual gene
tree.
General input/output
--------------------
All input and output is handled as plain ASCII text. White spaces are allowed when they don't affect
the syntax.
All trees (gene trees or species trees) must be expressed using the Newick format terminated by a
semicolon, and they must be fully binary (fully resolved). Multifurcations in the gene trees will
trigger a warning or error message. Species names in the species tree must be unique. In the gene
trees, the leaf names must correspond to the species from which the gene was sequenced. Multiple
genes corresponding to a single species can occur in a gene tree.
Species Tree e.g. ((speciesA,speciesB),speciesC);
Gene Tree e.g. (((speciesA,speciesB),speciesC),speciesA);
Lables with non-alphabetic characters (e.g spaces) have to be encapsulated in apostrophes or quotation
marks.
e.g. speciesA, "species A" or 'species A'
Trees can span multiple lines and contain comments. Comments have to be encapsulated in square brackets.
e.g.
(
(speciesA,speciesB), [example comment text]
speciesC
);
Gene trees can be unrooted or rooted. An unrooted gene tree will still be stored with an artificial
root, but this root is ignored during analyses. The rooting of a tree can be specified by the prefix
[&U] or [&R] for unrooted or rooted respectively. A gene tree with no prefix is assumed to be rooted.
Species trees are always handled rooted and need no prefix.
e.g. unrooted gene tree
[&U]((speciesA,speciesB),speciesC);
e.g. rooted gene tree
[&R]((speciesA,speciesB),speciesC); OR ((speciesA,speciesB),speciesC);
The standard input and output of the operating system is used by all tools unless specified by the
user. Using the standard input and output allows file-less communication between tools using pipes. It
is always possible to specify files and store the input and output in them.
e.g. file input/output
duptree -i -o
e.g. standard input/output example on Unix
cat | duptree >
DupTree
----------
SYNOPSIS
duptree [OPTIONS]
DESCRIPTION
Given a collection of rooted and/or unrooted, binary gene trees, the default behaviour of duptree is
to quickly construct an intitial species tree using a stepwise addition algorithm. The resulting
species tree will be rooted, fully binary, and will contain all species in the gene trees exactly
once. This initial species tree serves as a starting point for the heuristic search that follows.
DupTree can also accept user-given starting species trees. This can be done by using the argument
"--generator 0". In this case, the first tree in the input will be interpreted as the starting species
tree.
The starting species tree for duptree can contain topological constraints. In the tree search, the
species in the constrained subtree will always form a clade. In other words, no species inside the
constrained subtree can be moved outside the constrained subtree and no species outside the
constrained subtree can be moved inside the constrained subtree. However, relationships within the
constrained subtree can change. Constraints can be specified by using the "--constraints" argument.
For a user specified species tree constraints can be specified by attaching [&CONSTRAINT] to the root
node of the subtree in the species tree. Note that if constraints are used to build the starting
species tree, the starting tree will be a constraint tree that is used in duptree.
e.g. (((cat,turtle),dog)[&CONSTRAINT],(rice,arabidopsis));
In this example, cat, turtle, and dog will always form a clade. However, the final species
tree might not maintain the relationships within this clade. Thus, the resulting species tree
might be (((cat,dog),turtle),(rice,arabidopsis));
Next, duptree infers an optimal species tree using a local search heuristic based on rSPR
operations. This species tree is guaranteed to be locally optimal. However, multiple runs can lead to
different species trees depending on the search path taken by the heuristic.
The gene trees also can be weighted. If the gene tree is weighted, the reconciliation cost (number of
gene duplications) for each weighted gene tree is multiplied by its weight. Thus, gene trees with a
higher weight should have a larger impact in the tree search. The weight must be a number between 0
and 1. A weight of 1.0 is equivalent to no weighting, and if the weight of a gene tree is 0.0, the
gene tree has no effect on the gene tree parsimony analysis (i.e., it is equivalent to not including
the gene tree in the analysis). A weight can be specified by the prefix [&WEIGHT=]. The value
is a positive floating point number from 0.0 to 1.0. By default a tree has weight 1.0.
e.g. [&WEIGHT=0.5]((speciesA,speciesB),speciesC);
e.g. ((speciesA,speciesB),speciesC); [No specified weight is equivalent to a weight of 1]
OPTIONS
-i, --input Input file
-o, --output Output file
--oformat newick | nexus By default the output format is a list of trees in Newick format
separated by semicolons. The output can be changed to the NEXUS format.
The output consists of the best species tree(s) identified by the rSPR
heuristic followed by a list of the gene trees with their reconciliation
cost.
--nogenetree This option will only output the species tree. By default the input gene
trees are listed along with the species trees.
-g, --generator 0|1|2 The algorithm used to build the starting species tree.
0 - don't generate
1 - leaf adding heuristic [default]
First we consider all possible triplets of species in all possible
topologies. The best triplets and their corresponding topologies,
are stored. Among these best triplets (triplets with the lowest
reconciliation cost), we randomly pick a triplet and make that our
starting species tree. Next, we consider all possible leaves that
have yet to be added, and try to add each of them at all possible
positions in the current species tree. Again, all the trees with the
best score (lowest reconciliation cost) are stored, and one of these
is randomly picked to be the next species tree. In this way, we
incrementally add all species to the species tree. All gene trees
are assumed to be rooted.
2 - random tree
The species tree obtains a random topology.
-f, --fast This option implements a much faster version of the leaf adding
heuristic to build a starting tree. Instead of trying out all possible
triplets, we randomly pick three leaves from the species leaf set, and
then consider all the triplets that these three leaves can form. We
randomly pick a triplet with the best score as the starting species
tree. Then, we randomly pick another leaf to add to the current species
tree. We try to add this leaf in all possible positions in the current
species tree, and store all those positions that give the best score.
Then we randomly pick a position from among these best locations and add
the leaf at that location. This option is much faster than the original
leaf adding heuristic, but the starting trees may have higher
reconciliation costs.
--constraints A file containing groupings of species for the leaf adding heuristic.
Each constraint is a comma separated list of species terminated with a
semicolon. In the resulting species tree, species in a constraint will
form a clade. Multiple constraints can be specified, but any single
species can be part of at most one constraint. In other words, the
constraints cannot be nested.
e.g. example of constraints in a constraint file
cat, dog, turtle;
rice, arabidopsis;
[Note: if these are the two constraints, the additional constraint
"cat, dog" is not valid because both species already are part of
the first constraint.]
-e, --heuristic 1 | 2 | 3 Heuristic used in the search
1 - Randomized hill climbing [default]
This heuristic seeks to find a best tree in the rSPR neighborhood of
the currently best species tree. This procedure is called a local
search step. If there are multiple best species trees in the
neighborhood, then one is chosen randomly and the local search step
is repeated with this new tree. The heuristic terminates when no
better trees can be found in the local search step.
2 - Partial queue based
This heuristic uses randomized hill climbing, where all the best
trees found in the current rSPR neighborhood are enqueued. Each
queued tree serves as the starting tree for the hill climbing
heuristic search until a better tree is found. The heuristic
terminates when none of the trees found during the hill climbing
searches has a lower reconciliation cost.
3 - Full queue based [default: --queue=500 --trees=1]
This heuristic uses randomized hill climbing, where all the best
trees found so far are enqueued. Each queued tree serves as the
starting tree for the hill climbing heuristic search until a better
tree is found. The heuristic terminates when none of the trees found
during the hill climbing searches has a lower reconciliation cost.
Note that this heuristic is a more general and thorough version of
the partial queue based heuristic discussed above.
--queue The queue size for the full queue based heuristic. [default=500]
-t, --trees | all Maximum number of species trees to be output. This only applies to the
queue based heuristic.
- up to a specific number of species trees are output [default=1]
all - output all optimal species trees found
-r, --reroot opt | all Rerooting of unrooted gene trees.
opt - Rerooting occurs when an optimal species tree for the current
rooting is found. [default]
all - Rerooting occurs during every rSPR operation.
-q, --quiet No processing output.
--seed Set a user defined random number generator seed. By default the seed is
generated from the local wall clock.
-v, --version Output the version number.
-h, --help Output a brief help message with information about the various arguments.
NOTES AND RECOMMENDATIONS:
The initial species tree generated by the leaf adding heuristic is NOT expected to be optimal in terms
of gene duplications, and in many, cases it may be quite bad. However, we have found that starting
the rSPR heuristic from one of these initial starting trees often leads to much better solutions than
starting from a random species tree. Also, because of the random steps, the initial species trees
from multiple runs of the leaf adding heuristic likely will be different. The starting tree in a
duptree analysis can greatly affect the resulting species tree. Thus, we strongly recommend running
duptree multiple times using different starting trees.
In many cases, using the -r (reroot) option can greatly decrease the overall reconciliation cost
(number of gene duplications). It can be very difficult to root gene trees accurately, and this
option searches for the root that will minimize the number of gene duplications. While the "-r all"
option is the most thorough option for examining alternate rootings of gene trees, it can be extremely
computationally burdensome, if not entirely intractable, for large data sets. The "-r opt" option is
much faster, and we have successfully tested it with some very large data sets.
In addition to the program "duptree", we have included tools to convert files of trees from nexus format to newick format or from newick format to nexus format.
Tool nexus2newick and newick2nexus
----------------------------------
SYNOPSIS
nexus2newick [OPTIONS]
newick2nexus [OPTIONS]
DESCRIPTION
Convert a NEXUS file to a list of trees in Newick format separated by commas and vice versa. The NEXUS
format is compatible with the tool GeneTree by Rod Page.
OPTIONS
-i, --input Input file
-o, --output Output file
-q, --quiet No processing output.
-v, --version Output the version number.
-h, --help Output a brief help message with information about the various arguments.
Examples
--------
We provide several example files using a vertebrate data set from Rod Page. The file
"vertebrates01.newick" contains a set of 9 rooted gene trees with NO starting species tree. The file
"vertebrates01.UNR.newick" contains the same 9 gene trees that are denoted as unrooted. These files
can be used to infer a phylogeny using the program "duptree".
e.g. These files can be used to infer a phylogeny using the program "duptree".
duptree -i vertebrates01.newick -o output.newick
duptree -i vertebrates01.UNR.newick -o output.newick
The files "vertebrates02.newick" and "vertebrates02.UNR.newick" contain a species tree followed by the
9 gene trees. The gene trees are rooted in "vertebrates02.newick" and unrooted in
"vertebrates02.UNR.newick".
e.g. These files can be used to infer a phylogeny using the program "duptree".
duptree -g 0 -i vertebrates02.newick -o output.newick
duptree -g 0 -i vertebrates02.UNR.newick -o output.newick