SciSurf: Index of 'Reconstructible Phylogenetic Networks: Do Not Distinguish the Indistinguishable'

Reconstructible Phylogenetic Networks: Do Not Distinguish the Indistinguishable

Fabio Pardi, Celine Scornavacca

Published in PLOS Comp. Biol., April 2015

Abstract

An important way to interpret a phylogenetic network is in terms of the trees it displays, which represent all the possible histories of the characters carried by the organisms in the network. Interestingly, however, different networks may display exactly the same set of trees, an observation that poses a problem for network reconstruction: from the perspective of many inference methods such networks are “indistinguishable”. This is true for all methods that evaluate a phylo-genetic network solely on the basis of how well the displayed trees fit the available data, including all methods based on input data consisting of clades, triples, quartets, or trees with any number of taxa, and also sequence-based approaches such as popular formalisa—tions of maximum parsimony and maximum likelihood for networks. This identifiability problem is partially solved by accounting for branch lengths, although this merely reduces the frequency of the problem. Here we propose that network inference methods should only attempt to reconstruct what they can uniquely identify. To this end, we introduce a novel definition of what constitutes a uniquely reconstructible network. For any given set of indistinguishable networks, we define a canonical network that, under mild assumptions, is unique and thus representative of the entire set. Given data that undenNent reticulate evolution, only the canonical form of the underlying phylogenetic network can be uniquely reconstructed. While on the methodological side this will imply a drastic reduction of the solution space in network inference, for the study of reticulate evolution this is a fundamental limitation that will require an important change of perspective when interpreting

Author Summary

We consider here an elementary question for the inference of phylogenetic networks: what networks can be reconstructed. Indeed, whereas in theory it is always possible to reconstruct a phylogenetic tree, given sufficient data for this task, the same does not hold for phylogenetic networks: most notably, the relative order of consecutive reticulate events cannot be determined by standard network inference methods. This problem has been described before, but no solutions to deal with it have been put forward. Here we propose limiting the space of reconstructible phylogenetic networks to what we call “canonical net-works”. We formally prove that each network has a (usually unique) canonical form—where a number of nodes and branches are merged—representing all that can be uniquely reconstructed about the original network. Once a canonical network N is inferred, it must be kept in mind that—even with perfect and unlimited data—the true phylogenetic network is just one of the potentially many networks having N as canonical form. This is an important difference to what biologists are used to for phylogenetic trees, where in principle it is always possible to resolve uncertainties, given enough data.

Introduction

This may be caused by events such as hybrid speciation (e.g. in plants and animals [4, 5]), horizontal gene transfer (e.g. in bacteria [6, 7]), viral reassortment [8] , or recombination (e.g. in viruses [9, 10] or in the genomes of sexually reproducing species [11—13]). They are called “explicit” to distinguish them from “implicit” [14], “abstract” [1] or “data-display” [3] phylogenetic networks, which are used to display collections of alternative evolutionary hypotheses supported by conflicting signals in the data. In explicit networks, multiple-inheritance events are represented as reticulations, that is, nodes where two or more lineages converge to give rise to a new lineage, whose genetic material is a combination of that of its direct ancestors.

This observation gives rise to the notion of trees displayed by a network, which are all the possible single-character histories implied by a phylogenetic network. (See, e.g., Fig. 1, where T1, T2 and T3 are the trees displayed by networks N 1 and N2. Formal definitions are in the Results section.)

There remains, however, a strong demand for automatic reconstruction of networks that not only display conflicting signals in the data, but also seek to explain these signals with explicit inferences of past reticulation events (see, e.g., [18—20] ). This is evidenced, for example, by the abundance of manually reconstructed networks in the literature [8, 21—27]. As a result of this demand, the inference of explicit networks is now a rapidly growing field of research [1].

Not surprisingly, the notion of trees displayed by a phylogenetic network plays a central role: the general idea is to evaluate the fit of a network N with the data indirectly—on the basis of how well the trees displayed by N explain the data. In the following, we describe how this applies to the two main approaches for explicit network reconstruction: consistency-based approaches (see [28] for a survey)—seeking a network consistent with a number of prior evolutionary inferences (typically trees or groupings of taxa)—and sequence-based approaches, such as standard formulations of maximum parsimony and maximum likelihood for networks [2, 29—33]. Although evaluating a network via the trees it displays is evolutionarily meaningful, it has a problematic consequence: from the perspective of these reconstruction methods, all networks displaying the same set of trees are “indistinguishable”, as the function that these methods seek to optimize will always assign the same score to all networks displaying the same set of trees, regardless of the input data. In other words, the central parameter of phylogenetic network inference, the network itself, is in some cases not identifiable.

An Identifiability Problem

(In the following, T(N) denotes the set of trees displayed by N.) By displaying the same trees, these networks display the same clades, the same triples, the same quartets (triples and quartets are rooted subtrees with 3 leaves and unrooted subtrees with 4 leaves, respectively) and in general the same subtrees with an arbitrary number of leaves. Therefore, any method that reconstructs a network based on its consistency with collections of such data will not be able to distinguish between networks N1 and N2. This includes all the methods whose data consists of clusters of taxa (e.g., [34]), triples (e.g., [35]), quartets (e.g., [36]), or any trees (e.g., [37]).

For maximum parsimony, a practical approach [2, 29—31] is to consider that the input is partitioned in a number of alignments A1, A2, . . ., Am, each from a different non-recombining genomic region (possibly consisting of just one site each), and then take, for each of these alignments, the best parsimony score Ps(T|A,-) among all those of the trees displayed by a network N. The parsimony score of N is then the sum of all the parsimony scores thus obtained. Formally, we have It is clear that if two networks display the same set of trees (as in Fig. 1), then their parsimony score With respect to any input alignments Will be the same—because they take the minimum Fig 1. The network topologies N1 and N2 are indistinguishable to most current approaches for network reconstruction, as they display the same tree topologies T1, T2 and T3. value over the same set T(N) —and thus they are indistinguishable to any method based on the maximum parsimony principle above.

The likelihood function with respect to a set of alignments A1, A2, . . ., Am, each from a different non-recombining genomic region, is then given by: Note that an important difference with the consistency-based and parsimony methods described above is that any tree T displayed by a network has now edge lengths and an associated probability Pr(T|N).

For example, it does not allow us to distinguish between networks with topologies N1 and N2 in Fig. 1: for every assignment of edge lengths and inheritance probabilities to N1, there exist corresponding assignments to N2 that make the resulting networks indistinguishable, that is, displaying the same trees, with the same edge lengths and the same probabilities of being observed (see the last section in the Supporting Information, 81 Text). As a result, the likelihoods of these two networks will be identical, regardless of the data, and no method based on this definition of likelihood will be able to favour one of them over the other. We refer to 81 Text for a more detailed discussion about networks with inheritance probabilities and likelihood-based reconstruction. In general, we believe that these identif1ability problems affect all network inference methods which seek consistency with unordered collections of sequence alignments or pre-inferred attributes such as clusters, triples, quartets or trees.

The Importance of Edge Lengths

The primary motivation for this is that this choice makes our results directly relevant to the statistical approaches for network inference, all of which need edge lengths to measure the fit of a phylogeny with the available data. In addition to ML, these approaches include distance-based and Bayesian methods [39], which are also promising for future work.

For example, consider the three network topologies in Fig. 2 (top), where taxon o is an out-group used to identify the root of the phylogeny for a, b and c. These networks show three very different evolutionary histories: in N1 taxon b is the only one issued of a reticulation event—in other words the genome of b is recombinant—whereas in N2 and N3, it is a and c, respectively, that are recombinant. However, N1, N2 and N3 display the same tree topologies—those of T1 and Tz—and thus would be indistinguishable to any approach that does not model edge lengths.

It is easy to check that N2 now displays T1 and T2 With the shown edge lengths, Whereas no edge length assignment to N1 or N3 can make these networks display T1 and T2. We note that, throughout this paper, as in classical likelihood approaches, edge lengths measure evolutionary divergence, for example in terms of eXpected number of substitutions per site. No molecular clock is assumed, meaning that we do not eXpect edge lengths to be proportional to time.

Remaining Identifiability Problems, and a Proposed Solution

Consider networks N1 and N2 in Fig. 3: for any set of edge lengths for N1, there exist an infinity of edge length assignments for N2 that make these two networks display exactly the same set of trees with the same edge lengths. In the following, we say that networks such as N1 and N2 are indistinguishable.

Note that this operation, which we refer to as “unzipping” reticulation v, can result in v coinciding with a specia-tion node or a leaf when M is taken to equal the length of the edge going out of v. For example in Fig. 3, one may fully unzip the two reticulation nodes in N1, thus obtaining the network N of Fig. 4. As expected, N1 and N’ display the same set of trees ({T1, T2, T3}) and are thus indistinguishable. What is most interesting in this example is that, if we fully unzip the two reticula-tions in N2 (the other network in Fig. 3, also displaying {T1, T2, T3}), then we eventually end up obtaining N’ again. As we shall see in the following, this is not a coincidence: the unzipping transformations described above lead to what we call the canonical form of a network; under mild assumptions, two networks are indistinguishable if and only if they have the same

Canonical form of N1 and N2 in Fig. 3. doi:10.1371/journal.pcbi.1004135.9004 canonical form (e.g. N1, N2 in Fig. 3 have the same canonical form N; formal definitions and statements in the Results section). Here, we propose to deal with the identif1ability issues for phylogenetic networks in the following way: since no data will ever enable any of the standard inference methods described above to prefer a network over all of its indistinguishable equivalents, we propose that these methods should only attempt to reconstruct what they can uniquely identify, that is, networks in canonical form. This is a radical shift, not only for the developers of phylogenetic inference methods, Who Will see a drastic reduction of the solution space of their algorithms, but also for evolutionary biologists, Who should abandon their hopes of seeing a network such as N1 or N2 in Fig. 3 being reconstructed by these inference methods.

Previous Work and Comparison

Examples of such classes include galled trees [40, 41], galled networks [42], level-k networks [43], tree-child networks [44], tree-sibling networks [45], networks with visible reticulations [1]. Although the ultimate goal should be to establish what can be inferred from biological data, most of the proposed definitions are computationally-motivated: in general the rationale behind these classes is the possibility of devising an efficient algorithm to solve some formalization of the reconstruction problem. None of these definitions claims to have

biological significance.

From our standpoint, the computationally-motivated definitions above are at the same time too broad and too restrictive. Too broad, because they determine a set of networks that includes many pairs of indistinguishable networks: for example the three indistinguishable networks in Fig. 2 are all galled trees—and thus belong to every single one of the classes mentioned above (which are all generalizations of galled trees). Too restrictive, because these classes of networks do not include simple networks that it should be possible to reconstruct from real data. For example, Fig. 5a shows a network N with edge lengths that is not tree-sib-ling, nor has the visible property, and thus is not galled, nor tree-child (for definitions, see [1] ), but which in practice should be reconstructible: apart from the lengths of three edges (x, y, z), N is uniquely determined by the trees that it displays (a consequence of the formal results that we will show in the following), meaning that, given large amounts of data strongly supporting each of these (seven) trees with their correct edge lengths, any method for network inference properly accounting for edge lengths (e.g. based on ML) should be able to reconstruct N, or its canonical form N’.

These approaches bear some resemblances to ours, but do not include edge lengths in the definition of a network. Moreover, we argue that these classes of networks are still too narrow to be biologically relevant. We briefly describe and comment these previous works below.

[46] defined notions of reconstructible, indistinguishable and reduced networks that resemble concepts that we will introduce here. Although some of their results were flawed [47, 50], some of the arguments in this introduction are inspired by their paper. Particularly relevant to the current paper is a reduction algorithm to transform a network into its reduced version. (However, the exact definition of the reduced version is unclear: as one of the authors later pointed out [47] , “the reduction procedure of Moret et al. [46] is, in fact, inaccurate” and “in this paper we do not attempt to fix the procedure”.) The concept of reduced version is analogous to that of canonical form here, as the authors claim that networks displaying the same tree topologies have the same reduced version (up to isomorphism; Theorem 2 in [46] ). This is somehow a weaker analogue of one of our results (Corollary 1); weaker, because it does not claim that, conversely, networks with the same reduced version display the same tree topologies. To have an idea of the difference between our canonical form and the reduced version of Moret and colleagues, in Fig. 6 we compare the canonical form and the reduced version of the same network N1. (N 1 and its reduced version are taken from Fig. 15 of [46] to avoid possible issues with the reduction algorithm.) As one can see, the canonical form retains more of the complexity of the original network.

Again, this is analogous to, but somehow weaker than our results, since it only provides a sufficient condition for networks to be indistinguishable (which in their context means to display the same tree topologies). This means that there can be irreducible networks that are indistinguishable (e.g. those in Fig. 2) thus failing to achieve the distinguishability goal. Moreover, Gambette and Huber [49] show that a particular class of network topologies (binary galled trees with no gall containing exactly 4 nodes) are uniquely identified by the tree topologies they display. It is clear that this class is too small to achieve the existence goal.

This requirement implies, among other things, that N cannot contain any reticulation v with outdegree 1 (v and its direct descendant would have the same descendant leaves), which in turn implies that regular networks are special cases of our canonical networks (the latter however also specify edge lengths). In fact regular networks satisfy a property that is analogous to the one we prove here for canonical networks: a regular network N is uniquely determined by the tree topologies that it displays [51], meaning that there can be no other regular network N’ displaying exactly the same set of tree topologies. Willson [51] shows this constructively by providing an algorithm that, given the (exponentially large) set of tree topologies displayed by a regular network R, reconstructs R itself. However, unlike for our canonical forms, for a given network there may exist no regular network displaying the same set of trees (e.g. consider the topology of N” in Fig. 5b), thus failing to meet the existence goal. Regularity is in fact a very restrictive constraint for a network. For example, none of the networks in Fig. 5 and Fig. 7 is regular, despite the fact that their topologies are uniquely determined by the trees with edge lengths that they display (a consequence of our results further below). Finally, going back to Fig. 6, collapsing the edge above taxa c and d in R(N1) yields the regular network displaying the same tree topologies as N1 and N2. Again, this shows that the canonical form retains more of the complexity of the original network than its regular counterpart.

Results

Our main result consists of formally proving that for every network N there exists a network N’ in canonical form, indistinguishable from N; moreover, if we restrict ourselves to networks satisfying a mild condition (the NELP property below), such canonical form N is unique (see Theorem 1). In other words, although in general a phylogenetic network N is not uniquely recoverable from the data it generates, there always exists a canonical version N’ of N that is indeed determined by the data. Informally, N’ is all that can be reconstructed about N.

A directed acyclic graph (DAG) is a simple directed graph that is free of directed cycles. A DAG is rooted if it contains precisely one node of indegree 0, called the root. All nodes of outdegree 0 in a DAG are called leaves. A weighted rooted phylogenetic network N = (V, E, (p, A) on X (in this paper also called a network for simplicity) consists of a rooted DAG (V, E) whose leaves are bijectively labeled (via (p:X —> V) with the elements of X (called taxa). Moreover, each edge e E E is associated to a set of positive weights, called lengths, A(e) C R+. Figs. 3, 4, 5 contain examples of networks. A reticulation of a network N is a node v E Vwith indegree greater than 1. A weighted phylogenetic tree on X (a tree for simplicity) is a network on X with no reticulations and such that each edge e has a unique length (|A(e)| = 1), which we denote by A(e). Below, we discuss the biological justification of various aspects of the definitions above.

A rooted network Ng, and the trees it displays (T; and T5), obtained by removing a segment of length 0.5 from the outgroup lineage of N2 in Fig. 2. In our formal setting, a network such as N2 in Fig. 2 can either be represented as N’2 (by omitting the outgroup lineage, or part of it), or by rooting it in its outgroup display the same set of trees, that is T(N1) = T(N2). For example, N1 and N2 in Fig. 3 are indistinguishable, as they display the same set of trees (T1, T2 and T3, up to isomorphism).

Given a network N, a funnel is a node with indegree greater than 0 and outde-gree 1. A funnel-free network, or canonical network, is a network that does not contain funnels. A canonical form of a network N is a network that is funnel-free and indistinguishable from N.

The network N’ in Fig. 4 is a canonical form of N1 and N2 in Fig. 3, as N is funnel-free and indistinguishable from N1 and N2. Similarly, N; in Fig. 6 is a canonical form of N2. Note that nodes with indegree 1 and outdegree 1 are funnels. This implies that for trees the funnel-free condition coincides with the exclusion of suppressible nodes, which is a standard requirement in the definition of phylogenetic trees. It is thus appropriate to view the funnel-free condition as a natural extension of this requisite to networks.

A weighted path in a network N = (V, E, (p, A) is a pair (71, l), where 71 is a directed path in the graph (V, E) and l is a function that associates each edge e in 71 with a length Me) 6 A(e). The length of a weighted path is the sum of the lengths assigned to its edges. A network satisfies the NELP (no equally long paths) property if no pair of distinct weighted paths having the same endpoints have the same length.

The following result states that if we restrict ourselves to networks satisfying the NELP property, then every network has exactly one canonical form. An outline of its proof can be found in the Methods section, including an algorithm showing how to reduce a network to canonical form. The detailed proof is presented in 81 Text.

(i) Every network N has a canonical form. Moreover, (ii) N has the NELP property, then there exists a unique canonical form of N among networks satisfying the NELP property (up to isomorphism). (The notion of isomorphism between networks is only used for mathematical rigor and is defined in 81 Text.) The following result provides a necessary and sufficient condition for two networks satisfying the NELP property to be indistinguishable.

Let N1 and N2 be networks with the NELP property and let N1 and N; be their unique canonical forms satisfying the NELP property. Then N1 and N2 are indistinguishable and only N1 and N; are the same network (up to isomorphism). The following result states that a canonical network with the NELP property is uniquely determined by the trees it displays:

Let N be a canonical network satisfying the NELP property. Then N is the unique (up to isomorphism) canonical network satisfying the NELP property that displays (all and only) the trees in T(N). We now discuss the biological significance of a number of technical aspects of our framework.

Definition of Networks and Trees Displayed by a Network

This is because we assume that the analysis uses an outgroup (possibly consisting of multiple taxa, and With no reticula-tions) for rooting. For simplicity, outgroup lineages are not included in our phylogenies (an eXception to this is in Fig. 2). Note however that, because our phylogenies have edge lengths, and because omitting the outgroup is just a convention, the omitted lineages must have the same lengths for a network and all the trees it displays. For example, if we Wish to omit the outgroup from N2 in Fig. 7 and from the trees that it displays (T1 and T2 in Fig. 2), then What we obtain are Ng, T1 and T; in Fig. 7. This has a notable consequence: the trees displayed by a rooted network with edge lengths may have a root with outdegree 1 (e.g. T1 in Fig. 7). For flexibility, we also allow a network to have a root with outdegree 1.

For example, in Fig. 6, network N; has an edge with two lengths (17 + 112 + 114 and 17 + in + 113 + 114). The motivation behind multiple lengths lies in the observation that, whereas each edge in a phylo-genetic tree describing the evolution of non-reticulating organisms trivially corresponds to a unique evolutionary path in the underlying real evolutionary history, when reticulate events have occurred this is not necessarily true: Fig. 8 and Fig. 9 show that some evolutionary scenarios can either be represented using multiedges (multiple edges with the same endpoints) or edges with multiple lengths. Although these two options are mathematically equivalent, graphically the second one leads to more compact representations, and this is why we choose to allow multiple lengths rather than multiedges. For our purposes we only need to consider the case where e has a finite set of lengths (A(e) = {11(e), . . ., lk(e)}).

(See, e.g., the last common ancestor of c and d in N; in Fig. 6.) Traditionally, the internal nodes in a phylogenetic network are constrained to belong to one of two different categories: reticulate nodes, with more than one incoming edge and just one outgoing edge, and speciation (or coalescence) nodes, with one incoming edge and multiple outgoing edges. Because reticulate and speciation events are clearly distinct, it is

A non-reticulating evolutionary history (left) and a reticulating evolutionary history (right). The black lineages are those leading to a sampled set of taxa X. The horizontal jagged lines represent reticulation events. Note that, whereas representing the scenario on the left with a phylogenetic tree on X is straightforward, for the one on the right several options are possible. We show three alternative representations in Fig. 9.

In our framework, this requirement is dropped, and some networks, notably those in canonical form, may have nodes that both represent reticulate and speciation events. In this case, it is important to understand that these nodes represent a potentially compleX (and unrecoverable) reticulate scenario, followed by one or more speciation events. Compare, for example, network N and its canonical form N’ in Fig. 5, or N2 and N; in Fig. 6. (In the latter, it is especially instructive to consider the reticulate history above the direct ancestor of taxon e.)

The NELP Property

In N1 there are three distinct weighted paths having as endpoints the root of N1 and the leaf labelled by b. The lengths of these paths are 81 = 11 + 16, £2 = 12 + 13 + 15 + 18 and B3 = 12 + 110 + 19 + 18. Moreover, there is another pair of paths having the same endpoints: those of lengths B4 = 13 + 15 and £5 = 110 + 19. Thus N1 has the NELP property if and only if the three numbers 81, 82 and B3 are all different (note that this implies that also 84 and 85 are different). If edge lengths are taken to represent evolutionary change, rather than time, this is a very mild requirement: when edge lengths are drawn at random from a continuous distribution, the probability that two paths get exactly the same length is zero.

For these networks, canonical forms may not be unique (see Fig. 10 for an example of this). Even in this case, we believe that inference methods should only consider phylogenetic networks in their canonical form, as this allows to reduce the solution space without any loss in “expressive power”: since every network N has (at least one) canonical form that displays exactly the same set of trees—and therefore has the same fit with the data as N—restricting the solution space to canonical forms always leaves at least one optimal network within this space. The real weakness of using canonical forms in a molecular clock context is that if a canonical form is not unique, then it cannot be considered representative of all the networks indistinguishable from it. As an example of this, consider the indistinguishable networks in Fig. 10: none of these is representative of all the others.

Different (non-isomorphic) but indistinguishable funnel-free networks. All edges are assumed to have the (unique) length 1 unless otherwise displayed. These networks do not satisfy the NELP property, showing that this is a necessary condition for the uniqueness of canonical forms (Theorem 1(ii)). The ellipsis at the end represents the fact that an infinite number of such networks can be obtained by adding any number of copies of the subgraph in grey in the last network.

Discussion

The bad news is that any method that scores the fit between a network N and the available data—which may be sequences, distances, splits, trees (with or without edge lengths)—based on the set of trees displayed by N must face an important theoretical limitation: regardless of the amount of available data from the taxa under consideration, some parts of the network representing their evolutionary history may be impossible to recover—most notably the relative order of consecutive reticulate events (see, e.g., Fig. 3). The good news is that, when edge lengths are taken into account, we can set precise limits to what is recoverable: the canonical form of a network N is a simplified version of N that excludes all the unrecoverable aspects of N. In a canonical form, reticulate events are brought as forward in time as possible, causing the collapse of multiple consecutive nodes. (Compare again network N2 and its canonical form N; in Fig. 6.) The importance of the canonical form N’ of a network N lies in the fact that, if we restrict our consideration to networks with the NELP property, N’ is the unique canonical network consistent with perfect and unlimited data from the taxa in N.

Both represent lack of knowledge about the order of evolutionary events: speciations or more generally lineage splits in the first case, and reticulate events in the second. However, there is also an important difference between them: whereas in principle polytomies can be resolved by collecting further data from the taxa in the tree (for example, by extensive sequencing of their genomes [52] ), the standard network inference methods considered here cannot resolve collapsed nodes in a canonical network, irrespective of the amount of data from the taxa under consideration. This difference is mitigated by the observation that increased taxon sampling may indeed permit to resolve the collapsed nodes, when the new lineages break adjacencies between reticulate nodes. However, such lineages may not always exist or they may be difficult to sample.

We illustrate these consequences starting from a well-known problem of network inference methods, that of multiple optima. It has been noted before that many of the inference methods that have been recently proposed—especially those solely based on topological features—often return multiple optimal networks: Huson and Scor-navacca show a striking example of this (Fig. 2 in [53]), where the problem of finding the simplest network displaying two given tree topologies admits at least 486 optimal solutions.

For the example of 486 optimal solutions, this large number may be partly due to the fact that the goal was to achieve consistency with only two tree topologies. More data may enable to discriminate among the 486 returned networks. Non-identifiability, which occurs when none of the allowed data can discriminate between two or more networks, is a more serious problem than insufficient data, as it cannot be solved by simply increasing the size of the input sample. Another interesting example appears in a paper by Albrecht et al. [54] , which we reproduce here in Fig. 11. Here, there are only three optimal networks, essentially differing for which of the three clades {A.bicornis, A.longissima, A.sharo-nensis}, {A.aniaristata, A.comosa} and {A.taaschii} is considered as an hybrid (in this example reticulations represent hybridizations). This pattern is entirely analogous to that of the three networks in Fig. 2 (with a, b and c replaced by the three clades above), meaning that these three networks are indistinguishable to methods not accounting for edge lengths. Therefore, in this example, the existence of multiple optimal solutions is entirely due to non-identif1ability. All this motivates three recommendations: 1. It is important to use data in a way that causes non-identifiability to be as limited as possible. For example, as we have seen, accounting for edge lengths solves some cases of non-identifiability (e.g., in Fig. 2) although it does not eliminate this problem altogether (e.g., in Fig. 3).

Given an inferred network N , it is important to know the set of networks that are theoretically impossible to distinguish from N : no matter the amount of data, they Will all receive the same support as N. We may call this set the indistinguishable class of N . The biologist using an inference method must be aware that N is not the only network supported by the data.

It would be highly useful to devise inference methods that instead of searching for (or directly constructing) solutions in the space of all possible networks, only considers one element per indistinguishable class. This has the potential to significantly speed up the inference. Correspondingly, we recommend that edge lengths should be accounted for in the analyses (point 1) and, for each of the indistinguishable classes resulting from this choice, we identify a canonical network that, for all practical purposes, can be considered to be unique. Most important to the end users, we propose that a canonical network N is what should be given as the result of the inference, with the caveat that N is a way to represent a class of networks that are all equally supported (point 2). In a canonical form N, the aspects that are not common to all networks in this class are collapsed, as described above. This will help the evolutionary biologist to locate the uncertainties in the phylogeny, and possibly to choose further taxa to resolve them. Finally, we propose that inference methods only attempt to search among—or construct—phy-logenetic networks in their canonical form (point 3).

In the case of sequence-based methods, one may take into account the natural order of sites within a sequence [11—13, 55, 56]. Similarly, for reconstruction methods based on collections of subtrees, one could observe and use the relative position of the different genomic regions supporting the input trees. However, these relative positions must be conserved across the genomes being analyzed, a condition which may hold for recombining organisms (e.g. individuals within a population or different viral strains), but which is not obvious when studying a group of taxa that have undergone reticulate events (e.g., hybridization) at some point in a distant past. The main conclusion of the present study is the following: unless one abandons any optimization criterion that scores a network solely based on the trees it displays, the reconstruction should be carried out in a reduced space of networks: that of the canonical forms defined here. The motivation for this lies in the fact that canonical networks are guaranteed to be uniquely determined, if sufficient data are available. Once a canonical form N is inferred, it must be kept in mind that even assuming that the inference is free of statistical error, the true phylogenetic network is just one of the many networks having N as canonical form. Compared to what biologists are used to for phylogenetic trees—where in principle it is always possible to resolve un-certainties—it is clear that this requires an important change of perspective.

Methods

In the case of Theorem 1 part (ii), only the gist of the proof is provided here. The proof in full detail is deferred to 81 Text.

Reduction Algorithm

In order to prove that any network N has a canonical form, we describe an algorithm to transform N into a canonical network indistinguishable from N. The algorithm simply consists of repeatedly applying to N = (V, E, (p, A) one of the following two reduction rules, until neither can be executed (see Fig. 12):

Note that, even if the algorithm may temporarily produce multi-edges, the network produced in the end obviously does not have any multi-edge (otherwise we could still apply rule R2).

We must prove that any network N = (V, E, (p, A) has a canonical form. For this, we apply the reduction algorithm described above, thus obtaining a sequence N0 = N, N1, . . ., Nm, where each Ni+1 is obtained from N,- by applying either R1 or R2. Neither R1 nor R2 can be applied to Nm. We prove that Nm is a canonical form of N. Although, strictly speaking, N,- may not be a network (as it potentially contains multi-edges), the notion of trees displayed by N,, and thus that of indistinguishability, trivially extends to these multigraphs. Fig 12. The two rules at the basis of the canonical reduction algorithm.

This is true because at each iteration the size of E is reduced by at least one. Moreover, the resulting network Nm is funnel-free, since no reduction of type R1 can be applied to it.

To this end we prove that, at each iteration, N,- and NM are indistinguishable, i.e. T(N,-) = T(N,-+1). In other words any tree T is displayed by N,- if and only if T is displayed by NM.

Then T can be obtained by suppressing all suppressible nodes from a tree T,- contained in Ni. We consider three cases. (1) If none of the edges in T,- is involved in the reduction transforming N,- into NM, then clearly T,- is still contained in NM and thus T is still displayed by NM. (2) If T,- is involved in a R1 reduction, then it contains a funnel v and it contains one of the in-edges of the funnel, say (u, v), with length lj E Aj = A((u-, 12)), along with the out-edge (v, w), with length 10 6 A0 = A((v, w)). Now, let T,-+1 be the tree obtained from T,- by suppressing the suppressible node v and thus creating a new edge (uj, w) with length lj + 10. Because the R1 reduction creates a new edge (uj, w) with length set Aj + A0, containing the value lj + 10, then Tm is contained in NM. Moreover, it easy to see that T can still be obtained by suppressing all suppressible nodes from TM. Thus T is still displayed by NM. (3) If T,- is involved in a R2 reduction, then it contains one of the edges of a multi-edge (u, w), with a length l belonging to one of the length sets A'1,/1'2, . . . ,A; associated to the k In order to prove that, conversely, T(N,-+1) Q T(N,-), one can proceed in a similar way as above: if T is displayed by NIH, then T can be obtained by suppressing all suppressible nodes from a tree TM contained in Ni. By considering three cases analogous to the ones above regarding the involvement of TM in the reduction transforming N,- into NM, we can prove that in all these cases T is already displayed by N. Thus N,- and NM are indistinguishable, which concludes our proof.

To see this, it suffices to show that if two different reductions are applicable to a network, then the result of applying them is the same irrespective of the order of application. As we do not need this remark for the other results in this paper, we do not give a formal proof of it.

Let N be a network and N’ a canonical form of N obtained by applying the reduction algorithm. If N satisfies the NELP property, then N’ satisfies the NELP property.

Suppose the contrary; then, NM contains two distinct weighted paths p1, p2 with the same endpoints u and v and same lengths. Because R1 /R2 cannot create new nodes, u and v are also nodes in Ni. Moreover, it is easy to see that each weighted path p in N,- from u to 1/ gives rise to exactly one weighted path f(p) in NM from u to v, with exactly the same length as p. Now take two weighted paths in N, one in the preimagef1(p1) and the other in the preimagef1(p2). These two weighted paths in N,- are distinct (as p1 7E p2), have the same endpoints (u and v) and the same length. But then N,- violates the NELP property, leading to a contradiction. We thus have that if N,- satisfies the NELP property, then NM also satisfies it. By iterating the argument above for each step in the reduction algorithm, the lemma follows.

Uniqueness of the Canonical Form for Networks Satisfying the NELP

In this section, we introduce a number of new concepts and state the main intermediate results that are necessary to obtain this result. We leave their detailed proofs to 81 Text, together with the obvious definitions of basic concepts such as that of isomorphic networks, subnetwork and union of two networks.

Illustration of Definition 3. P (edges in black) is a root-leaf path of N and thus both a wishbone and a crack of N. R and 8 (black) are cracks of N. Q (black) is a wishbone of N. All edges are assumed to have the (unique) length 1 unless otherwise displayed. path P and a node v belonging to it, any weighted path formed by all the ancestors [descendants] of v in P is a prefix [suffix] of P. Note that a prefix [suffix] only consists of one node when 1/ is the root [leaf] of P. A wishbone of N is any subnetwork of N formed by taking the union of two root-leaf paths that have in common only a prefix. A crack of N is any sub-net-work of N formed by taking the union of two root-leaf paths that have in common only a prefix and a suffix.

14 illustrates the definitions above. Note that any root-leaf path P is both a wishbone and a crack, as P is the result of the union of P with itself, and P has a common prefix and a common suffix with P. Moreover, any subnetwork R that can be obtained from a root-leaf path by attributing two lengths to one of its edges e is a crack. Finally, note that wishbones and cracks are networks, and thus the notion of isomorphism (Definition 5 in 81 Text) can be applied to them. The proof of part (ii) in Theorem 1 depends on two important results (Propositions 1 and 2 below), whose proofs can be found in 81 Text. The first states that a network with the NELP property is uniquely determined by the wishbones and cracks it contains. Proposition 1. Two networks N1 and N2 with the NELP property are isomorphic if and only if they contain the same wishbones and cracks (up to isomorphism).

Unfortunately this algorithm would be impractical, as the number of wishbones (or cracks) in a network is not polynomial in the size of the network. Also note that we require N1 and N2 to satisfy the NELP property because there exist non-isomorphic networks containing the same wishbones and cracks: for example the networks in the bottom line of Fig. 10. The second result that we need is the following:

Let N1 and N2 be two indistinguishable funnel-free networks, satisfying the NELP property. Then they contain the same wishbones and cracks (up to isomorphism).

By Lemma 1, N satisfies the NELP property. Now suppose that there exists another canonical form of N, called N",

By transitiVity, N’ and N” are indistinguishable. Because N’ and N" are indistinguishable, funnel-free and with the NELP property, N’ and N” must contain the same wishbones and cracks (because of Proposition 2). But then, because of Proposition 1, N’ and N” are isomorphic.

This claim would allow us to simplify the statement of Theorem 1: networks with the NELP property would be guaranteed to have a unique canonical form (not just among networks with the NELP property, but among all networks). Unfortunately, to this date, we were unable to prove this conjecture. Nonetheless, note that the reduction algorithm returns, for any network with the NELP property, its unique canonical form with the NELP

Corollaries

The first one states that two networks N1 and N2 satisfying the NELP property are indistinguishable if and only if their unique canonical forms with the NELP property, N; and N; respectively, are isomorphic. By Lemma 1, N1 and N; can be obtained by applying the reduction algorithm to N1 and N2.

The part trivially follows from the transitivity of indistinguishabili-ty. As for the only part, note that (again by transitivity) N; is indistinguishable from N2. As it is also funnel-free, N; is a canonical form of N2. Because N2 can only have one canonical form satisfying the NELP property (by Theorem 1 (ii)), N; and N; must be the same network (up to isomorphism). As for Corollary 2, we recall that it states that a canonical network N with the NELP property is uniquely determined by the trees it displays.

Let N and N’ be indistinguishable canonical networks satisfying the NELP property. Then, N and N’ are both canonical forms of N satisfying the NELP. But then, by Theorem 1(ii), N and N must be the same network (up to isomorphism).

Supporting Information

Supporting Information: a mathematical theory of explicit phylogenetic networks with edge lengths. This document provides an introduction to the mathematical theory of eXplicit phylogenetic networks With edge lengths, leading in particular to the proofs of Propositions 1 and 2, Which are necessary for the proof of Theorem 1, part (ii). In the last section, we consider networks With inheritance probabilities and their relevance for likelihood-based reconstruction. (PDF)

Acknowledgments

We are grateful to O.Gascuel for advice on the structure of the paper.

Author Contributions

Conceived the question: FP CS. Proved the main formal results: FP.

Topics

phylogenetic

Appears in 36 sentences as: Phylogenetic (1) phylogenetic (36) phylogenetics (1)

In Reconstructible Phylogenetic Networks: Do Not Distinguish the Indistinguishable

Phylogenetic networks represent the evolution of organisms that have undergone reticulate events, such as recombination, hybrid speciation or lateral gene transfer.
Page 1, “Abstract”
An important way to interpret a phylogenetic network is in terms of the trees it displays, which represent all the possible histories of the characters carried by the organisms in the network.
Page 1, “Abstract”
Given data that undenNent reticulate evolution, only the canonical form of the underlying phylogenetic network can be uniquely reconstructed.
Page 1, “Abstract”
We consider here an elementary question for the inference of phylogenetic networks: what networks can be reconstructed.
Page 1, “Author Summary”
Indeed, whereas in theory it is always possible to reconstruct a phylogenetic tree, given sufficient data for this task, the same does not hold for phylogenetic networks: most notably, the relative order of consecutive reticulate events cannot be determined by standard network inference methods.
Page 1, “Author Summary”
Here we propose limiting the space of reconstructible phylogenetic networks to what we call “canonical net-works”.
Page 2, “Author Summary”
Once a canonical network N is inferred, it must be kept in mind that—even with perfect and unlimited data—the true phylogenetic network is just one of the potentially many networks having N as canonical form.
Page 2, “Author Summary”
This is an important difference to what biologists are used to for phylogenetic trees, where in principle it is always possible to resolve uncertainties, given enough data.
Page 2, “Author Summary”
Explicit [1] or evolutionary [2, 3] phylogenetic networks are used to represent the evolution of organisms or genes that may inherit genetic material from more than one source.
Page 2, “Introduction”
They are called “explicit” to distinguish them from “implicit” [14], “abstract” [1] or “data-display” [3] phylogenetic networks, which are used to display collections of alternative evolutionary hypotheses supported by conflicting signals in the data.
Page 2, “Introduction”
This observation gives rise to the notion of trees displayed by a network, which are all the possible single-character histories implied by a phylogenetic network.
Page 2, “Introduction”

See all papers in April 2015 that mention phylogenetic.

See all papers in PLOS Comp. Biol. that mention phylogenetic.

maximum likelihood

Appears in 4 sentences as: maximum likelihood (4)

In Reconstructible Phylogenetic Networks: Do Not Distinguish the Indistinguishable

This is true for all methods that evaluate a phylo-genetic network solely on the basis of how well the displayed trees fit the available data, including all methods based on input data consisting of clades, triples, quartets, or trees with any number of taxa, and also sequence-based approaches such as popular formalisa—tions of maximum parsimony and maximum likelihood for networks.
Page 1, “Abstract”
In the following, we describe how this applies to the two main approaches for explicit network reconstruction: consistency-based approaches (see [28] for a survey)—seeking a network consistent with a number of prior evolutionary inferences (typically trees or groupings of taxa)—and sequence-based approaches, such as standard formulations of maximum parsimony and maximum likelihood for networks [2, 29—33].
Page 2, “Introduction”
The same holds for many, sequence-based, maximum parsimony and maximum likelihood approaches proposed in recent papers.
Page 3, “An Identifiability Problem”
As for maximum likelihood (ML), Nakhleh and collaborators [2, 32, 33, 38] have proposed an elegant framework whereby a phylogenetic network N is not only described by a network topology, but also edge lengths and inheritance probabilities associated to the reticulations of N. As a result, any tree T displayed by N has edge lengths—allowing the calculation of its likelihood Pr(A| T) with respect to any alignment A—and an associated probability of being observed Pr(T|N).
Page 4, “An Identifiability Problem”

See all papers in April 2015 that mention maximum likelihood.

See all papers in PLOS Comp. Biol. that mention maximum likelihood.