Explaining Diversity in Metagenomic Datasets by Phylogenetic-Based Feature Weighting
Davide Albanese, Carlotta De Filippo, Duccio Cavalieri, Claudio Donati

Abstract

Despite the flourishing of new analysis methods, current approaches based on statistical comparisons between high-level taxonomic classes often fail to identify the microbial taxa that are differentially distributed between sets of samples, since in many cases the taxonomic schema do not allow an adequate description of the structure of the microbiota. This constitutes a severe limitation to the use of metagenomic data in therapeutic and diagnostic applications. To provide a more robust statistical framework, we introduce a class of feature-weighting algorithms that discriminate the taxa responsible for the classification of metagenomic samples. The method unambiguously groups the relevant taxa into clades without relying on predefined taxonomic categories, thus including in the analysis also those sequences for which a taxonomic classification is difficult. The phylogenetic clades are weighted and ranked according to their abundance measuring their contribution to the differentiation of the classes of samples, and a criterion is provided to define a reduced set of most relevant clades. Applying the method to public datasets, we show that the data-driven definition of relevant phylogenetic clades accomplished by our ranking strategy identifies features in the samples that are lost if phylogenetic relationships are not considered, improving our ability to mine metagenomic datasets. Comparison with supervised classification methods currently used in metagenomic data analysis highlights the advantages of using phylogenetic information.

Author Summary

Thanks to the decreasing cost of sequencing, large amounts of data have been generated for environmental samples and for a variety of health-associated conditions. In parallel there has been a flourishing of statistical methods to analyze metagenomic datasets, concentrating mainly on the problem of assessing the existence of significant differences between microbial communities in different conditions. However, for a large number of therapeutic and diagnostic applications it would be essential to identify and rank the microbial taxa that are most relevant in these comparisons. Here we present PhyloRelief, a novel feature-ranking algorithm that fills this gap by integrating the phylogenetic relationships amongst the taxa into a statistical feature weighting procedure. Without relying on a precompiled taxonomy, PhyloRelief determines the lineages most relevant to the diversification of the samples guided by the data. As such, PhyloRelief can be applied both to cases in which sequences can be classified according to a known taxonomy, and to cases in which this is not feasible, a common occurrence in metagenomic data analysis given the increasing number of new and uncultivable taxa that are discovered using these technologies.

Introduction

Extensive explorative studies of the microbiota colonizing several districts of the human body have been conducted, highlighting the large variability from site to site, as well as the interpersonal differences in the same body site [1]. The more extensively studied district is the human gastrointestinal tract (GI), whose metagenomics composition appears to be influenced by several factors [2], including age [3,4], geography [5], diet [6], and lifestyle [7]. In addition, a correlation between imbalances or abnormal composition of the gut microbiota and a number of pathologic conditions has been proposed. These alterations might be due to therapeutic interventions, like antibiotic treatment [8] , or different lifestyle [9].

Fecal transplantation has shown its effectiveness and safety in the treatment of recurrent Clostridium difiicile infections [10], which are known to correlate with altered microbiomes following antibiotic treatment [11]. Alternatives for bioremediation of microbiota alterations is the supplementation of pro or prebiotics, while it has been suggested that antibiotic treatment and vaccination can be used to guide the structure of the gut micro-biota towards a status that is compatible with health [12,13]. Most of these intervention strategies would greatly increase their efficacy using a precise definition of the microbial species that are differentially distributed in health and disease conditions. This task faces several difficulties. On one hand, most of the microorganisms composing the human and environmental micro-biota are poorly characterized, difficult to cultivate, and lack a precise taxonomic classification. On the other hand, methods to unambiguously define the microbial taxa that are responsible for these differences are still lacking, and their identification usually relies on a small number of arbitrarily chosen association tests with high-level taxonomic classes, or on statistical learning methods, both evaluating only taxa for which a taxonomic classification is possible [14]. In addition, the low abundance of most microbial taxa in metagenomic samples poses additional challenges only recently tackled with statistical methods [15]. In amplicon metagenomics, the composition in term of microbial genera of a sample is inferred from the high throughput sequencing of a small number of diagnostic genomic loci, the most popular being the V1—V6 variable regions of the 16S rDNA gene for bacteria [16] and the ITS spacer for fungi [17] , selectively amplified using broadly conserved PCR primers. As a proxy for species, Operational Taxonomic Units (OTUs) are determined by the clustering of the sequences up to a given level of similarity, usually 97%. Using the OTUs abundances, the differentiation between samples or classes of samples is accomplished by measuring their fi-diversity, i.e. the variations in community membership across the different groups [18]. Given that the sequences of marker genes are available, phylogenetic measures of diversity such as UniFrac [19,20] have proven to be able to identify subtle differences in the structures of microbial communities by weighting species abundances with the phylogenetic relationships amongst taxa.

By integrating the phy-logenetic relationships amongst taxa into the framework of statistical learning, the method is able to unambiguously group the taxa into clades without relying on a precompiled taxonomy, and accomplishes a ranking of the clades according to their contribution to the sample differentiation. We applied the method to a meta-analysis of two recent datasets of comparative studies of the gut microbiota of European, USA, African and South American healthy individuals, identifying bacterial taxa that are differentially distributed with geography and age. Comparison of the performances of the method to popular feature selection and classification algorithms shows that or strategy is effective in identifying microbial clades associated to the different sample groups, providing a novel analysis method for targeted metagenomic datasets.

Results

healthy vs. disease, lean vs. obese, population A vs. population B, etc.) in a metagenomic dataset. The method is designed to analyze any set of samples that has been characterized via high throughput sequencing of one or more marker genomic loci, whose sequences have been clustered into OTUs. The process requires that the samples are unambiguously classified into cases and controls according to the description provided by the study design, and that a phylogenetic tree of the OTUs has been obtained by molecular phylogenetic analysis. The algorithm is composed by two main conceptual steps: i) a scoring scheme that ranks the branches of the OTU tree according to their contribution to the differentiation of the classes, and ii) a merging step that merges nested subtrees into independent clades. At the end of this procedure, PhyloRelief ranks the clades according to their discriminant power between cases and controls. Definition of the scores

. . ), PhyloRelief ranks the internal branches in the OTU tree by assigning them a score w that reflects their importance in the differentiation of the classes. In its simplest form the procedure is as follows. First, one sample 8 is randomly chosen and its nearest hit H (i.e. the nearest sample of the same class) and miss M (Le. the nearest sample of a different class) are individuated (Fig. 1). Next, the score w of each clade is increased by an amount proportional to the contribution of the clade to the distance between S and M, and decreased by an amount proportional to its contribution to the distance between S and H. In this way, the score of those clades that support the fact that S is more distant from M than from H is increased, while the score of those that support the contrary is decreased. A detailed description of the update rules is given in the Methods section. After that the procedure has been repeated over all possible choices of 8, each clade has a score w that is high if the clade supports the partitioning of the samples into classes, {C1}, {C2}, and low if it does not (Fig. 1). The critical step of the procedure is the choice of the update function, for which different definitions are possible. Here we define (see Methods): a) an unweighted update function, that, for each clade, is proportional to the fraction of the clade that is unique to one of the classes, Le. the fraction of the phylogenetic tree from which descend only OTUs belonging to one of the classes; b) a weighted update function, in which each branch of the tree is weighted by a quantity proportional to its unbalance between the classes, Le. the difference between the number of sequences in samples from one class and from the other. Analogously to the Relief-F extension of the Relief algorithm, PhyloRelief can be applied to multi-class problems and can use k-nearest neighbors in the score computation, becoming robust in the case of noisy or unbalanced data sets [21].

Correlation between the lineages and identification of the clades

subtrees in a tree) introduces a correlation that needs to be taken into account when analyzing the data, and that can be exploited to define a set of independent clades ranking them according to their relevance. If a given branch is heavier, due to unbalanced OTUs distribution between the different classes, its weight will propagate to the parent branches, where it is either reinforced by coalescing with branches sharing a similar unbalance, or diluted if the coalescing branches have contrasting or no unbalances. Exploiting this property, individual lineages can be clustered into taxonomic clades by inspecting the profile of the weights along the tree and identifying the branch where this has a local maximum. This rule, exemplified in 81 Fig, (see Methods) naturally defines a set of independent taxonomic clades and ranks them according to their contribution to the diversification between the classes. Using this ranking, the minimal set of clades necessary to describe the classes to a certain level of accuracy is determined by running nonparametric tests of class diversification, such as PERMANOVA[23] and ANOSIM[24] , as a function of the number of clades.

Applications

To allow joint analysis of these two datasets, OTUs were picked using a reference database (see Materials and Methods) and the OTU tables were merged and raref1ed to the same number of reads. A PCoA analysis of the weighted UniFrac distances (Fig. 2A) shows that the samples segregate by geographical origin, with the USA and Italian samples clearly distinct from the African (Malawi and Burkina Faso) samples, and the Venezuelan occupying an intermediate position between the two groups. Previous meta-analyses of these data have shown differences in microbiota composition correlating to the “Western” (USA and Italy) or “non-Western” (Malawi, Burkina Faso and Venezuela) origin of the samples, and it has been suggested that these differences are related to the different balance between protein-rich and f1ber-rich diet in these communities [2,5,6]. Stratifying the data by age of the subjects shows (SZ Fig) that the age is also an important factor in the variability of the human gut microbiota, and that this variability seems to be highest at younger age.

To identify the number of clades that were more relevant to differentiate the two classes, we performed ANOSIM and PERMANOVA analysis with increasing number of clades ranked according to the PhyloRelief weights (Table 1). This procedure showed that both ANOSIM and PERMANOVA had a maximum comprised between 20 and 30 clades, indicating that using this number of clades the separation between the groups is largest. In Fig. 2B we show a phylogenetic tree of the OTUs present in the samples, with those included in the 30 most relevant clades identified by PhyloRelief highlighted (in red OTUs more prevalent in Malawi, Burkina Faso and Venezuela, in green OTUs more prevalent in the USA and Italy). It is worth noting that most of these clades were specifically more represented in the non-Western samples, while only few were specific of the Western individuals, and that much of the differences were confined within the order Bacteroidales. In particular, the Malawi, Burkina Faso and Venezuelan samples were rich in Prevotellaceae, while the Western samples were rich in Ruminococcaceae. In Fig. 2C the PCoA of the weighted UniFrac distances computed on the 30 most relevant clades is shown. Although the Western samples were distinct from the rest, they showed a large degree of variability, with a small fraction of samples from the USA closely related to the Malawi, Burkina Faso and Venezuelan samples. In addition, age was still a major factor, being closely associated to the second component of the PCoA (S3 Fig). To further investigate the individual distribution of the 30 most relevant clades, we Table 1. Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the weighted UniFrac distance between Western (USA and Italy) and non-Western (Ma-show in Fig. 2D a heatmap of the logarithm of the prevalences of the OTUs within these clades. These data confirmed that there was a group of individuals from the USA that were closely related to the non-Western individuals, sharing three clades of Prevotellaceae with most of the Malawi, Burkina Faso and Venezuelan subjects. Stratifying the subjects by age, we found that in both classes young subjects (below 2 years of age) were clearly distinct from older subjects (Fig. 2D, upper panel). In addition, while we found clear separation between Western and non-Western adult subjects, some of the Western young subjects were classified by hierarchical clustering together with the non-Western young subjects and vice-versa, suggesting that at young age cultural or geographical differences are less important in determining the structure of the gut microbiota probably related to the instability of the gut microbiota, a phenomenon typical of childhood[5].

We found (S4 Fig and 81 table) that the differentiation between young and older subjects was largest when the age threshold was set to two years, and that above 14 years of age, there was no difference between the microbiome of young and adult subjects. However, running the PhyloRelief analysis on the complete dataset, we could not unambiguously identify a minimal set of bacterial clades associated to this differentiation (82 Table). This result was likely due to the different gut microbiota of Western and non-Western adult subjects. For this reason, we repeated the analysis separately for Western and non-Western samples. ANOSIM and PERMANOVA showed that the maXimum Table 2. Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the weighted UniFrac distance between young (below two years of age) and older (above two years of age) Western individuals (USA and Italy).

The differentiation between young and adults was much sharper in Western subjects, with a prominent role played by Lachnospiraceae and Ruminococcaceae in the adults (Figs. 3A and SS). In non-Western subjects (Figs. 3B and S6), there was also a contribution of the presence of five clades of Prevotella-ceae to the differentiation of the adult gut microbiota. In both Western and non-Western samples the younger subjects have higher abundance of Bifidobacteriaceae (Fig. 3), probably due to breast-feeding in infants [5]. Bifidobacteriaceae were present at lower prevalence in most adult subjects, except for the adults from Burkina Faso probably due to the absence of dairy food in adult age in this African population [25]. Table 3. Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the weighted UniFrac distance between young (below two years of age) and older (above two years of age) non-Western individuals (Malawi, Burkina Faso and Venezuela).

Predictivity of the ranked features in supervised classification problems

Identifying a ranking strategy to reduce the dimensionality of the dataset can improve the effectiveness of classification algorithms in metagenomic datasets, where correlations between the variables are introduced both by the phylogenetic relationships between the clades and by the fact that relative abundances are measured. The Random Forest (RF) classifier was recently proven to be the most effective in this class of problems [26,27] , both for feature selection and classification. Although the main goal of this work is to define a phylogeny-based OTUs ranking method, it is interesting to assess the predictive power of the ranked taxa for the classification of samples into predefined categories in comparison to other state of the art algorithms. For this purpose, we selected four publicly available datasets, including data from four body sites (forehead vs. external nose and volar forearm vs. popliteal fossa) [1], from fecal samples (IBD vs. healthy) [28] and skin [29] (using both subject identifi-cation—3 classes—and subject/hand identification—6 classes—as target) that have recently been used as benchmark in comparative evaluations of classification methods for metagenomic data [26,27,28].

The performances were assessed in terms of average predictive accuracy using the K-category correlation coefficient (KCCC), a multiclass eXtension of the Matthews Correlation Coefficient (MCC) [32] (see Materials and Methods for details on the procedure). The results are reported in Table 4. We found that While in one case

Discussion

The relatively low cost of sequencing has triggered an exponential increase in the amount of data generated, that have highlighted correlations between the structure of the microbiota and important human pathologies for which conventional intervention strategies were not effective. This suggests that a precise definition of the structure of the healthy vs. disease microbiota could allow early diagnosis and the definition of effective intervention strategies in a number of pathologies. To become a viable diagnostic and therapeutic tool, the evolution of sequencing technologies needs to be paralleled by progress in computational tools enabling to significantly correlate phenotypes to the smallest possible number of microbial taxa. This would allow, on one hand, to develop relatively cheap and easy to use diagnostic tools, and on the other hand to design focused and personalized intervention strategies.

The improvement of this method over existing ones consists in its ability to accomplish a ranking of the microbial clades, defined on the basis of the taxa distribution amongst the samples weighted by phylogenetic information, discovering those that contribute to the differentiation between two or more classes of samples. Importantly, this result is obtained without relying on a predefined set of taxonomic categories that are often hard pressed to describe the complexity of the evolutionary relationships between microorganisms.

Of particular interest were the results obtained when comparing infants vs. adults in the different geographies, showing that age has a much greater influence in the USA and Italy than in the African and South American samples, with a much larger fraction of the OTU differentially distributed between young children and adults in the former than in the latter. Comparing the performances of the algorithm to LEfSe, MetaPhyl and to Random Forest in a classical supervised classification schema using cross validation, we found that the taxa ranked by PhyloRelief had also a high predictive value, performing as well as—and in some cases outperforming—current gold standard methods.

The method presented here is technology agnostic since it can be used to interpret data generated by the targeted amplification of marker genomic loci, such as the variable regions of the 16S rDNA gene for bacteria, or the ITS sequences for fungi as well as complete metagenome sequencing data, such as those obtained with Illumina technologies. In addition, the algorithm can readily be extended to regression problems to include cases where a continuous variable differentiate the individual samples using the RReliefF extension of Relief [21,22]. The PhyloRelief class of algorithms fills a significant gap in the growing array of computational methods that are currently used for the analysis of metagenomic data, and will impact importantly on the application of metagenomics to the development of novel diagnostic markers, leading the application of these approaches from the bench to the bedside.

Materials and Methods

The PhyloRelief algorithm

Given the availability of phylogenetic data, [3 diversity measures incorporating phylogenetic information, such as weighted and unweighted UniFrac [19,20] have become popular in the context of metagenomic research, but other measures, such as Bray-Curtis dissimilarity index could also be used. Let us define a partitioning of S into sample class {C1} and {C2}. Usually, this partitioning is obtained either by exploratory analysis of the distance matrix DS, or by the study design (e.g. according to the origin of the samples, health status or age of the donor in the case of human samples, etc. ).

To accomplish this result, we developed a modified version of the Relief-F procedure that takes into account the phylogenetic information contained in the tree T. To this purpose, the algorithm does not work directly with the OTUs, but with the clades (or sub-trees) T,- of the tree T. Below we report the two main steps of the algorithm, i.e. i) the scoring scheme ranking the sub-trees of the tree T, and ii) the merging step that identifies the independent clades.

Two different scoring function have been define: a unweighted update function and a weighted update function. Being closely related to the unweighted Uni-Frac distance, the unweighted update function is the natural choice when exploratory analysis of the samples has been performed using unweighted UniFrac, while the weighted update function is the natural choice when weighted UniFrac has been used. end; end ° where (9: is equal to 1 if the branch Bq contains OTUs from sample 8, 0 otherwise. In this way, the contribution of each clade does not take into account the prevalence of the OTUs in the different classes, but just their presence. Consequently, the algorithm identifies those lineages that are specific to one of the classes of samples. Weighted PhyloRelief. In weighted PhyloRelief we use the same iterative procedure defined above, but the update function d(T,-,A,B) is defined as:

p2 and p? are the fraction of the taxa descending from the branch Bq that are from samples A and B, respectively. bq is the length of the branch Bq. In other words, d(T,-,A,B) is the weighted UniFrac distance between samples A and B due to the OTUs in the subtree T,- of the tree T defined by the branch 3,. By iterating over 8-, the procedure positively weights those sub-trees T,- that support the partitioning of S into {C1} and {C2}, and negatively those that do not support the partitioning. m is a user defined parameter that, in practice, can be set to the number of samples in S.

Analogously to the Relief-F algorithm, PhyloRelief can work with multi-class classification problems. Moreover, in its generalized form, the algorithm again randomly selects a sample 8, but then identifies k of its nearest neighbors from the same class, called nearest hits H1, and k nearest neighbors from each of the different classes, called randomly select a sample Sj; find k nearest hits H1; for each class Cyéclass (Sj) do Hz)

HQ

end; end; end; where P(C) is the fraction of samples in class C. The factor $256)) is required for ensuring J appropriate normalization and to guarantee that the contribution of hits and of each class of misses is between 0 and 1.

Let be a subtree of T,. The correlation between the values w[T,-], w[Tj] is illustrated by the example in SI Fig, where the weights for four samples (“cir-cles”, “squares”, “triangles” and “stars”) partitioned into two classes (“red” and “blue”) are shown. The high weights in the bottom clade (Clade b, containing OTUs only from the “red” class of samples) propagates up from the terminal branches until it merges with Clade a, that contains OTUs from the “blue” class of samples. In the parent branch of Clades a and b, the unbalance between the two samples is diluted, and consequently the weight decreases. In this example, Clade a and clade b separately, but not their parent clade, would be responsible for the differentiation between “red” and “blue” samples.

To accomplish this, we identify the subtree Twith the highest weight. In the case of ties, we randomly start from one of the subtrees if these are independent, or take the one closest to the root if these are nested. This defines the first clade. Next, we prune the tree from all the branches descending from T, and from those ascending from T. This last step is needed to avoid the possibility to iteratively enlarge the same clades, given the correlation between the weights of nested sub-trees discussed above. Iteratively applying this rule, we define a set of independent clades ranked according to their weight. Using this ranking, the minimal set of clades necessary to describe the classes to a certain level of accuracy is determined by running nonparametric tests of class diversification, such as PERMANOVA[23] and ANOSIM[24] , as a function of the number of clades. Alternatively, univariate nonparametric tests such as the Kruskal—Wallis could be applied for testing whether samples originate from the same distribution in each clade.

Definition of the OTU table. USA, Venezuela and Malawi dataset. Reads were downloaded from the MG-RAST web server (http://metagenomics.anl.gov/ metagenomics.cgi?page=MetagenomeProject&project=401). II) Burkina vs. European children. Raw data were obtained from the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/ sra), study number ERP000133. Forward and reverse primers were removed using cutadapt [33] and reads missing the forward primer were discarded. Quality trimming was performed by sickle (https:/ / github.com/ najoshi/ sickle), with quality threshold 20. Reads with length < 100 were discarded. Chimeras were removed by uchime[34] (reference database mode) using the Greengenes[35] database (Release 13_5) clustered at 85% identity threshold as reference.

The OTU table was obtained using custom scripts, and rarefied using the rare-fy_even_depth() function in the phyloseq package (version 1.6.1) of the R (version 3.0.2) statistical software. Taxonomy was assigned to the representative sequences using the RDP classifier [36] (version 2.5) with a confidence threshold of 0.8. UniFrac distances were computed and the PhyloRelief analysis was performed using the reference phylogenetic tree from the Greengenes release.

Body Habitats (CBH) and Fierer et al. Subject (FS) Datasets. Preprocessed reads and metadata were downloaded from the Qiime repository http://www.microbio. me/qiime/ under the study ids 1290, 449 and 232 respectively. OTU tables and phylogenetic trees were inferred using the standard QIIME pipeline with default settings, where OTUs were picked with UCLUST [37] at a sequence similarity threshold of 0.97%. Taxonomy was assigned using the Greengenes database (version 2013/05). Rarefactions were performed at the depth of the shallowest sample. Phylogenetic tree were computed using PyNAST [38] (using the Green-genes 2013/05 database as template) and FastTree [39]. Non parametric tests. Permutational ANOVA (PERMANOVA) and ANOSIM tests were performed with 999 permutations.

Predictive classification pipeline

We compared the predictive performance of PhyloRelief with the Random Forest classifier (PhyloRelief +RF) to LEfSe + RF, MetaPhyl (without feature selection) and Random Forest used as both classifier and feature selection method (RF + RF).

Data are partitioned into a training set and a testing set (75% and 25% of the samples respectively). In order to avoid overfitting and selection bias effects, the feature selection procedure was included in the cross validation loop [40,41]. For each training set, the number of ranked features no that provides the smallest average KCCC is found by a nested 10X random subsampling CV. Later, the features are ranked using the entire training set and the model is trained using the top ranked no features. The model is finally tested on the independent testing set and a KCCC is computed. In the case of LEfSe + RF, LEfSe was treated as feature selection method using the common p-value threshold of 0.05. For MetaPhyl, no feature selection was performed and the nested CV was used to find the optimal model parameters (parameters grid: 2 = {100000, 1000, 100, 10, 1, 0.1, 0.01, 0.001, 0.0001} and w = {0, 0.2, 0.4, 0.6, 0.8, 1}). For the Random Forest classifier, the number of trees was set to 500 and the weights were computed as in [42]. In PhyloRe-lief OTUs were ranked using the weights computed on the related clades. The pipeline was developed in Python using the scikit-learn module (http://scikit-learn. org).

Supporting Information

The Bottom clades (Clade a and Clade b) have high weight since they contain OTUs only from the “blue” and “red” class of samples, respectively. The higher weights of the branches in clade b take into account the more even distribution of the “blue” class of samples. The weights propagate up from the terminal branches until the two clades merge. In the parent branch of Clade a and Clade b the unbalance between the two samples is diluted, and consequently the weight decreases. (TIF)

While the first component is correlated to geography (see Fig. 2) the second is related to the age of the subjects. (TIF)

Colors indicate the age of the subjects. Above two years (red); below two years (blue). (TIF)

The differentiation is maximum for two years of age, and there is no difference between the

Colors indicate the age of the subjects. Above two years (red); below two years (blue). (TIF) S6 Fig. PCoA of the weighted UniFrac distances between non-Western (Malawi, Burkina Faso and Venezuela) individuals using only the OT Us included in the 30 most relevant clades. Colors indicate the age of the subjects. Above two years (red); below two years (blue). (TIF)

Permutational ANOVA and ANOSIM tests on the effect of age. The sample has been partitioned into two as function of an age threshold and the Permutational ANOVA and ANOSIM tests have been computed using the weighted UniFrac distance. (DOCX) 82 Table. Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the UniFrac distance between young (below two years of age) and older (above two years of age) individuals. (DOCX)

Acknowledgments

The authors would like to thank S. Riccadonna, P. Fontana, P. Franceschi and V. Goremykin for useful discussions.

Author Contributions

Performed the experiments: DA CDF CD. Analyzed the data: DA CD. Wrote the paper: DA CDF DC CD.

Topics

phylogenetic

Appears in 19 sentences as: Phylogenetic (1) phylogenetic (22)
In Explaining Diversity in Metagenomic Datasets by Phylogenetic-Based Feature Weighting
  1. The phylogenetic clades are weighted and ranked according to their abundance measuring their contribution to the differentiation of the classes of samples, and a criterion is provided to define a reduced set of most relevant clades.
    Page 1, “Abstract”
  2. Applying the method to public datasets, we show that the data-driven definition of relevant phylogenetic clades accomplished by our ranking strategy identifies features in the samples that are lost if phylogenetic relationships are not considered, improving our ability to mine metagenomic datasets.
    Page 1, “Abstract”
  3. Comparison with supervised classification methods currently used in metagenomic data analysis highlights the advantages of using phylogenetic information.
    Page 1, “Abstract”
  4. Here we present PhyloRelief, a novel feature-ranking algorithm that fills this gap by integrating the phylogenetic relationships amongst the taxa into a statistical feature weighting procedure.
    Page 2, “Author Summary”
  5. Given that the sequences of marker genes are available, phylogenetic measures of diversity such as UniFrac [19,20] have proven to be able to identify subtle differences in the structures of microbial communities by weighting species abundances with the phylogenetic relationships amongst taxa.
    Page 3, “Introduction”
  6. PhyloRelief is an algorithm that introduces the Relief [21,22] strategy of feature weighting in a phylogenetic context to identify those OTUs or groups of OTUs that are responsible for the differentiation between classes of samples (Le.
    Page 3, “Results”
  7. The process requires that the samples are unambiguously classified into cases and controls according to the description provided by the study design, and that a phylogenetic tree of the OTUs has been obtained by molecular phylogenetic analysis.
    Page 3, “Results”
  8. the fraction of the phylogenetic tree from which descend only OTUs belonging to one of the classes; b) a weighted update function, in which each branch of the tree is weighted by a quantity proportional to its unbalance between the classes, Le.
    Page 4, “Results”
  9. 2B we show a phylogenetic tree of the OTUs present in the samples, with those included in the 30 most relevant clades identified by PhyloRelief highlighted (in red OTUs more prevalent in Malawi, Burkina Faso and Venezuela, in green OTUs more prevalent in the USA and Italy).
    Page 5, “Applications”
  10. Identifying a ranking strategy to reduce the dimensionality of the dataset can improve the effectiveness of classification algorithms in metagenomic datasets, where correlations between the variables are introduced both by the phylogenetic relationships between the clades and by the fact that relative abundances are measured.
    Page 9, “Predictivity of the ranked features in supervised classification problems”
  11. PhyloRelief is an algorithm that resolves the problem of relevant taxa identification by applying the Relief strategy of feature ranking in a phylogenetic context.
    Page 11, “Discussion”

See all papers in March 2015 that mention phylogenetic.

See all papers in PLOS Comp. Biol. that mention phylogenetic.

Back to top.

ANOVA

Appears in 8 sentences as: ANOVA (8)
In Explaining Diversity in Metagenomic Datasets by Phylogenetic-Based Feature Weighting
  1. Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the weighted UniFrac distance between Western (USA and Italy) and non-Western (Ma-show in Fig.
    Page 7, “Applications”
  2. Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the weighted UniFrac distance between young (below two years of age) and older (above two years of age) Western individuals (USA and Italy).
    Page 8, “Applications”
  3. Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the weighted UniFrac distance between young (below two years of age) and older
    Page 9, “Applications”
  4. Permutational ANOVA (PERMANOVA) and ANOSIM tests were performed with 999 permutations.
    Page 15, “HQ”
  5. R2 of the Permutational ANOVA obtained by partitioning the individuals into two samples using an increasing age threshold, and the pairwise weighted UniFrac distance.
    Page 16, “Supporting Information”
  6. Permutational ANOVA and ANOSIM tests on the effect of age.
    Page 16, “Supporting Information”
  7. The sample has been partitioned into two as function of an age threshold and the Permutational ANOVA and
    Page 16, “Supporting Information”
  8. Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the UniFrac distance between young (below two years of age) and older (above two years of age) individuals.
    Page 16, “Supporting Information”

See all papers in March 2015 that mention ANOVA.

See all papers in PLOS Comp. Biol. that mention ANOVA.

Back to top.

feature selection

Appears in 7 sentences as: feature selection (8)
In Explaining Diversity in Metagenomic Datasets by Phylogenetic-Based Feature Weighting
  1. Comparison of the performances of the method to popular feature selection and classification algorithms shows that or strategy is effective in identifying microbial clades associated to the different sample groups, providing a novel analysis method for targeted metagenomic datasets.
    Page 3, “Introduction”
  2. The Random Forest (RF) classifier was recently proven to be the most effective in this class of problems [26,27] , both for feature selection and classification.
    Page 9, “Predictivity of the ranked features in supervised classification problems”
  3. We compared the performance of PhyloRelief coupled With the RF classifier to LEfSe [30], an algorithm that uses statistical tests for biomarker discovery, to MetaPhyl, a recent phylogeny-based method for the classification of microbial communities [31] and to Random Forest, used both as classifier and feature selection method.
    Page 10, “Predictivity of the ranked features in supervised classification problems”
  4. We compared the predictive performance of PhyloRelief with the Random Forest classifier (PhyloRelief +RF) to LEfSe + RF, MetaPhyl (without feature selection) and Random Forest used as both classifier and feature selection method (RF + RF).
    Page 15, “Predictive classification pipeline”
  5. In order to avoid overfitting and selection bias effects, the feature selection procedure was included in the cross validation loop [40,41].
    Page 15, “Predictive classification pipeline”
  6. In the case of LEfSe + RF, LEfSe was treated as feature selection method using the common p-value threshold of 0.05.
    Page 15, “Predictive classification pipeline”
  7. For MetaPhyl, no feature selection was performed and the nested CV was used to find the optimal model parameters (parameters grid: 2 = {100000, 1000, 100, 10, 1, 0.1, 0.01, 0.001, 0.0001} and w = {0, 0.2, 0.4, 0.6, 0.8, 1}).
    Page 15, “Predictive classification pipeline”

See all papers in March 2015 that mention feature selection.

See all papers in PLOS Comp. Biol. that mention feature selection.

Back to top.

Sequencing technologies

Appears in 3 sentences as: Sequencing technologies (1) sequencing technologies (1) sequencing technology (1)
In Explaining Diversity in Metagenomic Datasets by Phylogenetic-Based Feature Weighting
  1. In metagenomics, the composition of complex microbial communities is characterized using Next Generation Sequencing technologies .
    Page 1, “Author Summary”
  2. To become a viable diagnostic and therapeutic tool, the evolution of sequencing technologies needs to be paralleled by progress in computational tools enabling to significantly correlate phenotypes to the smallest possible number of microbial taxa.
    Page 11, “Discussion”
  3. The algorithm is general and does not rely on any specific sequencing technology , as long as a phylogenetic tree of the OTUs and the distribution of the OTUs in the different samples are available.
    Page 12, “Discussion”

See all papers in March 2015 that mention Sequencing technologies.

See all papers in PLOS Comp. Biol. that mention Sequencing technologies.

Back to top.

training set

Appears in 3 sentences as: training set (3)
In Explaining Diversity in Metagenomic Datasets by Phylogenetic-Based Feature Weighting
  1. Data are partitioned into a training set and a testing set (75% and 25% of the samples respectively).
    Page 15, “Predictive classification pipeline”
  2. For each training set , the number of ranked features no that provides the smallest average KCCC is found by a nested 10X random subsampling CV.
    Page 15, “Predictive classification pipeline”
  3. Later, the features are ranked using the entire training set and the model is trained using the top ranked no features.
    Page 15, “Predictive classification pipeline”

See all papers in March 2015 that mention training set.

See all papers in PLOS Comp. Biol. that mention training set.

Back to top.