Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery

To lower cost and reduce the time to obtain promising peptides, machine learning approaches can greatly assist in the process and even partly replace expensive laboratory experiments by learning a predictor with existing data or with a smaller amount of data generation. Unfortunately, once the model is learned, selecting peptides having the greatest predicted bioactivity often requires a prohibitive amount of computational time. For this combinatorial problem, heuristics and stochastic optimization methods are not guaranteed to find adequate solutions. We focused on recent advances in kernel methods and machine learning to learn a predictive model with proven success. For this type of model, we propose an efficient algorithm based on graph theory, that is guaranteed to find the peptides for which the model predicts maximal bioactivity. We also present a second algorithm capable of sorting the peptides of maximal bioactivity. Extensive analyses demonstrate how these algorithms can be part of an iterative combinatorial chemistry procedure to speed up the discovery and the validation of peptide leads. Moreover, the proposed approach does not require the use of known ligands for the target protein since it can leverage recent multi-target machine learning predictors where ligands for similar targets can serve as initial training data. Finally, we validated the proposed approach in vitro with the discovery of new cationic antimicrobial peptides. Source code freely available at http://graal.ift.ulaval.ca/peptide-design/.

Hence, it makes sense to automate this chemical exploration endeavor in a Wise, informed, and efficient fashion. Here, we focused on peptides as they have properties that make them excellent drug starting points. Machine learning techniques may replace expensive in-vitro laboratory experiments by learning an accurate model of it. However, computational models also suffer from the combinatorial explosion due to the enormous chemical diversity. Indeed, applying the model to every peptides would take an astronomical amount of computer time. Therefore, given a model, is it possible to determine, using reasonable computational time, the peptide that has the best properties and chance for success? This exact question is What motivated our work. We focused on recent advances in kernel methods and machine learning to learn a model that already had excellent results. We demonstrate that this class of model has mathematical properties that makes it possible to rapidly identify and sort the best peptides. Finally, in-vitro and in-silico results are provided to support and validate this theoretical discovery.

To avoid side effects, a valuable drug precursor must have high affinity with the target protein while minimizing interactions with other proteins. Unfortunately, only a few have such properties and these have to be identified from an astronomical number of candidate compounds. Other factors, such as bioavailability and stability have to be considered; but this combinatorial search problem, by itself, is very challenging [1].

To fully exploit the great conformational and functional diversity, combinatorial peptide chemistry is certainly a powerful tool [2—4]. A major advantage of using combinatorial peptide libraries over classic combinatorial libraries, where the scaffold is fixed, is the possibility of generating enormous conformational and functional diversity using a randomized synthesis procedure. This chemical diversity and functionality can be further enhanced by the inclusion of nonnatural amino acids [5]. Furthermore, having a peptide scaffold can be very informative to screen for similarities in peptidomimetic libraries [6]. For these reasons, this work will focus on using peptides as drug precursors.

For example, 2g of a one-bead one-compound (OBOC) combinatorial library [7] composed of randomly-generated peptides of nine residues will generate a maximum of six million compounds, representing a vanishingly small fraction (less than 0.0016%) of the set of all 209 pep-tides. Consequently, it is almost certain that the best peptides will not be present and most synthesized peptides will have low bioactivity. Hence, drug discovery is a combinatorial problem which, unfortunately, cannot be solved using combinatorial chemistry alone. The process of discovering novel compounds with both high bioactivity and low toxicity must therefore be optimized.

These algorithms are extremely effective at providing accurate models for a wide range of biological and chemical problems: anticancer activity of small molecules [9] , protein-ligand interactions [10] and protein-protein interactions [11]. The inclusion of similarity functions, known as kernels [8] , provides a novel way to find patterns in biological and chemical data. By incorporating valuable biological and chemical knowledge, kernels provide an efficient way to improve the accuracy of learning algorithms. This work explores the use of learning algorithms to design and enhance the pharmaceutical properties of compounds [12, 13]. By starting with a training set containing approximately 100 peptides with their corresponding validated bioactivity (binding affinity, IC50, etc), we expect that a state-of-the-art kernel method will give a bioactivity model which is sufficiently accurate to find new peptides with activities higher than the 100 used to learn the model. This is possible because each peptide that possesses a small binding affinity contains information about subsequences of residues that can bind to the target. Learning a model can accelerate, but not solve, this costly process. I n-silico predictions are faster and cheaper than in-vitro assays, however, predicting the bioactivity of all possible peptide to select the most bioactive ones would require a prohibitive amount of computational time. Indeed, this transforms the combinatorial drug discovery problem into an equally hard computational task.

This algorithm makes use of graph theory and recent work [14] on the prediction of the bioactivity and the binding affinity between peptides and a target protein. This algorithm can be part of an iterative combinatorial chemistry procedure that could speed up the discovery and the validation of peptide leads. Moreover, the proposed approach can be employed without known ligands for the target protein because it can leverage recent multi-target machine learning predictors [10, 14] where ligands for similar targets can serve as an initial training set. Finally, we demonstrate the effectiveness and validate our approach in vitro by providing an example of how antimicrobial peptides with proven activity were designed.

In our context, strings are sequences of amino acids. Such kernels have been widely used in applications of machine learning to biology. For example, the local-alignment kernel [15] , closely related to the well-known Smith-Waterman alignment algorithm, was used for protein homology detection. It was however observed that kernels for large molecules such as proteins were not suitable for smaller amino acid sequences such as peptides [14]. Indeed, the idea of gaps in the local-alignment kernel or in the Smith-Waterman algorithm is well suited for protein homology, but a gap of only a few amino acids in a peptide would have important consequences on its ability to bind with a target protein. Many recently proposed string kernels have emerged from the original idea of the spectrum kernel [16] where each string is represented by the set of all its constituent k-mers. For example, the string PALI can be represented by the set of 2-mers {PA, AL, LI}. As defined by the k-spectrum kernel, the similarity score between two strings is simply the number of k-mers that they have in common. For example, the 2-spectrum similarity between PALI and LIPAT is 2, because they have two 2-mers in common (PA and LI).

First, two k-mers should only contribute to the similarity if they are in similar positions in the two peptides [17]. Second, the two k-mers should share common physicochemical properties [18] . Meinicke and colleagues [17] proposed to weight the contribution of identical k-mers with a term that decays exponentially with the distance between their positions. If i and j denote the positions of the k-mers in their respective strings, the contribution to the similarity is given by where 01, is a parameter that controls the length of the decay.

This was motivated by the fact that amino acids with similar physicochemical properties can be substituted in a peptide while maintaining the binding characteristics. To capture the physicochemical properties of amino acids, they proposed to use an encoding function 1p : A —> Rd where 1/1(a) = (1/11(a), 1/12(a), . . ., 1/1d(a)), to map every amino acid a E A to a vector where each component fill(a) encodes one of the d properties of amino acid by concatenating k physicochemical property vectors, each having d components. Throughout this study, the BLOSUM62 matrix was used in such a way that 1/1(a) is the line associated to the amino acid a in the matrix. It is now possible to weight the contribution of any two k-mers a1, . . . , ak and a’l, . . . ,a’k according to their properties: Where - denotes the Euclidean distance. More recently, the Generic String (GS) kernel was proposed for small biological sequences and pseudo-sequences of binding interfaces [14]. The GS kernel similarity between an arbitrary pair (x, x’) of biological sequences is defined to be

. .up to their k-mers, with the position penalizing term of Equation (1) and the physicochemical contribution term of Equation (3). The hyper-parameters k, 0P, 0C are chosen by cross-validation. This GS kernel is very versatile since, depending on the chosen hyper-parameters, it can be specialized to eight known kernels [14]: the Hamming kernel, the Dirac delta, the Blended Spectrum [8], the Radial Basis Function (RBF), the Blended Spectrum RBF [18], the Oligo [17], the Weighted degree [19] , and the Weighted degree RBF [18]. It thus follows that the proposed method, based on the GS kernel, is also valid for all of these kernels.

The GS kernel has also outperformed current state-of-the-art methods for predicting peptide-protein binding affinities on single-target and pan-specific Major Histocompatibility Complex (MHC) class II benchmark datasets and three Quantitative Structure Affinity Model benchmark data-sets. The GS kernel was also part of a method that won the 2012 Machine Learning Competition in Immunology [20]. External validation showed that the SVM classifier with the GS kernel was the overall best method to identify, given unpublished experimental data, new pep-tides naturally processed by the MHC Class I pathway. The proven effectiveness of this kernel made it ideal to tackle the present problem.

In the regression setting, the learning task is to predict a real value that quantifies the quality of a peptide, for example, its bioactiVity, inhibitory concentration, binding affinity, or bioavailability. In contrast to classification and regression, the task we consider here (described in the next section) is ultimately to predict a string of amino acids.

In classification, e 6 {+1, —1} denotes whether (X, y) has the desired property or not. Since predicting real values is strictly more general than predicting binary values, we focused on the more general case of real-valued predictors. Those learning examples are obtained from in vitro or in vivo experiments. The learning task is therefore to infer the value of e given new examples (x, y) that would not have been tested through experiments.

In our setting, the output h(x, y) is a real number that estimates the “true” bioactivity e between x and y. Such a predictor is said to be multi-target since its output depends on the ligand x and the target y. A multi-target predictor is generally obtained by learning from numerous peptides, binding to various proteins, for example, a protein family. For this reason, it can predict the bioactivity of any peptide with any protein of the family even if some proteins are not present in the training data [10, 14].

A target-specific predictor is obtained by learning only from peptides binding to a specific protein or from a multi-target predictor [10, 14]. For simplicity, we will focus on target-specific predictor but let us demonstrate how a tar-get-specific predictor is obtained from a multi-target one. proteins and peptides, and aq is the weight on the q-th training example. Since we use the GS kernel for k X, we obtain the target-specific predictor with peptides binding to y, we simply use fiq(y) = aq. The remainder of this manuscript will focus on target-specific predictor in the form of Equation 6. This makes the proposed solution compatible for both target-specific and multi-target predictors. Also, since the weights on eXamples are given by My), we will see that the approach is valid regardless of the choice of kernel for the target protein.

Note that all these learning methods require both kernels to be symmetric and positive semidefinite. This is the case for the GS kernel. The proposed solution for drug design is thus compatible with these popular bioinformatics learning algorithms [21]. However, some machine learning methods such as neural networks and its derivatives (deep neural networks) are not compatible with the proposed methodology. For the sake of comparison, we would like to highlight that when fiq(y) = 1/ m, k = 1, UP = O, and ac = 0 the predictor hy(x) in Equation (6) reduces to predict the probability of sequence X given the position-specific weight matriX (PSWM) obtained from the training set. Since fiq(y), k, 01,, and ac can be arbitrary, the class of predictors we consider here is much more general.

Since a PSWM assumes statistical independence between positions in the pattern, the probability that a sequence belongs to a certain pattern is given by summing the corresponding entries in M. PSWM are simple but have, however, been surpassed by modern machine learning algorithms [22, 23] since they assume independence between positions in the pattern. Moreover, they do not take into account the quantified bioactivity nor the similarities between amino acids. In addition, they require peptides to be aligned or of the same length. The method we present here have none of these serious limitations by allowing more sophisticated predictors to be learned.

It is true that replacing or reducing the number of expensive laboratory experiments by an in silico prediction will reduce costs. However, peptides having a low bioactivity do not qualify as drug precursors. Instead, we should focus on identifying the most bioactive ones. The computational problem is thus to identify and sort peptides according to a specific biological function. Let A be the set of all amino acids, and Al be the set of all possible peptides of length I. Then, finding the peptide x* 6 Al that, according to hy, has the maximal bioactivity with y, amounts at solving

This is the case since string kernel use k-mers to compare sequences. For that reason, each amino acid of the peptide cannot be optimized independently, but globally. Moreover, since the number of possible peptides grows exponentially with l (the length of the peptide), a brute force algorithm plexity for computing hy(x), the output of the predictor on peptide a X. Such an algorithm becomes impractical for any peptide exceeding 6 amino acids.

However, these methods often require prohibitive CPU time and are not guaranteed to find the optimal solution. In addition, these approaches are not capable of sorting the best solutions since they are designed to find a single maximum.

We also present a second algorithm capable of sorting in decreasing order the peptides maximizing Equation (7). Both algorithms have low asymptotic computational complexity, yielding tractable applications for the design and screening of peptides.

Here, we assume that we have, for a fixed target y, a prediction function hy(x) given by Equation (6). In this case, we show how the problem of finding, the peptide X; 6 Al of maximal bioactivity reduces to the problem of finding the longest path in a directed acyclic graph (DAG). Note that, throughout this manuscript, we will assume that the length of a path is given by the sum of the weights on its edges. To solve this problem, we construct a DAG with a source and a sink vertex such that for all possible peptides x 6 Al, there exists only one path associated to X that goes from the source to the sink. Moreover, the length of the path associated to X is exactly hy(x). Thus, if the size of the constructed graph is polynomial in 1, any algorithm that efficiently solves the longest path problem in a DAG will also efficiently find the peptide of maXimal bioactivity. A simplification of the graph is shown in Fig. 1 to assist in the comprehension of the formal definition that follows. A directed bipartite graph is a graph whose vertices can be divided into two disjoint sets such that every directed edge connects a verteX of the first set to the second set. The construction of the graph will proceed as follows.

Let U1. dzefAk X {i}, in other words, the set Ul contains all tuples (s, i) where sis a k-mer and i an integer. Let Gz- = ((Ui, Um), E) be the i-th directed bipartite graph of some set where the set of directed edges Ez- is defined as follows. Similarly as in the de Bruijn graph, there is a directed edge ((5, i), (s’, i+1)) from (s, i) in U,- to (s’, i+1) in Ui+1 if and only if the last k — 1 amino acids of s are the same as the first k — 1 amino acids of s’. For example, in the graph of Fig. 1, there is an edge from (ABA, 1) to (BAA, 2) with k = 3. Note that Vi E N, directed edges in Gz- only go from vertices in U,- to vertices in UM. There are exactly |A| edges that leave each vertex in U,- and there are exactly IAI edges that point to each vertex in U141. Moreover, for any chosen integer k, |Ui| = |Ui+1| = IAkI and 2 IA“1 Note that there is a one-to-one correspondence between a sequence in Ak+1 and a single edge path from a vertex in Ul to a vertex in UM. We define a n-partite graph as the union of n — 1 bipartite graphs:

There is a directed edge from A to all nodes of U1 and from all nodes of U, to t. For example, the graph illustrated in Fig. 1 is a 3-partite graph with a source and a sink node when the k-mer are of size 3 and the alphabet has two letters: A and B. Throughout this manuscript, we will only focus on paths starting at A, the source node, and ending at t, the sink node. For this reason, by choosing n = l — k + 1 we obtain the one-to-one Let us now describe how edges in G]1y are weighted in order for the length of a path associated to X to be exactly hy(x), the predicted bioactiVity of x. Using the definition of the GS kernel, given at Equation (4), and the general class of predictors, given by Equation (6), we can rewrite Therefore, every path from the source to the sink in Gqu represents a unique peptide x E A1 and its estimated bioactivity hy(x) is given by the length of the path. The problem of finding the peptide of highest predicted activity thus reduces to the problem of finding the longest path in Ghy. Despite being NP-hard in the general case, the longest path problem can be solved by dynamic programming in (9(l V(Ghy)| —|— |E(Ghy) for a directed acyclic graph, given a topological ordering of its vertices. By construction, Ghy is clearly acyclic and its vertices can always be topologically ordered by visiting them in the following order: 2», U1, has a complexity of O(n|A|k+1). Recall that k is a constant and l is the length of the best peptide we are trying to identify. Thus, 11 must be equal to l —k + 1.

The dynamic programming algorithm proposed for the computation of the GS kernel [14] can easily be adapted to efficiently evaluate this equation. In that case, the complexity of the weight function is reduced to (9(m - l - k).

Long k-mers will thus have negligible influence on the estimated bioactivity, explaining why small values of k g 6 << I are empirically chosen by cross-validation. Therefore, the time complexity of the proposed algorithm is orders of magnitude lower than the brute force algorithm which is in 0(IAIZ) since k g 6 << I in practice. The pseudo-code to find the longest path in Ghy is given in Box 1. Box 1. Algorithm for finding the longest path between the source node A and the sink node t in G”y end for

By using the same arguments, finding the peptide with the second greatest predicted activity reduces to the problem of finding the second longest path in Ghy. By induction, it follows that the problem of finding the K peptides of maximal predicted activity reduces to the problem of finding the K—longest paths in GhV. The closely-related K—shortest paths problem has been studied since 1957 and attracted considerable attention following the work of

Yen’s algorithm was later improved by Lawler [27]. Both algorithms make use of a shortest path algorithms to solve the K—shortest paths problem. By exploiting some restrictive properties of Ghy, Yen’s algorithm for the K—shortest paths was adapted, shown in Box 2, to find the K—longest paths in Ghy. It uses a variant of the longest path algorithm presented in the previous section, that allows a path to start from any node of the graph. Lawler improvement to the algorithm is not part of the presented algorithm to avoid unnecessary confusion but is part of the implementation we provide. The time complexity of the resulting algorithm is competitive with the latest work on K—shortest paths algorithms [28, 29]. The algorithm of Box 2 was implemented in a combination of both C and Python, the source code is freely available at http://graal.ift.ulaval.ca/peptide-design/. To validate the implementation and prevent potential flaws, it was successfully used to exhaustively sort all possible peptides of length 1 to 5 with various values of k, 01,, and ac.

Indeed, the best peptide candidates can be synthesized by an automated peptide synthesizer and tested in vitro. Such a procedure will allow rapid in vitro feedback and minimize turnaround time.

Also, in the next section, we Will describe how the K best predicted peptides can be utilized to predict a binding motif for a neW, unstudied protein. Such a motif should assist researchers in the early study of a target and for the design of peptidomimetic compounds by providing residue preferences.

In this case, the motif is a property of the learned model hy(x) as opposed to a consensus among known binding sequences. When hy(x) is obtained from a multi-target model h(x, y), it is then possible to predict affinities for proteins with no known ligand by exploiting similarities with related proteins. It is therefore feasible to predict a binding motif for a target with no known binders. To our knowledge, this has never been realized

It has been used for the discovery of ligands for receptors [30, 31], for proteins [32—35] and for transcription factors [36, 37]. To synthesize several pep-tides of length I using the 20 natural amino acids, the standard approach is to use one reactor per natural amino acid and a pooling reactor. At every step of the experiment, all reactors are pooled into the pooling reactor Which is then split, in equal proportions, back into the 20 amino acid reactors. Within this standard approach, each peptide in Al has an equal probability of being synthesized. Since the number of polystyrene beads (used to anchor every peptide) is generally orders of magnitude smaller than |A|l, only a vanishingly small fraction of the pep-tides in Al can be synthesized in each combinatorial experiment.

More restrictive protocols have been proposed to increase the hit ratio of this combinatorial experiment. For example, one could f1x certain amino acids at specific positions or limit the set of possible amino acids at this position (for example, only use hydrophobic amino acids). Such practice Will impact the outcome of the combinatorial experiment. One can probably increase the hit ratio by modifying (Wisely) the proportion of amino acids that can be found at different positions in the peptides. To explore more thoroughly this possibility, let us define a (combinatorial chemistry) protocol P by a l-tuple containing, for each position i in the peptide of length I, an independent distribution 2(a) over the 20 amino acids 61 E .4. Hence, we define a protocol P by

. . X 731. Hence, the probability of synthesizing a peptide X of size I is given

Moreover, this family of protocols is easy to implement in the laboratory since, at each step i, it only requires splitting the content of the pooling reactor in proportions equal to the distribution 73,. over amino acids. For example, if at position i, we wish to sample uniformly over each amino acid, then we will use 73,. (a) = 1/20 for all a E A. If on the other hand, we

We present a method for efficiently computing exact statistics on the screening outcome of a peptide library synthesized according to a protocol P. Specifically, we present an algorithm to compute the average predicted bioactivity and its variance over all peptides that a protocol can synthesize. Note that it is intractable to compute these statistics by predicting the activity of each peptide.

Furthermore, we will demonstrate in the next section that the computation of these statistics can be part of an iterative procedure to accelerate the discovery of bioactive peptides. Indeed, having the average predicted bioactivity data will help with the design of a protocol that synthesizes as many potential active candidates as possible. In addition, the predicted bioactivity variance will allow for better control of the exploration/ exploitation trade off of the experiment. Finally, as described in the previous section, a widely used practice for optimizing peptides is to assign residues at certain positions or restrict them to those that have specific properties such as charge or hydrophobicity. It is now possible to quantify how such procedure will impact the bioactivity of combinatorially synthesized peptides.

This allows for the efficient computation of the first and second moment of 11,, when peptides are drawn according to the distribution P. Then, the average and variance can easily be obtained from the first two moments. Details of the approach and the algorithm are given in supplementary material (see 81 Text).

The procedure is illustrated in Fig. 2. First, an initial set of random peptides is synthesized, typically using a split and pool approach. The peptides are assayed in laboratory to measure their bioactivities. At this point, most peptides are poor candidates. They are then used as a training set to produce a predictor hy. Next, 11,, is used for the generation of K bioactive peptides by finding the K—longest paths in Gqu as described previously. A protocol P is constructed from these K bioactive peptides to assist the next round of combinatorial chemistry. Then, the algorithm described in the previous section is used to predict statistics on the protocol P. This ensures that the protocol meets expectations in terms of quality (average predicted bioactivity) and diversity (predicted bioactivity variance). To lower costs, one should proceed to synthesize and test the library only if expectations are met. This process can be repeated until the desired bioactivity is achieved.

The first dataset consisted of 101 cationic antimicrobial pentadecapeptides (CAMPs) from the synthetic antibiotic peptides database [38]. Peptide antibacterial actiVities are eXpressed as the logarithm of bactericidal potency Which is the average potency over 24 bacteria such as Escherichia coli, Bacteroi’desfra-gilis, and Staphylococcus aureus. The average antibacterial activity of the CAMPs dataset was 0.39 and the best peptide had an activity of 0.824.

The bioactivities are eXpressed as the logarithm of the relative activity indeX compared to the peptide VESSK. The average bioactivity of the BPPs dataset was 0.71 and the best peptide had an activity of 2.73.

For both experiments, a predictor of biological activity was learned by kernel ridge regression (KRR) for the each data-sets: hCAMp and thp. Hyper-parameters for the GS kernel (k, ac, Up) and the kernel ridge regression (l) were chosen by standard cross-validation: k = 2, ac = 6.4, Up = 0.8, and l = 6.4 for hCAMP and k = 3, ac = 0.8, Up = 0.2, and)» = 0.4 for thP. In silica validation Using the K—longest path algorithm and the learned predictors, we generated the K peptides (of the same length as those of the training data) having the greatest predicted biological activity.

The antimicrobial activity of the top 100,000 peptides are showed in Fig. 3. We observe a smooth power law with only a few peptides having outstanding biological activity, as expected. As we will see in the next section, peptides at the top of the curve, hence having the best bioactivities, are very unlikely to be found by chance.

However, the predicted activity of IEWAK is much better than the average peptide activity of the dataset, which is 0.71. One may ask why IEWAK has a lower predicted biological activity than VEWAK, which was part of the training data. It is common for machine learning algorithms to sacrifice accuracy on the training data to prevent overfitting. Despite this small discrepancy, the model is very accurate on the training data (correlation coefficient of 0.97). Another possible explanation for this discrepancy is that the biological activity of VEWAK could be slightly erroneous as the learning algorithm could not find a simple model given such an outlier. It seems that the predicted activity of VEWAK is more coherent with the whole data than its measured activity. Hence, our proposed learning algorithm predicts new peptides having biological activities equivalent to the best of the training set and, in some cases, substantially improved activities. The next section present an in vitro experiment that clearly demonstrate that in a real world test, our approach can generate bioactive peptides.

Antibacterial activity Figure 3. The 100,000 peptides with highest antimicrobial activity found by the K-longest path algorithm. In vitro validation

Their antimicrobial activity against Escherichia coli and Staphylococcus aureus were measured in a growth inhibitory assay. Details on the synthesis and assay are given in the supplementary material (see 81 Text). The peptides were obtained using hCAMp, the same predictor used during the previous validation.

We also synthesized one peptide with poor activity (Peptides #7) as a control. We used the proposed approach with the predictor hCAMP to generate a list of K = 1,000 peptide candidates with the highest predicted activity. From this list, we greedily selected three peptides such that they all differed by at least 4 amino acids from each others. This was done to maximize the chemical diversity among them. We then tested these peptides (Peptides #2, #3, #4) in a growth inhibitory assay. Results from the minimal inhibitory concentration assay are shown in Table 2. Two of the three candidates had activities equal to the best peptide of the CAMPs dataset. We were intrigued by the failure of Peptide #4 and after investigation, the weak activity was due to poor water solubility. In a second series, we ensured that a filter for water solubility was employed. In this second series of tests, Peptide #1 showed (at least against E. coli) better activity than any of the original candidates from the CAMPs dataset, demonstrating that, in this limited biological experiment, we could improve the putative candidates using the proposed machine learning methodology. Finally, all predicted antimicrobial peptides are significantly different from those of the training set, sharing only 40% similarity with their most similar peptide in the CAMPs dataset.

However, before conducting such an expensive and time-consuming experiment, it is reasonable to first investigate, in silico, if the proposed methodology could find peptides having high bioactivity.

We choose to use hCAMP and thp as oracle as they represent, so far, the best understanding of the studied phenomena. These oracles will be used to quantify the bioactivity level of randomly generated pep-tides and those proposed by our methodology. Note that, examples used to learn the oracles are not available to our algorithm during the validation. Consequently, the validation method used was the following. 1. We randomly generated R peptides on a computer instead of using combinatorial chemistry. 2. To measure the bioactivities, we replaced the laboratory experiments by the oracle. 3. We used these random peptides of 10W bioactivities to learn a second predictor hmndom.

The predictor hmndom is used to initiate the graph-based approach. We then obtained the K potentially best peptides. 5. The new peptides bioactivities are validated by the oracle (instead of performing laboratory experiments). 6. Finally, we compared the bioactiVities of the initial set of peptides (randomly generated) and those proposed by our approach.

Once by generating R = 100 peptides at Step 1 and considering the K = 100 best predicted peptides at Step 4 of the methodology, and then by starting over the validation with R = 1,000 and K = 1,000. Statistics on the random peptides and those proposed by our approach are shown in Table 1.

Also, on both datasets, increasing R, the number of random peptides, had no significant influence on the bioactivity of the best peptide found. This support the main hypothesis upon which this work is based, random peptides will consistently be of low activity. This also indicates that combinatorial chemistry alone does not allow one to find the best peptides. It requires hints to orient its search. The next paragraph points out that our machine learning approach can provide such hints.

Such antimicrobial potency is similar to the best peptide of the (unseen) CAMPs dataset and much better than the best of the R = 100 random peptides. By increasing to R to 1,000, we found a peptide having a potency of 1.09 according to the oracle. This peptide surpasses the best known peptide of the CAMPs dataset and is also far superior to the best of the R = 1,000 random peptides. On the BPPs dataset, the proposed approach also considerably outperformed the random approach on both the best peptide found and the average bioactivity. Finally, on both datasets, increasing the number of initial peptides from R = 100 to R = 1,000 was more beneficial on the bioactivity measures than the random approach.

The Pearson correlation coefficient (PCC, also known as the Pearson’s r) was computed between hmndam predictions and the values in both databases. Since, in this simulation, hmndom was learned only with random peptides that, as pointed out above, have low bioactivity, it is interesting to evaluate its accuracy on these databases.

When initiated with R = 1,000 random peptides, it achieves a correlation coefficient of 0.90 (CAMPs) and 0.93 (BPP). In comparison, the oracle achieved a correlation coefficient of 0.91 (CAMPs) and 0.97 (BPP) on the same peptides. These were however used to train the oracle. Given that hmndam is bound to be less accurate than the oracle, these results demonstrate the capability of our approach to learn a predictor using low bioactivity peptides to obtain highly active ones.

4 shows the correlation coefficient of hmndom on the CAMPs data When varying R, the number of random peptides used for training. Near optimal accuracy is reached When hmndom is initiated With approximately R = 300 peptides. This suggests that the proposed method can achieve excellent performance With a database of modest size. Figure 4. Correlation coefficient of hmndom predictions on the CAMPs data while varying R, the number of random peptides used as training set.

The results presented here serve to demonstrate the ability of the proposed approach to predict potential functional motifs and to compare to position-specific weight matrix (PSWM) as they can be illustrated as a motif.

Using the oracle, we predicted the best K = 1,000 peptides and generated a bioactivity motif using these candidates (top panel of Fig. 5). Our goal was to assess how much of that reference motif we could rediscover if we were to hide all the CAMPs dataset during the validation.

The motif is shown in middle panel of Fig. 5. We were able to recover all the reference motif signal using only weakly active peptides and hmndom. To push the analysis even further, we also computed the motif when hmndam is trained with only R = 100 random peptides. Even then (motif not shown), for 12 of the 15 residue positions, we were able to correctly identify the dominant amino acid property (polar, neutral, basic, acidic, hydrophobic). This can be achieved since the GS kernel encodes amino acids physicochemical properties.

For example, one could learn a multi-target predictor for peptides binding to the major histocompatibility compleX [14]. Since these molecules are highly polymorphic, it would be interesting to predict antigen binding motifs for a specific segment of a population or even a single patient. This would have applications in the design of epitope based vaccines [40] and provide additional insight into autoimmune diseases.

The signal in PSWM motif was very poor, generating a meaningless motif (not shown). We increased the number of random peptides to R = 1,000,000 and only selected the best K = 1,000 to produce a PSWM whose motif is shown in the bottom panel of Fig. 5. Despite this big advantage, the motif of the PSWM shows minimal information. This clearly illustrates the potential of the proposed approach for accelerating the discovery of potential peptidic effectors and, possibly, for achieving a better understanding of the binding mechanisms of polymorphic molecules.

We proposed an efficient graph-based algorithm to predict peptides with the highest biological activity for machine learning predictors using the GS kernel. Combined with a multi-target model, it can be used to predict binding motifs for targets with no known ligands.

This allowed us to compute the expected predicted bioactivity and its variance that can be exploited in combinatorial chemistry. These steps can be part of an iterative drug discovery process that will have immediate use in both the pharmaceutical industry and academia. This methodology will reduce costs and the time to obtain lead peptides as well as facilitating their optimization. Finally, the proposed approach was validated in a real world test for the discovery of new antimicrobial peptides. These in vitro experiments confirmed the effectiveness of the new peptides uncovered.

However, in such libraries, it is unclear how we should prioritize high activity candidates (average) over the chemical diversity (variance). This exploration/ exploitation tradeoff warrants further investigation. The fast computation of the bioactivity average and variance given a combinatorial chemistry protocol will certainly help to exploit this tradeoff. Moreover, the method could easily be adapted to optimize multiple objectives simultaneously, for example, the bioactivity at the expense of mammalian cell toxicity or bioavailability when such data are available. In addition, the method could be expanded to cyclic peptides and chemical entities commonly found in clinical compounds. Finally, this method shows great promise in immunology, where antigen binding motifs for unstudied major histocompatibility complexes could be uncovered using a multi-target predictor.

The authors thank Pascal Germain for his insightful comments.

Performed the experiments: SG FL MM DT SM EB XL IC. Analyzed the data: SG FL MM DT SM EB XL IC. Contributed reagents/materials/analysis tools: SG FL MM DT SM EB XL IC. Wrote the paper: SG FL MM DT SM EB XL IC.

Appears in 110 sentences as: Peptides (2) peptides (113) peptidic (1)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- The discovery of peptides possessing high biological activity is very challenging due to the enormous diversity for which only a minority have the desired properties.Page 1, “Abstract”
- To lower cost and reduce the time to obtain promising peptides , machine learning approaches can greatly assist in the process and even partly replace expensive laboratory experiments by learning a predictor with existing data or with a smaller amount of data generation.Page 1, “Abstract”
- Unfortunately, once the model is learned, selecting peptides having the greatest predicted bioactivity often requires a prohibitive amount of computational time.Page 1, “Abstract”
- For this type of model, we propose an efficient algorithm based on graph theory, that is guaranteed to find the peptides for which the model predicts maximal bioactivity.Page 1, “Abstract”
- We also present a second algorithm capable of sorting the peptides of maximal bioactivity.Page 1, “Abstract”
- Finally, we validated the proposed approach in vitro with the discovery of new cationic antimicrobial peptides .Page 1, “Abstract”
- Here, we focused on peptides as they have properties that make them excellent drug starting points.Page 1, “Author Summary”
- Indeed, applying the model to every peptides would take an astronomical amount of computer time.Page 2, “Author Summary”
- We demonstrate that this class of model has mathematical properties that makes it possible to rapidly identify and sort the best peptides .Page 2, “Author Summary”
- For these reasons, this work will focus on using peptides as drug precursors.Page 2, “Introduction”
- However, it is important to note that combinatorial peptide chemistry cannot cover a significant part of the peptide diversity when peptides are longer than a few amino acids.Page 2, “Introduction”

See all papers in *April 2015* that mention peptides.

See all papers in *PLOS Comp. Biol.* that mention peptides.

Back to top.

Appears in 28 sentences as: amino acid (11) amino acids (23)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- This chemical diversity and functionality can be further enhanced by the inclusion of nonnatural amino acids [5].Page 2, “Introduction”
- However, it is important to note that combinatorial peptide chemistry cannot cover a significant part of the peptide diversity when peptides are longer than a few amino acids .Page 2, “Introduction”
- In our context, strings are sequences of amino acids .Page 3, “The Generic String kernel”
- It was however observed that kernels for large molecules such as proteins were not suitable for smaller amino acid sequences such as peptides [14].Page 3, “The Generic String kernel”
- Indeed, the idea of gaps in the local-alignment kernel or in the Smith-Waterman algorithm is well suited for protein homology, but a gap of only a few amino acids in a peptide would have important consequences on its ability to bind with a target protein.Page 3, “The Generic String kernel”
- Toussaint and colleagues [18] proposed to consider properties of amino acids when comparing similar k-mers.Page 4, “The Generic String kernel”
- This was motivated by the fact that amino acids with similar physicochemical properties can be substituted in a peptide while maintaining the binding characteristics.Page 4, “The Generic String kernel”
- To capture the physicochemical properties of amino acids , they proposed to use an encoding function 1p : A —> Rd where 1/1(a) = (1/11(a), 1/12(a), .Page 4, “The Generic String kernel”
- ., 1/1d(a)), to map every amino acid a E A to a vector where each component fill(a) encodes one of the d properties of amino acid by concatenating k physicochemical property vectors, each having d components.Page 4, “The Generic String kernel”
- Throughout this study, the BLOSUM62 matrix was used in such a way that 1/1(a) is the line associated to the amino acid a in the matrix.Page 4, “The Generic String kernel”
- In contrast to classification and regression, the task we consider here (described in the next section) is ultimately to predict a string of amino acids .Page 4, “The machine learning approach”

See all papers in *April 2015* that mention amino acids.

See all papers in *PLOS Comp. Biol.* that mention amino acids.

Back to top.

Appears in 17 sentences as: Machine Learning (1) Machine learning (2) machine learning (14)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- To lower cost and reduce the time to obtain promising peptides, machine learning approaches can greatly assist in the process and even partly replace expensive laboratory experiments by learning a predictor with existing data or with a smaller amount of data generation.Page 1, “Abstract”
- We focused on recent advances in kernel methods and machine learning to learn a predictive model with proven success.Page 1, “Abstract”
- Moreover, the proposed approach does not require the use of known ligands for the target protein since it can leverage recent multi-target machine learning predictors where ligands for similar targets can serve as initial training data.Page 1, “Abstract”
- Machine learning techniques may replace expensive in-vitro laboratory experiments by learning an accurate model of it.Page 1, “Author Summary”
- We focused on recent advances in kernel methods and machine learning to learn a model that already had excellent results.Page 2, “Author Summary”
- Machine learning and kernel methods [8] have the potential to help with this endeavour.Page 2, “Introduction”
- Moreover, the proposed approach can be employed without known ligands for the target protein because it can leverage recent multi-target machine learning predictors [10, 14] where ligands for similar targets can serve as an initial training set.Page 3, “Introduction”
- Such kernels have been widely used in applications of machine learning to biology.Page 3, “The Generic String kernel”
- The GS kernel was also part of a method that won the 2012 Machine Learning Competition in Immunology [20].Page 4, “The Generic String kernel”
- The machine learning approachPage 4, “The machine learning approach”
- However, some machine learning methods such as neural networks and its derivatives (deep neural networks) are not compatible with the proposed methodology.Page 5, “The machine learning approach”

See all papers in *April 2015* that mention machine learning.

See all papers in *PLOS Comp. Biol.* that mention machine learning.

Back to top.

Appears in 9 sentences as: training set (9)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- By starting with a training set containing approximately 100 peptides with their corresponding validated bioactivity (binding affinity, IC50, etc), we expect that a state-of-the-art kernel method will give a bioactivity model which is sufficiently accurate to find new peptides with activities higher than the 100 used to learn the model.Page 2, “Introduction”
- Moreover, the proposed approach can be employed without known ligands for the target protein because it can leverage recent multi-target machine learning predictors [10, 14] where ligands for similar targets can serve as an initial training set .Page 3, “Introduction”
- For the sake of comparison, we would like to highlight that when fiq(y) = 1/ m, k = 1, UP = O, and ac = 0 the predictor hy(x) in Equation (6) reduces to predict the probability of sequence X given the position-specific weight matriX (PSWM) obtained from the training set .Page 6, “The machine learning approach”
- They are then used as a training set to produce a predictor hy.Page 12, “Application in combinatorial drug discovery”
- For the CAMPS dataset, the proposed approach predicted that peptide WWKWWKRLRRLFLLV should have an antibacterial potency of 1.09, a logarithmic improvement of 0.266 over the best peptide in the training set (GWRLIKKILRVFKGL, 0.824), and a substantial improvement over the average potency of that dataset (average of 0.39).Page 14, “Improving the bioactivity of peptides”
- On the BPPs dataset, the proposed approach predicted that the pentapeptide IEWAK should have an activity of 2.195, slightly less than the best peptide of the training set (VEWAK, 2.73, predicted as 2.192).Page 14, “Improving the bioactivity of peptides”
- Hence, our proposed learning algorithm predicts new peptides having biological activities equivalent to the best of the training set and, in some cases, substantially improved activities.Page 14, “Improving the bioactivity of peptides”
- Finally, all predicted antimicrobial peptides are significantly different from those of the training set , sharing only 40% similarity with their most similar peptide in the CAMPs dataset.Page 15, “E" o \l”
- Correlation coefficient of hmndom predictions on the CAMPs data while varying R, the number of random peptides used as training set .Page 17, “Simulation of a drug discovery”

See all papers in *April 2015* that mention training set.

See all papers in *PLOS Comp. Biol.* that mention training set.

Back to top.

Appears in 7 sentences as: Correlation coefficient (1) correlation coefficient (5) Correlation coefficients (1)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- Despite this small discrepancy, the model is very accurate on the training data ( correlation coefficient of 0.97).Page 14, “Improving the bioactivity of peptides”
- The Pearson correlation coefficient (PCC, also known as the Pearson’s r) was computed between hmndam predictions and the values in both databases.Page 16, “Simulation of a drug discovery”
- Correlation coefficients are shown in the last column of Table 1.Page 16, “Simulation of a drug discovery”
- When initiated with R = 1,000 random peptides, it achieves a correlation coefficient of 0.90 (CAMPs) and 0.93 (BPP).Page 16, “Simulation of a drug discovery”
- In comparison, the oracle achieved a correlation coefficient of 0.91 (CAMPs) and 0.97 (BPP) on the same peptides.Page 16, “Simulation of a drug discovery”
- 4 shows the correlation coefficient of hmndom on the CAMPs data When varying R, the number of random peptides used for training.Page 17, “Simulation of a drug discovery”
- Correlation coefficient of hmndom predictions on the CAMPs data while varying R, the number of random peptides used as training set.Page 17, “Simulation of a drug discovery”

See all papers in *April 2015* that mention correlation coefficient.

See all papers in *PLOS Comp. Biol.* that mention correlation coefficient.

Back to top.

Appears in 7 sentences as: Drug discovery (1) drug discovery (6)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- Part of the complexity of drug discovery is the sheer chemical diversity to explore combined to all requirements a compound must meet to become a commercial drug.Page 1, “Author Summary”
- Drug discovery faces important challenges in terms of cost, complexity and the amount of time required to yield promising compounds.Page 2, “Introduction”
- Hence, drug discovery is a combinatorial problem which, unfortunately, cannot be solved using combinatorial chemistry alone.Page 2, “Introduction”
- Indeed, this transforms the combinatorial drug discovery problem into an equally hard computational task.Page 3, “Introduction”
- Application in combinatorial drug discoveryPage 12, “Application in combinatorial drug discovery”
- Simulation of a drug discoveryPage 15, “Simulation of a drug discovery”
- These steps can be part of an iterative drug discovery process that will have immediate use in both the pharmaceutical industry and academia.Page 19, “Conclusion and Outlook”

See all papers in *April 2015* that mention drug discovery.

See all papers in *PLOS Comp. Biol.* that mention drug discovery.

Back to top.

Appears in 7 sentences as: learning algorithm (2) learning algorithms (5)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- By incorporating valuable biological and chemical knowledge, kernels provide an efficient way to improve the accuracy of learning algorithms .Page 2, “Introduction”
- This work explores the use of learning algorithms to design and enhance the pharmaceutical properties of compounds [12, 13].Page 2, “Introduction”
- The proposed solution for drug design is thus compatible with these popular bioinformatics learning algorithms [21].Page 5, “The machine learning approach”
- Since a PSWM assumes statistical independence between positions in the pattern, the probability that a sequence belongs to a certain pattern is given by summing the corresponding entries in M. PSWM are simple but have, however, been surpassed by modern machine learning algorithms [22, 23] since they assume independence between positions in the pattern.Page 6, “The machine learning approach”
- It is common for machine learning algorithms to sacrifice accuracy on the training data to prevent overfitting.Page 14, “Improving the bioactivity of peptides”
- Another possible explanation for this discrepancy is that the biological activity of VEWAK could be slightly erroneous as the learning algorithm could not find a simple model given such an outlier.Page 14, “Improving the bioactivity of peptides”
- Hence, our proposed learning algorithm predicts new peptides having biological activities equivalent to the best of the training set and, in some cases, substantially improved activities.Page 14, “Improving the bioactivity of peptides”

See all papers in *April 2015* that mention learning algorithms.

See all papers in *PLOS Comp. Biol.* that mention learning algorithms.

Back to top.

Appears in 6 sentences as: binding affinities (1) binding affinity (5)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- By starting with a training set containing approximately 100 peptides with their corresponding validated bioactivity ( binding affinity , IC50, etc), we expect that a state-of-the-art kernel method will give a bioactivity model which is sufficiently accurate to find new peptides with activities higher than the 100 used to learn the model.Page 2, “Introduction”
- This is possible because each peptide that possesses a small binding affinity contains information about subsequences of residues that can bind to the target.Page 3, “Introduction”
- This algorithm makes use of graph theory and recent work [14] on the prediction of the bioactivity and the binding affinity between peptides and a target protein.Page 3, “Introduction”
- Recently [14], the GS kernel was used to learn a predictor capable of predicting, With reasonable accuracy, the binding affinity of any peptide to any protein on the Per database.Page 4, “The Generic String kernel”
- The GS kernel has also outperformed current state-of-the-art methods for predicting peptide-protein binding affinities on single-target and pan-specific Major Histocompatibility Complex (MHC) class II benchmark datasets and three Quantitative Structure Affinity Model benchmark data-sets.Page 4, “The Generic String kernel”
- In the regression setting, the learning task is to predict a real value that quantifies the quality of a peptide, for example, its bioactiVity, inhibitory concentration, binding affinity , or bioavailability.Page 4, “The machine learning approach”

See all papers in *April 2015* that mention binding affinity.

See all papers in *PLOS Comp. Biol.* that mention binding affinity.

Back to top.

Appears in 5 sentences as: ligands (7)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- Moreover, the proposed approach does not require the use of known ligands for the target protein since it can leverage recent multi-target machine learning predictors where ligands for similar targets can serve as initial training data.Page 1, “Abstract”
- Moreover, the proposed approach can be employed without known ligands for the target protein because it can leverage recent multi-target machine learning predictors [10, 14] where ligands for similar targets can serve as an initial training set.Page 3, “Introduction”
- Split and pool combinatorial peptide synthesis is a simple but efficient way to synthesize a very Wide spectrum of peptide ligands .Page 11, “Protocol for split and pool peptide synthesis”
- It has been used for the discovery of ligands for receptors [30, 31], for proteins [32—35] and for transcription factors [36, 37].Page 11, “Protocol for split and pool peptide synthesis”
- Combined with a multi-target model, it can be used to predict binding motifs for targets with no known ligands .Page 19, “Conclusion and Outlook”

See all papers in *April 2015* that mention ligands.

See all papers in *PLOS Comp. Biol.* that mention ligands.

Back to top.

Appears in 4 sentences as: computational time (3) computer time (1)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- Unfortunately, once the model is learned, selecting peptides having the greatest predicted bioactivity often requires a prohibitive amount of computational time .Page 1, “Abstract”
- Indeed, applying the model to every peptides would take an astronomical amount of computer time .Page 2, “Author Summary”
- Therefore, given a model, is it possible to determine, using reasonable computational time , the peptide that has the best properties and chance for success?Page 2, “Author Summary”
- I n-silico predictions are faster and cheaper than in-vitro assays, however, predicting the bioactivity of all possible peptide to select the most bioactive ones would require a prohibitive amount of computational time .Page 3, “Introduction”

See all papers in *April 2015* that mention computational time.

See all papers in *PLOS Comp. Biol.* that mention computational time.

Back to top.

Appears in 4 sentences as: in-vitro (4)

In *Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery*

- Machine learning techniques may replace expensive in-vitro laboratory experiments by learning an accurate model of it.Page 1, “Author Summary”
- Finally, in-vitro and in-silico results are provided to support and validate this theoretical discovery.Page 2, “Author Summary”
- I n-silico predictions are faster and cheaper than in-vitro assays, however, predicting the bioactivity of all possible peptide to select the most bioactive ones would require a prohibitive amount of computational time.Page 3, “Introduction”
- Comparing hmndom and the oracle accuracies on the CAMPs and BPPs databases To provide additional support for its accuracy, predictor hmndom was used to predict the bioactivity values of unseen but in-vitro validated peptides of the CAMPs and BPPs databases.Page 16, “Simulation of a drug discovery”

See all papers in *April 2015* that mention in-vitro.

See all papers in *PLOS Comp. Biol.* that mention in-vitro.

Back to top.