In the artificial intelligence subfield of neural networks, a barrier to that goal is that when agents learn a new skill they typically do so by losing previously acquired skills, a problem called catastrophic forgetting. That occurs because, to learn the new task, neural learning algorithms change connections that encode previously acquired skills. How networks are organized critically affects their learning dynamics. In this paper, we test whether catastrophic forgetting can be reduced by evolving modular neural networks. Modularity intuitively should reduce learning interference between tasks by separating functionality into physically distinct modules in which learning can be selectively turned on or off. Modularity can further improve learning by having a reinforcement learning module separate from sensory processing modules, allowing learning to happen only in response to a positive or negative reward. In this paper, learning takes place via neuromodulation, which allows agents to selectively change the rate of learning for each neural connection based on environmental stimuli (e.g. to alter learning in specific locations based on the task at hand). To produce modularity, we evolve neural networks with a cost for neural connections. We show that this connection cost technique causes modularity, confirming a previous result, and that such sparsely connected, modular networks have higher overall performance because they learn new skills faster while retaining old skills more and because they have a separate reinforcement learning module. Our results suggest (1) that encouraging modularity in neural networks may help us overcome the longstanding barrier of networks that cannot learn new skills without forgetting old ones, and (2) that one benefit of the modularity ubiquitous in the brains of natural animals might be to alleviate the problem of catastrophic forgetting.
An obstacle is that agents typically learn new skills only by losing previously acquired skills. Here we test whether such forgetting is reduced by evolving modular neural networks, meaning networks with many distinct subgroups of neurons. Modularity intuitively should help because learning can be selectively turned on only in the module learning the new task. We confirm this hypothesis: modular networks have higher overall performance because they learn new skills faster while retaining old skills more. Our results suggest that one benefit of modularity in natural animal brains may be allowing learning without forgetting.
A longstanding scientific challenge is to create agents that can learn, meaning they can adapt to novel situations and environments within their lifetime. The world is too complex, dynamic, and unpredictable to program all beneficial strategies ahead of time, which is why robots, like natural animals, need to be able to continuously learn new skills on the fly.
Such forgetting is especially problematic in fields that attempt to create artificial intelligence in brain models called artificial neural networks [1, 4, 5]. To learn new skills, neural network learning algorithms change the weights of neural connections [6—8], but old skills are lost because the weights that encoded old skills are changed to improve performance on new tasks. This problem is known as catastrophic forgetting [9, 10] to emphasize that it contrasts with biological animals (including humans), where there is gradual forgetting of old skills as new skills are learned . While robots and artificially intelligent software agents have the potential to significantly help society [12—14], their benefits will be extremely limited until we can solve the problem of catastrophic forgetting [1, 15]. To advance our goal of producing sophisticated, functional artificial intelligence in neural networks and make progress in our longterm quest to create general artificial intelligence with them, we need to develop algorithms that can learn how to handle more than a few different problems. Additionally, the difference between computational brain models and natural brains with respect to catastrophic forgetting limits the usefulness of such models as tools to study neurological pathologies .
Modular networks are those that have many clusters (modules) of highly connected neurons that are only sparsely connected to neurons in other modules [19, 22, 23]. The intuition behind this hypothesis is that modularity could allow learning new skills without forgetting old skills because learning can be selectively turned on only in modules learning a new task (Fig. 1, top). Selective regulation of learning occurs in natural brains via neuromodulation , and we incorporate an abstraction of it in our model . We also investigate a second hypothesis: that modularity can improve skill learning by separating networks into a skill module and a reward module, resulting in more precise control of learning (Fig. 1, bottom).
In nature, there are many costs associated with neural connections (e.g. building them, maintaining them, and housing them) [26—28] and it was recently demonstrated that incorporating a cost for such connections encourages the evolution of modularity in networks . Our results support the hypothesis that modularity does mitigate catastrophic forgetting: modular networks have higher overall performance because they learn new skills faster while retaining old skills more. Additional research into this area, including investigating the generality of our results, will catalyze research on creating artificial intelligence, improve models of neural learning, and shed light on Whether one benefit of modularity in natural animal brains is an improved ability to learn Without forgetting.
Catastrophic forgetting (also called catastrophic interference) has been identified as a problem for artificial neural networks (ANNs) for over two decades: When learning multiple tasks in a sequence, previous skills are forgotten rapidly as new information is learned [9, 10]. The problem occurs because learning algorithms only focus on solving the current problem and change any connections that will help solve that problem, even if those connections encoded skills appropriate to previously encountered problems .
Novelty vectors modify the backpropagation learning algorithm  to limit the number of connections that are changed in the network based on how novel, or unexpected, the input pattern is . This technique is only applicable for auto-encoder networks (networks whose target output is identical to their input), thus limiting its value as a general solution to catastrophic forgetting . Orthogonalization techniques mitigate interference between tasks by reducing their representational overlap in input neurons (via manually designed preprocessing) and by encouraging sparse hidden-neuron activations [30—32]. Interleaved learning avoids catastrophic forgetting by training on both old and new data when learning , although this method cannot scale and does not work for realistic environments because in the real world not all challenges are faced concurrently [33, 34]. This problem with interleaved learning can be reduced with pseudo rehearsal, wherein input-output associations from old tasks are remembered and rehearsed
However, scaling remains an issue with pseudo rehearsal because such associations still must be stored and choosing which associations to store is an unsolved problem . These techniques are all engineered approaches to reducing the problem of catastrophic forgetting and are not proposed as methods by which natural evolution solved the problem of catastrophic forgetting [1, 10, 29—32, 34].
The technique, inspired by theories on how human brains separate and subsequently integrate old and new knowledge, partitions early processing and longterm storage into different subnetworks. Similar to interleaved learning techniques, dual-net architectures enable both new knowledge and input history (in the form of current network state) to affect learning.
In this paper, we study a new hypothesis, which is that modularity can help avoid catastrophic forgetting. Unlike the techniques mentioned so far, our solution does not require human design, but is automatically generated by evolution. Evolving our solution under biologically realistic constraints has the added benefit of suggesting how such a mechanism may have originated in nature.
One method for setting the connection weights of neural networks is to evolve them, meaning that an evolutionary algorithm specifies each weight, and the weight does not change within an organism’s “lifetime” [5, 37—39]. Evolutionary algorithms abstract Darwinian evolution: in each generation a population of “organisms” is subjected to selection (for high performance) and then mutation (and possibly crossover)
Some learning algorithms, such as backpropagation [6, 7], require a correct output (e.g. action) for each input. Other learning algorithms are considered more biologically plausible in that they involve only information local to each neuron (e.g. Hebb’s rule ) or infrequent reward signals [8, 46, 47].
Compared to behaviors defined solely by evolution, evolving agents that learn leads to better solutions in fewer generations [48, 50, 51] , improved adaptability to changing environments [48, 49], and enables evolving solutions for larger neural networks . Computational studies of evolving agents that learn have also shed light on open biological questions regarding the interactions between evolution and learning [50, 52, 53].
In one relevant paper, evolution optimized certain parameters of a neural network to mitigate catastrophic forgetting . Such parameters included the number of hidden (internal) neurons, learning rates, patterns of connectivity, initial weights, and output error tolerances. That paper did show that there is a potential for evolution to generate a stronger resistance to catastrophic forgetting, but did not investigate the role of modularity in helping produce such a resistance.
Evolutionary experiments on artificial neural networks typically model only the classic excitatory and inhibitory actions of neurons in the brain . In addition to these processes, biological brains employ a number of different neuromodulators, which are chemical signals that can locally modify learning [24, 54, 55]. By allowing evolution to design neuromodulatory dynamics, learning rates for particular synapses can be upregulated and downregulated in response to certain inputs from the environment. These additional degrees of freedom greatly increase the possible complexity of reward-based learning strategies. This type of plasticity-controlling neuromodulation has been successfully applied when evolving neural networks that solve reinforcement learning problems [25, 46], and a comparison found that evolution was able to solve more complex tasks with neuromodu-lated Hebbian learning than with Hebbian learning alone . Our experiments include this form of neuromodulation (Methods).
Modularity is ubiquitous in biological networks, including neural networks, genetic regulatory networks, and protein interaction networks [17—21]. Why modularity evolved in such networks has been a longstanding area of research [18—20, 56—59]. Researchers have also long studied how to encourage the evolution of modularity in artificial neural networks, usually by creating the conditions that are thought to promote modularity in natural evolution [19, 57—61]. Several different hypotheses have been suggested for the evolutionary origins of modularity.
These environments are said to have modularly varying goals. While such environments can promote modularity  , the effect only appears for certain frequencies of environmental change  and can fail to appear with different types of networks [58, 60, 61]. Moreover, it is unclear how many natural environments change modularly and how to design training problems for artificial neural networks that have modularly varying goals. Other experiments have shown that modularity may arise from gene duplication and differentiation  , or that it may evolve to make networks more robust to noise in the genotype-phenotype mapping  or to reduce interference between network activity patterns .
This explanation for the evolutionary origins of modularity is biologically plausible because biological networks have connection costs (e.g. to build connections, maintain them, and house them) and there is evidence that natural selection optimally arranges neurons to minimize these connection costs [26, 27]. Moreover, the modularity-inducing effects of adding a connection cost were shown to occur in a wide range of environments, suggesting that adding a selection pressure to reduce connection costs is a robust, general way to encourage modularity . We apply this technique in our paper because of its efficacy and because it may be a main reason that modularity evolves in natural networks.
The environment is an abstraction of a world in Which an organism performs a daily routine of trying to eat nutritious food While avoiding eating poisonous food. Every day the organism observes every food item one time: half of the food items are nutritious and half are poisonous. To achieve maximum f1tness, the individual needs to eat all the nutritious items and avoid eating the poisonous ones. After a number of days, the season changes abruptly from a summer season to a Winter season. In the neW season, there is a neW set of food sources, half of them nutritious and half poisonous, and the organism has to learn Which is Which. After this Winter season, the environment changes back to the
The environment for one individual’s lifetime. A lifetime lasts 3 years. Each year has 2 seasons: winter and summer. Each season consists of 5 days. In each day, each individual sees all food items available in that season (only two are shown) in a random order.
The environment switches back and forth between these two seasons multiple times in the organism’s lifetime. Individuals that remember each season’s food associations perform better by avoiding poisonous items without having to try them first.
Every season lasts for five days, and in each day an individual encounters all four food items for that season in a random order. A lifetime is three years (Fig. 2). To ensure that individuals must learn associations within their lifetimes instead of having genetically hardcoded associations [47, 62] , in each lifetime two food items are randomly assigned as nutritious and the other two food items are assigned as poisonous (Fig. 3). To select for general learners rather than individuals that by chance do well in a specific environment, performance is averaged over four random environments (lifetimes) for each individual during evolution, and over 80 random environments (lifetimes) when assessing the performance of final, end-of-eXperiment individuals (Methods). :1 Nutritious Item - Poisonous Item Generation 1 Year 1 Year N Summer Winter Generation 2 Year 1 Year N
Randomizing food associations between generations. To ensure that agents learn associations within their lifetimes instead of genetically hardcoding associations, whether each food item is nutritious or poisonous is randomized each generation. There are fourfood items per season (two are depicted).
For instance, if an agent is able to avoid forgetting the summer associations during the winter season, it will immediately perform well when summer returns, thus outcompeting agents that have to relearn summer associations. Agents that forget, especially catastrophically, are therefore at a selective disadvantage. Our main results were found to be robust to variations in several of our experimental parameters, including changes to the number of years in the organism’s lifetime, the number of different seasons per year, the number of different edible items, and different representations of the inputs (the presence of items being represented either by a single input or distributed across all inputs for a season). We also observed that our results are robust to lengthening the number of days per season: networks in the experimental treatment (called “P&CC” for reasons described below) signif1cantly outperform the networks in the control (“PA”) treatment (p < 0.05) even when doubling or quadrupling the number of days per season, although the size of the difference diminished in longer seasons.
The model of the organism’s brain is a neural network with 10 input neurons (Supp. 81 Fig). From left to right, inputs 1-4 and 5-8 encode which summer and winter food item is present, respectively. During summer, the winter inputs are never active and vice versa. Catastrophic forgetting may appear in these networks because a non-modular neural network is likely to use the same hidden neurons for both seasons (Fig. 1, top). We segmented the summer and winter items into separate input neurons to abstract a neural network responsible for an intermediate phase of cognition, where early visual processing and object recognition have already occurred, but before decisions have been made about what to do in response to the recognized visual stimuli. Such disentangled representations of objects have been identified in animal brains  and are common at intermediate layers of neural network models . The final two inputs are for reinforcement learning: inputs 9 and 10 are reward and punishment signals that fire when a nutritious or poisonous food item is eaten, respectively. The network has a single output that determines if the agent will eat (output > 0) or ignore (output < = 0) the presented food item. Associations can be learned by properly connecting reward signals through neuromodula-tory neurons to non-modulatory neurons that determine which actions to take in response to food items (Methods). Evolution determines the neural wiring that produces learning dynamics, as described next.
Evolution begins with a randomly generated population of neural networks. The performance of each network is evaluated as described above. More fit networks tend to have more offspring, with fitness being determined differently in each treatment, as explained below. Offspring are generated by copying a parent genome and mutating it by adding or removing connections, changing the strength of connections, and switching neurons from being modulatory to non-modulatory or vice versa. The process repeats for 20,000 generations.
We compared a treatment where the fitness of individuals was based on performance alone (PA) to one based on both maximizing performance and minimizing connection costs (P&CC). Specifically, evolution proceeds according to a multi-objective evolutionary algorithm with one (PA) or two (P&CC) primary objectives. A network’s connection cost equals its number of connections, following . More details on the evolutionary algorithm can be found in Methods.
4). In addition to overall performance across generations, we looked at the day-to-day performance of final, evolved individuals (Fig. 5). P&CC networks learn associations faster in their first summer and winter, and maintain higher performance over multiple years (pairs of seasons).
4), confirming the finding of Clune et al.  in this different context of networks with within-life learning. Networks evolved in the P&CC treatment tend to create a separate reinforcement learning module that contains the reward and punishment inputs and most or all neuromodu-latory neurons (Fig. 6). One of our hypotheses (Fig. 1, bottom) suggested that such a separation could improve the efficiency of learning, by regulating learning (via neuromodulatory neurons) in response to whether the network performed a correct or incorrect action, and applying that learning to downstream neurons that determine which action should be taken in response to input stimuli.
We then measured the frequency with which the reinforcement inputs (re-ward/ punishment signals) were placed into a different module from the remaining food-item inputs. This measure reveals that P&CC networks have a separate module for learning in 31% of evolutionary trials, whereas only 4% of the PA trials do, which is a significant difference (p = 2.71 x 10—7), in agreement with our hypothesis (Fig. 1, bottom). Analyses also reveal that the networks from both treatments that have a separate module for learning perform significantly better than networks without this decomposition (median performance of modular networks in 80 randomly generated environments (Methods): 0.87 [95% CI: 0.83, 0.88] vs. non
Performance each day for evolved agents from both treatments. Plotted is median performance per day (i 95% bootstrapped confidence intervals of the median) measured across 100 organisms (the highest-performing organism from each experiment per treatment) tested in 80 new environments (lifetimes)
P&CC networks significantly outperform PA networks on every day (asterisks). Eating no items or all items produces a score of 0.5; eating all and only nutritious food items achieves the maximum score of 1.0. modular networks: 0.80 [0.71, 0.84], p = 0.02). Even though only 31% of the P&CC networks are deemed modular in this particular way, the remaining P&CC networks are still significantly more modular on average than PA networks (median Q scores are 0.25 [0.23, 0.28] and 0.2 [0.19, 0.22] respectively, p = 4.37 X 10—6), suggesting additional ways in which modularity improves the performance of P&CC networks.
Both sparsity and modularity are correlated with the performance of networks (Fig. 7). Sparsity also correlates with modularity (p = 5.15 x 10—40 as calculated by a t-test of the hypothesis that the correlation is zero), as previously shown [23, 66]. Our interpretation of the data is that the pressure for both functionality and sparsity causes modularity, which in turn helps evolve learners that are more resistant to catastrophic forgetting. However, it cannot be ruled out that sparsity itself mitigates catastrophic forgetting , or that the general learning abilities of the network have been improved due to the separation into a skill module and a learning module. Either way, the data support our hypothesis that a connection cost promotes the evolution of sparsity, modularity, and increased performance on learning tasks.
Measuring the percent of information a network retains can be misleading, because networks that never learn anything are reported as never forgetting anything. In many PA experiments, networks did not learn in one or both seasons, which looks like perfect retention, but for the wrong reason: they do not forget anything because they never knew anything to begin with. To prevent such pathological, non-learning networks from clouding this analysis, we compared only the 50 highest-performing experiments from each treatment, instead of all 100 experiments. For both treatments, we then measured retention and forgetting in the high-est-performing network from each of these 50 experiments.
Specifically, we allowed individuals to learn for 50 winter days—to allow even poor learners time to learn the winter associations—before exposing them to 20 summer days, during which we measured how rapidly they forgot winter associations and learned summer associations (Methods). Notice that individuals were evolved in seasons lasting only 5 days, but we measure learning and forgetting for 20 days in this analysis to study the longer-term consequences of the evolved learning architectures. Thus, the key result relevant to catastrophic forgetting is what occurs during the first five days. We included the remaining 15 days to show that the differences in performance persist if the seasons are extended.
8, left). They also learn the new task better (Fig. 8, center). The combined effect significantly improves performance (Fig. 8, right), meaning P&CC networks are significantly better at learning associations in a new season while retaining associations from a previous one.
If we regard performance in each season as a skill, this experiment measures whether the individuals can retain a previously-learned skill (perfect summer performance) after learning a new skill (perfect winter performance). We tested the knowledge of the individuals in the following way: at the end of each season, we counted the number of sets of associations (summer or winter) that individuals knew perfectly, which required them knowing the correct response for each food item in that season. We formulated four metrics that quantify how well individuals knew and retained associations. N 0 Percentage of known associations Perfect Known Forgotten Retained % Forgotten % Retained
P&CC networks significantly outperform PA networks in both learning and retention. P&CC individuals learn significantly more associations, whether counting only when the associations for both seasons are known (“Perfect” knowledge) or separately counting knowledge of either season’s association (total “Known”). P&CC networks also forget fewer associations, defined as associations known in one season and then forgotten in the next, which is significant when looking at the percent of known associations forgotten (“% Forgotten”). P&CC networks also retain significantly more associations, meaning they did not forget one season’s association when learning the next season’s association. See text for more information about the “Perfect”, “Known”, “Forgotten,” and “Retained” metrics. During all performance measurements, learning was disabled to prevent such measurements from changing an individual’s known associations (Methods). Bars show median performance, whiskers show the 95% bootstrapped confidence interval of the median. Two asterisks indicate p < 0.01 , three asterisks indicate p < 0.001.
Doing well on this metric indicates reduced catastrophic forgetting because it requires retaining an old skill even after a new one is learned. P&CC individuals learned significantly more Perfect associations (Fig. 9, Perfect).
In other words, it counts knowing either season in a year and doubly counts knowing both. P&CC individuals learned significantly more of these Known associations (Fig. 9, Known).
There is no significant difference between treatments on this metric when measured in absolute numbers (Fig. 9, Forgotten). However, measured as a percentage of Known items, P&CC individuals forgot significantly fewer associations (Fig. 9, % Forgotten). The modular P&CC networks thus learned more and forgot less—leading to a significantly lower percentage of forgotten associations.
P&CC individuals retained significantly more than PA individuals, both in absolute numbers (Fig. 9, Retained) and as a percentage of the total number of known items (Fig. 9, % Retained).
2), 80 random environments). The agent can retain or forget two associations each season except the first, making the maximum score for these metrics 5 x 80 x 2 = 800. However, the agent can only score one perfect association (meaning both summer and winter is known) each season, leading to a maximum score of 6 X 80 = 480 for that metric.
In other words, adding a connection cost mitigated catastrophic forgetting. That, in turn, enabled an increase in the total number of associations P&CC individuals learned in their lifetimes.
To further test whether the improved performance in the P&CC treatment results from it mitigating catastrophic forgetting, we conducted experiments in a regime where retaining skills between tasks is impossible. Under such a regime, if the P&CC treatment does not outperform the PA treatment, that is evidence for our hypothesis that the ability of P&CC networks to outperform PA networks in the normal regime is because P&CC networks retain previously learned skills more when learning new skills.
This forced forgetting was implemented by resetting all neuromodulated weights in the network to random values between each season change. The experimental setup was otherwise identical to the main experiment. In this treatment, evolution cannot evolve individuals to handle forgetting better, and can focus only on evolving good learning abilities for each season. With forced forgetting, the P&CC treatment no longer significantly outperforms the PA treatment (Fig. 10). This result indicates that the connection cost specifically helps evolution in optimizing the parts of learning related to resistance against forgetting old associations while learning new ones.
10, p = 2.5 X 10—5 via bootstrap sampling with randomization  ). Forcing forgetting likely removes some of the interference between learning the two separate tasks. With the connection cost, however, forced forgetting leads to worse results, indicating that the modular networks in the P&CC treatment have found solutions that benefit from remembering what they have learned in the past, and thus are worse off when not allowed to remember that information.
To test Whether neuromodulation is essential to evolving a resistance to forgetting in our experiments, we evolved neural networks with and without neuromodulation. When we evolve Without neuromodulation, the Hebbian learning dynamics of each connection are constant throughout the lifetime of the organism: this is 0.75 — Normal Learning Forced Forgetting PA P&CC accomplished by disallowing neuromodulatory neurons from being included in the networks (Methods).
11). This finding is in line with previous work demonstrating that neuromodulation allows evolution to solve more compleX reinforcement learning problems than purely Hebbian learning . While the non-modulato-ry P&CC networks perform slightly better than non-modulatory PA networks, the differences, while significant (P&CC performance 0.72 [95% CI: 0.71, 0.72] vs. PA 0.70 [0.69, 0.71],
Because networks in neither treatment learn much, studying whether they suffer from catastrophic forgetting is uninformative. These results reveal that neuromodu-lation is essential to perform well in these environments, and its presence is effectively a prerequisite for testing the hypothesis that modularity mitigates catastrophic forgetting. Moreover, neuromodulation is ubiquitous in animal brains, justifying its inclusion in our default model. One can think of neuromodulation, like the presence of neurons, as a necessary, but not sufficient, ingredient for learning without forgetting. Including it in the experimental backdrop allows us to isolate whether modularity further improves learning and helps mitigate catastrophic forgetting.
The resultant networks have a separate learning module and eXhibit significantly higher performance, learning, and retention. We further found three lines of evidence that modularity improves performance and helps prevent catastrophic forgetting: (1) networks with a separate learning module performed significantly PA P&CC
The effect of neuromodulation and connection costs when evolving solutions for catastrophic forgetting. Connection costs and neuromodulatory dynamics interact to evolve forgetting-resistant solutions. Without neuromodulation, neither treatment performs well, suggesting that neuromodulation is a prerequisite for solving these types of problems, a result that is consistent with previous research showing that neuromodulation is required to solve challenging learning tasks . However, even in the non-neuromodulatory (pure Hebbian) experiments, P&CC is more modular (0.33 [95% Cl: 0.33, 0.33] vs PA 0.26 [0.22, 0.31 ], p = 1.16 x 10—12) and performs significantly better (0.72 [95% Cl: 0.71, 0.72] vs. PA 0.70 [0.69, 0.71], p = 0.003). That said, because both treatments perform poorly without neuromodulation, and because natural animal brains contain neuromodulated learning , it is most interesting to see the additional impact of modularity against the backdrop of neuromodulation. Against that backdrop, neural modularity improves performance to a much larger degree (P&CC 0.94 [0.92, 0.94] vs. PA 0.78 [0.78, 0.81], p = 8.08 x 10—6), in part by reducing catastrophic forgetting (see text).
These findings support the idea that neural modularity can improve learning performance both for tasks with the potential for catastrophic forgetting, by reducing the overlap in how separate skills are stored (Fig. 1, top), and in general, by modularly separating learned skills from reward signals (Fig. 1, bottom).
In the presence of neuromodulatory learning dynamics, which occur in the brains of natural animals [24, 54], a connection cost could thus significantly mitigate catastrophic forgetting. This work thus provides a new candidate technique for improving learning and reducing catastrophic forgetting, which is essential for advancing our goal of making sophisticated robots and intelligent software based on neural networks. It also suggests that one benefit of the modularity ubiquitous in natural networks may be improved learning via reduced catastrophic forgetting.
Future work in different types of problems and experimental setups are needed to confirm or deny the hypotheses suggested in this paper. Specific studies that can investigate the generality of our hypothesis include studying whether the connection cost technique still reduces interference when inputs cannot be as easily disentangled (for instance, if certain inputs are shared between several skills), investigating the effect of more complex learning tasks that may not be learned at all if the agent forgets between training episodes, and further exploring the effect of experimental parameters, such as the length of training episodes, number of tasks, and different neural network sizes and architectures.
In fact, there may have been evolutionary pressure to create learning dynamics that result in neural modularity: whether such “modular plasticity” rules exist, how they mechanistically cause modularity, and the role of evolution in producing them, is a ripe area for future study. More generally, exploring the degree to which evolution encodes learning rules that lead to modular architectures, as opposed to hard coding modular architectures, is an interesting area for future research. The experiments in this paper are meant to invigorate the conversation about how evolution and learning produce brains that avoid catastrophic forgetting. While the results of these experiments shed light on that question, the importance, magnitude, and complexity of the question will yield fascinating research for decades, if not centuries, to come.
The network has five layers (Supp. 81 Fig) and is feed-forward, meaning each node receives inputs only from nodes in the previous layer and sends outputs only to nodes in the next layer. The number of neurons is 10/ 4/ 2 for the three hidden layers. The weights (connection strengths) and biases (activation thresholds) in the network take values in the range [-1, 1]. Following the paper that introduced the connection cost technique , networks are directly encoded [70, 71]. Information flows through the network from the input layer towards the output layer, with one layer per time step. The output of each node is a function of its inputs, as described in the next section.
 , and adapted for the Sferes software package by Tonelli and Mouret . It differs from standard ANN models by employing two types of neurons: non-modulatory neurons, which are regular, activity-propagating neurons, and modulatory neurons. Inputs into each neuron consist of two types of connections: modulatory connections Cm and non-modulatory connections Cn (normal neural network connections). The output of a neuron is decided by the weighted sum of its non-modulatory input connections, as follows: where i and j are neurons, aj is the output of neuron j, b,- is the bias of neuron i, wij is the weight of the connection between neuron i and j, and (p is a sigmoid function that maps its input to a value in the range [—1, 1], allowing both positive and negative outputs.
Their weight modification depends on the sum of modulatory inputs to the downstream neurons they connect to and a constant learning rate 17. Their weight change is calculated by the following two equations:
(p is a sigmoid function that maps its input to the interval [—1, 1] (thus allowing both positive and negative modulation). The sum includes weighted contributions from all modulatory connections.
17 is a constant learning rate that is set to 0.04 in our experiments. The ai'aj component is a regular Hebbian learning term that is high when the activity of the pre and post-synaptic neurons of a connection are correlated . The result is a Hebbian learning rule that is regulated by the inputs from neuromodulatory neurons, allowing the learning rate of specific connections to be increased or decreased in specific circumstances. In control experiments without the potential for neuromodulation, all neurons were non-modulatory. Updates to the weights of their incoming connections were calculated via Equation 3 with m,- set to a constant value of 1.
Specifically, it is a modification of the widely used Non-dominated Sorting Genetic Algorithm (NSGA-II) . However, NSGA-II does not take into account that one objective may be more important than others. In our case, network performance is essential to survival, and minimizing the sum of connection costs is a secondary priority. To capture this difference, we follow  in having a stochastic version of Pareto dominance, in which the secondary objective (connection cost) only factors into selection for an individual with a given probability p. In the experiments reported here, the value of p was 0.75, but preliminary runs demonstrated that values of p of 0.25 and 0.5 led to qualitatively similar results, indicating that the results are robust to substantial changes to this value. However, a p value of 1 was found to overemphasize connection costs at the eXpense of performance, leading to pathological solutions that perform worse than the PA networks.
To better capture the power of larger populations, which contain more diversity and thus are less likely to get trapped on local optima, we adopted the common technique of encouraging phenotypic diversity in the population [5, 73, 74]. Diversity was encouraged by adding a diversity objective to the multi-objective algorithm that selected for organisms whose network outputs were different than others in the population. As with performance, the diversity objective factors into selection 100% of the time (i.e. the probability p for PNSGA was 1). Technically, we register every choice (to eat or not) each individual makes and determine how different its sequence of choices is from the choices of other individuals: differences are calculated via a normalized bitwise XOR of the binary choice vectors of two individuals. For each individual, this difference is measured with regards to all other individuals, summed and normalized, resulting in a value between 0 and 1, which measures how different the behavior of this individual is from that of all other individuals. Preliminary experiments demonstrated that, for the problems in this paper, this diversity-promoting technique is necessary to reliably obtain functional networks in either treatment, and is thus a necessary prerequisite to conduct our study. This finding is in line with previous experiments that have showed that diversity is especially necessary for problems that involve learning, because learning problems are especially laden with local optima . All experiments were implemented in the Sferes evolutionary algorithm software package . The exact source code and experimental configuration files used in our experiments, along with data from all our experiments, are freely available in the online Dryad scientific archive at http://dx.doi.org/10.5061/dryad.s38n5.
In each generation, every neW offspring network is a copy of its parent that is randomly mutated. Mutations can add a connection, remove a connection, change the strength of connections, move connections and change the type of neurons (sWitching between modulatory and non-modulatory). Probabilities and details for each mutational event are given in Supp. 81 Table. We chose these evolutionary parameters, including keeping things simple by not adding crossover, to maintain similarity With related eXperiments on evolving modularity  and neuromodulated
During evolution, each individual is tested in four randomly generated environments (i.e. for four “lifetimes”, Fig. 2) that vary in which items are designated as food and poison, and in which order individuals encounter the items. Because there is variance in the difficulty of these random worlds, we test in 4 environments (lifetimes), instead of 1, to increase the sample size. We further increase the sample size to 80 environments (lifetimes) when measuring the performance of final, evolved, end-of-eXperi-ment individuals (e.g. Figs. 8 and 9). Individuals within the same generation are all subjected to the same four environments, but across generations the environments are randomized to select for learning, rather than genetically hard-coded solutions (Fig. 3). To start each environment (note: not season) from a clean slate, before being inserted in an environment the modulated weights of individuals are randomly initialized, which follows previous work with this neuro-modulatory learning model . Modulatory connections never change, and thus do not need to be altered between environments. In the runs without neuromodulation, all connections are reset to their genetically specified weights.
2). Fitness is proportional to the number of food items consumed minus the number of poison items consumed across all environments (Supp. S7 Fig). Individuals that can successfully learn which items to eat and which to avoid are thus rewarded, and the best fitness scores are obtained by individuals that are able to retain this information across the fluctuating seasons (i.e. individuals that do not exhibit catastrophic forgetting).
That modularity optimization method relies on the maximization of a benefit function Q, which measures the difference between the number of connections within each module and the expected fraction of such connections given a “null model”, that is, a statistical model of random networks. High values of Q indicate an “unexpectedly modular” network.
Leicht and Newman extend this model to directed networks by distinguishing between the in-degree and out-degree of each node in the degree sequence . The probability that the analyzed network has a connection between node i and j is therefore kinkfut/m, where kf“ and k?” are the in and out-degrees of node i and j, respectively, 111 is the total number of edges in the network, and the modularity of a given decomposition for directed networks is as follows:
Our results are qualitatively unchanged when using layered, feed-forward networks as “null model” to compute and optimize Q (Supp. 82 Table).
Here we applied the spectral optimization method, which gives good results in practice at a low computational cost [67, 78]. As suggested by Leicht and Newman  , each module is split in two until the next split stops increasing the modularity score.
Analyses are based on the highest-performing network from each trial. The experiments lasted 20,000 generations and had a population size of 400.
Each season lasted 5 days, and cycled through 3 years (Fig. 2). In each season, 2 poisonous items and 2 nutritious items were available, each item encoded by a separate input neuron (i.e. a “one-hot encoding” ).
Each day we randomize the order in which food items are presented, yielding 4! = 24 different possibilities per day. There are in total 5 days per season, and an individual lives for 6 seasons, resulting in 5 x 6 = 30 days per lifetime (Fig. 2), and thus 24 x 30 = 720 different ways to visit the items in a single lifetime. In addition to randomizing the order items are visited in, the edibility associations agents are supposed to learn are randomized between environments. We randomly designate 2 of the 4 items 4 as nutritious food, giving (2 > = 6 different possibilities for summer and 6 different possibilities for winter. There are thus a total of 6 X 6 = 36 different ways to organize edibility associations across both seasons. In total, we have 720 X 36 = 25,920 unique environments, reflecting the 720 different ways food items can be presented and the 36 possible edibility associations. As mentioned in the previous section, four of these environments were seen by each individual during evolution, and 80 of them were seen in the final performance tests. In both cases they were selected at random from the set of 25,920.
95% bootstrapped confidence intervals of the median are calculated by re-sampling the data 5,000 times. In Fig. 4, we smooth the plotted values With a median filter to remove sampling noise. The median filter has a Window size of 11, and we plot each 10 generations, meaning the median spans a total of 110 generations.
8 and 9), further learning was disabled. The process is thus (1) learn food associations, (2) measure what was learned and forgotten without further learning, and (3) repeat. Disabling learning allows measurements of what has been learned without the evaluation changing that learned information.
Inputs provide information about the environment. The output is interpreted as the decision to eat a food item or ignore it. (TIFF)
Dark blue nodes are inputs that encode Which type of food has been encountered. Light blue nodes indicate internal, non-modulatory neurons. Red nodes are reward or punishment inputs that indicate if a nutritious or poisonous item has been eaten. Orange nodes are neuromodulatory neurons that regulate learning. In the cases Where an input neuron was modulatory, we indicate this With an orange circle around the neuron. In each panel, the left number reports performance and the right number reports modularity. We follow the convention from  of placing nodes in the way that minimizes the total connection length. (TIFF) S3 Fig. The highest-performing networks from all of the 100 experiments in the PA treatment (part 2 of 2). See the previous figure caption for more details. (TIFF)
Dark blue nodes are inputs that encode Which type of food has been encountered. Light blue nodes indicate internal, non-modulatory neurons. Red nodes are reward or punishment inputs that indicate if a nutritious or poisonous item has been eaten. Orange nodes are neuromodulatory neurons that regulate learning. In the cases Where an input neuron was modulatory, we indicate this With an orange circle around the neuron. In each panel, the left number reports performance and the right number reports modularity. We follow the convention from  of placing nodes in the way that minimizes the total connection length. (TIFF) SS Fig. The highest-performing networks from all of the 100 experiments in the P&CC
See the previous figure caption for more details. (TIFF)
Shows how old associations are forgotten as new ones are learned for the two experimental treatments. The treatment with a connection cost (P&CC) was able to learn the associations better and shows a more gradual forgetting in the first timesteps. Together, this leads it to outperform the regular treatment (PA) significantly when measuring how fast individuals forget. Note that networks were evolved with five days per season, so the results during those first five days are the most informative regarding the evolutionary mitigation of catastrophic forgetting: we show additional days to reveal longer-term consequences of the evolved architectures. (TIFF)
The example describes What happens When an agent encounters a food item during summer. For the Winter season, the process is the same, but With Winter inputs active instead of summer inputs. (TIFF)
The mutation operators along with their probabilities of affecting an individual. (TIFF)
Two different null models for calculating the modularity score. The conventional way to calculate modularity is inherently relative: one computes the modularity of network N by searching for the modular decomposition (assigning N’s nodes to different modules) that maximizes the number of edges within the modules compared to the number of expected edges given by a statistical model of random, but similar, networks called the “null model”. There are different ways to model random networks, depending on the type of networks being measured and their topological constraints. Here, we calculated the modularity Q-score with two different null models, one modeling random, directed networks and the other modeling random, layered, feed-forward networks. When calculating modularity with either null model, P&CC networks are significantly more modular than PA networks. Aij is 1 if there is an edge from node i to node j, and 0 otherwise, kl’7’ and kl?” are the in and out-degrees of node i and j, respectively, m is the total number of edges in the network, mi]- is the number of edges between the layer containing node i and the layer containing node j, and 66,-, cj is a function that is 1 if i and j belong to the same module, and 0 otherwise. (TIFF)
Performed the experiments: KOE. Analyzed the data: KOE IBM IC. Contributed reagents/materials/analysis tools: KOE IBM. Wrote the paper: KOE IBM IC. Developed the software used in eXperiments: IBM.
See all papers in April 2015 that mention neural networks.
See all papers in PLOS Comp. Biol. that mention neural networks.
Back to top.
See all papers in April 2015 that mention Hebbian.
See all papers in PLOS Comp. Biol. that mention Hebbian.
Back to top.
See all papers in April 2015 that mention learning algorithms.
See all papers in PLOS Comp. Biol. that mention learning algorithms.
Back to top.
See all papers in April 2015 that mention “null model”.
See all papers in PLOS Comp. Biol. that mention “null model”.
Back to top.
See all papers in April 2015 that mention reinforcement learning.
See all papers in PLOS Comp. Biol. that mention reinforcement learning.
Back to top.
See all papers in April 2015 that mention network model.
See all papers in PLOS Comp. Biol. that mention network model.
Back to top.
See all papers in April 2015 that mention confidence intervals.
See all papers in PLOS Comp. Biol. that mention confidence intervals.
Back to top.
See all papers in April 2015 that mention learning rules.
See all papers in PLOS Comp. Biol. that mention learning rules.
Back to top.