SciSurf: Index of 'How Attention Can Create Synaptic Tags for the Learning of Working Memories in Sequential Tasks'

How Attention Can Create Synaptic Tags for the Learning of Working Memories in Sequential Tasks

Jaldert O. Rombouts, Sander M. Bohte, Pieter R. Roelfsema

Published in PLOS Comp. Biol., March 2015

Abstract

Neurons in association cortex are thought to be essential for this ability. During learning these neurons become tuned to relevant features and start to represent them with persistent activity during memory delays. This learning process is not well understood. Here we develop a biologically plausible learning scheme that explains how trial-and-error learning induces neuronal selectivity and working memory representations for task-relevant information. We propose that the response selection stage sends attentional feedback signals to earlier processing levels, forming synaptic tags at those connections responsible for the stimulus-re-sponse mapping. Globally released neuromodulators then interact with tagged synapses to determine their plasticity. The resulting learning rule endows neural networks with the capacity to create new working memory representations of task relevant information as persistent activity. It is remarkably generic: it explains how association neurons learn to store task-relevant information for linear as well as nonlinear stimulus-response mappings, how they become tuned to category boundaries or analog variables, depending on the task demands, and how they learn to integrate probabilistic evidence for perceptual decisions.

Author Summary

Most, if not all, tasks that one can imagine require some form of working memory. The optimal solution of a working memory task depends on information that was presented in the past, for example choosing the right direction at an intersection based on a road-sign some hundreds of meters before. Interestingly, animals like monkeys readily learn difficult working memory tasks, just by receiving rewards such as fruit juice when they perform the desired behavior. Neurons in association areas in the brain play an important role in this process; these areas integrate perceptual and memory information to support decision-making. Some of these association neurons become tuned to relevant features and memorize the information that is required later as a persistent elevation of their activity. It is, however, not well understood how these neurons acquire their task-relevant tuning. Here we formulate a simple biologically plausible learning mechanism that can explain how a network of neurons can learn a wide variety of working memory tasks by trial-and-error learning. We also show that the solutions learned by the model are comparable to those found in animals when they are trained on similar tasks.

Introduction

They can learn to map sensory stimuli onto responses, to store task-relevant information and to integrate and combine unreliable sensory evidence. Training induces new stimulus and memory representations in ‘multiple-demand’ regions of the cortex [1]. For example, if monkeys are trained to memorize the location of a visual stimulus, neurons in lateral intra-parietal cortex (LIP) represent this location as a persistent increase of their firing rate [2,3]. However, if the animals learn a visual categorization task, persistent activity of LIP cells becomes tuned to the boundary between categories [4] whereas the neurons integrate probabilistic evidence if the task is sensory decision making [5]. Similar effects of training on persistent activity have been observed in the somatosensory system. If monkeys are trained to compare frequencies of successive vibrotactile stimuli, working memory representations of analog variables are formed in somatosensory, prefrontal and motor cortex [6].

We here outline AuGMEnT (Attention-Gated MEmory Tagging), a new reinforcement learning [7] scheme that explains the formation of working memories during trial-and-error learning and that is inspired by the role of attention and neuromodulatory systems in the gating of neuronal plasticity. AuGMEnT addresses two well-known problems in learning theory: temporal and structural credit-assignment [7,8]. The temporal credit-assignment problem arises if an agent has to learn actions that are only rewarded after a sequence of intervening actions, so that it is difficult to assign credit to the appropriate ones. AuGMEnT solves this problem like previous temporal-difference reinforcement learning (RL) theories [7]. It learns action-values (known as Q-values [7] ), i.e. the amount of reward that is predicted for a particular action when executed in a particular state of the world. If the outcome deviates from the reward-prediction, a neuro-modulatory signal that codes the global reward-prediction error (RPE) gates synaptic plasticity in order to change the Q-value, in accordance with experimental findings [9—12]. The key new property of AuGMEnT is that it can also learn tasks that require working memory, thus going beyond standard RL models [7,13].

Which synapses should change to improve performance? AuGMEnT solves this problem with an ‘attentional’ feedback mechanism. The output layer has feedback connections to units at earlier levels that provide feedback to those units that were responsible for the action that was selected [14]. We propose that this feedback signal tags [15] relevant synapses and that the persistence of tags (known as eligibility traces [7,16]) permits learning if time passes between the action and the RPE [see 17]. We will here demonstrate the neuroscientif1c plausibility of AuGMEnT. A preliminary and more technical version of these results has been presented at a conference [18].

Model

Model architecture

1). Time was modeled in discrete steps.

Input layer

The sensory layer represents stimuli with instantaneous and transient units (Fig. 1). Instantaneous units represent the current sensory stimulus x(t) and are active as long as the stimulus is present. Transient units represent changes in the stimulus and behave like ‘on (+)’ and ‘off (-)’ cells in sensory cortices

They encode positive and negative changes in sensory inputs W.r.t. the previous time-step t- 1: where H + is a threshold operation that returns 0 for all negative values, but leaves positive values unchanged. Every input is therefore represented by three sensory units. We assume that all units have zero activity at the start of the trial (t = 0), and that t: 1 at the first time-step of the trial.

Association layer

We use the term ‘regular unit’ to reflect the fact that these are regular sigmoidal units that do not exhibit persistent activity in the absence of input. Regular units j are fully connected to instantaneous units 1' in the sensory layer by connections 1/; (the superscript R indexes synapses onto regular units, and 1/5]. is a bias here inpflt) denotes the synaptic input and a a sigmoidal activation function; although our results do not depend on this particular choice of a. The derivative of yf(t) can be conveniently expressed as: Memory units m (diamonds in Fig. 1) are fully connected to the transient (+/—) units in the sensory layer by connections vf‘fn (superscript M indexes synapses onto memory units) and they integrate their input over the duration of the trial: where we use the shorthand x; that stands for both + and - cells, so Zlvfixxt) should be read as Zlvyfxflt) —|— Zlvmfxl‘ (t) The selective connectivity between the transient input units and memory cells is advantageous. We found that the learning scheme is less stable when memory units also receive input from the instantaneous input units because in that case even weak constant input becomes integrated across time as an activity ramp. We note, however, that there are other neuronal mechanisms which can prevent the integration of constant inputs. For example, the synapses between instantaneous input units and memory units could be rapidly adapting, so that the memory units only integrate variations in their input. The simulated integration process causes persistent changes in the activity of memory units. It is easy to see that the activity of a memory unit equals the activity of a hypothetical regular unit that would receive input from all previous time-steps of the trial at the same time. To keep the model simple, we do not simulate the mechanisms responsible for persistent activity, which have been addressed in previous work [20—22]. Although the perfect integration assumed in Eqn. (7) does not exist in reality, we suggest that it is an acceptable approximation for trials with a relatively short duration as in the tasks that will be described below. Indeed, there are reports of single neuron integrators in entorhinal corteX with stable firing rates that persist for ten minutes or more [23], which is orders of magnitude longer than the trials modeled here. In neurophysiological studies in behaving animals, the neurons that behave like regular and memory units in e.g. LIP [2,3] and frontal corteX [24] would be classified as visual cells and memory cells, respectively.

Q-value layer

The third layer receives input from the association layer through plastic connections wjk

1). Its task is to compute action-values (i.e. Q-Values [7]) for every possible action. Specifically, a Q-Value unit aims to represent the (discounted) expected reward for the remainder of a trial if the network selects an action a in the current state 5 [7]: where the term is the expected discounted future reward Rt given a and 5, under action-selection policy 71 and y E [0, 1] determines the discounting of future rewards r. It is informative to explicitly write out the above expectation to see that Q-values are recursively defined as: where P; is a transition matrix, containing the probabilities that executing action a in state 5 will move the agent to state 5', R; is the expected reward for this transition, and S and A are the sets of states and actions, respectively. Note that the action selection policy 71 is assumed to be stochastic in general. By executing the policy 71, an agent samples trajectories according to the probability distributions 71, P; and R; where every observed transition can be used to update the original prediction Q(st, at). Importantly, temporal difference learning schemes such as AuGMEnT are model-free, which means that they do not need explicit access to these probability distributions while improving their Q-values. Q-value units k are fully connected to the association layer by connections w}: (from regular units, with wfik as bias weight) and wflfk (from memory units). The action value qk(t) is estimated as:

Association layer units must therefore learn to represent and memorize information about the environment to compute the value of all possible actions a. They transform a so-called partially observable Markov decision process (POMDP) Where the optimal decision depends on information presented in the past into a simpler Markov decision process (MDP) by storing relevant information as persistent activity, making it available for the next decision.

Action selection

The network usually chooses the action a with the highest value, but occasionally explores other actions to improve its value estimates. We used a Max-Boltz-mann controller [25] to implement the action selection policy 71. It selects the greedy action (highest qk(t), ties are broken randomly) With probability 1 - 8, and a random action k sampled from the Boltzmann distribution PB With small probability 8:

We assume that the controller is implemented downstream, e.g. in the motor cortex or basal ganglia, but do not simulate the details of action selection, Which have been addressed previously [26—30]. After selecting an action a, the activity in the third layer becomes zk = 6k“, Where 6;“, is the Kronecker delta function (1 if k = a and 0 otherwise). In other words, the selected action is the only one active after the selection process, and it then provides an “attentional” feedback signal to the association cortex (orange feedback connections in Fig. 1A).

Learning

Once an action is selected, the unit that codes the winning action a feeds back to earlier processing levels to create synaptic tags [31,32], also known as eligibility traces [7,16] on the responsible synapses (orange pentagons in Fig. 1). Tagging of connections from the association layer to the motor layer follows a form of Hebbian plasticity: the tag strength depends on presynaptic activity (yj) and postsynap-tic activity after action selection (zk) and tags thus only form at synapses wja onto the winning (i.e. selected) motor unit a:

Here, A denotes the change in one time-step, i.e Tag(t+1) = Tag(t)+ATag(t). The formation of tags on the feedback connections 141;]. follows the same rule so that the strength of feedforward and feedback connections becomes similar during learning, in accordance With neurophysiological findings [33]. Thus, the association units that provided strong input to the Winning action 61 also receive strongest feedback (Fig. 1, middle panel): they will be held responsible for the outcome of a. Importantly, the attentional feedback signal also guides the formation of tags on connections vij so that synapses from the input layer onto responsible association units j (strong wig.) are most strongly tagged (Fig. 1B). For regular units we propose:

(5)), which determines the influence that a change in the input inpj has on the activity of unit j. The idea has been illustrated in Fig. 1B. Feedback from the winning action (lower synapse in Fig. 1B) enables the formation of tags on the feedforward connections onto the regular unit. These tags can interact with globally released neuromodulators that inform all synapses about the RPE (green cloud ‘6’ in Fig. 1). Note that feedback connections only influence the plasticity of representations in the association layer but do not influence activity in the present version of the model. We will come back to this point in the discussion.

These traces are located on the synapses from the sensory units onto memory cells. Any pre-synaptic activity in these synapses leaves a trace that persists for the duration of a trial. If one of the selected actions provides a feedback signal (panel iv in Fig. 1C) to the post-synaptic memory unit, the trace gives rise to a tag making the synapse plastic as it can now interact with globally released neuromodulators: We assume that the time scale of trace updates is fast compared to the tag updates, so that tags are updated with the latest traces. The traces persist for the duration of the trial, but all tags decay exponentially (0<a<1).

Moreover, an action a at time step (tl) may have caused a change in the sensory stimulus. For example, in most studies of monkey vision, a visual stimulus appears if the animal directs gaze to a fixation point. In the model, the new stimulus causes feedforward processing on the next time step t, which results in another set of Q-values. To evaluate whether a was better or worse than expected, the model compares the predicted outcome Qa(t-1), which has to be temporarily stored in the system, to the sum of the reward r(t) and the discounted action-value Qa/(t) of unit a’ that wins the subsequent stochastic WTA-competition. This temporal difference learning rule is known as SARSA [7,34]:

Neurons representing action values have been found in the frontal cortex, basal ganglia and midbrain [12,35,36] and some orbitofrontal neurons specifically code the chosen value, qa [37]. Moreover, dopamine neurons in the ventral tegmental area and substantia nigra represent 6 [9,10,38]. In the model, the release of neuromodulators makes 6 available throughout the brain (green cloud in Fig. 1). Plasticity of all synapses depends on the product of 6 and tag strength: Where fl is the learning rate, and Where the latter equation also holds for the feedback weights 1%. These equations capture the key idea of AuGMEnT: tagged synapses are held accountable for the RPE and change their strength accordingly. Note that AuGMEnT uses a four-factor learning rule for synapses vij. The first two factors are the pre and postsynaptic activity that determine the formation of tags (Eqns. (14)—(16)). The third factor is the “attentional” feedback from the motor selection stage, Which ensures that tags are only formed in the circuit that is responsible for the selected action. The fourth factor is the RPE 6, which reflects whether the outcome of an action was better or worse than expected and determines if the tagged synapses increase or decrease in strength. The computation of the RPE demands the comparison of Q-values in different time-steps. The RPE at time 1‘ depends on the action that the network selected at t-1 (see Eqn. (17) and the next section), but the activity of the units that gave rise to this selection have typically changed at time t. The synaptic tags solve this problem because they labeled those synapses that were responsible for the selection of the previous action.

(13), (14), (16)) and traces (Eq. (15)) and the equations that govern plasticity (Eq. (18)) rely only on information that is available locally, at the synapse. Furthermore, the hypothesis that a neuromodulatory signal, like dopamine, broadcasts the RPE to all synapses in the network is supported by neurobiological findings [9,10,38].

Results

(17)) of the transitions that are experienced by the network by online gradient descent. Although AuGMEnT is not guaranteed to find optimal solutions (we cannot provide a proof of convergence), we found that it reliably learns difficult nonlinear working memory problems, as will be illustrated below. AuGMEnT minimizes the reward-prediction error (RPE)

The RPE 6(t) implies a comparison between two quantities: the predicted Q-value before the transition, qa(t-1), and a target Q-value r(t)+yqar(t), which consists of the actually observed reward and the next predicted Q-value [7]. If the two terms cancel, the prediction was correct. SARSA aims to minimize the prediction error by adjusting the network weights w to improve the prediction qa(t- 1) to bring it closer to the observed value r(t)+yqa/(t). It is convenient to do this through online gradient descent on the squared prediction error where is the gradient of the predicted Q-value Qa(t-1) with respect to parameters w. In Note that E is defined with regard to the sampled transition only so that the definition typically differs between successive transitions experienced by the network. For notational convenience we will abbreviate E(qa(t-1)) to Eqa in the remainder of this paper.

The RPE is high if the sum of the reward r(t) and discounted qa/(t) deviates strongly from the prediction qa(t-1) on the previous time step. As in other SARSA methods, the updating of synaptic weights is only performed for the transitions that the network actually experiences. In other words, AuGMEnT is a so-called “on policy” learning method [7]. We will first establish the equivalence of online gradient descent defined in Equation (19) and the AuGMEnT learning rule for the synaptic weights wflt) from the regular units onto the time step t-1 should change as: leaving the other weights k7éa unchanged. We will now show that AuGMEnT causes equivalent changes in synaptic strength. It follows activity of association unit j on the previous time step. This result allows us to rewrite (20) as:

In the special case a = 1, it follows that on time step t, Tagja(t) = yf(t — 1) and that tags on synapses onto output units k7éa for the synapses onto the selected action a, and the second, generalized, equation follows from the fact that gfifi‘j = 0 for output units k7éa that were not selected and therefore do not con-jk tribute to the RPE. Inspection of Eqns. (18) and (23) reveals that AuGMEnT indeed takes a step of size fl in the direction opposite to the error gradient of Equation (19) (provided a = 1; we discuss the case a7é1 below). The updates for synapses between memory units m and Q-value units k are equivalent to those between regular units and the Q-value units. Thus, The plasticity of the feedback connections 14/; and 14/2; from the Q-Value layer to the association layer follows the same rule as the updates of connections and wfifk and the feedforward and feedback connections between two units therefore become proportional during learning [14] . We will now show that synapses 1/; between the input layer and the regular association units

1) also change according to the negative gradient of the error function defined above. Applying the chain rule to compute the influence of 1/5 on qa(t-1) results in the following equa-The amount of attentional feedback that was received by unit j from the selected Q-Value unit a at time t-1 is equal to w’fj because the activity of unit a equals 1 once it has been selected. As indicated above, learning makes the strength of feedforward and feedback connections similar so that can be estimated as the amount of feedback w’fj that unit j receives from the seA comparison to Eq. (18) demonstrates that AuGMEnT also takes a step of size fl in the direction opposite to the error gradient for these synapses. The final set of synapses that needs to be considered are between the transient sensory units and the memory units. We approximate the total input 1'an (t) of memory unit m as (see The approximation is good if synapses 1/2141 change slowly during a trial. According to Equation (19), the update for these synapses is: be written as:

(16) states that ATaglm = —ocTaglm —|— sTracelma’Unpmw’fiw because the feedback from the Winning action a converts the trace into a tag (panel iv in Fig. 1C). Thus, if a = 1 then

(31) and (18) shows that AuGMEnT takes a step of size fl in the direction opposite to the error gradient, just as is the case for all other categories of synapses. We conclude that AuGMEnT causes an online gradient descent on all synaptic weights to minimize the temporal difference error if a = 1. AuGMEnT provides a biological implementation of the well known RL method called SARSA, although it also goes beyond traditional SARSA [7] by (i) including memory units (ii) representing the current state of the external world as a vector of activity at the input layer (iii) providing an association layer that aids in computing Q-values that depend non-linearly on the input, thus providing a biologically plausible equivalent of the error-backpropagation learning rule [8], and (iv) using synaptic tags and traces (Fig. 1B,C) so that all the information necessary for plasticity is available locally at every synapse.

If a memory unit j receives input from input unit 1' then a trace of this input is maintained at synapse vij for the remainder of the trial (blue circle in Fig. 1C). Suppose that j, in turn, is connected to action a which is selected at a later time point. Now unit j receives feedback from a so that the trace on synapse vz-j becomes a tag making it sensitive to the globally released neuromodulator that codes the RPE 6 (panel iv in Fig. 1C). If the outcome of a was better than expected (6> 0) (green cloud in panel v), vij strengthens (thicker synapse in panel vi). When the stimulus that activated unit 1' reappears on a later trial, the larger vl-j increases unit j ’s persistent activity which, in turn, enhances the activity of the Q-value unit representing 61, thereby decreasing the RPE.

In SARSA learning speeds up if the eligibility traces do not fully decay on every time step, but exponentially with parameter XE [0,1] [7]; the resulting rule is called SARSAOQ. In AuGMEnT, the parameter 05 plays an equivalent role and precise equivalence can be obtained by setting a = l-M/ as can be verified by making this substitution in Eqn. (13) (14) and (16) (noting that Tag(t+1) = Tag(t)+ATag(t)). It follows that tags decay exponentially as Tag(t+1) = Qty/Tag“), equivalent to the decay of eligibility traces in SARSA()L). These results establish the correspondence between the biologically inspired AuGMEnT learning scheme and the RL method SARSAQ). A special condition occurs at the end of a trial. The activity of memory units, traces, tags, and Q-values are set to zero (see [7]), after updating of the weights with a 6 that reflects the transition to the terminal state. In the remainder of the results section we will illustrate how AuGMEnT can train multi-lay-ered networks with the form of Fig. 1 to perform a large variety of tasks that have been used to study neuronal representations in the association cortex of monkeys.

Using AuGMEnT to simulate animal learning experiments

The first three tasks have been used to study the influence of learning on neuronal activity in area LIP and the fourth task to study Vibrotactile working memory in multiple cortical regions. All tasks have a similar overall structure: the monkey starts a trial by directing gaze to a fixation point or by touching a response key. Then stimuli are presented to the monkey and it has to respond with the correct action after a memory delay. At the end of a trial, the model could choose between two possible actions. The full task reward (rf, 1.5 units) was given if this choice was correct, while we aborted trials and gave Table 1. Model parameters. Parameter Description Value ,8 Learning rate 0.15 A Tag/T race decay rate 0.20 y Discount factor 0.90 a Tag persistence 1-Ay 5 Exploration rate 0.025 no reward if the model made the wrong choice or broke fixation (released the key) before a go signal.

The monkey starts with simple tasks and then the complexity is gradually increased. It is also common to give small rewards for reaching intermediate goals in the task, such as attaining fixation. We encouraged fixation (or touching the key in the vibrotactile task below) by giving a small shaping reward (r,, 0.2 units) if the model directed gaze to the fixation point (touched the key). In the next section we will demonstrate that the training of networks with AuGMEnT is facilitated by shaping. Shaping was not necessary for learning in any of the tasks, however, but it enhanced learning speed and increased the proportion of networks that learned the task within the alloted number of training trials.

The number of input units varied across tasks as the complexity of the sensory stimuli differed. We note, however, that the results described below would have been identical had we simulated a fixed, large input layer with silent input units in some of the tasks, because silent input units have no influence on activity in the rest of the network.

Saccade/antisaccade task

2A) is a memory saccade/anti-saccade task modeled after Gottlieb and Goldberg [3]. Every trial started with an empty screen, shown for one time step. Then a fixation mark was shown that was either black or white, indicating that a pro or anti-saccade would be required. The model had to fixate within 10 time-steps, otherwise the trial was terminated without reward. If the model fixated for two time-steps, we presented a cue on the left or the right side of the screen for one time-step and gave the fixation reward ri. This was followed by a memory delay of two time steps during which only the fixation point was Visible. At the end of the memory delay the fixation mark turned off. To collect the final reward rf in the pro-saccade condition, the model had to make an eye-movement to the remembered location of the cue Table 2. Network architecture parameters. Architecture Value Input units Task dependent Memory units N = 4 Initial weights Uniform over [-0.25,0.25] and to the opposite location on anti-saccade trials. The trial was aborted if the model failed to respond within eight time steps.

2B) represented the color of the fixation point and the presence of the peripheral cues. The three Q-value units had to represent the value of directing gaze to the centre, left and right side of the screen. This task can only be solved by storing cue location in working memory and, in addition, requires a nonlinear transformation and can therefore not be solved by a linear mapping from the sensory units to the Q-value units. We trained the models for maximally 25,000 trials, or until they learned the task. We kept track of accuracy for all four trial types as the proportion correct responses in the last 50 trials. When all accuracies reached 0.9 or higher, learning and exploration were disabled (i.e. fl and 8 were set to zero) and we considered learning successful if the model performed all trial-types accurately.

We distinguished three points along the task learning trajectory: learning to obtain the fixation reward (‘Fix’), learning to fixate until fixation-mark offset (‘Go’) and finally to correctly solve the task (‘Task’). To determine the ‘Fix’-learn trial, we determined the time point when the model attained fixation in 90 out of 100 consecutive trials. The model learned to f1xate after 224 trials (median) (Fig. 2C). The model learned to maintain gaze until the go signal after ~ 1,300 trials and it successfully learned the complete task after ~ 4,100 trials. Thus, the learning process was at least an order of magnitude faster than in monkeys that typically learn such a task after months of training with more than 1,000 trials per day.

Networks that received fixation rewards were more likely to learn than networks that did not (99.45% versus 76.41%; 12 = 2,498, p<10'6). Thus, shaping strategies facilitate training with AuGMEnT, similar to their beneficial effect in animal learning [39].

One of the association units (grey in Fig. 2D) and the Q-unit for fixating at the centre of the display (blue in Fig. 2B,D) had strongest activity at fixation onset and throughout the fixation and memory delays. If recorded in a macaque monkey, these neurons would be classified as fixation cells. After the go-signal the Q-unit for the appropriate eye movement became more active. The activity of the Q-units also depended on cue-location during the memory delay as is observed, for example, in the frontal eye fields (* in Fig. 2D) [40]. This activity is caused by the input from memory units in the association layer that memorized cue location as a persistent increase in their activity (green and orange in Fig. 2D). Memory units were also tuned to the color of the fixation mark which differentiated pro-saccade trials from anti-saccade trials, a conjoined selectivity necessary to solve this nonlinear task [41]. There was an interesting division of labor between regular and memory units in the association layer. Memory units learned to remember the cue location. In contrast, regular units learned to encode the presence of task-relevant sensory information on the screen. Specifically, the fixation unit in Fig. 2D (upper row) was active as long as the fixation point was present and switched off when it disappeared, thus cueing the model to make an eye movement. Interestingly, these two classes of memory neurons and regular (“light sensitive”) neurons are also found in areas of the parietal and frontal cortex of monkeys [2,40] where they appear to have equivalent roles.

2D provides a first, casual impression of the representations that the network learns. To gain a deeper understanding of the representation in the association layer that supports the nonlinear mapping from the sensory units to the Q-value units, we performed a principal component analysis (PCA) on the activations of the association units. We constructed a single (32x7) observation matrix from the association layer activations for each time-step (there were seven association units and eight time-points in each of the four trial-types), with the learning rate fl and exploration rate 8 of the network set to zero. Fig. 2E shows the projection of the activation vectors onto the first two principal components for an example network. It can be seen activity in the association layer reflects the important events in the task. The color of the fixation point and the cue location provide information about the correct action and lead to a ‘split’ in the 2D principal component (PC) space. In the ‘Go’ phase, there are only two possible correct actions: ‘left’ for the Pro-Left and Anti-Right trials and ‘right’ otherwise. The 2D PC plot shows that the network splits the space into three parts based on the optimal action: here the ‘left’ action is clustered in the middle, and the two trial types with target action ‘right’ are adjacent to this cluster. This pattern (or its inversion with the ‘right’ action in the middle) was typical for the trained networks. Fig. 2F shows how the explained variance in the activity of association units increases with the number of PCs, averaged over 100 simulated networks; most variance was captured by the first two PCs.

2G). For every network we entered a ‘1’ in the matrix for trial types with the smallest distance and a ‘0’ for all other pairs of trials and then aggregated results over all networks by averaging the resulting matrices. Initially the patterns of activity in the association layer are similar for all trial types, but they diverge after the presentation of the fixation point and the cue. The regular units convey a strong representation of the color of the fixation point (e.g. activity in pro-saccade trials with a left cue is similar to activity in pro-saccade trials with a right cue; PL and PR in Fig. 2G), which is visible at all times. Memory units have a clear representation of the previous cue location during the delay (e.g. AL trials similar to PL trials and AR to PR trials in Fig. 2G). At the go-cue their activity became similar for trials requiring the same action (e.g. AL trials became similar to PR trials), and the same was true for the units in the Q-value layer.

We used the same stimuli, but we now only required pro-saccades so that the color of the fixation point became irrelevant. We trained 100 networks, of which 96 learned the task and we investigated the similarities of the activation patterns. In these networks, the memory units became tuned to cue-location but not to color of the fixation point (Fig. 2H; note the similar activity patterns for trials with a differently colored fixation point, e.g. AL and PL trials). Thus, AuGMEnt specifically induces selectivity for task-relevant features in the association layer.

Delayed match-to-category task

After training, neurons in frontal [42] and parietal corteX [4] respond similarly to stimuli from the same category and discriminate between stimuli from different categories. In one study [4], monkeys had to group motion stimuli in two categories in a delayed-match-to-category task (Fig. 3A). They first had to look at a fixation point, then a motion stimulus appeared and after a delay a second motion stimulus was presented. The monkeys’ response depended on whether the two stimuli came from the same category or from different categories. We investigated if AuGMEnT could train a network with an identical architecture (with 3 regular and 4 memory units in the association layer) as the network of the delayed saccade/antisaccade task to perform this categorization task. We used an input layer with a unit for the fixation point and 20 units with circular Gaussian tuning curves of the form r(x) = exp <— ) with preferred directions QC evenly distributed over

A

Category 2 Category Boundary

O

mory o .o o A 0‘ IN

U

Activation .0 DUI Data NW 0 Firing Rate (Hz)

Match-to-category task.A, When the network directed gaze to the fixation point, we presented a motion stimulus (cue-1), and after a delay a second motion stimulus (cue-2). The network had to make a saccade to the left when the two stimuli belonged to the same category (match) and to the right othenNise. There were twelve motion directions, which were divided into two categories (right). B, The sensory layer had a unit representing the fixation point and 20 units with circular Gaussian tuning curves (s.d. 12 deg.) with preferred directions evenly distributed overthe unit circle. C, Activity of two example memory units in a trained network evoked by the twelve cue-1 directions. Each line represents one trial, and color represents cue category. Responses to cues closest to the categorization boundary are drawn with a dashed line of lighter color. F, fixation mark onset; C, cue-1 presentation. D, delay; G, cue-2 presentation (go signal); S, saccade. D, Activity of the same two example memory units as in (C) in the ‘go’ phase of the task for all 12x12 combinations of cues. Colors of labels and axes indicate cue category. E, Left, Motion tuning of the memory units (in C) at the end of the memory delay. Error bars show s.d. across trials and the dotted vertical line indicates the category boundary. Right, Tuning of a typical LIP neuron (from [4]), error bars show s.e.m. F, Left, Distribution of the direction change that evoked the largest difference in response across memory units from 100 networks. Right, Distribution of direction changes that evoked largest response differences in LIP neurons (from [4]).

3B). The two categories were defined by a boundary that separated the twelve motion directions (adjacent motion directions were separated by 30 deg.) into two sets of six directions each.

Two time-steps after fixation we presented one of twelve motion-cues (cue-1) for one time step and gave the fixation reward rl- (Fig. 3A). We added Gaussian noise to the motion direction (s.d. 5 deg.) to simulate noise in the sensory system. The model had to maintain fixation during the ensuing memory delay that lasted two time steps. We then presented a second motion stimulus (cue-2) and the model had to make an eye-movement (either left or right; the fixation mark did not turn off in this task) that depended on the match between the categories of the cues. We required an eye movement to the left if both stimuli belonged to the same category and to the right otherwise, within eight time-steps after cue-2. We trained 100 models and measured accuracy for the preceding 50 trials with the same cue-1. We determined the duration of the learning phase as the trial where accuracy had reached 80% for all cue-1 types.

Fig. 3C illustrates motion tuning of two example memory neurons in a trained network. Both units had become category selective, from cue onset onwards and throughout the delay period. Fig. 3D shows the activity of these units at ‘Go’ time (i.e. after presentation of cue-2) for all 144 combinations of the two cues. Fig. 3E shows the tuning of the memory units during the delay period. For every memory unit of the simulations (N = 400), we determined the direction change eliciting the largest difference in activity (Fig. 3F) and found that the units exhibited the largest changes in activity for differences in the motion direction that crossed a category boundary, as do neurons in LIP [4] (Fig. 3E,F, right). Thus, AuGMEnT can train networks to perform a delayed match-to-category task and it induces memory tuning for those feature variations that matter.

Probabilistic decision making task

Persistent activity in area LIP has also been related to perceptual decision making, because LIP neurons integrate sensory information over time in decision making tasks [43]. Can AuGMEnT train the very same network to integrate evidence for a perceptual decision? We focused on a recent study [5] in which monkeys saw a red and a green saccade target and then four symbols that were presented successively. The four symbols provided probabilistic evidence about whether a red or green eye-movement target was baited with reward (Fig. 4A). Some of the symbols provided strong evidence in favor of the red target (e.g. the triangle in the inset of Fig. 4A), others strong evidence for the green target (heptagon) and other symbols provided weaker evidence. The pattern of choices revealed that the monkeys assigned high weights to symbols carrying strong evidence and lower weights to less informative ones. A previous model with only one layer of modifiable synapses could learn a simplified, linear version of this task where the symbols provided direct evidence for one of two actions [44]. This model used a pre-wired memory and it did not simulate the full task where symbols only carry evidence about red and green choices while the position of the red and green targets varied across trials. Here we tested if AuGMEnT could train our network with three regular and four memory units to perform the full nonlinear task.

We will first decribe the most complex version of the task. In this version, the model (Fig. 4B) had to first direct gaze to the fixation point. After fixating for two time-steps, we gave the fixation reward ri and presented the colored targets and also one of the 10 symbols at one of four locations around the fixation mark, In the subsequent three time-steps we presented the additional symbols. We randomized location of the red and green targets, the position of the successively presented symbols as well as the symbol sequence over trials. There was a memory delay of two time steps after all symbols ($1,. - -,s4) had been presented and we then removed the fixation point, as a cue to make a saccade to one of the col-with W = Z; wi (w,- is specified in Fig. 4A, inset) and to the green target otherwise. The model’s choice was considered correct if it selected the target with highest reward probability, or either target if reward probabilities were equal. However, rfwas only given if the model selected the baited target, irrespective of whether it had the highest reward probability.

- -,4) in eight steps (Table 3). Training started with the two ‘trump' shapes which guarantee reward for the correct decision (triangle and heptagon, see Fig. 4A, inset). We judged that the task had been learned when the success rate in the last 11 trials was 85%. As the number of possible input patterns grew we increased 11 to ensure that a significant fraction of possible input-patterns had been presented before we determined convergence (see Table 3). Difficulty was first increased by adding the pair of symbols with the next smaller absolute weight, until all shapes had been introduced (level 1—5) and then by increasing sequence length (level 6—8).

Training of the model to criterion (85% correct in the final task) took a median total of 55,234 trials across the eight difficulty levels, which is faster than the monkeys learned. After the training procedure, the memory units had learned to integrate information for either the red or green choice over the symbol sequence and maintained information about the value of this choice as persistent activity during the memory delay. Fig. 4C shows the activity of two memory units and the Q-value units of an example network during a trial where the shield symbol was presented four times, providing strong evidence that the green target was baited with reward. The memory units became sensitive to the context determined by the position of the red and green saccade targets. The unit in the first row of Fig. 4C integrated evidence for the green target if it appeared on the right side and the unit in the second row if the green target appeared on the left. Furthermore, the activity of these memory units ramped up gradually as more evidence accumulated.

To investigate the influence of log likelihood on the activity of the memory units, we computed log likelihood ratio (logLR) quintiles as follows. We enumerated all 10,000 length 4 symbol combinations $68 and computed the probability of reward for a saccade to the red target, P(R|S) for every combination. We next computed the conditional probabilities of reward P(R| $1) and P(G|sl) = 1-P(R|sl) for sequences 51 of length lE{1,- - -,4} (marginalizing over the unobserved symbols). We then computed LogLR(sl) as log10(P(R|sl)/P(G|sl)) for each specific sequence of length l and divided those into quintiles.

We then computed the average within-quintile activities over the aligned population. The upper panel of Fig. 5A shows how the average activity of the four memory units of an example network depended on the log likelihood that the targets were baited and the lower panel shows

It can be seen that the memory units’ activity became correlated to the log likelihood, just like LIP neurons. Importantly, the synaptic weights from input neurons to memory cells depended on the true weights of the symbols after learning (Fig. 5B). This correlation was also strong at the population level as can be seen in Fig. 5C which shows the distribution of all the correlation coefficients (N = 396). Thus, plasticity of synapses onto the memory neurons can explain how the monkeys valuate the symbols and AuGMEnT explains how these neurons learn to integrate the most relevant information. Furthermore, our results illustrate that AuGMEnT not only trains the association units to integrate stochastic sensory evidence but that it also endows them with the required mixed selectivity for target color and symbol sequence that is required to solve this nonlinear task [41].

Vibrotactile discrimination task

Our last simulation investigated a task that has been used to study Vibro-tactile working memory [6,45]. In this task, the monkey touches a key with one hand and then two Vibration stimuli are applied sequentially to a fingertip of the other hand (Fig. 6A). The monkey has to indicate whether the frequency of the first vibration stimulus (F1) is higher or lower than the frequency of the second one (F2). At the end of the trial the animal indicates its choice by releasing the key and pressing one of two buttons. The overall structure of the task is similar to that of the visual tasks described above, but the feature of interest here is that it requires a comparison between two scalar values; F2 that is sensed on the finger and F1 that has to be maintained in working memory.

Several models addressed how neural network models can store F1 and compare it to F2 [46—48]. More recently, Barak et al. [49] investigated the dynamics of the memory states in networks trained with three different supervised learning methods and compared them to the neuronal data. However, these previous studies did not yet address trial-and-error learning of the vibrotactile discrimination task with a biologically plausible learning rule. We therefore investigated if AuGMEnT could train the same network that had been used for LIP, with three regular units and four memory units, to solve this task.

Neurons in this cortical area have broad tuning curves and either monotonically increase or decrease their firing rate as function of the frequency of the vibrotactile stimulus [50]. The input units of the model had sigmoidal tuning curves r(x) = 1/ (1+exp(w(6c-x))), with 10 center points 6C evenly distributed over the interval between 5.5Hz and 49.5Hz. We used a pair of units at every 6C with one unit increasing its activity with stimulus frequency and the other one decreasing, so that there were a total of 20 input units. Parameter w determines the steepness of the tuning curve and was +/-5. We modeled sensory noise by adding independent zero mean Gaussian noise (s.d. 7.5%) to the firing rates of the input units. We also included a binary input unit that signaled skin contact with the stimulation device (unit S in Fig. 6B). The association and Q-value layers were identical to those of the other simulations (Fig. 6B).

A trial started when the input unit indicating skin contact with the vibrating probe became active and the model had to select the hold-key within ten time-steps, or else the trial was terminated. When the model had held the key for two time-steps, a vibration stimulus (F1, uniformly random between 5 and 50 Hz) was presented to the network for one time-step and the small shaping reward (r,) was given. This was followed by a memory delay after which we presented the second vibration stimulus (F2), drawn from a uniform distribution between 5 and 50 Hz, but with a minimal separation of 2 Hz from F1. If F2 was lower than F1 the model had to select the left button (green Q-value unit in Fig. 6B)—and the right button (red) otherwise—within eight time steps after the presentation of F2 to obtain the reward rf.

When the model reached a performance of 80% for every F1 we disabled learning and exploration (setting learning parameters 6 and 8 to zero) and checked the performance of the model for F1 stimuli of 20, 30 and 40 Hz and F2 stimuli with offsets of [-10, -8, . . ., -2,2, . . ., 8, 10] Hz, repeating each test 20 times. We considered learning to be successful if the model classified the nearest F2 frequencies (2 Hz distance) with a minimal accuracy of 50% and all other F2 frequencies with an accuracy better than 75%, for every F1 bin.

Fig. 7C illustrates the average (is.d.) choices of these 100 trained models as a function of F2, for three values of F1 as well as a logistic function fitted to the data [as in 6]. It can be seen that the model correctly indicates whether F1 is higher or lower than F2 and that the criterion depends on the value F1, implying that the model has learned to store this analog scalar value in its working memory. What are the memory representations that emerged during learning?

6D shows the F1 tuning of two memory units in an example network; typically the tunings are broad and can be increasing or decreasing as a function of F1, similar to what was found in experiments in the frontal cortex of monkeys [51]. Fig. 6E shows the distribution of linear correlations between 400 memory units in 100 trained networks and F1 frequency; most units exhibit a strong positive or negative correlation, indicating that the networks learned to code the memory of F1 as the level of persistent firing of the memory units.

This comparison process depends critically on the order of presentation of the two stimuli, yet it involves information that comes in via the same sensory inputs and association units [48]. We found that the memory units were indeed sensitive to both F1 and F2 in the comparison period. Fig. 6F shows the response of two example memory units and the three Q-value units for a trials with an F1 of 20 or 30 Hz, followed by an F2 with a frequency that was either 5Hz higher (solid line) or lower than F1 (dashed line). The activity of the memory units encodes F1 during the memory delay, but these units also respond to F2 so that the activity after the presentation of F2 depends on both frequencies. The lower panel illustrates the activity of the Q-value units. The activity of the Hold Q-value unit (H, blue) is highest until the presentation of F2, causing the model to hold the key until the go-signal. This unit did not distinguish between trials that required a right or left button press. The activities of Q-value units for the left and right button press (red and green traces) explain how the network made correct decisions at the go-signal because the Q-value of the appropriate action became highest (the solid lines in Fig. 6F show activity if F2>F1 and dashed lines F2<F1). It can be seen, for example, how the response elicited in the Q-value layer by an F2 of 25Hz depended on whether the preceding F1 was 20Hz (continuous curves in the left panel of Fig. 6F) or 30Hz (dashed curves in the right panel). We next quantified how the activity of the memory, regular and Q-Value units from 100 networks (N = 400, 300 and 300 units, respectively) depended on F1 and F2 during the comparison phase with a regression [see 52] using all trials where the F2 stimulus was presented and for all combinations of the two frequencies between 5 and 50 Hz (step size 1H2),

The activity of many memory units depended on F1 and also on F2 (Fig. 6G, left) and the overall negative correlation between the coefficients (r = -0.81, p<10'6) indicates that units that tended to respond more strongly for increasing F1 tended to decrease their response for increasing F2 and vice versa, just as is observed in area S2, the prefrontal cortex and the medial premotor cortex of monkeys [45,51,52]. In other words, many memory units became tuned to the difference between F1 and F2 in the comparison phase, as is required by this task. In spite of the fact that F1 and F2 activate memory units with the same synapses, the inverse tuning is possible because the F1 stimulus has turned off and activated the off-cells in the sensory layer in the comparison phase. In contrast, the F2 stimulus is still ‘on’ in this phase of the task so that the off-units coding F2 did not yet provide their input to the memory cells. As a result, the memory units’ final activity can reflect the difference between F1 and F2, as is required by the task. Regular units only have access to the current stimulus, and were therefore they are only tuned to F2 in the comparison phase (Fig. 6G, middle). Q-value units reflect the outcome of the comparison process (Fig. 6G, right): their regression coefficients with F1 and F2 fall into three clusters as predicted by the required action.

Hernandez et al. [6] also studied a version of the task where F1 was fixed for a block of trials. In this version, the monkeys based their response on F2 only and did not memorize F1. As a result their performance deteriorated at the start of a new block of trials with a different F1. Networks trained with AuGMEnT also only memorize task-relevant information. Do networks trained with AuGMEnT also fail to memorize F1 if it is fixed during training? To investigate this question, we trained models with a fixed F1 of 30 Hz [6] and presented F2 stimuli in the range between 5—50 Hz (2.5 Hz spacing) with a minimal distance from F1 of 10 Hz. We estimated convergence as the trial when accuracy reached 90% (running average of 50 trials).

After learning the fixed F1 task, we subjected the networks to block training with F1 stimuli of 20, 30 and 40 Hz as in [6] while we presented F2 stimuli with frequencies of ([-10,-8, . . .,-2,2,. . ., 8,10] Hz relative to F1 (10 total, each shown 150 times). These blocks of trials had a pseudorandom ordering but we always presented a 30Hz F1 in the last block. When we tested immediately after every block, we found that the models were well able to adapt to a specific F1. However, the models were not able to solve the variable F1 task after this extensive block training, even though they had significant exposure to different F1 stimuli.

61 shows the average psychometric curves for 100 networks after the last block with F1 2 30Hz. Colors represent trials with different F1 stimuli (as in Fig. 6C). It can be seen that the models disregarded F1 and only determined whether F2 was higher or lower than 30 Hz, just as monkeys that are trained with a blocked procedure [6]. Thus, the model can explain why the monkeys do not learn to compare the two stimuli if the F1 is fixed for longer blocks of trials. The memory units and the Q-value units now had similar rather than opposite tuning for F1 and F2 (positive correlations in the left and right panel of Fig. 6H; compare to Fig. 6G), which indicates that blocked training causes a failure to learn to subtract the memory trace of F1 from the representation of F2.

Memory units learn to represent the analog value that needs to be memorized as a graded level of persistent activity. However, if F1 is fixed for blocks of trials, the network does not memorize F1 but learns to base its decision on F2 only, in accordance with experimental findings.

Varying the learning parameters and the size of the network

In the simulations described above we fixed the number of units in the association layer and Q-value layer and used a single set of learning parameters. To examine the stability of the learning scheme, we also evaluated learning speed and convergence rate for various values of the learning rate fl and the SARSA learning parameter A (which determines the tag-decay parameter or because or = l-ly as was explained above, 7/ was kept at the default value). For the saccade/antisaccade, match-to-category and vibrotactile discrimination tasks we tested fiE{0.05,0.10,' - -,1.0} and XE{0.0,0.1,- - -,0.9} while the other parameters remained the same (Table 1,2) and ran 100 simulations for every combination. Fig. 7 shows the proportion of networks that converged and the median convergence trial. Training in the probabilistic classification task required a number of different training stages and a longer overall training time and we evaluated this task with a smaller set of parameters (Fig. 7, right). There was a wide range for the learning parameters where most of the networks converged and these ranges overlapped for the four tasks, implying that the AuGMEnT learning scheme is relatively robust and stable.

Can AuGMEnT also train networks with a larger association layer? To further investigate the generality of the learning scheme, we ran a series of simulations with increasing numbers of association units, multiplying the number of association units in the network described above by 2, 4, . . ., 128 and training 100 networks of each size in the saccade/antisaccade task. We first evaluated these larger networks without changing the learning parameters and found that the learning was largely unaffected within a limited range of network sizes, whereas performance deteriorated for networks that were 32—128 fold larger (Fig. 8A). The decrease in performance is likely caused by the larger number of synapses, causing larger adjustments of the Q-values after each time step than in the smaller networks. It is possible to compensate for this effect by choosing a smaller fl (learning rate) and A. We jointly scaled these parameters by i , i and i and selected the parameter combination which resulted in the highest convergence rate and the fastest median convergence speed for every network size (Fig. 8B). The performance of the larger networks was at least as good as that of the network with 7 units if learning parameters were scaled. Thus, AuGMEnT can also successfully train networks with a much larger association layer.

Discussion

The scheme uses units inspired by transient and sustained neurons in sensory cortices [19] , action-value coding neurons in frontal corteX, basal ganglia and midbrain [12,35,36] and neurons with mnemonic activity that integrate input in association corteX. To the best of our knowledge, AuGMEnT is the first biologically plausible learning scheme that implements SARSA in a multilayer neural network equipped with working memory. The model is simple, yet is able to learn a wide range of difficult tasks requiring nonlinear sensory-motor transformations, decision making, categorization, and working memory. AuGMEnT can train the very same network to perform either of these tasks by presenting the appropriate sensory inputs and reward contingency, and the representations it learns are similar to those found in animals trained on these tasks. AuGMEnT is a so-called on-policy method because it only relies on the Q-Values that the network experiences during learning. These on-policy methods appear to be more stable than off-policy algorithms (such as Q-learning which considers transitions not experienced by the network), if combined with neural networks (see e.g. [53,54]).

In the delayed saccade/anti-saccade task, training induced persistent neuronal activity tuned to the cue location and to the color of the fixation point, but only if it was relevant. In the categorization task, units became sensitive to category boundaries and in the decision making task, units integrated sensory evidence with stronger weights for the more reliable inputs. These properties resemble those of neurons in LIP [2—5] and the frontal cortex [24] of monkeys. Finally, the memory units learned to memorize and compare analog values in the vibrotactile task, just as has been observed in the frontal cortex of monkeys [6,45].

The first and foremost prediction is that feedback connections gate plasticity of the connections by inducing synaptic tags. Specifically, the learning scheme predicts that feedback connections are important for the induction of tags on feedforward connections from sensory cortices to the association cortex (Fig. 1B). A second prediction is the existence of traces in synapses onto neurons with persistent activity (i.e. memory units) that are transformed into tags upon the arrival of feedback from the response selection stage, which may occur at a later point in time. The third prediction is that these tags interact with globally released neuromodulators (e.g. dopamine, acetylcholine or serotonin), which determine the strength and sign of the synaptic changes (potentiation or depression). Neurobiological evidence for the existence of these tags and their interaction with neuromodulatory substances will be discussed below. A final prediction is that stationary stimuli provide transient input to neurons with persistent activity. As a result, stimuli that are visible for a longer time do not necessarily cause a ramping of activity. In our network ramping was prevented because memory units received input from “on” and “off ’ input units only. We note, however, that other mechanisms such as, for example, rapidly adapting synapses onto memory cells, could achieve the same effect. In contrast, neurons in association cortex without persistent activity are predicted to receive continuous input, for as long as a stimulus is present. These specific predictions could all be tested in future neuroscientific work.

Role of attentional feedback and neuromodulators in learning

The first two factors are pre and post-syn-aptic activity of the units and there are two additional “gating factors” that enable synaptic plasticity. The first gating factor is the feedback from units in the motor layer that code the selected action. These units send an attentional signal back to earlier processing levels to tag synapses responsible for selecting this action. The importance of selective attention for learning is supported by experiments in cognitive psychology. If observers select a stimulus for an action, attention invariably shifts to this stimulus [55] and this selective attention signal gates perceptual learning so that attended objects have larger impact on future behavior [56—58]. Moreover, neurophysiological studies demonstrated that such a feedback signal exists, because neurons in the motor cortex that code an action enhance the activity of upstream neurons providing input for this action [59,60].

Dopamine is often implicated because it is released if reward expectancy increases and it influences synaptic plasticity [10,38]. There is also a potential role for acetylcholine because cholinergic cells project diffusely to cortex, respond to rewards [61—63] and influence synaptic plasticity [61,64]. Furthermore, a recent study demonstrated that se-rotonergic neurons also carry a reward-predicting signal and that the optogenetic activation of serotonergic neurons acts as a positive reinforcer [65]. Guidance of synaptic plasticity by the combination of neuromodulatory signals and cortico-cortical feedback connections is biologically plausible because all information for the synaptic update is available at the synapse.

Synaptic tags and synaptic traces

The first step in the plasticity of a synapse onto a memory cell is the formation of a synaptic trace that persists until the end of the trial (Fig. 1C). The second step is the conversion of the trace into a tag, when a selected motor unit feeds back to the memory cell. The final step is the release of the neuromodulator that modifies tagged synapses. The learning rule for the synapses onto the regular (i.e. non-memory) association units is similar (Fig. 1B), but tags form directly onto active synapses, skipping the first step. We note, however, that the same learning rule is obtained if these synapses also have traces that decay within one time-step. The hypothesis that synaptic plasticity requires a sequence of events [66,67] is supported by the synapses’ complex biochemical machinery. There is evidence for synaptic tags [15,31,32] and recent studies have started to elucidate their identity [32]. Neuromodulatory signals influence synaptic plasticity even if released seconds or minutes later than the plasticity-inducing event [15,17,32] , which supports the hypothesis that they interact with some form of tag.

Comparison to previous modeling approaches

Many of the models rely either on Actor-Critic learning [7] or on policy gradient learning [75]. An advantage of Actor-Critic models is that model components relate to brain regions [16,71,73]. AuGMEnT has features in common with these models. For example, it uses the change in Q-value to compute the RPE (Eqn. (17)). Another widely used class of models is formed by policy gradient learning methods [68,75] where units (or synapses [68]) act as local agents that try to increase the global reward. An advantage of these models is that learning does not require knowledge about the influence of units on other units in the network, but a disadvantage is that the learning process does not scale well to larger networks where the correlation between local activity and the global reward is weak [70]. AuGMEnT uses ‘attentional’ feedback from the selected action to improve leaning [14] and it also generalizes to multilayer networks. It thereby alleviates a limitation of many previous biologically plausible RL models, which can only train a single layer of modifiable synaptic weights and solve linear tasks [16,21,44,67,70,71,73,76] and binary decisions [21,44,67,70].

It differs from many previous models in its ability to train task-relevant working memory representations, without pre-wiring. We modeled memory units as integrators, because neurons that act as integrators and maintain their activity during memory delays have been found in many cortical regions [2—5,23,24]. To keep the model simple, we did not specify the mechanisms causing persistent activity, which could derive from intracellular processes, local circuit reverberations or recurrent activity in larger networks spanning cortex, thalamus and basal ganglia [20—22].

Earlier neural networks models used “backpropagation-through-time”, but its mechanisms are biologically implausible [77]. The long short-term memory model (LSTM) [78] is a more recent and popular approach. Working memories in LSTM rely on the persistent activity of memory units, which resemble the ones used by AuGMEnT. However, LSTM relies on the biologically implausible error-backpropagation rule. To our knowledge, only one previous model addressed the creation of working memories with a neurobiologically inspired learning scheme, the pre-frontal basal-ganglia working memory model (PBWM) [72] , which is part of the Leabra cognitive architecture [79,80]. Although a detailed comparison of AuGMEnT and Leabra is beyond the scope of this article, it is useful to mention a few key differences. First, the complexity and level of detail of the Leabra/PBWM framework is greater than that of AuGMEnT. The PBWM framework uses more than ten modules, each with its own dynamics and learning rules, making formal analysis difficult. We chose to keep the models trained with AuGMEnT as simple as possible, so that learning is easier to understand. AuGMEnT’s simplicity comes at a cost because many functions remained abstract (see next section). Second, the PBWM model uses a teacher that informs the model about the correct decision, i.e. it uses more information than just reward feedback. Third, PBWM is an actor-critic architecture that learns the value of states, whereas AuGMEnT learns the value of actions. Fourth and finally, there are important differences in the mechanisms for working memory. In PBMW, memory units are bistable and the model is equipped with a system to gate information in prefrontal corteX via the basal ganglia. In AuGMEnT, memory units are directly activated by on and off-units in the input layer and they have continuous activity levels. The activity profile of memory units is task-de-pendent in AuGMEnT. It can train memory units to integrate evidence for probabilistic decision making, to memorize analog values as graded levels of persistent activity but also to store categories with almost binary responses in a delayed match-to-category task.

Biological plausibility, biological detail and future work

Our aim was to propose a learning rule based on Hebbian plasticity that is gated by two factors known to gate plasticity: a neuromodulatory signal that is released globally and codes the reward-prediction error and an attentional feedback signal that highlights the part of the network that is accountable for the outcome of an action. We showed that the combination of these two factors, which are indeed available at the level of the individual synapses, can cause changes in synaptic strength that follow gradient descent on the reward-prediction error for the transitions that the network experiences. At the same time, the present model provides only a limited degree of detail. The advantage of such a more abstract model is that it remains mathematically tractable. The downside is that more work will be needed to map the proposed mechanisms onto specific brain structures. We pointed out the correspondence between the tuning that developed in the association layer and tuning in the association cortex of monkeys. We now list a number of simplifying assumptions that we made and that will need to be alleviated by future models that incorporate more biological detail.

Future modeling studies could include brain structures for storing the Q-value of the previously selected action while new action-values are computed. Although we do not know the set of brain structures that store action values, previous studies implicated the medial and lateral prefrontal cortex in storing the outcome that is associated with an action [81,82]. Prefrontal neurons even update the predicted outcome as new information comes in during the trial [83]. An alternative to storing Q-values is provided by Actor-Critic architectures that assign values to the various states instead of state-action combinations. They use one network to estimate state-values and another network to select actions [16]. Interestingly, [16] proposed that the basal ganglia could compute temporal difference errors by comparing activity in the indirect pathway, which might store the predicted value of the previous time-step, and the direct pathway, which could code the predicted value of the next state. We hypothesize that a similar circuit could be used to compute SARSA temporal difference errors. In addition, we also did not model the action-selection process itself, which has been suggested to take place in the basal ganglia (see [30]).

Future studies could specify more detailed network architectures with constrained weights ([e.g. as in 72] ). Indeed, it is possible to change networks into functionally equivalent ones with excitatory and inhibitory units that have only positive weights [84], but the necessary generalization of AuGMEnT-like learning rules would require additional work.

Future studies could include feedback connections that also influence activity of units in the lower layers and develop learning rules for the plasticity of activity propagating feedback connections. These connections might further expand the set of tasks that neural networks can master if trained by trial-and-error. In this context it is of interest that previous studies demonstrated that feedforward propagation of activity to higher cortical areas mainly utilizes the AMPA receptor, whereas feedback effects rely more on the NMDA receptor [85], which plays an important role in synaptic plasticity. NMDA receptors also modify neuronal activity in lower areas, and another candidate receptor that could have a specific role in the influence of feedback connections on plasticity are metabotropic glutamate receptors, which are prominent in feedback pathways [86,87] and known to influence synaptic plasticity [88].

Therefore, implementations of AuGMEnT using populations of spiking neurons in continuous time deserve to be studied. We leave the integration of the necessary biological detail in AuGMEnT-like networks that would alleviate all these simplifications for future work.

Conclusions

Here we have shown that interactions between synaptic tags and neuromodulatory signals can explain how neurons in ‘multiple-demand’ association areas acquire mnemonic signals for apparently disparate tasks that require working memory, categorization or decision making. The finding that a single network can be trained by trial and error to perform these diverse tasks implies that these learning problems now fit into a unified reinforcement learning framework. Acknowledgments We thank Iohn Assad and Ariel Zylberberg for helpful comments.