SciSurf: Index of 'On the Number of Neurons and Time Scale of Integration Underlying the Formation of Percepts in the Brain'

On the Number of Neurons and Time Scale of Integration Underlying the Formation of Percepts in the Brain

Adrien Wohrer, Christian K. Machens

Published in PLOS Comp. Biol., March 2015

Abstract

Here we study the formation of such percepts under the assumption that they emerge from a linear readout, i.e., a weighted sum of the neurons’ firing rates. We show that this assumption constrains the trial-to-trial covariance structure of neural activities and animal behavior. The predicted covariance structure depends on the readout parameters, and in particular on the temporal integration window w and typical number of neurons K used in the formation of the percept. Using these predictions, we show how to infer the readout parameters from joint measurements of a subject’s behavior and neural activities. We consider three such scenarios: (1) recordings from the complete neural population, (2) recordings of neuronal sub-en-sembles whose size exceeds K, and (3) recordings of neuronal sub-ensembles that are smaller than K. Using theoretical arguments and artificially generated data, we show that the first two scenarios allow us to recover the typical spatial and temporal scales of the readout. In the third scenario, we show that the readout parameters can only be recovered by making additional assumptions about the structure of the full population activity. Our work provides the first thorough interpretation of (feed-fonNard) percept formation from a population of sensory neurons. We discuss applications to experimental recordings in classic sensory decision-making tasks, which will hopefully provide new insights into the nature of perceptual integration.

Author Summary

A “standard model” for these tasks has progressively emerged, Whence the animal’s percept and subsequent choice on each trial are obtained from a linear integration of the activity of sensory neurons. However, up to date, there has been no principled method to estimate the parameters of this model: mainly, the typical number of neurons K from the population involved in conveying the percept, and the typical time scale w during which these neurons’ activities are integrated. In this article, we propose a novel method to estimate these quantities from experimental data, and thus assess the validity of the standard model of percept formation. In the process, we clarify the predictions of the standard model regarding two classic experimental measures in these tasks: sensitivity, which is the animal’s ability to distinguish nearby stimulus values, and choice signals, which assess the amount of correlation between the activity of single neurons and the animal’s ultimate choice on each trial.

Introduction

When we record the responses of sensory neurons to well-controlled stimuli, their spike patterns vary from trial to trial. Does this variability reflect the uncertainties of the measurement process, or does it have a direct impact on behavior? These questions are central to our understanding of percept formation and decision-making in the brain and have been the focus of much previous work [1]. Many studies have sought to address these problems by studying animals that perform simple, perceptual decision-making tasks [2, 3]. In such tasks, an animal is typically presented with different stimuli s and trained to categorize them through a simple behavioral report. When this perceptual report is monitored simultaneously with the animal’s neural activity, one can try to find a causal link between the two.

In terms of signal detection theory, the hypothesis predicts a quantitative match between (1) the animal’s ability to discriminate nearby stimulus values—known as psychometric sensitivity, and (2) an ideal observer’s ability to discriminate nearby stimulus values based on the activities of the underlying neural population—known as neurometric sensitivity. Both types of sensitivities can be quantified as signal-to-noise ratios (SNR). With this idea in mind, several studies have compared the neurometric and psychometric sensitivities in various sensory systems and behavioral tasks (see [6, 7] for reference).

For example, if neurons in a population behave independently one from another, then the SNR of the population is simply the sum of the individual SNRs. Consequently, any estimate of neurometric sensitivity will grow linearly with the number of recorded neurons K. However, if neurons in a population do not behave independently, the precise growth of neural sensitivity with K is determined by the correlation structure of noise in the population [8—10]. In addition, the neurometric sensitivities also depend on the time scale w that is used to integrate each neuron’s spike train in a given trial [3, 11—13]. Indeed, the more spikes are incorporated in the readout, the more accurate that readout will be. Adding extra neurons by increasing K, or adding extra spikes by increasing w, are two dual ways of increasing the readout’s overall SNR.

In other words, how many sensory neurons K, and what integration time scale w, provide a relevant description of the animal’s percept formation? Given the “K—w” duality mentioned above, we cannot answer that question based solely on sensitivity (SNR). Another experimental measure should also be included in the analysis.

These signals, weak but often significant, arise from the unknown process by which each neuron’s activity influences—or is influenced by—the animal’s perceptual decision. In two-al-ternative forced choice (2AFC) discrimination tasks, they have generally been computed in the form of choice probabilities (CP) [14, 15]. The temporal evolution of CPs has been used to find the instants in time when a given population covaries with the animal’s percept [13, 16]. In a seminal study, Shadlen et al. (1996) proposed to jointly use sensitivity and choice signals, as two independent constraints characterizing the underlying neural code [17]. They derived a feed-forward model of perceptual integration in visual area MT, and studied numerically how the population’s sensitivity and CPs vary as a function of various model parameters. They acknowledged the existence of a link between CPs and pairwise noise correlations—both measures being (partial) reflections of how information is embedded in the neural population as a whole (see also [12, 18] ). However, the quantitative nature of this link was only revealed recently, when Haefner et al. (2013) derived the analytical expression of CPs in the standard model of perceptual integration [19] (see Methods).

These equations depend on the brain’s readout policy across neurons and time, and hold for any noise correlation structure in the neural population. In accordance with the intuition of Shadlen et al. (1996), we show that sensitivity and choice signals correspond to two distinct, characteristic properties of the readout. The equation describing choice signals is equivalent to the one derived by Haefner et al. (2013), but stripped from the nonlinear complications inherent to the CP formulation. We use a linear formulation instead, which gives us a particularly simple prediction of choice signals at every instant in time.

A quantitative analysis of choice signals allows us to overcome the “K—w trade-off’ inherent to neurometric sensitivity. We specifically focus on situations in which only a finite sample of neurons has been measured from a large, unknown population. We show how to recover the typical number of neurons K, provided that the experimenter could record at least K neurons simultaneously. Finally, we discuss the scope and the limitations of our method, and how it can be applied to real experimental data.

Results

Experimental measures of behavior and neural activities

1, see Methods or Tables 1—3 for the corresponding formulas). In these experiments, an animal is typically confronted with a stimulus, s, and must then make a behavioral choice, c, according to the rules of the task. A specific example is the classic discrimination task in which the animal’s choice c is binary, and the animal must report Whether it perceived s to be higher (c = 1) or lower (c = 0) than a fixed reference s0 (Fig. 1A, top and middle panels). While the animal is performing the task, the neural activity in a given brain area can be monitored (Fig. 1A, bottom panel). Typical examples from the literature include area MT in the context of a motion discrimination task [3], area MT or V2 in the context of a depth Table 1. Variables and notations: typography. Table 2. Variables and notations: experimental data. Raw experimental data 3 Stimulus—a varying scalar value on each trial so Threshold stimulus value in the 2AFC task 0* Animal choice—binary report on each trial r,-(t) Spike train from neuron i in a given trial 02 Stimulus variance across trials Animal psychometry Individual statistics for the neurons (linear equivalent of choice probabilities) discrimination task [11, 20], or area 81 in the context of a tactile discrimination task [21]. For concreteness, we will mostly focus on these discrimination tasks, although the general framework can be applied to arbitrary perceptual decision-making tasks.

This curve measures the animal’s repartition of responses at each stimulus value 5 (Fig. 1B). If the animal is unbiased, it Will choose randomly Whenever the stimulus s is equal to the threshold value so, so that w(s0) = 1/2. The slope of the psychometric curve at s = so determines the animal’s ability to distinguish near-threshold values of the stimulus, i.e., its psychometric sensitivity. We assess this sensitivity through the just noticeable difference (IND) or difference limen, noted Z. The more sensitive the animal, the smaller Z, and the steeper its psychometric curve. Table 3. Variables and notations: model and methods. Linear readout and decision model Ref in text w Readout window—duration of the temporal integration tR extraction time—time at which the percept is formed a,- Readout weight—contribution of neuron ito the percept ad Decision noise—added to the percept at decision time Model predictions (characteristic equations) 6 Readout ensemble—neurons used forthe readout K Readout size—number of neurons in 6

1A, bottom). We describe the activity of this neural population on every trial as a multivariate point process r(t) = {Ty(1‘)},- = 1. , N“, where each r,(t) is the spike train for neuron i, and Ntot denotes the full population size, a very large and unknown number. (The number of neurons actually recorded is generally much smaller.) As is common in electrophysiological recordings, we will quantify the raw spike trains by their first and second order statistics. First, neuron is trial-averaged activity in response to each tested stimulus s is given by the peri-stimulus time histogram (PSTH) or time-varying firing rate, mi(t; s) (Fig. 1D). In so-called “fine” discrimination tasks, the stimuli 5 display only moderate variations around the central value so, so that the PSTH at each instant in time can often be approximated by a linear function of s: mi(t ; s) 2 m? (t) —|— bi(t)s. The slope 19,-(t), defined at every instant in time, summarizes neuron is tuning properties (Fig. 1E). Second, we assume that several neurons can be recorded simultaneously, so that we can access samples from the trial-to-trial covariance structure of the population activity (Fig. 1C). For every pair of neurons (1', j) and instants in time (t, u), the joint peri-stimulus time histogram (IPSTH, [22]) CiJ-(t, u) summarizes the pairwise noise correlations between the two neurons (eq. 25). For simplicity, we furthermore assume that the IPSTHs do not depend on the exact stimulus value 5. Finally, we can measure a choice signal for each neuron, which captures the trial-to-trial co-variation of neuron activity ri(t) with the animal’s choice (Fig. 1F). Traditionally, this signal is measured in the form of choice probability (CP) curves. We consider here a simpler linear equivalent, that we term choice covariance (CC) curves [3]. The CC curve for neuron i, denoted by di(t), measures the difference in firing rate (at each instant in time) between trials where the animal chose c = 1 and trials where it chose c = O—all experimental features (including stimulus value) being fixed.

Exact formulas for these statistical measures are provided in the Methods. By keeping track of time, we will be able to predict when, and how long, perceptual integration takes place in an organism. From the neural activities to the animal’s choice

Our goal is to quantify the mapping from the neural activities, r(t), to the animal’s choice, c. This can be done if we assume (1) how the stimulus information is eXtracted from the neural activities and (2) how the animal’s decision is formed. For (1) we assume the common linear readout model (Fig. 2A). Here, each neuron’s spike train ri(t) is first integrated into a single number describing the neuron’s activity over the trial. We write, where the kernel h(') defines the shape of the integration window (e.g., square window, decreasing exponential, etc. ), the parameter w controls the length of temporal integration, and the parameter tR specifies the time at which the percept is built or read out. Second, the actual percept is given by a weighted sum over the neurons’ activities, tot readout has sometimes been referred to as the “standard” model of perceptual integration [17, 19] .

Most often, 17,. is taken to be the total spike count for neuron i, in which case tR = w coincides with the end of the stimulation period, and h(-) in eq. 1 is a square kernel. However, this readout is likely incorrect: the length of the integration window w influences the neurometric sensitivity, and experiments suggest that animals do not always use the full stimulation period to build their judgment [23]. Similarly, vector a is often defined over an arbitrary set of neurons, typically those recorded by the experimenter. Again, this choice is arbitrary, and it has a direct influence on the predicted sensitivities.

If a neuron does not contribute to the percept, it simply corresponds to a zero entry in a. For conceptual and implementation simplicity, we take h(-) to be a simple square window (see Discussion for a generalization). Our goal is now to understand whether the readout s can be a good model for the animal’s true percept formation and if yes, for what set of parameters.

The linear model builds a continuous-valued, internal percept§ of stimulus value by the animal on each trial. To emulate the discrimination tasks, we also need to model the animal’s decision policy, which converts the continuous percept§ into a binary choice c. While the linear model is rather universal, the decision model will depend on the specifics of each experimental task. To ground our argumentation, we model here the required decision in a classic random dot motion discrimination task [3]. However, the ideas herein could also be transposed to other types of behavioral tasks (see Discussion).

2C). Known in the literature as ‘decision noise’ or ‘pooling noise’, Ed encompasses all extrasensory sources of variation which may influence the animal’s decision. We assume that Ed is a Gaussian variable with variance 0:, which we take as an additional model parameter. Then, the animal’s choice on each trial is built deterministically, by comparing § —|— Q to the threshold value so (Fig. 2C), so that Where H (') is the Heaviside function. We note that the decision noise is negligible in the classic “sensory noise hypothesis”, in which case ad —> 0.

The characteristic equations of the standard model

If we had recorded the activities of the entire neural population together with the animal’s behavior, then the parameters of this model could be estimated from the data using any standard regression method. However, this is generally not a realistic experimental situation. Instead, we take here a statistical approach to the problem, which (1) allows us to deal with incomplete recordings and (2) relates the estimation problem to the standard experimental measures described above.

Thanks to its linear structure, the readout defined in eq. 2 induces a simple covariance between the neural activities, r(t), and the resulting percept, E (Fig. 2B). Since the linear readout relies on the integrated spike trains, eq. 1, we need similarly integrated versions of the neural tuning and noise covariances in order to express the respective covariance relations. In general, we will denote these time-integrated quantities by an overhead bar, and alert the reader that the respective quantities depend implicitly on the readout window w and the extraction time tR. We will write 131. for the integrated version of the neural tuning, 19,-(1‘), we will write (—31.1. (1‘) for the once integrated noise covariances, and (—31.1. for the doubly integrated noise covariance. This latter quantity, known in the literature as the ‘noise covariance matriX’, measures how the spike counts of two neurons, a. and 7]., covary due to shared random fluctuations across trials (stimulus 5 being held fixed). We can then summarize the covariances between neural activities and the resulting percepts by three characteristic equations (see Methods): On the left-hand sides of eq. 4—6, we find statistical quantities related to the percept E. On the right-hand sides of these equations, we find the model’s predictions, which are based on the neurons’ (measurable) response statistics, 19 and C. More specifically, the first line describes the average dependency of § on stimulus s, the second line expresses the resulting variance for the percept, and the third line expresses the linear covariance between each neuron’s spike train, and the animal’s percept § on the trial.

To produce a binary choice, the continuous percept§ is fed into the decision model (Fig. 2C). From the output of this decision model, we obtain a second set of characteristic equations (see Methods),

The second equation relates the IND, Z, extracted from the psychometric curve, to the variance in the percept, s. The third equation restates the definition of choice covariance, eXcept for the scaling factor, K‘(Z), which Will be constant for most practical purposes, and is described in detail in the Methods (eq. 46). Hence, in our full model of the task, we are able to predict both the psychometric sensitivity and the individual neurons’ choice signals from the first and second-order statistics of the neural responses. Specifically, by combining the characteristic equations for the linear readout and the decision policy, we obtain Importantly, since these equations deal with integrated versions of the raw neural signals, they depend on both the readout time window, w, and the extraction time, tR. We note that the choice covariance equation (eq. 9) can also be derived in a simpler, time-which provides the linear covariance between each neuron’s spike count a. on the trial, and the animal’s choice. This is essentially the relationship already revealed by Haefner et al. (2013) [19] , that choice probabilities are related to readout weights through the noise covariance matriX. The simpler linear measure of choice covariance, used in this article, allows us (1) to get rid of some non-linearities inherent to the choice probability formulation, and (2) to easily eXtend the interpretation of choice signals in the time domain, with eq. 9.

Estimating the parameters of sensory integration

This naturally raises the reverse question: can we estimate the parameters of the standard model (a, w, tR, and ad) from actual measurements? From here on, we will denote the true (and unknown) values of these parameters, i.e., the values used in the animal’s actual percept formation, with a star (a*, w*, t;, and 0;).

Thus, we assume that the animal’s percept is constructed from a specific sub-ensemble 8* of neurons, of size K* (Fig. 3A). Neurons inside 3* correspond to nonzero entries in the readout vector a*, while neurons outside 8* have zero entries. Since only a subset of neurons within a cortical area will project to a downstream area, we can generally assume that K* < Ntot.

For any candidate set of parameters, a, w, tR, and 0d, the characteristic equations 7—9 lead to predictions for Z and ell(t) (note the absence of star when referring to predictions). In turn, the experimenter can measure the animal’s actual choice c* on each trial, from which they can estimate the IND Z*, and the CC curves df(t) for all recorded neurons. In the next three sections, we study whether this information is sufficient to retrieve the true readout parameters, depending on the amount of data available.

In most experimental recordings, however, we only measure the activities of a small subset of that population (Fig. 3A). If this subset is representative of the full population, we may want to retrieve the readout parameters through extrapolation. Unfortunately, any such extrapolation is fraught with additional assumptions—whether implicit or explicit—as it requires to replace the missing data with some form of (generative) model. In Case 2, we impose a generative model for the readout vector a. Coupled with a statistical principle, it allows us to estimate the true size K* of the readout ensemble, provided that the number of neurons recorded simultaneously, N, is larger: N > K*. In Case 3, we study the scenario in which N s K*. Here, we need to assume a generative model for the neural activities themselves. Since the noise covariance structure assumed by that model exerts a strong influence on the predicted IND and CC curves, a direct inference of the readout scales becomes impossible.

Case 1: all cells recorded

For fixed parameters w and fig, eq. 7 and 9 impose linear constraints on vector a. These constraints are generally over-complete, since a is Ntot-dimensional, while each time tin eq. 9 provides Ntot additional linear constraints. Thus, in general, a solution a will only exist if one has targeted the true parameters w* and t1*,, and it will then be unique. (If no choice of the readout parameters approximately fulfills the characteristic equations, we would have to conclude that the linear readout model is fundamentally wrong.) In practice, we can find the best solution to the characteristic equations by simply combining them and then minimizing the following mean-square error: where the parameters A and y trade off the importance of the errors in the different characteristic equations. Note that the loss function L depends not only on the readout weights a and the decision noise ad, but also on the parameters w and tR, both of which enter all the time integrations that are denoted by an overhead bar. Once vector a* is estimated, the readout ensemble 5* will correspond to the set of neurons with nonzero readout weights. Case 2: more than K* cells recorded Unfortunately, measuring the neural activity of a full population is essentially impossible, although optogenetic techniques are coming ever closer to this goal [24—26]. Nevertheless, if the activity patterns of the recorded cells are statistically similar to those of the readout ensemble, and if the number of simultaneously recorded cells exceeds the number of cells in the readout ensemble, we can still retrieve the readout parameters by making specific assumptions about the true readout vector a*.

Our central assumption will be that the system uses the principle of restricted optimality: we assume that the readout vector a* extracts as much information as possible from the neurons within the readout ensemble, 8*, and no information from all other neurons. Since most of the neurons contributing to the readout were probably not recorded, we cannot directly estimate the true readout vector, a*. However, we can form candidate ensembles from the recorded pool of neurons, 8, compute their optimal readout vector, ar(€), and then test to what extent these candidate ensembles can predict the IND or the CC curves (Fig. 3C). By changing the size of the candidate ensembles, K, we can in turn infer the number of neurons involved in the readout. For an arbitrary candidate ensemble 8, we can express its optimal readout vector, ar(€) := {a,},- E g, on the basis of the neurons’ tuning and noise covariance, through a formula known as Fisher’s linear discriminant [27]:

The remaining neurons in the population do not participate in the readout. The resulting readout vector verifies eq. 7, and minimizes the just noticeable difference Z under the given constraints. Specifically, by entering the optimal readout into eq. 8, we obtain a prediction for the IND (Fig. 3D),

Instead, we reexpress this equation in terms of two population-wide indicators, that summarize the CC signals of the individual neurons. The first indicator assesses the population-wide link between a neuron’s tuning at each time u, and its choice covariance at each time t. The second indicator measures the average deviation from this link (see also Methods): Here, the angular brackets denote averaging over the full neural population—or, in practice, over a representative ensemble of neurons (see Methods on how to construct this from actual data).

3E)—likely due to the fact that positively-tuned neurons contribute positively to stimulus estimation, and nega-tively-tuned neurons negatively. This correlation can be quantified under the assumption of restricted optimality (see Methods). The indicator q(u, t) has a simple interpretation Which we will illustrate by focusing on its doubly time-integrated version, 6:] = ( bid 1. ) 1.. When we seek to predict a neuron’s choice covariance Ell. from its tuning R, then 51 is the best regression coefficient (Fig. 3F), so that

3F). A similar relation holds for the time-dependent indicator q(u, t).

We set a number of potential values for parameters K, w, tR, 0d, and we explore routinely all their possible combinations. For each tested value of the readout ensemble size, K, we repeatedly pick a random neural ensemble 8 of size K from the pool of neurons recorded by the experimenter, and propose it as the source of the animal’s percept (Fig. 3C). Then, we compute the average indicators across ensembles of similar size (see Methods), which we will denote by (Z2) 5, (q) 5, and (V) 5. Note that all of these indicators depend on the parameters w, tR, K, and ad. Finally, we replace the loss function of Case 1 (eq. 11) by the following “statistical” loss function: The minimum of the loss function then indicates what values of the readout parameters agree best with the recorded data.

To validate these claims, we have tested our method on synthetic data—Which are the only way to control the true parameters of integration, and thus to test our predictions. We implemented a recurrent neural network with N = 5000 integrate-and-fire neurons that encodes some input stimulus s in the spiking activity of its neurons, and we built a perceptual readout from that network according to our model, with parameters K* = 80 neurons, w* = 50 ms, t; = 100 ms, and a; = 1 stimulus units (see Methods for a description of the network, and supporting 81 Text).

Each indicator plays a specific role in recovering some of the parameters. First, indicator q(u, 1‘) allows us to recover the temporal parameters of integration (w, tR). Indeed, it characterizes the time interval during which the population—as a whole—shows the strongest choice covariance (Fig. 5A), and the bounds of this interval are essentially governed by parameters (w, tR) (Fig. 5B). As a result, the match between true measure and prediction—second term in eq. 16—shows a clear optimum near the true values (w*, tlfi) (Fig. 5C). The bi-temporal structure of q(u, t) in eq. 14, with time indeX u corresponding to the neurons’ tuning 19,-(u), stabilizes the results by insuring that q(u, t) is globally positive.

Fig. 5D depicts the predicted value for Z as a function of w and K. The mark of the ‘K—w tradeoff is visible: higher sensitivity to stimulus can be achieved either through longer temporal integration (w), or through larger readout ensembles (K). Analytically, the IND Z depends on w because the covariance matrix C will generally scale with w‘1 (under mild assumptions, supporting 81 Text). The red curve marks the pairs (K, w) for which the prediction matches the measured IND Z*—thereby minimizing the first loss term in eq. 16. The true parameters (K*, w*) lie along that curve (white square in

5D). Since w* is recovered independently thanks to indicator q(u, 1‘), this in turn allows us to recover parameter K*.

But in the general case, the observed IND Z* can also be influenced by extraneous sources of noise in the animal’s decision, and bias the comparison between T and its prediction. To account for this potential effect, our model includes the decision-noise term 0d. For a fixed value of w, the IND Z is influenced both by parameters K and ad (eq. 13, Fig. 5E). However, both parameters can be disentangled thanks to the third indicator V, which depends mostly on K (Fig. SF). The signification of Vhinges on the following result, that was first shown in [19]: when the readout is truly optimal over the full population (K = Ntot), then each neuron’s choice covariance all. is simply proportional to its tuning 51. (see Methods). Since the indicator V quantifies the deviations from perfect proportionality between 51. and cl 1. (eq. 15, Fig. 3F), it becomes a marker of the readout’s global optimality, and decreases to zero as K grows to large populations. At the same time, the dependency of V on parameter ad is minimal, and limited to the influence of the scaling factor K‘(Z) in eq. 9 (see Methods).

In our simulation, the minimum was achieved for the following values: w = 50 msec, tR = 100 msec, K = 60 neurons, ad 2 0.25 stimulus units (with the following levels of discretization: 10 neurons for K, 0.25 stimulus units for 0d, 10 msec for w and tR).

The temporal parameters (w, tR) are recovered with good precision (panel A). Conversely, parameters K and ad are somewhat underestimated (panels B and C) compared to their true values (black square). Indeed, the values of K and ad are disentangled thanks to indicator thich, of the three indicators introduced, is the most subject to measurement noise. As a result, the match between V* and its prediction V is not as precise as the other two: see Fig. 5F. Nevertheless, the true values are rather close to the final estimates, lying within the 1-standard deviation confidence region (Fig. 6C). Importantly, only a reasonable amount of data is required to produce these estimates. Network activity was monitored on 15 independent runs, each run consisting of 180 repetitions for each of the 3 stimuli. On each run, a different set of N = 170 random neurons were simultaneously recorded—out of a total population of Ntot = 5000. As a result, (i) individual neuron statistics such as Cl-J-(t, u) or dflt) display an important amount of measurement noise, (ii) population statistics such as indicator Vare computed from relatively few neurons 1'. Numerically, this noisiness introduces a number of biases in the above indicators, such as overf1tting, Which require counteracting With specific corrections (see Methods and supplementary material for details). Naturally, the width of the confidence intervals in Fig. 6 is directly related to the amount of data available. In conclusion, if the data conform to a number of hypotheses (optimal linear readout from a neural ensemble typical of the full population, and smaller than the recording pool size), then it is possible to estimate the underlying readout’s parameters, from a plausible amount of experimental samples.

Case 3: less than K* cells recorded

If N is smaller than the true size K*, the method will provide biased estimates. In current-day experiments, N can range from a few tens to a few hundred neurons. While it is not excluded that typical readout sizes K* be of that magnitude in real neural populations (as suggested, e.g., by [8]), it is also possible that they are larger. In this case, the only way to estimate the readout parameters is to make specific assumptions about the nature of the full population activity. In turn, the extrapolated results will depend on these assumptions.

To investigate the underlying issues, and to explain why there is no “natural” extrapolation, we will study how the indicators Z, i, and V defined above evolve as a function of the number of neurons K used for the readout. For simplicity, we assume a fixed choice of (w, tR) and focus on the time-integrated neural activities K. (eq. 1). We also suppose that the decision noise ad 2 0 is negligible. Finally, we consider alternative definitions for the indicators Z and 51 that simplify the following analysis. We define where a? is the variance of the tested stimuli, i.e., a? := E[52] — E[s]2. The sensitivity Y is simply an inverse reparametrization of the IND, Z. More specifically, Y is the ratio between the signal-related variance and the total variance (see Methods), which grows from zero (if Z = 00) to one (if Z = 0) as the readout’s sensitivity to the stimulus increases. As for Q, it is simply a convenient linear rescaling of Then, we reexpress the population activity through a singular value decomposition (SVD) (see Methods for details). Specifically, we write the time-averaged activity of neuron i for the q-th presentation of stimulus s as

is the trial-averaged activity of each neuron. This decomposition is best interpreted as a change of variables, Which reexpresses the neural activities {E} in terms of a new set of variables, {vm}m = 1. . . M, which we will call the activations of the population’s modes. These modes can be viewed as the underlying “patterns of activity” that shape the population on each trial. Each mode m has a strength 1m > 0 which describes the mode’s overall impact on population activity. We assume ll 2 . . . 2 l M, so we progressively include modes with lower strengths (Fig. 7A). The vector um is the “shape” of mode m and describes how the mode affects the individual neurons. Finally, 11:: is the mode’s activation variable, which takes a different (random) value on every trial q for a given stimulus s. The number of modes M is the intrinsic dimensionality of the neural population’s activity. In real populations we may expect M < Ntot, because neural activities are largely correlated. Since the singular value decomposition is simply a linear coordinate transform, we can redefine all quantities with respect to the activity modes. Of particular interest is the sensitivity of each mode, which is the square of its respective tuning parameter, or (see Methods) If the readout vector a is chosen optimally over the full population, the resulting percept’s sensitivity will be a simple sum over the modes: Ytot = 2m ym.

Note the presence of a “dominant” mode for the sensitivity. This seems to be a rather systematic effect, which arises because the definition of total covariance (Methods, eq. 50) favors the appearance of a mode almost collinear with 13. Even so, this dominant mode accounted only for 71% of the population’s total sensitivity, so the residual sensitivity in the other modes is generally not negligible.

However, we wish to study the more general case where the readout is built from sub-ensem-bles of size K. In such a case, not all modes are equally observable, and we rather need to introduce a set of fractions, {em(K)}m = 1. . . M, that express to what extent each mode m is “observed”, on average, in sub-ensembles 8 of size K (see Methods for a precise definition). Modes with larger power 1m tend to be observed more, so em(K) globally decreases with m. Conversely, em(K) naturally increases with K. For the full population, em(Ntot) = 1 for all modes m, meaning that all modes are fully observed (see Fig. 7C; here, the mode observation fractions were empirically computed by averaging over random neural sub-ensembles). Using these fractions, we can analytically approximate the values of Y and Q which are expected, on average, if the readout is based on ensembles of size K:

As em(K) progressively reveals modes with lower power 1m, this average power is expected to decrease with K. Again, the minimum value is reached when all nonzero ym are revealed (Fig. 7E).

As for the third indicator used in Case 2, V, it can also be expressed in the SVD basis (see Methods). However, being a second-order variance term, its approximation based solely on the average fractions {em(K)}, as in eq. 21—22, is generally poor.

What do these results imply in terms of extrapolation to larger neural ensembles than those recorded by the experimenter? Arguably, eq. 21—22 constitute an interesting basis for principled extrapolations to larger sizes K. These equations show that the evolution of Y and Q in growing ensembles of size K is mostly related to the interplay between the modes’ sensitivity spectrum {ym} and their power spectrum {Am}. (Empirically, the observation fractions {em(K)} seem primarily governed by the decay rate of {1m}, although the analytical link between the two remains elusive.) However, note that the spectra {ym}, {1m} and {em(K)} are generally not accessible to the experimenter—this would precisely require to have recorded at least N > M neurons, and potentially the whole neural population ifM = Ntot. To extrapolate sensitivity 5(K) in ensembles of size K larger than those monitored, one must (implicitly or explicitly) assume a model for {firm} and {ym}—which amounts to characterizing the relative embedding of signal and noise in the full population [28]. A number of reasonable heuristics could be used to produce such a model. For example, one may assume a simple distribution for {firm}, such as a power law, and estimate its parameters from recorded data. Alternatively, it is often assumed that the noise covariance matrix is “smooth” with respect to the signal covariance matrix, so that the former can be predicted on the basis of the latter [19, 29]. Finally, the extrapolation could rely on more specific assumptions about how neural activities evolve, e.g., through linear dynamics with additive noise [30]. In all cases, the additional assumptions impose (implicit) constraints on the structure of the spectra {1m} and {ym}.

For example, one can imagine scenarios in which the most sensitive modes (those with highest ym) correspond to very local circuits of neurons, independent from the rest of the population, and thus invisible to the experimenter (see also [19] ). Another pathological situation could be a neural network specifically designed to dispatch information non-redundantly across the full population [31, 32], resulting in a few ‘global’ modes of activity with very large SNR—meaning high ym and low firm. As a result, extrapolation to neural populations larger than those recorded is never trivial, and always subject to some a priori assumptions. The most judicious assumptions, and the extent to which they are justified, will depend on each specific context.

Discussion

Our study describes percept formation within a full sensory population, and proposes novel methods to estimate its characteristic readout scales on the basis of realistic samples of experimental data. Here, we briefly discuss the underlying assumptions and their restrictions, the possibility of further extensions, and the applicability to real data.

The linear readout assumption

2) used to analyze sensitivity and choice signals is an installment of the “standard”, feed-forward model of percept formation [17, 19]. As such it makes a number of hypotheses which should be understood when applying our methods to real experimental data. First, it assumes that the percept§ is built linearly from the activities of the neurons—a common assumption which greatly simplifies the overall formalism (but see, e.g., [33] for a recent example of nonlinear decoding). Even if the real percept formation departs from linearity, fitting a linear model will most likely retain meaningful estimates for the coarse information (temporal scales, number of neurons involved) that we seek to estimate in our work.

Theory does not prevent us from studying a more general integration, where each neuron i contributes with a different time course Ai(t). The readout’s characteristic equations are derived equally well in that case. Rather, assuming a separable form reflects our intuition that the time scale of integration is somewhat uniform across the population. This time scale, w, is then the one crucial parameter of the integration kernel. Although the shape h(t) of the kernel could also be fit from data in theory, it seems more fruitful to assume a simple shape from the start. We assumed a classic square kernel in our applications. Other shapes may be more plausible biologically, such as a decreasing exponential mimicking synaptic integration by downstream neurons. However, given that our goal is to estimate the (coarse) time scale of percept formation, our method will likely be robust to various simple choices for 11. As a simple example, we tested our method, assuming a square kernel, on data produced by an exponential readout kernel, and still recovered the correct parameters w, tR and K (data not shown).

In our model, the animal’s estimate at time tR serves as the basis for its behavioral report (Fig. 2A), and we designate this single number § as the “percept”. A second strong assumption of our model is that this perceptual readout occurs at the same time tR on every stimulus presentation. In reality, there is indirect evidence that tR could vary from trial to trial, as suggested by the subjects’ varying reaction times (RT) when they are allowed to react freely [34, 35]. In such tasks, we expect the variations in tR to be moderate—because subjects generally react as fast as they can—and we may even try to correct for fluctuations across trials by measuring RTs. On the other hand, when subjects are forced to wait for a long period of time before responding, there is room for ample variations in tR from trial to trial, and the model presented above may become insufficient.

The main impact of this modification is on CC curves, which become broader and flatter; essentially, the resulting curve resembles a convolution of the deterministic CC curve by g(t) (Fig. 8A). This means that if a behavioral task is built such that tR can display strong variations from trial to trial, the methods introduced above will produce biased estimates. In theory, this issue could be resolved by adding an additional parameter in the analysis, to describe g(t) (see supporting SI Text).

The decision model

In principle, behavioral experiments could be set up such that the subject directly reports this percept, so that c 2 §. Such experiments could be treated completely without a decision model. However, almost all experiments that have been studied in the past involve a more indirect report of the animal’s percept. In these cases, some assumptions about how the percept is transformed into the behavioral report c need to be made. In the choice of a decision model, we have followed the logic of the classic random dot motion discrimination task [3], in which a monkey observes a set of randomly moving dots whose overall motion is slightly biased towards the left (5 < 0 in our notations) or towards the right

The monkey must then press either of two buttons depending on its judgment of the overall movement direction. The simplest decision model assumes a fixed integration time window, additive noise on the percept, §, and an optimal binary decision. A slightly more sophisticated model, the “integration-to-bound” model, assumes that the integration time is not fixed, but rather limited by a desired behavioral accuracy. This model requires variable readout windows, rather than the fixed readout window assumed here, and will require further investigation in the future.

They must press either of two buttons depending on whether they consider that $1 > 52 or not. In this task, the optimal behavioral model would be c = H (§ 1 — .32). In reality, however, the monkey needs to memorize $1 for a few seconds before 52 is presented, so potential effects of memory loss may also come into play (see e.g. [36] for a study of these problems).

Choosing a relevant behavioral model is a connected problem that cannot be addressed here, and that will vary depending on the task and individual considered. For most tractable behavioral models, the predicted sensitivities and choice signals will ultimately rely on the quantities introduced in this article.

The feedforward assumption

The activities ri(t) of the sensory neurons are integrated to give rise to the percept§ and the animal’s choice c, yet the formation of this decision does not affect sensory neurons in return. Recent evidence suggests that reality is more complex. By looking at the temporal evolution of CP signals in V2 neurons during a depth discrimination task, Nienborg and Cumming (2009) evidenced dynamics which are best explained by a top-down signal, biasing the activity of the neurons on each trial after the choice is formed [20]. In our notations, the population spikes ri(t) would thus display a choice-dependent signal which kicks in on every trial after time tR, resulting in CC signals that deviate from their prediction in the absence of feedback (Fig. 8B).

The answer depends on the nature of the putative feedback. If the feedback depends linearly on per-cept§ (and thus, on the spike trains), its effects are fully encompassed in our model. Indeed, this feedback signal will then be totally captured by the neurons’ linear covariance structure Cz-J-(t, u), so that our predictions will naturally take it into account. On the other hand, if the feedback depends directly on the choice c—which displays a nonlinear, “all-or-none” dependency on E—then it will not be captured by our model, and lead to possible biases. Even so, our model would still apply if percept and decision were essentially uncoupled before the putative extraction time tR, in which case one could simply compare true and predicted CC signals up to (candidate) time tR (see Fig. 8B).

Undersampled neural populations

Our solution relies on an assumption of restricted optimality, based on Fisher’s linear discriminant formula (eq. 12). By assuming that readout is made optimally from some unknown neural ensemble 8, we reformulated the problem of characterizing a in that of characterizing 8, and could in turn exploit the characteristic equations 4—6 statistically.

This potential discrepancy from the true readout is inescapable, once we start representing a through a statistical model. However, note that our model uses two distinct sources of non-optimality: (1) the size K of the readout ensemble, which can be much smaller than the full population, and (2) the decision noise ad, which adds a ‘global’ non-optimality to the readout. Arguably, by combining both factors, our chosen model for a will be flexible enough to provide meaningful estimates when fit to real data.

Past work has often shown that small ensembles of neurons are completely sufficient to account for an animal’s behavior [3, 37]. However, there is an inherent tradeoff between the number of neurons and the time scale of integration. One simple explanation for the small sizes of previous readout ensembles is that the true readout time scales used by subjects are much shorter. Unfortunately, as detailed above (Case 3), extrapolations from a finite-size recording onto the whole population always come at the price of strong additional assumptions. However, as experimental techniques advance, and as the number of simultaneously recorded neurons reaches the number of neurons implied in the readout, we will eventually be able to directly infer the readout parameters from the data. In this case, our method can readily be tested on real data, and hopefully provide new insights into the nature of percept formation from populations of sensory neurons.

Methods

First, we set our basic notations and definitions. Second, we derive the characteristic equations of the model, both for the linear part and decision part. Third, we detail the predictions in case of an optimal readout from some neural sub-ensemble 8. Fourth, we reexpress these predictions in the basis of the population’s SVD modes. Finally, we detail our methodology to empirically estimate the quantities used in this article, from limited amounts of experimental data. Tables 1—3 summarize the main variables and notations used in the article.

Statistical notation

An example is the spike count of a single neuron. Trials in turn can be grouped by stimulus s or choice c. We can make this explicit by writing xscq to denote the q-th trial in which the stimulus was s and the subject’s choice was c. Given such a variable, we will write E[x] for its expectation value, i.e., for the hypothetical value this quantity would take if it could be averaged over infinitely many trials. We will write E[x|s] for the expectation value conditioned on stimulus s, i.e., for the expectation value computed over all trials in which the stimulus was s. A similar notation holds when conditioning on choices c. We note that for quantities that are already conditional expectations, for instance, y(5) = E[x|s], their expectation value E[y(s)] will average out the stimuli according to their relative probabilities, i.e., E[y(s)] 2 ZS p(s)y(s). Thereby, each stimulus s contributes to the expectation in proportion to the number of trials associated to it. Then the notations are coherent, since we have E[E[x|s]] = Covariances are generically defined as Cov[x, y] = E[xy] — E[x]E[y], and variances as Var[x] = Cov[x, x]. For vectorial

Experimental statistics of neural activity and choice

Classic measures in decision-making experiments can be interpreted as estimates of the first-and second-order statistics of choice c and recorded spike trains ri(t), across all trials with a fixed stimulus value 5:

The choice covariance (CC) curve di(t; s) is our proposal for measuring each neuron’s “choice signal”. Theoretically, the temporal signals in eq. 24—26 are well-defined quantities in the framework of continuous-time point processes [38]. In practice, they are estimated by binning spike trains ri(t) with a finite temporal precision, depending on the amount of data available. From the psychometric curve, we also derive two simpler quantities: the animal’s just-n0-ticeable difference (IND), Z, and decision bias yd. We obtain them as the best (MSE) fit to the following formula: where (I) is the standard cumulative normal distribution. Z measures the inverse slope of the psychometric curve (up to a scaling factor V271). The decision bias yd, when nonzero, represents a bias towards one button when 5 = so. This formula for the psychometric curve arises naturally when we model the decision task (see below).

Choice covariance and choice probability

Throughout the article, we consider the special case of a binary choice c = {0, 1}. In this case, the variance of the choice conditioned on s is given by and a straightforward computation shows that (These formulas, and all those below, assume that the choice takes values 0 and 1. Any other binary parametrization should first be reparametrized to {0, 1}.)

This measure is sometimes used as a simpler alternative to choice probabilities [3]. In fact, CC curves and CP curves can be analytically related if one assumes Gaussian statistics: see [19] or supporting 81 Text.

Simplified dependencies on the stimulus

To ease the subsequent analysis, we assume that the activity of each neuron is well approximated by a time-varying, linear dependency on the stimulus s, and that Cij(t, u; s) is independent of 5. Consequently, Since we are modeling a discrimination task, in which stimuli 5 display only small variations around the central value so, the linearity assumption seems reasonable. In turn, we can write We will refer to bi(t) as the neural tuning. More precisely, it is the slope of the neuron’s tuning curve at each time point. Naturally, actual data (even from a synthetic simulation) always somewhat deviate from this idealized situation. In practice, we obtain the best fits for 19,-(t) and Cl-J-(t, u) using linear regression, so that

26) into a single CC curve for each neuron, say di(t). There is no obvious choice for this simplification, because ell(t; s) has to change with s. For example, the CC signal is nonzero only if stimulus s and threshold so are close enough for the animal to make occasional mistakes (this is reflected in eq. 29, since 03 (5) tends to zero When the animal makes no mistakes). In the experimental literature, a common choice is to focus only on the CC curve at threshold, that is ell(t) = ell(t; so). In experiments with a limited number of trials, this has the inconvenience of losing the statistical power from nearby stimulus values 5 that were also tested. We thus propose an alternative definition: which exploits each stimulus s in proportion to the number of associated trials. In our model, this averaging also limits the influence of the IND Z on the magnitude of CC signals: see eq. 45—46.

Derivation of the linear characteristic equations

To clarify the equations, let us introduce the temporal averaging kernel

24—25), and after differentiating the first line with respect to 5, see eq. 30, we obtain: These are exactly the characteristic equations 4—6 from the main text, after introducing the following vectors and matrices: d 1. = E[Cov[?i, c|5]] (choice covariance vector). Given our assumptions above, the resulting quantities are all independent of the stimulus 5. Note though, that all quantities depend on the readout parameters w and tR. Importantly, one can show that the noise covariance matriX C scales as w_1, under mild assumptions (supporting 81 Text, section 2). The decision model of a fine-discrimination task threshold, and Ed ~ Mud, ad) is a Gaussian variable representing additional noise and biases. The mean yd implements a possible bias towards one button when 5 = so. The standard deviation ad implements additional sources of noise in the animal’s decision process.

First, we assume that E[s ls] = s, meaning that s follows 5 on average. (In statistical terminology, s is an unbiased estimator of 5.) Then, the left-hand side of eq. 35 is simply equal to

(In theory, this assumption is violated at small time scales due to the binary nature of r,(t). But in practice this is not an issue, as the spike trains always undergo some form of temporal integration afterwards.) Then, § (given 5) is normally distributed, and eq. 36 ensures that its variance Var [E |s] is independent of 5 (see Fig. 2B). In these conditions, the predicted formula for the psychometric curve is exactly that of eq. 27, namely, and the IND, Z, is given by the following expression: With di(t) defined as an average CC curve over tested stimuli (eq. 33), we finally obtain

9) is obtained by combining eq. 37 and 45.

Indeed, note the rough approximation K‘(Z) oc fsdsg(s; 50—yd, Z) = 1, valid Whenever the tested stimuli s are uniformly distributed over a range of values comparable to Z. This is another practical argument for considering the stimulus-averaged CC signal ell(t), from eq. 33.

Signal, noise, and sensitivity

Here, we briefly review these relations. The variance of any scalar variable x that changes from trial to trial can be decomposed in a signal term a: := Var [E [x|5]] and a noise term Z: := E[Var Then, note that Var [x] = a: —i— Z3. The noise term Zx defines the minimal level past Which fluctuations in x can be attributed to 5 rather than intrinsic noise—hence the term IND. When a decision is taken on the basis of variable x, the IND governs the inverse slope of the corresponding psychometric curve (see eq. 27). We also define the sensitivity of variable x as which is simply the ratio of the signal to the total variance. The sensitivity Yx takes values between 0 and 1. It thus avoids singularities which may occur when Zx tends to 0 or +00. We can also distinguish between signal-related and noise-related variance for the (time-av-eraged) neural activities I_‘. The signal covariance matrix, 2, noise covariance matriX, C, and total covariance matriX, A, are given by the following relations:

Note that 2 is a rank-1 matrix, owing to the system’s assumed linearity wrt. stimulus s. In turn, these matrices allow to compute the signal and noise variances for any weighted sum of the neural activities. For our linear readout (with added decision noise 5d), we have Optimal readout from a neural ensemble 8 We now assume that the readout vector a has support only on some neural ensemble 8. Formally, we introduce the K x Ntot projection matrix H(€), such that for i E 8 and every neuron j, Hz-j(£') = 61-]; Then, the restrictions of vectors and matrices in neuron space, such as B and 6, to ensemble 8 will be denoted by a subscript r (for restriction), so that Our principle of (restricted) optimality selects the readout vector a which maximizes the sig-achieved by minimizing the noise variance aTéa (eq. 52)—or equivalently, the total variance aT A a (eq. 53). The solution, known as Fisher’s Linear Discriminant, is easily found with Lagrange multipliers (either based on (:3 or A): The second formulation of ar, based on the total covariance matrix Ar, will prove more useful when we turn to the SVD analysis. It also has the advantage of avoiding the singularity which may occur When vector hr lies outside the span of matrix Cr. In that case one simply replaces (Ar)_1 by the (Moore-Penrose) pseudoinverse (Ar)+.

52), we obtain the IND predicted by the model: Equivalently, using the formulations based on total variance (eq. 47, 53, 56) we obtain the mod-el’s prediction for sensitivity:

CC signals for the optimal readout

9), we obtain the CC curves predicted by the model, Here, di(t) is the resulting, predicted CC curve for every neuron i in the population (not only in If neuron ibelongs to the readout ensemble 8, matrix (1 simplifies away from eq. 60, yielding: This equation, first shown in [19], means that choice signals within the readout ensemble are simply proportional to tuning. This is not true, however, for neurons outside the readout ensemble.

First, it proves that choice signals are markedly different for neurons inside or outside the readout ensemble (an observation made empirically by [12] ). Second, as we consider readout ensembles 8 larger and larger, eq. 61 will become true for more and more neurons. As a result the statistical indicator V (eq. 15), which measures the population-wide deviation from linearity between d 1. and 131., is expected to decrease with the readout ensemble’s size K. Finally, under the assumption of (restricted) optimality, the time-averaged statistical indicator 6:1 is always positive. Indeed, averaging over all neurons 1' in the population is akin to a scalar tot which is always positive because both matrices 6 and HT (ér)_1H are symmetric semidefinite positive.

Singular value decomposition

We denote the time-averaged activities of neuron i in the q-th presentation of stimulus s as W. We interpret these activities as a very large Ntot x 9 matrix, where Ntot refers to the number of neurons and Q to an idealized, and essentially infinitely large number of trials. Next, we consider the singular value decomposition (SVD) of the neural activities. The (compact) SVD is a standard decomposition which can be applied to any rectangular matrix R. It is given by R = U A VT, where A is an M X M diagonal matrix with strictly positive entries firm (the singular values), U is an Ntot X M matrix of orthogonal columns (meaning UT U 2 MM), and V is an Q X M matrix of orthogonal columns (meaning VT V 2 MM). Using the indices defined above, the SVD decomposition for the neural activities becomes

is the average activity of each cell over all trials and stimuli. The orthogonality of U implies that for all indices m and n, we have 2114:." u? similarly= 5"“, While the orthogonality of V simi-larly implies 2541/2316?) 2 5m".

Statistics of activity, in the space of modes

63) is best interpreted as a change of variables reexpressing neural activities {Wham-NW in terms of mode appearance variables {vigmzlm M. As a result, we can define the respective equivalents of all statistical quantities in the space of activity modes. Specifically, we can reinterpret sums over trials in the SVD as expectations, thus emphasizing the statistical interpretation of the SVD. First we note that 7? 2 37?] for all neurons 1', so that the data for the actual SVD has been “centered”. This centering implies for all modes m that Where nm is the tuning parameter of the m-th mode, just as 131. was the tuning parameter for the i-th neuron. Grouping all mode appearance variables in a vector V, we obtain the signal covariance and total covariance matrices in mode space as

The singular values 1m and distribution vectors um then allow us to relate the statistics at the levels of neurons and modes. Using the SVD formula (eq. 63) yields (in matrix form):

Sensitivity of sub-ensembles, in the space of modes

We now wish to understand which factors govern the sensitivity embedded in a neural sub-en-semble 8 of cardinality K. For simplicity, we will consider the case for which the decision noise

To reexpress this sensitivity of finite sub-ensembles 8 into mode space, we need to find the equivalent, restricted expressions of eq. 68—69. For that purpose, we introduce the design matrix associated to ensemble 8 in mode space: Where we have defined the M X M matrix The projector P = P(€) spans more and more space as the size K of ensemble 8 increases. In the limiting case, when K is larger than the number of modes M, then necessarily P 2 MM, and we obtain In other words, all modes are available experimentally, and sensitivity estimates saturate to their maximum value, independently of ensemble 8. We can explicitly denote the sensitivity of each mode’s activation variable vm by defining

CC signals, in the space of modes

First, we reexpress the CC equation (eq. 10) as a function of the total covariance A (eq. 50) to obtain

Hence, up to a scaling and shift, the CC vector (—1 can be replaced by the total percept covariance vector In the case of an optimal readout, vector a is given by eq. 56, so that we obtain

73). Note again that e provides the CC signal for every neuron i in the population (not only in ensemble 6'). As 8 tends to the full population, P = P(€) tends to MM and we recover e(oo) = a? Ytgtl b, the prediction for choice signals in the case of a

Using eq. 78, we can finally compute the analytical predictions for the two CC statistical indicators, 6:1 and V. Precisely, we compute the following population-wide regression coefficient between e and b: ar rescaling of d, Q is a similar rescaling of indicator i, as pointed in the main text (eq. 18). Finally, a very similar computation leads to the expression of indicator V (eq. 15) in the space of modes:

Sensitivity and CC signals as a function of K

We are mostly interested in averages of these quantities over very large numbers of randomly chosen ensembles 8 of size K; we thus use the generic notation E[le] :=E[x(€)|Card(€) = K] to denote the expected value of a variable x when averaging over ensembles of size K. Note that this notation is equivalent to the more explicit notation used in the main text, so that E[le] = (x)5(K). From eq. 72 we find: E[Y|K] = 052 if E[P|K]17.

71) as a collection of K random vectors x,- in mode space, viewing neuron identities i as the random variable. Thus, P(€) is the orthogonal projector on the linear span of the K sample vectors {xi},- 6 5. As a projector, its trace is equal to its rank, so we have Tr(E[P = K. Furthermore, since K+1 samples span on average more space than K samples, we are ensured that E[P | K+1] > E[P|K] , in the sense of positive semideflnite matrices.

Indeed, as the various modes are linearly independent, there is no linear interplay between the different dimensions of x,- across samples 1'. More precisely, the expectation value over neurons is

Then, the projection matrix P is simply equal to W WT. As for the previous equation, it rewrites

So, if we assume a form of independence between W and D, it is reasonable to suppose that E[W WTlK] = E[PlK] is close to diagonal as well. (Actually, we postulate that E[PlK] is exactly diagonal when the random vectors x.- follow a normal distribution. In the general case, small or moderate deviations from diagonality can be observed.) We denote these diagonal terms as The properties of E[P|K] stated above imply that 2m em(K) = K (trace property), and em(K+1) > em(K) (growth property). Finally, we can consider the resulting approximations of sensitivity In this expression, we recognize the individual mode sensitivities ym = 03173". For CC signals, Validation on a simulated neural network

The neural network used to test our methods is described in detail in supporting 81 Text (section 3). Briefly, on each trial, 2000 input Poisson neurons fire with rate 5, taking one of three possible values 25, 30 and 35 Hz (so in our simulation, stimulus units are Hz). The encoding population per se consists of 5000 leaky integrate-and-fire (LIF) neurons. 1000 of these neurons receive sparse excitatory projections from the input Poisson neurons, which naturally endows them with a positive tuning to stimulus 5. Another 1000 neurons receive sparse inhibitory projections from the Poisson neurons, which naturally endows them with negative tuning. The remaining 3000 neurons receive no direct projections from the input. Instead, all neurons in the encoding population are coupled through a sparse connectivity with random delays up to 5 msec. Synaptic weights are random and balanced, leading to a mean firing rate of 21.8 Hz in the population. We implemented and simulated the network using Brian, a spiking neural network simulator in Python [39].

The readout vector a” was built optimally given these constraints (eq. 12). The trials used to learn a” were not used in the subsequent analysis. The resulting IND for the “animal” was Z* w 3 stimulus units (Hz). Then, “experimentally”, neural activity was observed through 15 pools of 170 simultaneously recorded neurons, each pool being recorded on 3 X 180 trials. For the statistical inference method, we assumed a square integration kernel 11. We tested all combinations of the following readout parameters (in matrix notation): K = 10:10:150 neurons, w = 10:10:100 msec, tR = 10:10:200 msec, ad 2 0:0.25z3 stimulus units (Hz). For each tested size K, we picked 2000 random candidate ensembles 8 (always within one of the 15 simultaneous pools) to build the predictions. For each ensemble 8, another ensemble I of 20 neurons, segregated from 8, were used to predict CC signals outside the readout ensemble (this was always possible since recording pools had size 170, and K g 150). The details of these predictions are explained in the following paragraph. Finally, the three terms in the “statistical” loss function (eq. 16) were weighted according to the power of the respective, true measures. That is:

Experimental predictions for CC indicators

14—15) from actual data. For the measured versions q*(u, t) and V*(w, tR), this is straightforward. One considers the true, measured CC signals di*(t), and computes the population averages in eq. 14—15 over as many neurons 1' as were recorded. Note however that the final indicators can be corrupted by noise, whenever each measure d1. *(t) comes from too few recording trials (this problem is addressed in the next section). Also note that, since the definition of Vrequires a temporal integration, we actually have to produce a different “true” V* for each tested set of temporal parameters w and tR.

Whenever a candidate ensemble 8 is proposed as the source of the readout, eq. 59 predicts the resulting CC signal di(t|€) for every neuron i in the population. However, in practice, the noise covariance term C ir(t) is required in the computation, so neuron i and ensemble 8 must have been recorded simultaneously during the same run. This limits the number of neurons 1' which can participate in the population averages.

As a result, the two following averages must be predicted separately: before one can recombine them in the correct proportions: and similarly for V(€). To compute qout experimentally, each tested candidate ensemble 8 (of size K) is associated to a complimentary set of neurons I (of size I), which we use to approximate the average in eq. 85: All neurons in ensembles 8 and I must have been recorded during the same run, which imposes that I+K g N. Hence in our simulations, we chose a size I = 170—150 = 20 neurons.

So in practice, we cannot estimate reliably each prediction q(u, 1‘ lg) from eq. 87. Luckily, we are not interested in their value for each individual readout ensemble 8. We simply need to estimate their means across all tested ensembles 8 of similar size: which Will be reliable as soon as we test a sufficient amount of candidate ensembles 8.

16), a match is sought between the true indicators q* and V*—Which arise from a single readout ensemble 8*, and the predictions (q) 5 and (V) g —Which are average values across all readout ensembles 8 of size K. Thus, a prediction error can occur Whenever the true readout ensemble 8* is not a “typical” representative of its size K*. To quantify these potential errors, one should also estimate the indicators’ variance across ensembles 8 of same size.

Correcting for the finite amounts of data

The computations of Z, q and V, as described above, can produce imprecise results when the data are overly limited. Generically, for any quantity X estimated from the data, we can write where 5 represents the measurement error on X due to the finite amounts of data. If we could recompute X from a different set of neurons and/ or a different set of trials, variable {would take a different value—meaning that Var(f) > 0. This is an inescapable phenomenon for experimental measures.

Since the bias is generally different for the ‘true’ and ‘predicted’ versions, the comparison between the two (eq. 16) will be systematically flawed. To counteract this effect, we applied a number of correction procedures when computing indicators Z, q and V, to ensure that they are globally unbiased. We only provide an overview here, and refer to supporting 81 Text for a detailed description. First, when the optimal vector a is computed with Fisher’s linear discriminant, it systematically underestimates the IND Z (overestimates the sensitivity Y). Essentially, vector a, computed through eq. 12 finds artificial “holes” in matriX Cr which are only due to its imprecise measurement—a phenomenon known as statistical overfitting. The less recording trials, the more overfitting there will be [40, 41]. We addressed this problem with a regularization technique, inspired by Bayesian linear regression [42]. We replaced eq. 12 by the following: Where the strength of parameter A imposes the degree of regularization. We chose 1 according to an ‘empirical Bayes’ principle, to maximize the likelihood of the data under a given statistical model (supporting 81 Text, section 4). It largely mitigated the effects of overfitting, without totally suppressing them—as can be seen in Fig. 5D-E.

15) can also display substantial biases (E(f) 7E 0 in the above discussion). Indeed, its computation relies on squared quantities—such as d or @2—that systematically transform measurement errors into positive biases. The required corrections are very similar to the classic “N/(N—l)” correction for the naive variance estimator, with the additional difficulty that Vis affected by two sources of noise: the finite number of recording trials, and the finite number of recorded neurons. The exact corrections to ensure an unbiased estimation of Vare detailed in supporting 81 Text, section 5.

Yet, it can display an important level of measurement noise (Var(f) >> 0 in the above discussion) that may deteriorate the subsequent inference procedure. We mitigated this measurement noise by applying a bi-temporal Gaussian smoothing to q*(u, t) and predictions q(u, t), with time constant 10 msec.

These resamplings were used to derive some of the correction terms for V, and also to derive confidence intervals on our final estimators, as shown in Fig. 6. This departure from the statistical canon was imposed by the length of the whole inference procedure (see supporting 81 Text, section 5, for details).

Reproduction of our results and implementation

We also provide the Python code for the network simulation, and MATLAB scripts for the reproduction of the experimental Figures in this article (Fig. 4—7).

Supporting Information

Supporting text. Contains additional information about Choice Probabilities (section 1), the influence of parameter w on stimulus sensitivity (section 2), the encoding neural network used for testing the method (section 3), the Bayesian regularization procedure on Fisher’s linear discriminant (section 4), unbiased computation of CC indicators in the presence of measurement noise (section 5), and an extended readout model with variable extraction time tR (section 6).

Supporting code for the article. (GZ)

Author Contributions

Performed the experiments: AW. Analyzed the data: AW. Wrote the paper: AW CM.