arxiv:0712.4381v1 [q-bio.nc] 28 dec 2007wbialek/our_papers/bialek+al... · 2007-12-28 · e cient...

Efficient representation as a design principle for neural coding and computation

William Bialek,a Rob R. de Ruyter van Steveninckb and Naftali Tishbyc

aJoseph Henry Laboratories of Physics, Lewis–Sigler Institute for Integrative Genomics,and Princeton Center for Theoretical Physics, Princeton University, Princeton, NJ 08544 USA

bDepartment of Physics and Program in Neuroscience,Indiana University, Bloomington, IN 47405, USA

cSchool of Computer Science and Engineering, and Interdisciplinary Centerfor Neural Computation, Hebrew University, Jerusalem 91904, Israel

(Dated: December 28, 2007)

Does the brain construct an efficient representation of the sensory world? We review progresson this question, focusing on a series of experiments in the last decade which use fly vision asa model system in which theory and experiment can confront each other. Although the idea ofefficient representation has been productive, clearly it is incomplete since it doesn’t tell us whichbits of sensory information are most valuable to the organism. We suggest that an organism whichmaximizes the (biologically meaningful) adaptive value of its actions given fixed resources shouldhave internal representations of the outside world that are optimal in a very specific informationtheoretic sense: they maximize the information about the future of sensory inputs at a fixed valueof the information about their past. This principle contains as special cases computations whichthe brain seems to carry out, and it should be possible to test this optimization directly. We returnto the fly visual system and report the results of preliminary experiments that are in encouragingagreement with theory.

I. INTRODUCTION

Since Shannon’s original work [1] there has been thehope that information theory would provide not onlya guide to the design of engineered communication sys-tems but also a framework for understanding informationprocessing in biological systems. One of the most con-crete implementations of this idea is the proposal thatcomputations in the brain serve to construct an efficient(perhaps even maximally efficient) representation of in-coming sensory data [2, 3, 4]. Since efficient codingschemes are matched, at least implicitly, to the distri-bution of input signals, this means that what the braincomputes—perhaps down to the properties of individ-ual neurons—should be predictable from the statisticalstructure of the sensory world. This is a very attractivepicture, and points toward general theoretical principlesrather than just a set of small models for different smallpieces of the brain. More precisely, this picture suggestsa research program that could lead to an experimentallytestable theory.

Our research efforts, over several years, have been in-fluenced by these ideas of efficient representation. Onthe one hand, we have found evidence for this sort ofoptimization in the responses of single neurons in the flyvisual system, especially once we developed tools for ex-ploring the responses to more naturalistic sensory inputs.On the other hand, we have been concerned that simpleimplementations of information theoretic optimizationprinciples must be wrong, because they implicitly at-tach equal value to all possible bits of information aboutthe world. In response to these concerns, we have beentrying to develop alternative approaches, still groundedin information theory but not completely agnostic aboutthe value of information. Guided by our earlier results,

we also want to phrase these theoretical ideas in a waythat suggests new experiments.

What we have outlined here is an ambitious program,and certainly we have not reached anything like com-pletion. The invitation to speak at the InternationalSymposium on Information Theory in 2006 seemed likea good occasion for a progress report, so that is whatwe present here. It is much easier to convey the senseof ‘work in progress’ when speaking than when writing,and we hope that the necessary formalities of text do notobscure the fact that we are still groping for the correctformulation of our ideas. We also hope that, incompleteas it is, others will find the current state of our under-standing useful and perhaps even provocative.

II. SOME RESULTS FROM THE FLY VISUALSYSTEM

The idea of efficient representation in the brain hasmotivated a considerable amount of work over severaldecades. We begin by reviewing some of what has beendone along these lines, focusing on one experimental test-ing ground, the motion sensitive neurons in the fly visualsystem.

Many animals, in particular those that fly, rely on vi-sual motion estimation to navigate through the world.The sensory–motor system responsible for this task,loosely referred to as the optomotor control loop, hasbeen the subject of intense investigation in the fly, bothin behavioral [5] and in electrophysiological studies. Inparticular, Bishop and Keehn [6] described wide fieldmotion sensitive cells in the fly’s lobula plate, and someneurons of this class have been directly implicated inoptomotor control [7]. The fly’s motion sensitive vi-

arX

iv:0

712.

4381

v1 [

q-bi

o.N

C]

28

Dec

200

7

2

sual neurons thus are critical for behavior, and one canrecord the action potentials or spikes generated by indi-vidual motion sensitive cells (e.g., the cell H1, a lobulaplate neuron selective for horizontal inward motion) us-ing an extracellular tungsten microelectrode, and stan-dard electrophysiological methods [8]; unlike most suchrecordings, in the fly one can record stably and continu-ously for days.

The extreme stability of the H1 recordings has madethis system an attractive testing ground for a wide vari-ety of issues in neural coding and computaton. In par-ticular, for H1 it has been possible to show that:

1. Sequences of action potentials provide largeamounts of information about visual inputs, withina factor of two of the limit set by the entropy ofthese sequences even when we distinguish spike ar-rival times with millisecond resolution [9].

2. This efficiency of coding has significant contribu-tions from temporal patterns of spikes that providemore information than expected by adding up theinformation carried by individual spikes [10].

3. Although many aspects of the neural response varyamong individual flies, the efficiency of coding isnearly constant [11].

4. Information rates and coding efficiencies arehigher, and the high efficiency extends to evenhigher time resolution, when we deliver stimulusensembles that more closely approximate the stim-uli which flies encounter in nature [12, 13].

5. The apparent input/output relation of these neu-rons changes in response to changes in the in-put distribution. For the simple case where wechange the dynamic range of velocity signals, theinput/output relation rescales so that the signalis encoded in relative units; the magnitude ofthe rescaling factor maximizes information transfer[16].

6. In order to adjust the input/output relation reli-ably, the system has to collect enough samples tobe sure that the input distribution has changed. Infact the speed of adaptation is close to this theo-retical limit [17].

All of these results point toward the utility of efficientrepresentation as a hypothesis guiding the design of newexperiments, and perhaps even as a real theory of theneural code. So much for the good news.

III. ON THE OTHER HAND ...

Despite the successes of information theoretic ap-proaches to the neural code in fly vision and in othersystems, we must be honest and consider the funda-mental stumbling blocks in any effort to use informa-tion theoretic ideas in the analysis of biological systems.

First, Shannon’s formulation of information theory hasno place for the value or meaning of the information.This is not an accident. On the first page of his 1948paper [1], Shannon remarked (italics in the original):

Frequently the messages have meaning; thatis they refer to or are correlated accordingto some system with certain physical or con-ceptual entities. These semantic aspects ofthe communication are irrelevant to the en-gineering problem.

Yet surely organisms find some bits more valuable thanothers, and any theory that renders meaning irrelevantmust miss something fundamental about how organismswork. Second, it is difficult to imagine that evolution canselect for abstract quantities such as the number of bitsthat the brain extracts from its sensory inputs. Both ofthese problems point away from general mathematicalstructures toward biological details such as the fitness oradaptive value of particular actions, the costs of particu-lar errors, and the resources needed to carry out specificcomputations. It would be attractive to have a theoret-ical framework that is faithful to these biological detailsbut nonetheless derives predictions from more generalprinciples.

To develop a biologically meaningful notion of opti-mization, we should start with the idea that there issome metric (‘adaptive value’ in Fig 1) for the qualityor utility of the actions taken by an organism, and thatthere are resources that the organism needs to spend inorder to take these actions and to maintain the appara-tus that collects and processes the relevant sensory in-formation. The ultimate metric is evolutionary fitness,but in more limited contexts one can think about thefrequency or value of rewards and punishments, and inexperiments one can manipulate these metrics directly.Costs often are measured in metabolic terms, but onealso can measure the volume of neural circuitry devoted

forbidden

resources

adaptive value

optimal performance

allowed

FIG. 1: Optimization from a biological point of view.

3

to a task. Presumably there also are costs associatedwith the development of complex structures, althoughthese are difficult to quantify.

Given precise definitions of utility and cost for dif-ferent strategies (whether represented by neurons or bygenomes), the biologically meaningful optimum is tomaximize utility at fixed cost: While there may be noglobal answer to the question of how much an organismshould spend, there is a notion that it should receivethe maximum return on this investment. Thus within agiven setting there is a curve that describes the maxi-mum possible utility as a function of the cost, and thiscurve divides the utility/cost plane into regions that arepossible and impossible for organisms to achieve, as inFig 1. This curve defines a notion of optimal perfor-mance that seems well grounded in the facts of life, evenif we can’t compute it. The question is whether we canmap this biological notion of optimization into some-thing that has the generality and power of informationtheory.

IV. COSTS AND BENEFITS ARE RELATEDTO INFORMATION

To begin, we note that taking actions which achieve acriterion level of fitness requires a minimum number ofbits of information, as schematized in the upper left ofFig 2. Consider an experiment in which human subjectspoint at a target, and the reward or utility is depen-dent upon the positional error of the pointing. We canthink of the motor neurons, muscles and kinematics ofthe arm together as a communication channel that trans-forms some central neural representation into mechanicaldisplacements. If we had an accurate model of this com-munication channel we could calculate its rate–distortionfunction, which determines the minimum number of bitsrequired in specifying the command to insure displace-ments of specified accuracy across a range of possible tar-get locations. The rate–distortion function divides theutility/information plane into accessible and inaccessibleregions.

It also is true that bits are not free. In the classical ex-amples of communication channels, the signal–to–noiseratio (SNR) with which data can be transmitted is re-lated directly to the power dissipation, and the SNR inturn sets the maximum number of bits that can be trans-mitted in a given amount of time; this is (almost) theconcept of channel capacity. If we think about the bitsthat will be used to direct an action, then there are manycosts—the cost of acquiring the information, of repre-senting the information, and the more obvious physicalcosts of carrying out the resulting actions. Continuingwith the example of motor control, we always can assignthese costs to the symbols at the entrance to the commu-nication channel formed by the motor neurons, musclesand arm kinematics. The channel capacity separates theinformation/cost plane into accessible and inaccessible

forbidden

resources

adaptive value

biologicallyoptimal performance

information(bits)

forbiddenchannel codinglimit

forbidden

information(bits)

rate-distortion limit

What happens here?Are these information measures the same?

FIG. 2: Biological costs and benefits are connected to bits.The upper right quadrant redraws the biologically motivatednotion of optimization from Fig 1, trading resources for adap-tive value. In the upper left, we show schematically thatachieving a given quality of performance requires a minimumnumber of bits, in the spirit of rate–distortion theory. In thelower right, we show that a given resource expenditure willsuffice only to collect a certain maximal number of bits, inthe spirit of channel coding. Through these connections, thequantities that govern biological optimization are translatedinto bits. But are these the same bits?

regions, as in the lower right quadrant of Fig 2. Ideasabout metabolically efficient neural codes [18, 19] can beseen as efforts to calculate this curve in specific models.

V. CAN WE CLOSE THE LOOP?

To complete the link between biological optimizationand information theoretic ideas, we need to remem-ber that there is a causal path from information aboutthe outside world to internal representations to actions.Thus the adaptive value of actions always depends on thestate of the world after the internal representation hasbeen formed, simply because it takes time to transformrepresentations into actions; the only bits that can con-tribute to fitness are those which have predictive powerregarding the future state of the world. In contrast, be-cause of causality, any internal representation necessarilyis built out of information about the past.

The fact that representations are built from dataabout the past but are useful only to the extent thatthey provide information about the future means that,for the organism, the bits in the rate–distortion tradeoffare bits about the future, while the bits in the channelcapacity tradeoff are bits about the past. Thus the differ-ent tradeoffs we have been discussing—the biologicallyrelevant trade between resources and adaptive value, therate–distortion relation between adaptive value and bits,and the channel capacity trading of resources for bits—

4

forbidden

resources

adaptive value

biologicallyoptimal performance

informationabout the past

forbiddenchannel codinglimit

forbidden

informationabout the future

rate-distortion limit

informationbottleneck

FIG. 3: Connecting the different optimization principles.Lines indicate curves of optimal performance, separating al-lowed from forbidden (hashed) regions of each quadrant. Inthe upper right quadrant is the biological principle, maximiz-ing fitness or adaptive value at fixed resources. But actionsthat achieve a given level of adaptive value require a minimumnumber of bits, and since actions occur after plans these arebits about the future (upper left). On the other hand, the or-ganism has to “pay” for bits, and hence there is a minimumresource costs for any representation of information (lowerright). Finally, given some bits (necessarily obtained fromobservations on the past), there is some maximum number ofbits of predictive power (lower left). To find a point on thebiological optimum one can try to follow a path through theother three quadrants, as indicated by the arrows.

form three quadrants in a plane (Fig 3). The fourthquadrant, which completes the picture, is a purely in-formation theoretic tradeoff between bits about the pastand bits about the future.

Assuming that the organism lives in a statisticallystationary world, predictions ultimately are limited bythe statistical structure of the data that the organismcollects. More concretely, if we observe a time seriesthrough a window of duration T (that is, for times−T < t ≤ 0), then to represent the data Xpast wehave collected requires S(T ) bits, where S is the entropy,but the information that these data provide about thefuture Xfuture (i.e., at times t > 0) is given by someI(Xpast;Xfuture) ≡ Ipred(T ) � S(T ). In particular,while for large T the entropy S(T ) is expected to be-come extensive, the predictive information Ipred(T ) al-ways is subextensive [20]. Thus we expect that the dataXpast can be compressed significantly into some internalrepresentation Xint without losing too much of the rele-vant information aboutXfuture. This problem—mappingXpast → Xint to minimize the information I(Xint;Xpast)that we keep about the past while maintaining informa-tion I(Xint;Xfuture) about the future—is an example ofthe “information bottleneck” problem [21]. Again thereis a curve of optimal performance, separating the plane

into allowed and forbidden regions. Formally, we canconstruct this optimum by solving

maxXpast→Xint

[I(Xint;Xfuture)− λI(Xint;Xpast)] , (1)

where Xpast → Xint is the rule for creating the internalrepresentation and λ is a Lagrange multiplier.

We see that there are several different optimizationprinciples, all connected, as schematized in Fig 3. Thebiologically relevant principle is to maximum the fitnessF given some resource cost C. But in order to take ac-tions that achieve some mean fitness F in a potentiallyfluctuating environment, the organism must have an in-ternal representation Xint that provides some minimumamount of information I(Xint;Xfuture) about the futurestates of that environment; the curve I(Xint;Xfuture) vs.F is a version of the rate–distortion curve. Building andacting upon this internal representation, however, en-tails various costs, and these can all be assigned to theconstruction of the representation out of the (past) dataas they are collected; the curve of I(Xint;Xpast) vs. Cis an example of the channel capacity. Finally, the in-formation bottleneck principle tells us that there is anoptimum choice of internal representation which maxi-mizes I(Xint;Xfuture) at fixed I(Xint;Xpast).

The four interconnected optimization principles cer-tainly have to be consistent with one another. Thus, ifan organism wants to achieve a certain mean fitness, itneeds a minimum number of bits of predictive power, andthis requires collecting a minimum number of bits aboutthe past, which in turn necessitates some minimum cost.The possible combinations of cost and fitness—the ac-cessible region of the biologically meaningful tradeoff inFig 1—thus have a reflection in the “information plane”(the lower left quadrant of Fig 3) where we trade bitsabout the future against bits about the past.

The consistency of the different optimization princi-ples means that the purely information theoretic tradeoffbetween bits about the future and bits about the pastmust constrain the biologically optimal tradeoff betweenresources and fitness. We would like to make “constrain”more precise, and conjecture that under reasonable con-ditions organisms which operate at the biological opti-mum (that is, along the bounding curve in the upperright quadrant of Fig 3) also operate along the informa-tion theoretic optimum (the bounding curve in the lowerleft quadrant of Fig 3). At the moment this is only aconjecture, but we hope that the relationships in Fig 3open a path to some more rigorous connections betweeninformation theoretic and biological quantities.

VI. A UNIFYING PRINCIPLE?

The optimization principle in Eq (1) is very abstract;here we consider two concrete examples. First imaginethat we observe a Gaussian stochastic process [x(t)] thatconsists of a correlated signal [s(t)] in a background of

5

white noise [η(t)]. For simplicity, let’s understand ‘cor-related’ to mean that s(t) has an exponentially decayingcorrelation function with a correlation time τc. Thus,x(t) = s(t) + η(t), where

〈s(t)s(t′)〉 = σ2 exp (−|t− t′|/τc) (2)〈η(t)η(t′)〉 = N0δ(t− t′), (3)

and hence the power spectrum of x(t) is given by

〈x(t)x(t′)〉 ≡∫dω

2πSx(ω) exp [−iω(t− t′)] (4)

Sx(ω) =2σ2τc

1 + (ωτc)2+N0. (5)

The full probability distribution for the function x(t) is

P [x(t)] =1Z

exp[−1

2

∫dt

∫dt′ x(t)K(t− t′)x(t′)

],

(6)where Z is a normalization constant and the kernel

K(τ) =∫dω

2π1

Sx(ω)exp(−iωτ). (7)

If we sit at t = 0, then Xpast ≡ x(t < 0) and Xfuture ≡x(t > 0). In the exponential of Eq (6), mixing betweenXpast and Xfuture is confined to a term which can bewritten as[∫ 0

−∞dt g(−t)x(t)

]×[∫ ∞

0

dt′ g(t′)x(t′)], (8)

where g(t) = exp(−t/τ0), with τ0 = τc(1+σ2τc/N0)−1/2.This means that the probability distribution of Xfuture

given Xpast depends only on x(t) as seen through thelinear filter g(τ), and hence only this filtered version ofthe past can contribute to Xint [22].

The filter g(t) is exactly the filter that provides op-timal separation between the signal s(t) and the noiseη(t); more precisely, given the data Xpast, if we ask forthe best estimate of the signal s(t), where “best” meansminimizing the mean–square error, then this optimal es-timate is just y(t) [24]. Solving the problem of optimallyrepresenting the predictive information in this time seriesthus is identical to the problem of optimally separatingsignal from noise.

In contrast to these results for Gaussian time serieswith finite correlation times, consider what happens welook at a time series that has essentially infinitely longcorrelations. Specifically, consider an ensemble of possi-ble experiments in which points xn are drawn indepen-dently and at random from the probability distributionP (x|~α), where ~α is a K–dimensional vector of param-eters specifying the distribution. At the start of eachexperiment these parameters are drawn from the distri-bution P (~α) and then fixed for all n. Thus the jointdistribution for many successive observations on x on

one experiment is given by

P (x1, x2, · · · , xM ) =∫dKαP (~α)

M∏n=1

P (xn|~α). (9)

Now we can define Xpast ≡ {x1, x2, · · · , xN} andXfuture ≡ {xN+1, xN+2, · · · , xM}, and we can (optimisti-cally) imagine an unbounded future, M → ∞. To findthe optimal representation of predictive information inthis case we need a bit more of the apparatus of theinformation bottleneck [21].

It was shown in Ref [21] that an optimization problemof the form in Eq (1) can be solved by probabilistic map-pings Xpast → Xint provided that the distribution whichdescribes this mapping obeys a self–consistent equation,

P (Xint|Xpast) =1

Z(Xpast;λ)exp

[− 1λDKL(Xpast;Xint)

],

(10)where Z is a normalization constant andDKL(Xpast;Xint) is the Kullback–Leibler divergencebetween the distributions of Xfuture conditional on Xpast

and Xint, respectively,

DKL(Xpast;Xint) =∫DXfutureP (Xfuture|Xpast)

× ln[P (Xfuture|Xpast)P (Xfuture|Xint)

]. (11)

Since the future depends on our internal representationonly because this internal representation is built fromobservations on the past, we can write

P (Xfuture|Xint) =∫DXpastP (Xfuture|Xpast)

×P (Xpast|Xint) (12)

P (Xpast|Xint) = P (Xint|Xpast)P (Xpast)P (Xint)

, (13)

which shows that Eq (10) really is a self–consistent equa-tion for P (Xint|Xpast). To solve these equations it ishelpful to realize that they involve integrals over manyvariables, since Xpast is N dimensional and Xfuture is Mdimensional. In the limit that these numbers are verylarge and the temperature–like parameter λ is very small,it is plausible that the relevant integrals are dominatedby their saddle points.

In the saddle point approximation one can find so-lutions for P (Xint|Xpast) that have the following sug-gestive form. The variable Xint can be thought of asa point in a K–dimensional space, and then the distri-butions P (Xint|Xpast) are Gaussian, centered on loca-tions ~αest(Xpast), which are the maximum a posterioriBayesian estimates of the parameters ~α given the obser-vations {x1, x2, · · · , xN}. The covariance of the Gaussianis proportional to the inverse Fisher information ma-trix, reflecting our certainty about ~α given the past data.Thus, in this case, if we solve the problem of efficiently

6

representing predictive information, then we have solvedthe problem of learning the parameters of the probabilis-tic model that underlies the data we observe.

Signal processing and learning usually are seen as verydifferent problems, especially from a biological point ofview. Building an optimal filter to separate signal fromnoise is a “low–level” task, presumably solved in the veryfirst layers of sensory processing. Learning is a higherlevel problem, especially if the model we are learningstarts to describe objects far removed from raw sensedata, and presumably happens in the cerebral cortex.Returning to the problem of placing value on informa-tion, separating signal from noise by filtering and learn-ing the parameters of a probabilistic model seem to bevery different goals. In the conventional biological view,organisms carry out these tasks at different times andwith different mechanisms because the kinds of infor-mation that one extracts in the two cases have differentvalue to the organism. What we have shown is that thereis an alternative, and more unfiied, view.

There is a single principle—efficient representation ofpredictive information—that values all (predictive) bitsequally but in some instances corresponds to filteringand in others to learning. In this view, what determineswhether we should filter or learn is not an arbitrary “bio-logical” choice of goal or assignment of value, but ratherthe structure of the data stream to which we have access.

VII. NEURAL CODING OF PREDICTIVEINFORMATION

It would be attractive to have a direct test of theseideas. We recall that neurons respond to sensory stimuliwith sequences of identical action potentials or “spikes,”and hence the brain’s internal representation of the worldis constructed from these spikes [8]. More narrowly, ifwe record from a single neuron, then this internal rep-resentation Xint can be identified with a short segmentof the spike train from that neuron, while Xpast andXfuture are the past and future sensory inputs, respec-tively. The conventional analysis of neural responsesfocuses on the relationship between Xpast and Xint—trying to understand what features of the recent sensorystimuli are responsible for shaping the neural response.In contrast, the framework proposed here suggests thatwe try to quantify the information I(Xint;Xfuture) thatneural responses provide about the future sensory in-puts [25]. More specifically, to test the hypothesis thatthe brain generates maximally efficient representationsof predictive information, we need to measure directlyboth I(Xint;Xfuture) and I(Xint;Xpast), and see whetherin a given sensory environment the neural representationXint lies near the optimal curve predicted from Eq (1).

It would seem that to measure I(Xint;Xfuture) wewould have to understand the structure of the code bywhich spike trains represent the future; the same prob-lem arises even with I(Xint;Xpast). In fact there is a

more direct strategy [9]. The essential idea behind di-rect measurements of neural information transmission [9]is to use the (ir)reproducibility of the neural response torepeated presentations of the same dynamic sensory sig-nal. If we think of the sensory stimulus as a movie thatruns from time t = 0 to t = T , we can run the movierepeatedly in a continuous loop. Then at each momentt we can look at the response R ≡ Xint of the neuron,and if there are enough repetitions of the movie we canestimate the conditional distribution P (Xint|t); the en-tropy Sn(t) of this distribution measures the “noise” inthe neural response. On the other hand, if we averageover the time t we can estimate P (Xint), and the en-tropy Stotal of this distribution measures the capacity ofthe neural responses to convey information. In the limitof large T ergodicity allows us to identify averages overtime with averages over the distribution out of which thestimulus movies are being drawn, and then

I = Stotal −1T

∫ T

0

Sn(t) (14)

is the mutual information between sensory inputs andneural responses. Since the neuron responds causally toits sensory inputs, the information that it carries aboutthese inputs necessarily is information about the past,I(Xint;Xpast) [26]. Note that this computation does notrequire us to understand how to read out the encoded in-formation, or even to know which features of the sensoryinputs are encoded by the brain.

More careful analysis makes clear that the strategyin Ref [9] measures the information which Xint providesabout whatever aspects of the sensory stimulus are beingrepeated. For example, if we have a movie with soundand we repeat the video but randomize the audio, thenfollowing the analysis of Ref [9] we would measure theinformation that neurons carry about their visual andnot auditory inputs. Thus to measure I(Xint;Xfuture)we need to generate sensory stimuli that are all drawnindependently from the same distribution but are con-strained to lead to the same future, and then repro-ducible neural responses to these stimuli will reflect in-formation about the future. This can be done by a vari-ety of methods.

Consider a time dependent signal sk(t) generated onrepeat k as

τcdsk(t)dt

+ sk(t) = ξk(t), (15)

where ξk(t < 0) = ξ0(t) for all k, while each ξk(t > 0)is drawn independently; in the simplest case ξ(t) has nocorrelations in time (white noise). Then the correlationfunction of the signal becomes

〈sk(t)sk(t′)〉 ∝ exp(−|t− t′|/τc), (16)

and all sk(t < 0) are identical. Now take the trajecto-ries and reverse the direction of time. The result is an

7

−0.2 −0.15 −0.1 −0.05 0

−600

−400

−200

0

200

400

s(t)

(° /s

)

−0.2 −0.15 −0.1 −0.05 00

20

40

60

80

time relative to common future (s)

tria

ls

−0.1 −0.05 00

100

200

300

400

spik

e pr

obab

ility

(1/

s)−0.1 −0.05 0

0

20

40

60

80

100

120

time relative to common future (s)

pred

ictiv

e in

form

atio

nra

te (

bits

/s)

FIG. 4: Trajectories with a common future and their neural representation. Top left: Sample trajectories sk(t) designed toconverge on a common future at t = 0, as explained in the text. Units are angular velocity, since these signals will be used todrive the motion–sensitive visual neuron H1. The correlation time τc = 0.05 s and the variance 〈s2〉 = (200◦/s)2. Bottom left:Responses of the blowfly H1 neuron to ninety different trajectories of angular velocity vs. time sk(t) converge on a commonfuture at t = 0; each dot represents a single spike generated by H1 in response to these individual signals. Stimulus deliveryand recordings as described in Ref [12]. Top right: Probability per unit time of observing a spike in response to trajectoriesthat converge on three different common futures. At times long before the convergence, all responses are drawn from the samedistribution and hence have same spike probability within errors. Divergences among responses begin at a time ∼ τc prior tocommon future. This divergence means that the neural responses carry information about the particular common future, asexplained in the text. Error bars are standard errors of the mean, estimated by bootstrapping. Bottom right: Information insingle time bins about the identity of the future, normalized as an information rate. Blue points are from the real data, andgreen points are from shuffled data that should have zero information.

ensemble of trajectories that lead to the same future butare otherwise statistically independent, as in upper leftpanel in Fig 4.

We have used the strategy outlined above to explorethe coding of predictive information in the fly visual sys-tem, returning to the neuron H1. The extreme stabilityof recordings from H1 has been exploited in experimentswhere we deliver motion stimuli by physically rotatingthe fly outdoors in a natural environment rather thanshowing movies to a fixed fly [12], and this is the paththat we follow here.

We have generated angular velocity trajectories s(t)with a variance 〈s2〉 = (200 ◦/s)2 and a correlation timeτc = 0.05 s by numerical solution of Eq (15). We choosenine such segments at random to be the common futures,and then follow the construction leading to the upper leftpanel in Fig 4 to generate ninety independent trajecto-ries for each of these common futures. These trajectories

are used as angular velocity signals to drive rotation ofa blowfly Calliphora vicina mounted on a motor drive asin Ref [12] while we record the spikes generated by theH1 neuron.

The lower left panel in Figure 4 shows examples ofthe spike trains generated by H1 in response to indepen-dent stimuli that converge on a common future. Longbefore the convergence, stimuli are completely differenton every trial, and hence the neural responses are highlyvariable. As we approach the convergence time, stim-uli on different trials start to share features which arepredictive of the common future, and hence the neuralresponses become more reproducible. Importantly, stim-ulus trajectories that converge on different common fu-tures generate responses that are not only reproduciblebut also distinct from one another, as seen in the upperright of Fig 4. Our task now is to quantify this distin-guishability by estimating I(Xint;Xfuture).

8

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2!0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

window preceding common future (s)

pred

ictiv

e in

form

atio

n (b

its)

FIG. 5: Information about the future carried by spikes in awindow that ends at the time of convergence onto a commonfuture. Essentially this is the integral of the information rateshown in at the lower right in Fig 4, but error bars must beevaluated carefully. Shown in green are the results for shuffleddata, which should be zero within errors if our estimates arereliable.

Imagine dividing time into small bins of duration ∆τ .For small ∆τ , we observe either one spike or no spikes, sothe neural response is a binary variable. If we sit at onemoment in time relative to the convergence on a commonfuture, we observe this binary variable, and it is associ-ated with one of the nine possible futures. Thus there isa 2× 9 table of futures and responses, and it is straight-forward to use the experiments to fill in the frequencieswith which each of these response/future combinationsoccurs. With 810 samples to fill in the frequencies of 18possible events, we have reasonably good sampling andcan make reliable estimates of the mutual information,with error bars, following the methods of Refs [9, 27].In each time bin, then, we can estimate the informationthat the neural response provides about the future, andwe can normalize this by the duration of the bin ∆τ toobtain an information rate, as shown at the lower rightin Fig 4. We see that information about sensory signalsin the future (t > 0) is negligible in the distant past, asit must be, with the scale of the decay set by the corre-lation time τc. The local information rate builds up aswe approach t = 0, peaking for this stimulus ensembleat ∼ 30 bits/s.

The results in Fig 4 provide a moment by moment viewof the predictive information in the neural response, butwe would like a slightly more integrated view: if we sit att = 0 and look back across a window of duration T , howmuch predictive information can we extract from the re-sponses in this window? The complication in answering

this question is that the neural response across this win-dow is a T/∆τ–letter binary word, and the space of thesewords is difficult to sample for large T . The problemcould be enormously simpler, however, if the informa-tion carried by each spike were independent, since thenthe total information would be the integral of the localinformation rate. This independence isn’t exactly true,but it isn’t a bad approximation under some conditions[10], and we adopt it here. Then the only remainingtechnical problem is to be sure that small systematic er-rors in the estimate of the local information rate don’taccumulate as we compute the time integral, but this canbe checked by shuffling the data and making sure thatthe shuffled data yields zero information. The results ofthis computation are shown in Fig 5.

Should we be surprised by the fact that the neuralresponse from this one single neuron carries somewhatmore than one bit of information about the future? Per-haps not. This is, after all, a direction selective motionsensitive neuron, and because of the correlations the di-rection of motion tends to persist; maybe all we havefound is that the neuron encodes the current sign of thevelocity, and this is a good predictor of the future sign,hence one bit, with perhaps a little more coming alongwith some knowledge of the speed. We’d like to suggestthat things are more subtle.

Under the conditions of these experiments, the signal–to–noise in the fly’s retina is quite high. As explained inRef [17], we can think of the fly’s eye as providing an es-sentially perfect view of the visual world that is updatedwith a time resolution ∆t on the scale of milliseconds.But if we are observing a Gaussian stochastic processwith a correlation time of τc, then the limit to predictionis set by the need to extrapolate across the ‘gap’ of du-ration ∆t. Since the exponentially decaying correlationfunction corresponds to a Markov process, this limitingpredictive information is calculable simply as the mutualinformation between two samples separated by the gap;the result is

Ipred =12

log2

[1

1− exp(−2∆t/τc)

]. (17)

Plugging in the numbers, we find that, for these stim-uli, capturing roughly one bit of predictive informationdepends in an essential way on the system have a timeresolution of better than ten milliseconds, and the ob-served predictive information requires resolution in the3− 4 ms range.

When we look back at a window of the neural responsewith duration T , we expect to gain RinfoT bits of infor-mation about the stimulus [28], and as noted above thisnecessarily is information about the past. Thus the x–axis of Fig 5, which measures the duration of the win-dow, can be rescaled, so that the whole figure is a plot ofinformation about the future vs information about thepast, as in the lower left quadrant of Fig 3. Thus we cancompare this measure of neural performance with theinformation bottleneck limit derived from the statistical

9

structure of the visual stimulus itself; since the stimulusis a Gaussian stochastic process, this is straightforward[23], and as above we assume that the fly has access toa perfect representation of the velocity vs. time, up-dated at multiples of the time resolution ∆t. With ∆tin the range of 3 − 4 ms to be consistent with the totalamount of predictive information captured by the neu-ral response, we find that the performance of the neuronalways is within a factor of two of the bottleneck limit,across the whole range of window sizes.

VIII. DISCUSSION

We should begin our discussion by reminding thereader that, more than most papers, this is a report onwork in progress, intended to capture the current state ofour understanding rather than to draw firm conclusions.

Efficient representation? Although one should becareful of glib summaries, it does seem that the fly’svisual system offers concrete evidence of the brain build-ing representations of the sensory world that are efficientin the sense defined by information theory. The absoluteinformation rates are large (especially in comparison toprior expectations in the field!), and there are many signsthat the coding strategy used by the brain is matchedquantitatively to the statistical structure of sensory in-puts, even as these change in time. This matching, whichgives us a much broader view of “adaptation” in sensoryprocessing, has now been observed directly in many dif-ferent systems [29].

Why are these bits different from all other bits? Con-trary to widespread views in the neuroscience commu-nity, information theory does give us a language fordistinguishing relevant from irrelevant information. Wehave tried to argue that, for living organisms, the crucialdistinction is predictive power. Certainly data withoutpredictive power is useless, and thus ‘purifying’ predic-tive from non–predictive bits is an essential task. Oursuggestion is that this purification may be more thanjust a first step, and that providing a maximally effi-cient representation of the predictive information can bemapped to more biologically grounded notions of opti-

mal performance. Whether or not this general argumentcan be made rigorous, certainly it is true that extractingpredictive information serves to unify the discussion ofproblems as diverse as signal processing and learning.

A new look at the neural code? The traditional ap-proach to the the analysis of neural coding tries to cor-relate (sometimes in the literal mathematical sense) thespikes generate by neurons with particular features ofthe sensory stimulus. But, because the system is causal,these features must be features of the organism’s recentpast experience. Our discussion of predictive informa-tion suggests a very different view, in which we ask howthe neural response represents the organism’s future sen-sory experience. Although there are many things to bedone in this direction, we find it exciting that one canmake rather direct measurements of the predictive powerencoded in the neural response.

From an experimental point of view, the most com-pelling success would be to map neural responses topoints in the information plane—information about thefuture vs information about the past—and find thatthese points are close to the theoretical optimum de-termined by the statistics of the sensory inputs and theinformation bottleneck. We are close to being able to dothis, but there is enough uncertainty in our estimates ofinformation (recall that we work only in the approxima-tion where spikes carry independent information) thatwe are reluctant to put theory and experiment on thesame graph. Our preliminary result, however, is thattheory and experiment agree within a factor of two, en-couraging us to look more carefully.

Acknowledgments

We thank N Brenner, G Chechik, AL Fairhall, AGloberson, R Koberle, GD Lewen, I Nemenman, FCPereira, E Schneidman, SP Strong and Y Weiss fortheir contributions to work reviewed here, and T Bergerfor the invitation to speak at ISIT06. This work wassupported in part by the National Science Foundation,though Grants IIS–0423039 and PHY–0650617, and bythe Swartz Foundation.

[1] CE Shannon, A mathematical theory of communication.Bell Sys Tech J 27, 379–423 & 623–656 (1948).

[2] F Attneave, Some informational aspects of visual per-ception. Psych Rev 61, 183–193 (1954).

[3] HB Barlow, Sensory mechanisms, the reduction of re-dundancy, and intelligence. In Proceedings of the Sym-posium on the Mechanization of Thought Processes, vol-ume 2, DV Blake & AM Utlley, eds, pp 537–574 (HMStationery Office, London, 1959).

[4] HB Barlow, Possible principles underlying the transfor-mation of sensory messages. In Sensory Communication,W Rosenblith, ed, pp 217–234 (MIT Press, Cambridge,

1961).[5] W Reichardt & T Poggio, Visual control of orientation

behavior in the fly. Part I: A quantitative analysis. QRev Biophys 9, 311–375 (1976).

[6] L G Bishop & D G Keehn Two types of neurones sensi-tive to motion in the optic lobe of the fly. Nature 212,1374–1376 (1966).

[7] K Hausen & C Wehrhahn, Microsurgical lesion of hor-izontal cells changes optomotor yaw responses in theblowfly Calliphora erythrocephala. Proc R Soc Lond B219, 211–216 (1983).

[8] F Rieke, D Warland, R de Ruyter van Steveninck & W

10

Bialek Spikes: Exploring the Neural Code. (MIT Press,Cambridge, 1997).

[9] SP Strong, R Koberle, RR de Ruyter van Steveninck& W Bialek, Entropy and information in neuralspike trains. Phys Rev Lett 80, 197–200 (1998);arXiv:cond-mat/9603127.

[10] N Brenner, SP Strong, R Koberle, W Bialek & RR deRuyter van Steveninck, Synergy in a neural code. NeuralComp 12, 1531–1552 (2000); arXiv:physics/9902067.

[11] E Schneidman, N Brenner, N Tishby, RR de Ruytervan Steveninck, & W Bialek, Universality and individ-uality in a neural code. in Advances in Neural Infor-mation Processing 13, TK Leen, TG Dietterich & VTresp, eds, pp 159–165 (MIT Press, Cambridge, 2001);arXiv:physics/0005043.

[12] GD Lewen, W Bialek & RR de Ruyter van Steveninck,Neural coding of naturalistic motion stimuli. Network12, 317–329 (2001); arXiv:physics/0103088.

[13] I Nemenman, GD Lewen, W Bialek & RR de Ruyter vanSteveninck, Neural coding of a natural stimulus ensem-ble: Uncovering information at sub–millisecond resolu-tion. PLoS Comp Bio in press; arXiv:q-bio.NC/0612050(2006).

[14] I Nemenman, W Bialek & R de Ruyter van Steveninck,Entropy and information in neural spike trains: Progresson the sampling problem. Phys Rev E 69, 056111 (2004);arXiv:physics/0306063.

[15] NIPS 2003 Workshop. Estimation of entropy andinformation of undersampled probability distribu-tions. http://nips.cc/Conferences/2003/Work-

shops/#EstimationofEntropy.

[16] N Brenner, W Bialek & R de Ruyter van Steveninck,Adaptive rescaling optimizes information transmission.Neuron 26, 695–702 (2000).

[17] AL Fairhall, GD Lewen, W Bialek, & RR de Ruytervan Steveninck, Efficiency and ambiguity in an adaptiveneural code. Nature 412, 787–792 (2001).

[18] SB Laughlin, RR de Ruyter van Steveninck & JC An-derson, The metabolic cost of neural information. NatureNeurosci 1, 36–41 (1998).

[19] V Balasubramanian, D Kimber & MJ Berry II, Metabol-ically efficient information processing. Neural Comp 13,799–815(2001).

[20] W Bialek, I Nemenman & N Tishby, Predictability,complexity and learning. Neural Comp 13, 2409–2463(2001); arXiv:physics/0007070.

[21] N Tishby, FC Pereira & W Bialek, The information bot-tleneck method. In Proceedings of the 37th Annual Aller-ton Conference on Communication, Control and Com-puting, B Hajek & RS Sreenivas, eds, pp 368–377 (Uni-versity of Illinois, 1999); arXiv:physics/0004057.

[22] The information bottleneck in the case of Gaussian dis-tributions admits a more general discussion, in particularidentifying those instances in which Xint depends onlyon a limited set of linear projections of the input variable(Xpast in the present discussion). See Ref [23] for details.

[23] G Chechik, A Globerson, N Tishby & Y Weiss, Informa-tion bottleneck for Gaussian variables. J Machine LearnRes 6, 165–188 (2005).

[24] N Wiener, Extrapolation, Interpolation, and Smoothingof Stationary Time Series (Wiley, New York, 1949).

[25] There is a literature which focuses on the predictive na-ture of neural responses, but only in very specific situa-

tions such as the prediction of rewards. The argumentspresented above suggest that prediction is a much moregeneral problem.

[26] This step depends on the design of the experiment. Infully natural conditions, the outputs of sensory neuronsserve to guide motor behavior, and hence the future sen-sory inputs depend explicitly on the internal representa-tion. Analysis of this closed loop situation is beyond thescope of our theoretical discussion above, so we considerhere experiments in which sensory stimuli are indepen-dent of the neural output.

[27] N Slonim, GS Atwal, G Tkacik & W Bialek, Estimat-ing mutual information and multi–information in largenetworks. arXiv:cs.IT/0502017 (2005).

[28] Strictly speaking this is correct only for large T , but inthe approximation that each spike provides independentinformation this proportionality is exact.

[29] B Wark, BN Lundstrom & A Fairhall, Sensory adapta-tion. Curr Opin Neurobio 17, 423–429 (2007).

http://arxiv.org/abs/cond-mat/9603127

http://arxiv.org/abs/physics/9902067



http://arxiv.org/abs/q-bio/0612050


http://nips.cc/Conf-erences/2003/Work-shops/#EstimationofEntropy

http://nips.cc/Conf-erences/2003/Work-shops/#EstimationofEntropy



http://arxiv.org/abs/cs/0502017

arxiv:0712.4381v1 [q-bio.nc] 28 dec 2007wbialek/our_papers/bialek+al... · 2007-12-28 · e cient...

Documents