pnas201010868_108_S3 15565..15571 Searching for simplicity in the analysis of neurons and behavior Greg J. Stephensa,1, Leslie C. Osborneb, and William Bialeka aJoseph Henry Laboratories of Physics, Lewis–Sigler Institute for Integrative Genomics and Princeton Center for Theoretical Sciences, Princeton University, Princeton, NJ 08544; and bDepartment of Neurobiology, University of Chicago, Chicago, IL 60637 Edited by Donald W. Pfaff, The Rockefeller University, New York, NY, and approved January 20, 2011 (received for review October 21, 2010) What fascinates us about animal behavior is its richness and complexity, but understanding behavior and its neural basis requires a simpler description. Traditionally, simplification has been imposed by training animals to engage in a limited set of behaviors, by hand scoring behaviors into discrete classes, or by limiting the sensory experience of the organism. An alternative is to ask whether we can search through the dynamics of natural behaviors to find explicit evidence that these behaviors are simpler than they might have been. We review two mathematical approaches to simplification, dimensionality reduction and the maximum entropy method, and we draw on examples from different levels of biological organiza- tion, from the crawling behavior of Caenorhabditis elegans to the control of smooth pursuit eye movements in primates, and from the coding of natural scenes by networks of neurons in the retina to the rules of English spelling. In each case, we argue that the explicit search for simplicity uncovers new and unexpected features of the biological system and that the evidence for simplification gives us a language with which to phrase new questions for the next gener- ation of experiments. The fact that similar mathematical structures succeed in taming the complexity of very different biological sys- tems hints that there is something more general to be discovered. maximum entropy models | stochastic dynamical systems The last decades have seen an explosion in our ability to char-acterize the microscopic mechanisms—the molecules, cells, and circuits—that generate the behavior of biological systems. In contrast, our characterization of behavior itself has advanced much more slowly. Starting in the late 19th century, attempts to quantify behavior focused on experiments in which the behavior itself was restricted, for example by forcing an observer to choose among a limited set of alternatives. In the mid-20th century, ethologists emphasized the importance of observing behavior in its natural context, but here, too, the analysis most often focused on the counting of discrete actions. Parallel to these efforts, neurophysiologists were making progress on how the brain rep- resents the sensory world by presenting simplified stimuli and labeling cells by preference for stimulus features. Here we outline an approach in which living systems naturally explore a relatively unrestricted space of motor outputs or neural representations, and we search directly for simplification within the data. Although there is often suspicion of attempts to reduce the evident complexity of the brain, it is unlikely that under- standing will be achieved without some sort of compression. Rather than restricting behavior (or our description of behavior) from the outset, we will let the system “tell us” whether our fa- vorite simplifications are successful. Furthermore, we start with high spatial and temporal resolution data because we do not know the simple representation ahead of time. This approach is made possible only by the combination of experimental methods that generate larger, higher-quality data sets with the application of mathematical ideas that have a chance of discovering unexpected simplicity in these complex systems. We present four very differ- ent examples in which finding such simplicity informs our under- standing of biological function. Dimensionality Reduction In the human body there are ≈100 joint angles and substantially more muscles. Even if each muscle has just two states (rest or tension), the number of possible postures is enormous, 2Nmuscles ∼1030: If our bodies moved aimlessly among these states, char- acterizing our motor behavior would be hopeless—no experi- ment could sample even a tiny fraction of all of the possible trajectories. Moreover, wandering in a high dimensional space is unlikely to generate functional actions that make sense in a re- alistic context. Indeed, it is doubtful that a plausible neural sys- tem would independently control all of the muscles and joint angles without some coordinating patterns or “movement pri- matives” from which to build a repertoire of actions. There have been several motor systems in which just such a reduction in dimensionality has been found (1–5). Here we present two ex- amples of behavioral dimensionality reduction that represent very different levels of system complexity: smooth pursuit eye movements in monkeys and the free wiggling of worm-like nematodes. These examples are especially compelling because so few dimensions are required for a complete description of nat- ural behavior. Smooth Pursuit Eye Movements. Movements are variable even if conditions are carefully repeated, but the origin of that vari- ability is poorly understood. Variation might arise from noise in sensory processing to identify goals for movement, in planning or generating movement commands, or in the mechanical response of the muscles. The structure of behavioral variation can inform our understanding of the underlying system if we can connect the dimensions of variation to a particular stage of neural processing. Like other types of movement, eye movements are potentially high dimensional if eye position and velocity vary independently from moment to moment. However, an analysis of the natural variation in smooth pursuit eye movement behavior reveals a sim- ple structure whose form suggests a neural origin for the noise that gives rise to behavioral variation. Pursuit is a tracking eye move- ment, triggered by image motion on the retina, which serves to stabilize a target’s retinal image and thus to prevent motion blur (6). When a target begins to move relative to the eye, the pursuit system interprets the resulting image motion on the retina to es- timate the target’s trajectory and then to accelerate the eye to match the target’s motion direction and speed. Although tracking on longer time scales is driven by both retinal inputs and by extraretinal feedback signals, the initial ≈125 ms of the movement is generated purely from sensory estimates of the target’s motion, This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Quantification of Behavior” held June 11–13, 2010, at the AAAS Building in Washington, DC. The complete program and audio files of most presentations are available on the NAS Web site at www.nasonline.org/quantification. Author contributions: G.J.S., L.C.O., and W.B. designed research; G.J.S., L.C.O., and W.B. performed research; G.J.S., L.C.O., and W.B. contributed new reagents/analytic tools; G.J.S., L.C.O., and W.B. analyzed data; and G.J.S., L.C.O., and W.B. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. 1To whom correspondence should be addressed. E-mail: gstephen@princeton.edu. www.pnas.org/cgi/doi/10.1073/pnas.1010868108 PNAS | September 13, 2011 | vol. 108 | suppl. 3 | 15565–15571 http://www.nasonline.org/quantification mailto:gstephen@princeton.edu www.pnas.org/cgi/doi/10.1073/pnas.1010868108 using visual inputs present before the onset of the response. Fo- cusing on just this initial portion of the pursuit movement, we can express the eye velocity in response to steps in target motion as a vector, vðtÞ ¼ vHðtÞbi þ vV ðtÞbj;where vH(t) and vV(t) are the hor- izontal and vertical components of the velocity, respectively. In Fig. 1A we show a single trial velocity trajectory (horizontal and vertical components, dashed black and gray lines) and the trial-averaged velocity trajectory (solid black and gray lines). Because the initial 125 ms of eye movement is sampled every millisecond, the pursuit trajectories have 250 dimensions. We compute the covariance of fluctuations about the mean trajectory and display the results in Fig. 1B. Focusing on a 125-ms window at the start of the pursuit response (green box), we com- pute the eigenvalues of the covariance matrix and find that only the three largest are statistically different from zero according to the SD within datasets (7). This low dimensional structure is not a limitation of the motor system, because during fixation (yellow box) there are 80 significant eigenvalues. Indeed, the small am- plitude, high dimensional variation visible during fixation seems to be an ever-present background noise that is swamped by the larger fluctuations in movement specific to pursuit. If the covariance of this background noise is subtracted from the covariance during pursuit, the 3D structure accounts for ∼94% of the variation in the pursuit trajectories (Fig. 1C). How does low dimensionality in eye movement arise? The goal of the movement is to match the eye to the target’s velocity, which is constant in these experiments. The brain must therefore interpret the activity of sensory neurons that represent its visual inputs, detecting that the target has begun to move (at time t0) and estimating the direction θ and speed v of motion. At best, the brain estimates these quantities and transforms these esti- mates into some desired trajectory of eye movements, which we can write as vðt;bt0;bθ;bvÞ; where ·̂ denotes an estimate of the quantity ·. However, estimates are never perfect, so we should imagine thatbt0 ¼ t0 þ δt0; and so on, where δt0 is the small error in the sensory estimate of target motion onset on a single trial. If these errors are small, we can write vðtÞ ¼ vðt; t0; v; θÞ þ δt0 ∂vðt; t0; v; θÞ ∂t0 þ δθ ∂vðt; t0; v; θÞ ∂θ þ δv∂vðt; t0; v; θÞ ∂v þ δvbackðtÞ; [1] where the first term is the average eye movement made in re- sponse to many repetitions of the target motion, the next three terms describe the effects of the sensory errors, and the final term is the background noise. Thus, if we can separate out the effects of the background noise, the fluctuations in v (t) from trial to trial should be described by just three random numbers, δt0, δθ, and δv: the variations should be 3D, as observed. The partial derivatives in Eq. 1 can be measured as the dif- ference between the trial-averaged pursuit trajectories in re- sponse to slightly different target motions. In fact the average trajectories vary in a simple way, shifting along the t axis as we change t0, rotating in space as we change θ, and scaling uniformly faster or slower as we change v (7), so that the relevant derivatives can be estimated just from one average trajectory. We identify these derivatives as sensory error modes and show the results in Fig. 1D, where we have abbreviated the partial derivative expressions for the modes of variation as vdir ≡ ∂v/(t; t0, v, θ)/∂θ, vspeed ≡ ∂v/(t; t0, v, θ)/∂v, and vtime ≡ ∂v/(t; t0, v, θ)/∂t0. We note that each sensory error mode has a vertical and horizontal com- ponent, although some make little contribution. We recover the sensory errors (δθ, δv, δt0) by projecting the pursuit trajectory on each trial onto the corresponding sensory error mode. We can write the covariance of fluctuations around the mean pursuit trajectory in terms of these error modes as Cijðt; t′Þ ¼ 2 64 v ðiÞ dirðtÞ vðiÞspeedðtÞ vðiÞtimeðtÞ 3 75 T264 hδθδθi hδθδvi hδθδt0ihδvδθi hδvδvi hδvδt0i hδt0δθi hδt0δvi hδt0δt0i 3 75 2 64 v ðjÞ dirðt′Þ vðjÞspeedðt′Þ vðjÞtimeðt′Þ 3 75 þCðbackÞij � t; t′ � ; [2] where the terms {〈δθδθ〉, 〈δθδv〉,. . .} are the covariances of the sensory errors. The fact that C can be written in this form implies not only that the variations in pursuit will be 3D but that we can predict in advance what these dimensions should be. Indeed, we find experimentally that the three significant dimensions of C have 96% overlap, with axes corresponding to vdir, vspeed, and vtime. These results strongly support the hypothesis that the ob- servable variations in motor output are dominated by the errors that the brain makes in estimating the parameters of its sensory -100 0 100 200 300 time (ms) 300 200 100 0 -100 tim e ( m s) 0 1 2 3 4 5 6 DC 1 10 100 Rank order e ig e n va lu e 1 0 BA 100 150 200 time (ms) 0 5 10 15 e ye v e lo ci ty ( d e g /s ) 100 150 200 time (ms) -0.3 0.3 0 e ye v e lo ci ty ( d e g /s ) direction speed time Fig. 1. Low-dimensional dynamics of pursuit eye velocity trajectories (7). (A) Eye movements were recorded from male rhesus monkeys (Macaca mulatta) that had been trained to fixate and track visual targets. Thin black and gray lines represent horizontal (H) and vertical (V) eye velocity in response to a step in target motion on a single trial; dashed lines represent the corre- sponding trial-averaged means. Red and blue lines represent the model prediction. (B) Covariance matrix of the horizontal eye velocity trajectories. The yellow square marks 125 ms during the fixation period before target motion onset, the green square the first 125 ms of pursuit. The color scale is in deg/s2. (C) Eigenvalue spectrum of the difference matrix ΔC(t, t′) = Cpursuit(t, t′) (green square) − Cbackground(t, t′) (yellow square). (D) Time courses of the sensory error modes (vdir, vspeed, vtime). The sensory error modes are calculated from derivatives of the mean trajectory, as in Eq. 1, and linear combinations of these modes can be used to reconstruct trajectories on single trials as shown in A. These modes have 96% overlap with the significant dimensions that emerge from the covariance analysis in B and C and thus provide a nearly complete description of the behavioral variation. Black and gray curves correspond to H and V components. 15566 | www.pnas.org/cgi/doi/10.1073/pnas.1010868108 Stephens et al. www.pnas.org/cgi/doi/10.1073/pnas.1010868108 inputs, as if the rest of the processing and motor control circuitry were effectively noiseless, or more precisely that they contribute only at the level of background variation in the movement. Further, the magnitude and time course of noise in sensory es- timation are comparable to the noise sources that limit percep- tual discrimination (7, 8). This unexpected result challenges our intuition that noise in the execution of movement creates be- havioral variation, and it forces us to consider that errors in sensory estimation may set the limit to behavioral precision. Our findings are consistent with the idea that the brain can minimize the impact of noise in motor execution in a task-specific manner (9, 10), although it suggests a unique origin for that noise in the sensory system. The precision of smooth pursuit fits well with the broader view that the nervous system can approach optimal performance at critical tasks (11–14). How the Worm Wiggles. The free motion of the nematode Cae- norhabditis elegans on a flat agar plate provides an ideal oppor- tunity to quantify the (reasonably) natural behavior of an entire organism (15). Under such conditions, changes in the worm’s sinuous body shape support a variety of motor behaviors, in- cluding forward and backward crawling and large body bends known as Ω-turns (16). Tracking microscopy provides high spatial and temporal resolution images of the worm over long periods of time, and from these images we can see that fluctuations in the thickness of the worm are small, so most variations in the shape are captured by the curve that passes through the center of the body. We measure position along this curve (arc length) by the variable s, normalized so that s = 0 is the head and s = 1 is the tail. The position of the body element at s is denoted by x(s), but it is more natural to give an “intrinsic” description of this curve in terms of the tangent angle θ(s), removing our choice of coor- dinates by rotating each image so that the mean value of θ along the body always is zero. Sampling at n = 100 equally spaced points along the body, each shape is described completely by a 100- dimensional vector (Fig. 2 A and B). As we did with smooth pursuit eye movements, we seek a low dimensional space that underlies the shapes we observe. In the simplest case, this space is a Euclidean projection of the original high dimensional space so that the covariance matrix of angles, C(s, s’) = 〈(θ(s) − 〈θ〉) (θ(s’) − 〈θ〉)〉, will have only a small number of significant eigenvalues. For C. elegans this is exactly what we find, as shown in Fig. 2 C and D: more than 95% of the variance in body shape is accounted for by projections along just four dimensions (“eigenworms,” red curves in Fig. 2C). Further, the trajectory in this low dimensional space of shapes predicts the motion of the worm over the agar surface (17). Importantly, the simplicity that we find depends on our choice of initial repre- sentation. For example, if we take raw images of the worm’s body, cropped to a minimum size (300 × 160 pixels) and aligned to remove rigid translations and rotations, the variance across images is spread over hundreds of dimensions. The tangent angle representation and projections along the eigenworms provide a compact yet substantially complete de- scription of worm behavior. In distinction to previous work (see, e.g., refs. 16, 18, and 19), this description is naturally aligned to the organism, fully computable from the video images with no human intervention, and also simple. In the next section we show how these coordinates can be also used to explore dynamical questions posed by the behavior of C. elegans. A B C D Fig. 2. Low-dimensional space of worm postures (15). (A) We use tracking video microscopy to record images of the worm’s body at high spatiotem- poral resolution as it crawls along a flat agar surface. Dotted lines trace the worm’s centroid trajectory, and the body outline and centerline skeleton are extracted from the microscope image on a single frame. (B) We characterize worm shape by the tangent angle θ vs. arc length s of the centerline skeleton. (C) We decompose each shape into four dominant modes by projecting θ (s) along the eigenvectors of the shape covariance matrix (eigenworms). (D, black circles) Fraction of total variance captured by each projection. The four eigenworms account for ≈95% of the variance within the space of shapes. (D, red diamonds) Fraction of total variance captured when worm shapes are represented by images of the worm’s body; the low dimensionality is hidden in this pixel representation. 0 0.3 a1 2 2 -2 a 2 A -1 1 -1 1 φ 0 0 π −π φ (r ad ) 0-6 t (s) -4 -2 B φ (rad) -1 1 (c yc le s/ s) d φ/ d t −π π 0 -3 D .. t (s) -2 0 4 P p au se (t ) 0.1 0.55 C 2 Fig. 3. Worm behavior in the eigenworm coordinates. (A) Amplitudes along the first two eigenworms oscillate, with nearly constant amplitude but time-varying phase ϕ = tan−1(a2/a1). The shape coordinate ϕ(t) captures the phase of the locomotory wave moving along the worm’s body. (B) Phase dynamics from Eq. 3 reveals attracting trajectories in worm motion: forward and backward limit cycles (white lines) and two instantaneous pause states (white circles). Colors denote the basins of attraction for each attracting trajectory. (C) In an experiment in which the worm receives a weak thermal impulse at time t = 0, we use the basins of attraction of B to label the in- stantaneous state of the worm’s behavior and compute the time-dependent probability that a worm is in either of the two pause states. The pause states uncover an early-time stereotyped response to the thermal impulse. (D) Probability density of the phase [plotted as log P(ϕ|t)], illustrating stereo- typed reversal trajectories consistent with a noise-induced transition from the forward state. Trajectories were generated using Eq. 3 and aligned to the moment of a spontaneous reversal at t = 0. Stephens et al. PNAS | September 13, 2011 | vol. 108 | suppl. 3 | 15567 Dynamics of Worm Behavior. We have found low dimensional structure in the smooth pursuit eye movements of monkeys and in the free wiggling of nematodes. Can this simplification inform our understanding of behavioral dynamics—the emergence of discrete behavioral states, and the transitions between them? Here we use the trajectories of C. elegans in the low dimensional space to construct an explicit stochastic model of crawling be- havior and then show how long-lived states and transitions be- tween them emerge naturally from this model. Of the four dimensions in shape space that characterize the crawling of C. elegans, motions along the first two combine to form an oscillation, corresponding to the wave that passes along the worm’s body and drives it forward or backward. Here, we focus on the phase of this oscillation, ϕ = tan−1 (a2/a1) (Fig. 3A), and construct, from the observed trajectories, a stochastic dynamical system, analogous to the Langevin equation for a Brownian par- ticle. Because the worm can crawl both forward and backward, the phase dynamics is minimally a second-order system, dϕ dt ¼ ω; dω dt ¼ Fðω; ϕÞ þ σðω; ϕÞηðtÞ; [3] where ω is the phase velocity and η(t) is the noise—a random component of the phase acceleration not related to the current state of the worm—normalized so that 〈η(t)η(t’)〉 = δ(t − t’). As explained in ref. 15, we can recover the “force” F(ω, ϕ) and the local noise strength σ(ω, ϕ) from the raw data, so no further “modeling” is required. Leaving aside the noise, Eq. 3 describes a dynamical system in which there are multiple attracting trajectories (Fig. 3B): two limit cycle attractors corresponding to forward and backward crawling (white lines) and two pause states (white circles) corresponding to an instantaneous freeze in the posture of the worm. Thus, un- derneath the continuous, stochastic dynamics we find four dis- crete states that correspond to well-defined classes of behavior. We emphasize that these behavioral classes are emergent—there is nothing discrete about the phase time series ϕ(t), nor have we labeled the worm’s motion by subjective criteria. Whereas for- ward and backward crawling are obvious behavioral states, the pauses are more subtle. Exploring the worm’s response to gentle thermal stimuli, one can see that there is a relatively high prob- ability of a brief sojourn in one of the pause states (Fig. 3C). Thus, by identifying the attractors—and the natural time scales of transitions between them—we uncover a more reliable compo- nent of the worm’s response to sensory stimuli (15). The noise term generates small fluctuations around the attract- ing trajectories but more dramatically drives transitions among the attractors, and these transitions are predicted to occur with ste- reotyped trajectories (20). In particular, the Langevin dynamics in Eq. 3 predict spontaneous transitions between the attractors that correspond to forward and backward motion. To quantify this prediction, we run long simulations of the dynamics, choose moments in time when the system is near the forward attractor (0.1 < dϕ/dt < 0.6 cycles/s), and then compute the probability that the trajectory has not reversed (dϕ/dt < 0) after a time τ following this moment. If reversals are rare, this survival probability should decay exponentially, P(τ) = exp(−τ/〈τ〉), and this is what we see, with the predicted mean time to reverse 〈τ〉 = 15.7 ± 2.1 s, where the error reflects variations across an ensemble of worms. We next examine the real trajectories of the worms, performing the same analysis of reversals by measuring the survival probability in the forward crawling state. We find that the data obey an expo- nentialdistribution,aspredictedbythemodel,andtheexperimental mean time to reversal is 〈τdata〉 = 16.3± 0.3s.Thisobserved reversal rate agrees with the model predictions within error bars, and this corresponds to a precision of ∼4%, which is quite surprising. It should be remembered that we make our model of the dynamics by analyzing how the phase and phase velocity at the time t evolve into phase and phase velocity at time t + dt, where the data determine dt = 1/32 s. Once we have the stochastic dynamics, we can use them to predict the behavior on longtime time scales. Although we define our model on the time scale of a single video frame (dt), behavioral dynamics emerge that are nearly three orders of magnitude longer (〈τ〉/dt ≈ 500), with no adjustable parameters (20). In this model, reversals are noise-driven transitions between attractors, in much the same way that chemical reactions are thermally driven transitions between attractors in the space of molecular structures (21). In the low noise limit, the trajectories that carry the system from one attractor to another become stereotyped (22). Thus, the trajectories that allow the worm to escape from the forward crawling attractor are clustered around prototypical trajectories, and this is seen both in the simulations (Fig. 3D) and in the data (20). In fact, many organisms, from bacteria to humans, exhibit dis- crete, stereotyped motor behaviors. A common view is that these behaviors are stereotyped because they are triggered by specific commands, and in some cases we can even identify “command neurons” whose activity provides the trigger (23). In the extreme, discreteness and stereotypy of the behavior reduces to the dis- creteness and stereotypy of the action potentials generated by the command neurons, as with the escape behaviors in fish triggered by spiking of the Mauthner cell (24). However, the stereotypy of spikes itself emerges from the continuous dynamics of currents, voltages, and ion channel populations (25, 26). The success here of the stochastic phase model in predicting the observed reversal characteristics of C. elegans demonstrates that stereotypy can also emerge directly from the dynamics of the behavior itself. Maximum Entropy Models of Natural Networks Much of what happens in living systems is the result of inter- actions among large networks of elements—many amino acids interact to determine the structure and function of proteins, many genes interact to define the fates and states of cells, many neurons interact to represent our perceptions and memories, and so on. Even if each element in a network achieves only two values, the number of possible states in a network of N elements is 2N, which easily becomes larger than any realistic experiment (or lifetime!) can sample, the same dimensionality problem that we encountered in movement behavior. Indeed, a lookup table for the probability of finding a network in any one state has ≈2N parameters, and this is a disaster. To make progress we search for a simpler class of models with many fewer parameters. We seek an analysis of living networks that leverages in- creasingly high-throughput experimental methods, such as the recording from large numbers of neurons simultaneously. These experiments provide, for example, reliable information about the correlations between the action potentials generated by pairs of neurons. In a similar spirit, we can measure the correlations be- tween amino acid substitutions at different sites across large families of proteins. Can we use these pairwise correlations to say anything about the network as a whole? Although there are an infinite number of models that can generate a given pattern of pairwise correlations, there is a unique model that reproduces the measured correlations and adds no additional structure. This minimally structured model is the one that maximizes the entropy of the system (27), in the same way that the thermal equilibrium (Boltzmann) distribution maximizes the entropy of a physical system given that we know its average energy. Letters in Words. To see how the maximum entropy idea works, we examine an example in which we have some intuition for the states of the network. Consider the spelling of four-letter English words (28), whereby at positions i = 1, 2, 3, 4 in the word we can chose a variable xi from 26 possible values. A word is then represented by the combination x ≡ {x1, x2, x3, x4}, and we can sample the distri- bution of words, P(x), by looking through a large corpus of writings, 15568 | www.pnas.org/cgi/doi/10.1073/pnas.1010868108 Stephens et al. www.pnas.org/cgi/doi/10.1073/pnas.1010868108 for example the collected novels of Jane Austen [the Austen word corpus was created via Project Gutenberg (www.gutenberg.org), combining text from Emma, Lady Susan, Love and Friendship, Mansfield Park, Northhanger Abbey, Persuassion, Pride and Preju- dice, and Sense and Sensibility]. If we do not know anything about the distribution of states in this network, we can maximize the entropy of the distribution P(x) by having all possible combina- tions of letters be equally likely, and then the entropy is S0 ¼ − P P0log2P0 ¼ 4  ×  log2ð26Þ ¼ 18:8bits: However, in actual English words, not all letters occur equally often, and this bias in the use of letters is different at different positions in the word. If we take these “one letter” statistics into account, the maximum en- tropy distribution is the independent model, Pð1ÞðxÞ ¼ P1ðx1Þ P2ðx2Þ P3ðx3Þ P4ðx4Þ; [4] where Pi(x) is the easily measured probability of finding letter x in position i. Taking account of actual letter frequencies lowers the entropy to S1 = 14.083 ± 0.001 bits, where the small error bar is derived from sampling across the ∼106 word corpus. The independent letter model defined by P(1) is clearly wrong: the most likely words are “thae,” “thee,” and “teae.” Can we build a better approximation to the distribution of words by in- cluding correlations between pairs of letters? The difficulty is that now there is no simple formula like Eq. 4 that connects the maximum entropy distribution for x to the measured dis- tributions of letter pairs (xi, xj). Instead, we know analytically the form of the distribution, Pð2ÞðxÞ ¼ 1 Z exp " − X i > j Vij � xi; xj �# ; [5] where all of the coefficients Vij (x, x’) have to be chosen to re- produce the observed correlations between pairs of letters. This is complicated but much less complicated than it could be—by matching all of the pairwise correlations we are fixing ∼6 × (26)2 parameters, which is vastly smaller than the (26)4 possible com- binations of letters. The model in Eq. 5 has exactly the form of the Boltzmann distribution for a physical system in thermal equilibrium, whereby the letters “interact” through a potential energy Vij (x, x’). The essential simplification is that there are no explicit interactions among triplets or quadruplets—all of the higher-order correla- tions must be consequences of the pairwise interactions. We know that in many physical systems this is a good approximation, that is P ≈ P(2). However, the rules of spelling (e.g., i before e except after c) seem to be in explicit conflict with such a simplification. Nonetheless, when we apply the model in Eq. 5 to English words, we find reasonable phonetic constructions. Here we leave aside the problem of how one finds the potentials Vij from the measured correlations among pairs of letters (see refs. 29–35) and discuss the results. Once we construct a maximum entropy model of words using Eq. 5, we find that the entropy of the pairwise model is S2 = 7.471 ± 0.006 bits, approximately half the entropy of independent let- ters S1. A rough way to think about this result is that if letters were chosen independently, there would be 2S1∼17; 350 possible four- letter words. Taking account of the pairwise correlations reduces this vocabulary by a factor of 2S1 − S2∼100; down to effectively ≈178 words. In fact, the Jane Austen corpus is large enough that we can estimate the true entropy of the distribution of four-letter words, and this is Sfull = 6.92 ± 0.003 bits. Thus, the pairwise model captures ∼92% of the entropy reduction relative to choosing letters independently and hence accounts for almost all of the restriction in vocabulary provided by the spelling rules and the varying frequencies of word use. The same result is obtained with other corpora, so this is not a peculiarity of an author’s style. We can look more closely at the predictions of the maximum entropy model in a “Zipf plot,” ranking the words by their prob- ability of occurrence and plotting probability vs. rank, as in Fig. 4. The predicted Zipf plot almost perfectly overlays what we obtain by sampling the corpus, although some weight is predicted to occur in words that do not appear in Austen’s writing. Many of these are real words that she happened not to use, and others are perfectly pronounceable English even if they are not actually words. Thus, despite the complexity of spelling rules, the pairwise model captures a very large fraction of the structure in the net- work of letters. Spiking and Silence in Neural Networks. Maximum entropy models also provide a good approximation to the patterns of spiking in the neural network of the retina. In a network of neurons where the variable xi marks the presence (xi = +1) or absence (xi = −1) of an action potential from neuron i in a small window of time, the state of the whole network is given by the pattern of spiking and silence across the entire population of neurons, x ≡ {x1, x2,. . ., xN}. In the original example of these ideas, Schneidman et al. (36) looked at groups of n = 10 nearby neurons in the vertebrate retina as it responded to naturalistic stimuli, with the mast tome log(rank) mast whem hove mady tome tike mide mone sime liss lave bady hame mike dore sere sade welt wert gove `Non-Words´ 0 1 2 3 -1 -4 all words 4−letter words 4−letter maxS non−words BA 0 1 2 3 −6 −4 −2 0 log(rank) 10−neuron maxS 10−neuron words lo g (P ) 0 Sfull 6.9 Sind 14.1 Srand 18.8 S2 7.5 S (bits) 1.817 1.917 S full Sind S2 1.820 2 1 10Srand S -3 -2 (bits) Fig. 4. For networks of neurons and letters, the pairwise maximum entropy model provides an excellent approximation to the probability of network states. In each case, we show the Zipf plot for real data (blue) compared with the pairwise maximum entropy approximation (red). Scale bars to the right of each plot indicate the entropy captured by the pairwise model. (A) Letters within four-letter English words (28). The maximum entropy model also produces “nonwords” (Inset, green circles) that never appeared in the full corpus but nonetheless contain realistic phonetic structure. (B) Ten neuron patterns of spiking and silence in the vertebrate retina (36). with that from this they have were said when been Basin 0 8 1 10 than then them here8 10 12 0 0.1 0.2 0.3 0.4 0.5 time (s) α P (G | t im e ) α=2 α=3 α=4 α=5 Stimulus Trial 20 6040 N e u ro n # 40 30 20 10 1 1 A B Fig. 5. Metastable states in the energy landscape of networks of neurons and letters. (A) Probability that the 40-neuron system is found within the basin of attraction of each nontrivial locally stable state Gα as a function of time during the 145 repetitions of the stimulus movie. Inset: State of the entire network at the moment it enters the basin of G5, on 60 successive trials. (B) Energy landscape (ε = −In P) in the maximum entropy model of letters in words. We order the basins in the landscape by decreasing probability of their ground states and show the low energy excitations in each basin. Stephens et al. PNAS | September 13, 2011 | vol. 108 | suppl. 3 | 15569 http://www.gutenberg.org results shown in Fig. 4. Again we see that the pairwise model does an excellent job, capturing ≈90% or more of the reduc- tion in entropy, reproducing the Zipf plot and even predicting the wildly varying probabilities of the particular patterns of spiking and silence (see Fig. 2a in ref. 36). The maximum entropy models discussed here are important because they often capture a large fraction of the interactions present in natural networks while simultaneously avoiding a com- binatorial explosion in the number of parameters. This is true even in cases in which interactions are strong enough so that in- dependent (i.e., zero neuron–neuron correlation) models fail dra- matically. Such an approach has also recently been used to show how network functions such as stimulus decorrelation and error correction reflect a tradeoff between efficient consumption of finite neural bandwidth and the use of redundancy to mitigate noise (37). As we look at larger networks, we can no longer compute the full distribution, and thus we cannot directly compare the full entropy with its pairwise approximation. We can, however, check many other predictions, and the maximum entropy model works well, at least to n = 40 (30, 38). Related ideas have also been applied to a variety of neural networks with similar findings (39– 42) (however, also see ref. 43 for differences), which suggest that the networks in the retina are typical of a larger class of natural ensembles. Metastable States. As we have emphasized in discussing Eq. 5, maximum entropy models are exactly equivalent to Boltzmann distributions and thus define an effective “energy” for each pos- sible configuration of the network. States of high probability correspond to low energy, and we can think of an “energy land- scape” over the space of possible states, in the spirit of the Hop- field model for neural networks (44). Once we construct this landscape, it is clear that some states are special because they sit at the bottom of a valley—at local minima of the energy. For net- works of neurons, these special states are such that flipping any single bit in the pattern of spiking and silence across the pop- ulation generates a state with lower probability. For words, a local minimum of the energy means that changing any one letter pro- duces a word of lower probability. The picture of an energy landscape on the states of a network may seem abstract, but the local minima can (sometimes sur- prisingly) have functional meaning, as shown in Fig. 5. In the case of the retina, a maximum entropy model was constructed to de- scribe the states of spiking and silence in a population of n = 40 neurons as they respond to naturalistic inputs, and this model predicts the existence of several nontrivial local minima (30, 38). Importantly, this analysis does not make any reference to the vi- sual stimulus. However, if we play the same stimulus movie many times, we see that the system returns to the same valleys or basins surrounding these special states, even though the precise pattern of spiking and silence is not reproduced from trial to trial (Fig. 5A). This suggests that the response of the population can be summarized by which valley the system is in, with the detailed spiking pattern being akin to variations in spelling. To reinforce this analogy, we can look at the local minima of the energy landscape for four-letter words. In the maximum entropy model for letters, we find 136 of local minima, of which the 10 most likely are shown in Fig. 5B. More than two thirds of the entropy in the full distribution of words is contained in the distribution over these valleys, and in most of these valleys there is a large gap between the bottom of the basin (the most likely word) and the next most likely word. Thus, the entropy of the letter distribution is dominated by states that are not connected to each other by single letter substitutions, per- haps reflecting a pressure within language to communicate with- out confusion. Discussion Understanding a complex system necessarily involves some sort of simplification. We have emphasized that, with the right data, there are mathematical methods that allow a system to “tell us” what sort of simplification is likely to be useful. Dimensionality reduction is perhaps the most obvious method of simplification—a direct reduction in the number of variables that we need to describe the system. The examples of C. elegans crawling and smooth pursuit eye movements are compelling be- cause the reduction is so complete, with just three or four coor- dinates capturing ≈95% of all of the variance in behavior. In each case, the low dimensionality of our description provides func- tional insight, whether into origins of stereotypy or the possibility of optimal performance. The idea of dimensionality reduction in fact has a long history in neuroscience, because receptive fields and feature selectivity are naturally formalized by saying that neurons are sensitive only to a limited number of dimensions in stimulus space (45–48). More recently it has been emphasized that quantitative models of protein/DNA interactions are equiv- alent to the hypothesis that proteins are sensitive only to limited number of dimensions in sequence space (49, 50). The maximum entropy approach achieves a similar simplifica- tion for networks; it searches for simplification not in the number of variables but in the number of possible interactions among these variables. The example of letters in words shows how this simpli- fication retains the power to describe seemingly combinatorial patterns. For both neurons and letters, the mapping of the maxi- mum entropy model onto an energy landscape points to special states of the system that seem to have functional significance. There is an independent stream of work that emphasizes the suf- ficiency of pairwise correlations among amino acid substitutions in defining functional families of proteins (51–53), and this is equiv- alent to the maximum entropy approach (53); explicit construction of the maximum entropy models for antibody diversity again points to the functional importance of the metastable states (54). Although we have phrased the ideas of this article essentially as methods of data analysis, the repeated successes of mathe- matically equivalent models (dimensionality reduction in move- ment and maximum entropy in networks) encourages us to seek unifying theoretical principles that give rise to behavioral sim- plicity. Finding such a theory, however, will only be possible if we observe behavior in sufficiently unconstrained contexts so that simplicity is something we discover rather than impose. ACKNOWLEDGMENTS. We thank D. W. Pfaff and his colleagues for organizing the Sackler Colloquium and for providing us the opportunity to bring together several strands of thought; and our many collaborators who have worked with us on these ideas and made it so much fun: M. J. Berry II, C. G. Callan, B. Johnson-Kerner, S. G. Lisberger, T. Mora, S. E. Palmer, R. Ranganathan, W. S. Ryu, E. Schneidman, R. Segev, S. Still, G. Tka�cik, and A. Walczak. This work was supported in part by grants from the National Science Foundation, the National Institutes of Health, and the Swartz Foundation. 1. Nelson WL (1983) Physical principles for economies of skilled movements. Biol Cybern 46:135–147. 2. d’Avella A, Bizzi E (1998) Low dimensionality of supraspinally induced force fields. Proc Natl Acad Sci USA 95:7711–7714. 3. Santello M, Flanders M, Soechting JF (1998) Postural hand synergies for tool use. J Neurosci 18:10105–10115. 4. Sanger TD (2000) Human arm movements described by a low-dimensional super- position of principal components. J Neurosci 20:1066–1072. 5. Ingram JN, Körding KP, Howard IS, Wolpert DM (2008) The statistics of natural hand movements. Exp Brain Res 188:223–236. 6. Rashbass C (1961) The relationship between saccadic and smooth tracking eye movements. J Physiol 159:326–338. 7. Osborne LC, Lisberger SG, Bialek W (2005) A sensory source for motor variation. Nature 437:412–416. 8. Osborne LC, Hohl SS, Bialek W, Lisberger SG (2007) Time course of precision in smooth-pursuit eye movements of monkeys. J Neurosci 27:2987–2998. 15570 | www.pnas.org/cgi/doi/10.1073/pnas.1010868108 Stephens et al. www.pnas.org/cgi/doi/10.1073/pnas.1010868108 9. Harris CM, Wolpert DM (1998) Signal-dependent noise determines motor planning. Nature 394:780–784. 10. Todorov E, Jordan MI (2002) Optimal feedback control as a theory of motor coordination. Nat Neurosci 5:1226–1235. 11. Todorov E (2004) Optimality principles in sensorimotor control. Nat Neurosci 7: 907–915. 12. Bialek W (2002) Physics of Biomolecules and Cells: Les Houches Session LXXV, eds Flyvbjerg H, Julicher F, Ormos P, David F (EDP Sciences, Les Ulis and Springer-Verlag, Berlin), pp 485–577. 13. Bialek W (1987) Physical limits to sensation and perception. Annu Rev Biophys Biophys Chem 16:455–478. 14. Barlow HB (1981) The Ferrier Lecture, 1980. Critical limiting factors in the design of the eye and visual cortex. Proc R Soc Lond B Biol Sci 212:1–34. 15. Stephens GJ, Johnson–Kerner B, Bialek W, Ryu WS (2008) Dimensionality and dynamics in the behavior of C. elegans. PLoS Comp Biol 4:e1000028. 16. Croll N (1975) Components and patterns in the behavior of the nematode Caenorhabditis elegans. J Zool 176:159–176. 17. Stephens GJ, Bialek W, Ryu WS (2010) From modes to movement in the behavior of C. elegans. PLoS One 5:e13914. 18. Pierce-Shimomura JT, Morse TM, Lockery SR (1999) The fundamental role of pirouettes in Caenorhabditis elegans chemotaxis. J Neurosci 19:9557–9569. 19. Gray JM, Hill JJ, Bargmann CI (2005) A circuit for navigation in Caenorhabditis elegans. Proc Natl Acad Sci USA 102:3184–3191. 20. Stephens GJ, Ryu WS, Bialek W (2009) The emergence of stereotyped behaviors in C. elegans, arXiv:0912.5232 [q-bio]. 21. Hänggi P, Talkner P, Borkovec M (1990) Reaction-rate theory: Fifty years after Kramers. Rev Mod Phys 62:251–341. 22. Dykman MI, Mori E, Ross J, Hunt P (1994) Large fluctuations and optimal paths in chemical kinetics. J Chem Phys 100:5735–5750. 23. Bullock TH, Orkand R, Grinnell A (1977) Introduction to Nervous Systems (WH Freeman, San Francisco). 24. Korn H, Faber DS (2005) The Mauthner cell half a century later: A neurobiological model for decision-making? Neuron 47:13–28. 25. Hodgkin AL, Huxley AF (1952) A quantitative description of membrane current and its application to conduction and excitation in nerve. J Physiol 117:500–544. 26. Dayan P, Abbott LF (2001) Theoretical Neuroscience: Computational and Mathe- matical Modeling of Neural Systems (MIT Press, Cambridge, MA). 27. Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106:62–79. 28. Stephens GJ, Bialek W (2010) Statistical mechanics of letters in words. Phys Rev E 81: 066119. 29. Darroch JN, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43:1470–1480. 30. Tka�cik G, Schneidman E, Berry MJ, II, Bialek W (2006) Ising models for networks of real neurons, arXiv:q-bio/0611072. 31. Broderick T, Dudik M, Tka�cik G, Schapire RE, Bialek W (2007) Faster solutions of the inverse pairwise Ising problem, arXiv:0712:2437[q-bio]. 32. Ganmor E, Segen R, Schneidman E (2009) How fast can we learn maximum entropy populations? J Phys Conf Ser 197:012020. 33. Sessak V, Monasson R (2009) Small–correlation expansions for the inverse Ising problem. J Phys A: Math Theor 42:055001. 34. Mézard M, Mora T (2009) Constraint satisfaction problems and neural networks: A statistical physics perspective. J Physiol Paris 103:107–113. 35. Roudi Y, Aurell E, Hertz J (2009) Statistical physics of pairwise probability models. Front Comp Neurosci 3:22. 36. Schneidman E, Berry MJ, 2nd, Segev R, Bialek W (2006) Weak pairwise correlations imply strongly correlated network states in a neural population. Nature 440:1007– 1012. 37. Tka�cik G, Prentice JS, Balasubramanian V, Schneidman E (2010) Optimal population coding by noisy spiking neurons. Proc Natl Acad Sci USA 107:14419–14424. 38. Tka�cik G, Schneidman E, Berry MJ, II, Bialek W (2009) Spin glass models for networks of real neurons, arXiv:0912:5409 [q-bio]. 39. Shlens J, et al. (2006) The structure of multi-neuron firing patterns in primate retina. J Neurosci 26:8254–8266. 40. Tang A, et al. (2008) A maximum entropy model applied to spatial and temporal correlations from cortical networks in vitro. J Neurosci 28:505–518. 41. Yu S, Huang D, Singer W, Nikolić D (2008) A small world of neuronal synchrony. Cereb Cortex 18:2891–2901. 42. Shlens J, et al. (2009) The structure of large-scale synchronized firing in primate retina. J Neurosci 29:5022–5031. 43. Ohiorhenuan IE, et al. (2010) Sparse coding and high-order correlations in fine-scale cortical networks. Nature 466:617–621. 44. Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci USA 79:2554–2558. 45. Rieke F, Warland D, de Ruyter van Steveninck R, Bialek W (1997) Spikes: Exploring the Neural Code (MIT Press, Cambridge, MA). 46. Bialek W, de Ruyter van Steveninck R (2005) Features and dimensions: Motion estimation in fly vision, arXiv:q-bio/0505003. 47. Sharpee T, Rust NC, Bialek W (2004) Analyzing neural responses to natural signals: Maximally informative dimensions. Neural Comp 16:223–250. 48. Schwartz O, Pillow JW, Rust NC, Simoncelli EP (2006) Spike-triggered neural char- acterization. J Vis 6:484–507. 49. Kinney JB, Tka�cik G, Callan CG, Jr. (2007) Precise physical models of protein-DNA interaction from high-throughput data. Proc Natl Acad Sci USA 104:501–506. 50. Kinney JB, Murugan A, Callan CG, Jr., Cox EC (2010) Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc Natl Acad Sci USA 107:9158–9163. 51. Socolich M, et al. (2005) Evolutionary information for specifying a protein fold. Nature 437:512–518. 52. Russ WP, Lowery DM, Mishra P, Yaffe MB, Ranganathan R (2005) Natural-like func- tion in artificial WW domains. Nature 437:579–583. 53. Bialek W, Ranganathan R (2007) Rediscovering the power of pairwise interactions, arXiv:0712.4397[q-bio]. 54. Mora T, Walczak AM, Bialek W, Callan CG, Jr. (2010) Maximum entropy models for antibody diversity. Proc Natl Acad Sci USA 107:5405–5410. Stephens et al. PNAS | September 13, 2011 | vol. 108 | suppl. 3 | 15571