In Chapter 3 I outlined an idea that I extracted from the work of Kenneth Craik: namely, that the mind builds idealised models of the world and uses these models for prediction. In the last two chapters I elaborated on this proposal. In Chapter 4 I argued that models in general—from simple maps to complex mathematical models used in science to the mental models inside our brains (if there are such things)—should be understood as idealised structural surrogates for target domains. In the previous chapter I then explained how this functional profile can be understood in the context of the neural mechanisms that underlie our psychological capacities.
In this post I’m going to outline some ideas from contemporary cognitive science and machine learning that actually provide some support for a vision of the mind as a predictive modelling engine.
I’ll start with a brief overview of generative models. Then I’ll explain how such models can be understood in terms of probabilistic graphical models. Then I’ll outline predictive coding and a more general conception of the neocortex as a general-purpose learning device constructing generative models of those features of the world responsible for generating the data to which it is exposed.
Brains must estimate features of the distal world based on proximal sensory inputs. As one of my favourite philosophers W.V.O. Quine put it,
“The stimulation of his sensory receptors is all the evidence anybody has had to go on, ultimately, in arriving at his picture of the world.”
Here is one way a neural network might learn that mapping. First, feed it lots of examples of sensory input. Second, see what classification the network delivers. Compare this output to the correct output—what the classification should be—and use the difference between the two to change the configuration of connections in the network so that next time it is more likely to produce the correct output. Repeat this process a bunch of times. Eventually the network will learn how to map the set of inputs onto the correct outputs. If all goes to plan, when you then feed the network a completely new input—an input it was not trained on—it will output the correct classification: “that’s a cat!”
Stripped down to its basics, this method underlies many advances in recent machine learning in which deep feedforward neural networks are trained in tasks such as image classification. These networks are “deep” in that they decompose the overall mapping (e.g. from images to appropriate classifications of those images) into a set of intermediate mappings.
Given the impressive abilities of such feedforward networks, one might think that they provide a good model of what our sensory systems in the brain are doing. Our brains are neural networks, after all, and our sensory systems are also “deep” (i.e. hierarchically structured). Further, some deep convolutional networks trained on image classification tasks have been shown to reproduce certain characteristics of the response profiles of neuronal populations in the ventral stream (“object recognition pathway”) of the visual cortex.
Call this approach the discriminative approach to understanding perception. (Feedforward neural networks implement so-called “discriminative models” that map inputs onto outputs or a probability distribution over those outputs).
Three Problems for the Discriminative Approach
There are many reasons why the discriminative approach doesn’t offer a good model of perception. Here are three.
- Unsupervised learning. I noted that such feedforward networks learn the mapping from sensory inputs to appropriate classifications by comparing their outputs to the desired or correct outputs. Obviously this can’t be how the brain works, however. The brain doesn’t receive labelled examples, i.e. there is no external supervision to tell it when it gets something right or wrong.
- The discriminative approach offers no account of top-down effects in perception. There are really a few different issues here:
- Anatomically, it offers no account of “top-down” or “backwards” synaptic connections in cortical areas (i.e. connections carrying information back towards primary sensory areas).
- Functionally, it offers no explanation of top-down knowledge effects in perception.
- Likewise, it doesn’t offer any account of forward-looking sensory predictions (i.e. what am I going to experience next?) based on estimates of states of the world.
- Mental imagery. The discriminative approach suggests that perceptual processing is triggered by proximal sensory inputs. It therefore offers no account of the endogenous generation of perceptual states in mental imagery, and likewise offers no explanation of the substantial neural overlap between online perception and mental imagery.
The Generative Approach
Here is another way to train a neural network. Rather than training it to, say, classify input data, train it to reconstruct that data—or, if the data have a temporal element, to predict that data as it arrives.
The logic here is simple, even if the engineering challenges are immense. It is this: a network that can effectively reconstruct or predict the data it is given must have some understanding of the process responsible for generating that data. (Of course, for this to work it must reconstruct that data without simply “copying” it. For example, it must be able to reconstruct the data by drawing on a latent representation that is simpler than the data it reconstructs).
If the task is to reconstruct the data to which the network is exposed, then learning can be supervised by… the data. That is, the network can compare its attempts at reconstructing the data with the data itself and use the mismatch—the “reconstruction error” or “prediction error”—as a learning signal to update the structure of its model.
This is the generative approach. In one way or another it underlies a whole host of famous generative model-based applications in neural network modelling: the Helmholtz machine, variational autoencoders, deep belief networks, generative adversarial networks, generative query networks, and so on.
Most abstractly, a generative model can be understood as a structure designed to generate a range of phenomena in a way that is intended to model the process by which those phenomena are actually generated. If the process is causal, it is a causal generative model. If it decomposes this generative process hierarchically, it is a deep (i.e. hierarchical multi-level) causal generative model.
In the case of perception, the generative approach claims that our perceptual systems implement deep causal generative models of the process by which sensory signals are generated by interactions among features of the distal world. Why believe this?
- It offers at least a schematic explanation of unsupervised perceptual learning.
- It offers at least a schematic explanation of “top-down” effects in perception.
- Top-down synaptic connections encode a generative model.
- Top-down knowledge effects come from using a generative model to predict likely sensory data.
- Perceptual forward-looking prediction comes from using a dynamic causal generative model to predict what sensory information is likely next given a representation of current world states.
- It offers a straightforward account of mental imagery, in terms of using a generative model to generate sensory data in the absence of externally provided sensory data.
Importantly, I have spoken of generative models in perception as representing the process by which proximal sensory inputs are generated by interactions among features of the distal world. Crucially, however, embodied perceivers are themselves parts of this generative process. Thus their generative models must include themselves—a point I will return to in a future post when I consider embodied cognition.
So far I have described the generative approach to perception. Notice, though, that the concept of unsupervised learning within generative models is highly general, and applies in any case where an agent has access to some phenomena that are the result of some systematic causal process. This suggests that generative models might be applicable far beyond simple perception to other areas of cognition—a point I’ll return to in future posts.
Generative Models as Graphical Models
[This short section is more technical and can be skipped without any real loss of understanding].
To summarise, then, a generative model implements a process capable of generating some phenomena—typically data—in a way that is supposed to model the actual process by which those phenomena are generated.
This doesn’t tell us much about how such generative models are structured, however, or how they work.
One useful way of understanding generative models is in terms of probabilistic graphical models.
Graphical models were largely introduced into cognitive science and AI by Judea Pearl. They are called “graphical” because they represent the set of dependencies (i.e. causal and statistical relationships) between the elements of a domain with a graph, in the mathematical sense of a structure of nodes and edges, which can be directed, undirected, or bidirectional.
In the graphical models of concern here, the nodes are variables and the edges are causal and statistical relations between them. Here is simple visual example of a graphical model.
A simple graphical model capturing the structure of a set of dependencies between (1) what season it is, (2) whether the sprinkler is on, (3) whether rain falls, (4) whether the pavement is wet, and (5) whether it is slippery. In typical models this qualitative structure will be combined with parameters that quantify the probabilistic strength of these dependencies—for example, how probable rain is given different seasons.
What makes graphical models like this so powerful is their graphical component. Specifically, the graph makes extremely complex domains tractable to represent and reason about through the use of direct dependence relationships. If two nodes are directly connected by an edge (i.e. without intermediate edges), they are directly relevant to one another. If they are not, they are “conditionally independent” (i.e. they are independent of one another once one factors in other information). This is important because it means that you can update such models locally, such that the values of variables within the network can be predicted or inferred solely based on information about the values of relevant “nearby” variables. Without this “modularity,” representing and reasoning about highly complex multidimensional domains would be impossible.
Graphical models also have the nice characteristic of visually depicting the structure of a statistical model. By making the structure of such models explicit, it becomes clear what it might mean to say that they recapitulate the causal-statistical structure of target domains in the body and world. I will return to this in the next post.
What does any of this have to do with the brain?
The Neocortex as a General-Purpose Learning Device
The neocortex is a thin sheet of (folded) neural tissue that envelops most of our brains. All mammals have one and some other species have homologous structures. It is principally responsible for intuitively psychological functions such as perception, imagery, reasoning, and—in humans, at least—much of action control.
It is not solely responsible, of course! Often people complain—and often rightly—of a kind of neocortical chauvinism in neuroscience. Nevertheless, I do think that the neocortex (or, more broadly, thalamocortical system) is the most important part of the brain to understand if one wants to understand the nature of mental representation.
Understanding the neocortex seems like a formidable task. It has billions of highly connected neurons, multiple neuron types, and it underlies a variety of distinct cognitive functions.
Nevertheless, many neuroscientists have sought what might be called “global theories” of the neocortex—theories in which its superficial complexity and diversity of functions arise from relatively simple underlying operating principles. Why think this?
Again, there are many reasons, but here are two.
First, the cortex is extremely plastic. This is evidenced both in the large structural changes its architecture undergoes with development, its capacity for learning about an open-ended variety of domains, and the surprising ability of cortical regions usually associated with one function to adapt to a different function when provided novel inputs. (For example, routing visual inputs to previously auditory cortical regions in ferrets results in functional visual pathways and capacities in that region of “auditory” cortex).
Second, the cortex is surprisingly structurally uniform, both across different functional regions and across mammalian species. Of course, this claim is immensely controversial, but it has provided hope to many neuroscientists that what differentiates, say, visual from auditory from somatosensory cortex is not different information-processing architectures but rather just the kinds of inputs to which they are exposed.
In conjunction with one another, then, these ideas give rise to the search for a common cortical algorithm—one which enables the neocortex to build models of an open-ended variety of possible domains without supervision.
According to some, this algorithm is predictive coding.
Hierarchical Predictive Coding
There has been so much written about predictive coding at this point—so many attempts at introductions—that I won’t bother providing an introduction here. Very roughly, predictive coding is an encoding strategy that involves only transmitting the unpredicted elements of a signal. If one situates this in the context of cortical hierarchies, the idea is that higher levels build generative models of the phenomena represented at lower levels, exploiting these models to predict lower-level activity, register the errors in these predictions (“prediction errors”), and then use these prediction errors to update the generative model.
From my perspective, what is interesting about predictive coding is that it provides a nice algorithm-level story of cortical functioning (with some empirical support) of how the cortex might function as a predictive modelling engine, building hierarchical generative models of the causal processes responsible for generating the data to which it is exposed and then exploiting these models for continual prediction.
Because this blog post is getting too long, I’ll now wrap things up. To summarise:
- Causal generative models provide a useful framework for understanding unsupervised learning, top-down knowledge effects in perceptual inference and prediction, and the continuities between perception and imagery.
- Generative models apply beyond perception, however—namely, to any case in which an agent has access to phenomena that are generated by some systematic process.
- Generative models can be understood in terms of probabilistic graphical models, which model the causal-statistical structure of target domains in a way that is computationally tractable.
- One way of understanding the neocortex is as a general-purpose empiricist learning device, building hierarchical generative models of those features of the distal body and world responsible for generating the data to which it is exposed.
- Hierarchical predictive coding offers a theory of how this might actually work.
In the next chapter I will relate this empirical research to the three insights that I extracted from Craik’s work in Chapter 3 and jump into some of the applications of this research to specific cognitive domains.