Machine Learning vs Neuroscience: Flip Sides of the Same Coin? Part 1 of 3

Can our brains also optimise cost functions like in machine learning?

Jul 06, 2023

Machine learning has managed some impressive feats, even beating human minds at tasks we never thought machines will be able to take over, all by focusing on function optimisation. Neuroscience, on the other hand, has made progress in the detailed analysis of how our mind works, managing to discover a wide range of brain areas, cell types and mechanisms. It seems as though neuroscience and machine learning are on completely separate trajectories. However, they have much in common, and this article is going to explore their relationship.

This article summarises and discusses a research paper by DeepMind, Towards an Integration of Deep Learning and Neuroscience.

Introduction

Machine learning, with a special emphasis on neural networks, evidently draws its genesis from the neuronal structures in the human brain. However, it is imperative to recognise that the modern advancements in this field are heavily reliant on strides made in mathematical optimisation and computational power. Neural networks have evolved from rudimentary linear systems to sophisticated architectures such as recurrent networks, which emulate the memory functionalities of the human brain. Additionally, gradient descent algorithms have undergone significant enhancement through the incorporation of momentum terms, efficient weight initialisation, and conjugate gradients, which are not directly borrowed from neuroscience.

Interestingly, three cardinal facets of contemporary neural networks exhibit semblance to the human brain. First, there is an inherent focus on optimising cost functions within neural networks. Second, the present-day machine learning algorithms employ intricate cost functions that are not homogeneously distributed across neural network layers but take into account interlayer interactions. Lastly, the architecture of neural networks have evolved to now encompass memory cells with multiple states. We posit that these characteristics mirror the operational paradigm of the human brain. This argument is summarised in the following hypotheses:

Hypothesis 1: The brain optimises cost functions
Hypothesis 2: Cost functions are diverse across areas and change over development
Specialised systems allow efficiently solving key computational problems

In this article, we will explore hypothesis 1 in detail to evaluate it.

Hypothesis 1: The Brain optimises cost functions

In the realm of mathematics, optimisation of functions is a well-established concept. However, when this notion is juxtaposed with the human brain, it warrants a deeper understanding. Several experiments point to humans naturally optimising objectives; subjects minimise energy consumption of their movement systems, and minimise risk and damage to their body, while maximising movement gains (Taylor and Faisal, 2011). Furthermore, we know that computational optimisation of trajectories gives rise to elegant solutions for many complex motor tasks (Mordatch et al., 2012). Physical phenomena also involve optimisation; for example, laws of physics can be viewed as minimising the action functional, while evolution optimises the fitness of species over a long timescale. So, it would be useful to distill the hypothesis into two fundamental propositions:

The brain possesses mechanisms for credit allocation during learning, allowing it to optimise global functions in multilayer networks by modulating the properties of individual neurons to synergistically contribute to the global outcome.
The brain has mechanisms that can specify exactly which cost functions it subjects its networks. In other words, the cost functions are highly tuneable and are matched with the ethological needs of the animal.

A natural question to ask at this stage is how the brain might efficiently perform credit assignments throughout large, multilayered networks, in order to optimise complex functions. One possible perspective is that the brain uses several different types of optimisation approaches to solve distinct problems. For example, in cases where only limited learning from data is required, it may use genetic pre-specification of circuits, or exploit local optimisation to avoid assigning credit through many layers of neurons.

Furthermore, the brain could also use a variety of circuit structures that allow it to actual perform, in essence, backpropagation of errors through a multilayer network, using biologically realistic mechanisms. Potential such mechanisms include

circuits that explicitly backpropagate error derivatives, akin to conventional backpropagation algorithms,
but also include circuits that provide other efficient means of approximating the effects of backpropagation, perhaps by rapidly computing the approximate gradient of a cost function relative to any given connection weight in the network.
use algorithms that exploit specific neurophysiological aspects - such as spike timing dependent plasticity, dendritic computation or local excitatory-inhibitory networks.

However, learning does not necessary require gradient descent based optimisation. Many theories of the cortex consider the self-organising and unsupervised learning properties that obviate the need for multilayer backpropagation.

For example, Hebbian plasticity, where neuron weights are adjusted according to the correlation between pre-synaptic and post-synaptic activity, gives rise to different forms of competition between neurons, which subsequently leads to self-organising maps and orientation columns.

However, we can interpret such types of local self-organisation as optimising a cost function: for example, Hebbian learning can be viewed as extracting the principal components of the input, which minimises the reconstruction error for linear mappings.

Biological Implementation of Optimisation

A key argument is that the local self-organisation mechanisms above are inadequate to account for the capabilities of the brain. However, in order to justify that the brain performs some form of gradient based optimisation on the customised cost functions, we consider how it can implemented in biological systems.

There exists several “biologically plausible” mechanisms by which a neural circuit could implement optimisation algorithms that efficiently utilise gradients. The core principle behind all these mechanisms the feedback connections that carry prediction error physically. Learning occurs by comparing a prediction with a target, and the prediction error is used to drive top-down changes in bottom-up activity.

Some examples include

Generalised Recirculation (O’Reilly, 1996)
- Temporally eXtended Contrastive Attractor Learning (XCAL) algorithm optimises the cost function by adjusting the weights according to local pre-post synaptic activity, which recreates the backpropagation functionality by performing locally modified Hebbian learning at each node of the network.
Spike Timing Dependent Plasticity (STDP) with iterative inference and target propagation (Scelier and Benjio, 2016)
- Some neurons adjust the sign of the synaptic weight change according to the precise millisecond-scale relative timing of pre-synaptic and post-synaptic spikes. Although this is typically interpreted as Hebbian learning, an alternative interpretation would be that neurons could encode the types of error derivatives needed for backpropagation in the temporal derivatives of their firing rates.
Feedback Alignment (Lillicrap et al., 2014)
- Feedback pathways in backpropagation, by which error derivatives at a layer are computed from error derivatives at a subsequent layer, is replaced by a set of random feedback connections with no dependence on forward weights. Assuming the existence of a synaptic normalisation mechanism and approximate sign-concordance between feedforward and feedback connections, this mechanism closely recreates backpropagation on a variety of tasks. Essentially, the forward weights are able to adapt the network such that the feedback weights actually carry the information that is useful for approximating the gradient during learning.

However, these implementations still lack some aspects of biological realism. For example, in artificial neural networks, the same node can send both excitatory and inhibitory signals (i.e. backpropagate a positive and negative gradient), but biological neurons can only either be excitatory or inhibitory. This is ultimately limit the complexity of the function that can be learnt with the same network with biological neurons as opposed to artificial neurons.

Biological neural networks are highly recurrent, which show rich dynamics in time. Recurrent networks require “Backpropagation through Time” (BPTT), which is not easily implemented in the biological implementations of backpropagation given above. BPTT is done by unfolding the recurrent network across multiple discrete time steps and performing standard backpropagation on the unfolded network. Although this seems biological implausible, we actually need to consider to what extent BPTT is truly needed for learning.

Given that neural network in the brain have access to memory systems of various kinds, we could hypothesise that these memory systems will aid in BPTT of recurrent networks. In particular, memory systems could “spatialise” the problem of temporal credit assignment and allow the network to perform conventional backpropagation on the recurrent network. We show this with two examples

In memory networks, everything up to a certain buffer is stored, which directly eliminates the need to perform credit assignments to write-to-memory events because the network only needs to perform credit assignment to read-from-memory events.
Certain network architectures that are superficially deep can actually be seen as ensembles of shallow networks. By applying this idea in the time domain, we remove the need to propagate errors far backwards in time.

So far, we have considered biological neural networks to be able to receive and send somewhat continuous signals to each other. However, in reality, biological neurons communicate solely via spikes. This poses a critical issue for our hypothesis, as it is difficult to apply gradient descent algorithms directly to spiking neural networks. In general, learning over many layers of non-differentiable functions is a hard problem. There have been some advances in integer programming, but none come close to the power of gradient descent algorithms. However, despite this difficulty, a number of optimisation procedures have been proposed in the neuroscience literature.

Performing optimisation on a continuous representation of the network dynamics and embedding variables into high-dimensional spaces with many spiking neurons representing each variable.
Using recurrent connections with multiple timescales removes the need for backpropagation in the direct training of spiking recurrent networks. Fast connections maintain the network in a particular state, while slow connections have local access to the global error signal. This way, we can recreate gradient descent using just the comparison of the slow and fast signals.

Alternative Learning Mechanisms

It would be narrow-minded to assume that gradient descent based backpropagation is the only optimal method of learning, especially given that the brain has evolved over a long period of time. Via such evolution (which is essentially a form of optimisation), the brain may have discovered mechanisms of credit assignment quite different from what we use in machine learning. We explore two such mechanisms

Dendritic computation: neuronal dendrites exhibit various linear and non-linear behaviours that allow them to perform basic computations. This influences learning algorithms in three ways
- First, real neurons are highly nonlinear, with dendrites of each single neuron implementing something computationally similar to a three-layer neuron network (Mel, 1992)
- Second, the problem of credit assignment is simplified by the dendrite because as the action potential propagates from the soma to the dendritic tree, it propagates more strongly into the branches of the dendritic tree that have been active. This works as a natural form of gradient descent because the feedback signal will flow preferentially through the nodes that were most active in the feedforward process.
- Third, neurons can have multiple somewhat independent dendritic compartments, as well as somewhat independent somatic compartments. Both point to the neuron essentially storing more than one variable at the same time. So, there is a possibility that the neuron itself stores both its activation itself, and the error derivative of a cost function with respect to this activation, required in backpropagation.
Neuromodulation: the same neuron can exhibit different input-output responses and plasticity depending on a global circuit state, represented by the concentrations of various neuromodulators like dopamine, serotonin, norepinephrine and acetylcholine. These modulators can influence learning in two ways
- First, modulators can be used to gate synaptic plasticity on and off selectively in different areas and at different times. Such gating allows for precise, rapidly updated coordination of where and when cost functions are applied to a network of neurons.
- Since we can consider a single neuronal circuit as the combination of multiple overlapping circuits with modulation switching between them, modulators allow for the sharing of weight information across different neuronal circuits.

A particular brain region that has been heavily studied is the cortex, where several models attempt to explain cortical learning on the basis of specific architectural features of the 6-layered cortical sheet. These models generally agree that a primary function for the cortex is some form of unsupervised learning via prediction (O’Reilly et al., 2014b). Some model even attempt to map the cortical structure onto a message-passing framework for Bayesian inference, while others use learning functions to explain cortical neurophysiology. A few examples of the latter:

Cortical pyramidal neurons have multiple dendritic zones that are targeted by different kinds of projections, which may allow the pyramidal neuron to make comparisons of top-down and bottom-up inputs.
Local inhibitory neurons targeting particular dendritic compartments of the L5 pyramidal neurons could be used to exert precise control over when and how relevant feedback signals and associative mechanisms are utilised, serving as a proxy for backpropagation according to the prediction error signals.
Learning based on temporal predictions can be facilitated by the storage of information over time, which is controlled by the recurrent connectivity with the thalamus, structured bursts of spikes and cortical oscillations.

All these suggest that the physiology of the cortex could be interpreted as a machine learning framework that goes beyond backpropagation. However, note that this is still in the early stages, as we lack detailed maps of even a single local cortical microcircuit.

Conclusion

Over the past few sections, we considered the hypothesis that the brain can optimise cost functions like how machine learning algorithms do. Although there exist some evidence that neuronal circuits are able to perform gradient descent via backpropagation, we run into some practical issues when we consider the biological constraints of neurons to behave the same way as artificial neurons do. However, considering that the brain might have other forms of optimisation in place instead of gradient descent via backpropagation, we see that the physiology of the cortex in particular can be considered as a machine learning framework that optimises cost functions for unsupervised learning via predictions. Finally, we note that this observation is still in its early stages, as we need better modelling capabilities to accurate model even a single local cortical microcircuit.

Over the next few posts, we will dive deep into Hypothesis 2 and 3, with the final post relating how we can use these hypotheses and observations to improve our machine learning models.

(References are not included here to save space. Please refer to the original paper for the relevant references)

Engineer Quant

Discussion about this post