Active Inference — Chapter 7: T-Maze

The T-Maze Problem

Exploration vs Exploitation

A rat in a T-maze faces a choice: one arm contains cheese (reward), the other a shock (punishment). A cue in the central arm reveals which side has the reward.

Option 1: Go Directly (Exploit)

Gamble on 50/50 odds—might get cheese, might get shocked.

Option 2: Seek Information First (Explore)

Visit the cue first, learn where the reward is, then go there.

"This choice speaks to the classical exploration-exploitation dilemma: a dilemma resolved under Active Inference."

— Parr, Pezzulo & Friston (2022), Chapter 7, p.130

Information-Gathering Actions

The cue is any observation that resolves uncertainty before a consequential choice. It's an epistemic action—taken for information, not immediate reward.

Checking a calendar before deciding which café
Reading reviews before purchasing
Asking a question before committing
Checking the weather before dressing

"If he does not know what day it is, he has to first select an action with epistemic value."

— Parr, Pezzulo & Friston (2022), Chapter 2, p.34

Building a POMDP

The task is formalised as a Partially Observable Markov Decision Process:

A-matrices (Likelihood)

Map hidden states to observations. What will I see?

B-matrices (Transitions)

How states change given actions.

C-vectors (Preferences)

Cheese: +6, Shock: −6.

D-vectors (Initial Beliefs)

Both contexts equally likely (50/50).

— Chapter 7, p.131-135

Policy Selection

Policies minimise expected free energy G(π):

Expected Free Energy

G(π) = −Epistemic − Pragmatic

Epistemic Value

How much will this reduce uncertainty?

Pragmatic Value

How likely to bring preferred outcomes?

"We do not need to balance exploration and exploitation. Both serve the same function."

— Chapter 7, p.131

Controlling the Balance

Precision (γ) controls how deterministic policy selection is:

Policy Probability

P(π) = σ(−γ · G(π))

Low Precision (γ → 0)

More random/exploratory.

High Precision (γ → ∞)

More deterministic.

Try the Precision slider!

— Chapter 4, p.72

Active Inference in Action

Step 1: Go to Cue (Epistemic)

Visit the informative cue first.

Step 2: Go to Reward (Pragmatic)

After seeing cue, go to correct arm.

"The rat chooses to sample the informative cue—the location with greatest epistemic value."

Press "AUTO" to watch!

— Chapter 7, p.135

T-Maze Environment

Precision (γ) — Policy determinism

4.0

Epistemic Weight — Value of information

1.5

Beliefs Q(s)

CENTRE

100%

CUE

LEFT

RIGHT

Expected Free Energy G(π)

→ Cue

← Left

→ Right

Prior Preferences C

🧀 Attractive: +6
⚡ Aversive: −6
○ Neutral: 0

TRIAL1

STEP1

CONTEXTUnknown

🧀0

⚡0

EFE in Phase Space

Watch an agent navigate epistemic & pragmatic landscapes

Epistemic attractor (information)

Pragmatic attractor (reward)

Agent minimising G(π)

EFE gradient flow

Parameters

Belief about Reward Location

Where does the agent think the reward is?

🧀 LEFT RIGHT 🧀

50/50 (uncertain)

Epistemic Weight

How much does the agent value information?

Ignore Seek

1.0

Pragmatic Weight

How much does the agent value reward?

Ignore Seek

1.0

Precision (γ)

How deterministic is policy selection?

Random Decisive

2.0

Live EFE Values

Epistemic (INFO):

0.00

Pragmatic LEFT:

0.00

Pragmatic RIGHT:

0.00

Winner:

INFO

💡

High uncertainty → agent seeks information first (epistemic), then reward (pragmatic).

Watching agent...

"The fact that utility and the value of information emerge as two components of expected free energy means we do not need to worry about balancing exploration and exploitation."

— Parr, Pezzulo & Friston (2022), Chapter 7, p.131

Glossary of Terms

Chapter 7 concepts in friendly language

Expected Free Energy (G)

Expected surprise about future observations. Policies are selected to minimise G. "If I do this, how surprised will I be?"

Epistemic Value

Expected information gain. High = action reduces uncertainty. Checking calendar has high epistemic value.

Pragmatic Value

Expected preference satisfaction: E[ln P(o|C)]. Going to favourite restaurant = high pragmatic value.

POMDP

Partially Observable Markov Decision Process. Decision-making with uncertain state. Life: you never see reality directly.

Policy (π)

A sequence of actions. Agents evaluate trajectories, not single actions. Not "turn left?" but the whole plan.

Precision (γ)

Policy selection determinism. P(π) ∝ exp(−γG). High = decisive. Low = exploratory.

A-Matrix

Likelihood mapping: states → observations. Rain (state) → wet streets (observation).

B-Matrix

Transition mapping: (state, action) → next state. Button press → floor changes.

C-Vector

Prior preferences. Not expectations, but desires. Your values encoded numerically.

D-Vector

Initial beliefs about state. Rat: 50/50 context uncertainty. Know location, not context.

Exploration-Exploitation

Gather info vs act on knowledge. Active inference dissolves this dilemma. Both serve the same objective: minimise G.

Information Gain

Uncertainty reduction after observation. Cue has high info gain. Window → weather knowledge.