← Back to Catalog
The T-Maze Problem
A rat in a T-maze faces a choice: one arm contains cheese (reward), the other a shock (punishment). A cue in the central arm reveals which side has the reward.

Option 1: Go Directly (Exploit)

Gamble on 50/50 oddsβ€”might get cheese, might get shocked.

Option 2: Seek Information First (Explore)

Visit the cue first, learn where the reward is, then go there.

"This choice speaks to the classical exploration-exploitation dilemma: a dilemma resolved under Active Inference."

β€” Parr, Pezzulo & Friston (2022), Chapter 7, p.130
The cue is any observation that resolves uncertainty before a consequential choice. It's an epistemic actionβ€”taken for information, not immediate reward.
  • Checking a calendar before deciding which cafΓ©
  • Reading reviews before purchasing
  • Asking a question before committing
  • Checking the weather before dressing

"If he does not know what day it is, he has to first select an action with epistemic value."

β€” Parr, Pezzulo & Friston (2022), Chapter 2, p.34
The task is formalised as a Partially Observable Markov Decision Process:

A-matrices (Likelihood)

Map hidden states to observations. What will I see?

B-matrices (Transitions)

How states change given actions.

C-vectors (Preferences)

Cheese: +6, Shock: βˆ’6.

D-vectors (Initial Beliefs)

Both contexts equally likely (50/50).

β€” Chapter 7, p.131-135
Policies minimise expected free energy G(Ο€):
Expected Free Energy
G(Ο€) = βˆ’Epistemic βˆ’ Pragmatic

Epistemic Value

How much will this reduce uncertainty?

Pragmatic Value

How likely to bring preferred outcomes?

"We do not need to balance exploration and exploitation. Both serve the same function."

β€” Chapter 7, p.131
Precision (Ξ³) controls how deterministic policy selection is:
Policy Probability
P(Ο€) = Οƒ(βˆ’Ξ³ Β· G(Ο€))

Low Precision (Ξ³ β†’ 0)

More random/exploratory.

High Precision (Ξ³ β†’ ∞)

More deterministic.

Try the Precision slider!
β€” Chapter 4, p.72

Step 1: Go to Cue (Epistemic)

Visit the informative cue first.

Step 2: Go to Reward (Pragmatic)

After seeing cue, go to correct arm.

"The rat chooses to sample the informative cueβ€”the location with greatest epistemic value."

Press "AUTO" to watch!
β€” Chapter 7, p.135
T-Maze Environment
4.0
1.5
Beliefs Q(s)
CENTRE
100%
CUE
0%
LEFT
0%
RIGHT
0%
Expected Free Energy G(Ο€)
β†’ Cue
← Left
β†’ Right
Prior Preferences C
πŸ§€ Attractive: +6
⚑ Aversive: βˆ’6
β—‹ Neutral: 0
TRIAL1
STEP1
CONTEXTUnknown
πŸ§€0
⚑0

EFE in Phase Space

Watch an agent navigate epistemic & pragmatic landscapes
Epistemic attractor (information)
Pragmatic attractor (reward)
Agent minimising G(Ο€)
EFE gradient flow

Parameters

Where does the agent think the reward is?
πŸ§€ LEFT RIGHT πŸ§€
50/50 (uncertain)
How much does the agent value information?
Ignore Seek
1.0
How much does the agent value reward?
Ignore Seek
1.0
How deterministic is policy selection?
Random Decisive
2.0

Live EFE Values

Epistemic (INFO):
0.00
Pragmatic LEFT:
0.00
Pragmatic RIGHT:
0.00
Winner:
INFO
πŸ’‘

High uncertainty β†’ agent seeks information first (epistemic), then reward (pragmatic).

Watching agent...

"The fact that utility and the value of information emerge as two components of expected free energy means we do not need to worry about balancing exploration and exploitation."

β€” Parr, Pezzulo & Friston (2022), Chapter 7, p.131

Glossary of Terms

Chapter 7 concepts in friendly language
Expected Free Energy (G)
Expected surprise about future observations. Policies are selected to minimise G. "If I do this, how surprised will I be?"
Epistemic Value
Expected information gain. High = action reduces uncertainty. Checking calendar has high epistemic value.
Pragmatic Value
Expected preference satisfaction: E[ln P(o|C)]. Going to favourite restaurant = high pragmatic value.
POMDP
Partially Observable Markov Decision Process. Decision-making with uncertain state. Life: you never see reality directly.
Policy (Ο€)
A sequence of actions. Agents evaluate trajectories, not single actions. Not "turn left?" but the whole plan.
Precision (Ξ³)
Policy selection determinism. P(Ο€) ∝ exp(βˆ’Ξ³G). High = decisive. Low = exploratory.
A-Matrix
Likelihood mapping: states β†’ observations. Rain (state) β†’ wet streets (observation).
B-Matrix
Transition mapping: (state, action) β†’ next state. Button press β†’ floor changes.
C-Vector
Prior preferences. Not expectations, but desires. Your values encoded numerically.
D-Vector
Initial beliefs about state. Rat: 50/50 context uncertainty. Know location, not context.
Exploration-Exploitation
Gather info vs act on knowledge. Active inference dissolves this dilemma. Both serve the same objective: minimise G.
Information Gain
Uncertainty reduction after observation. Cue has high info gain. Window β†’ weather knowledge.
Temporal Grammar | Alexander Sabine | Active Inference Institute | temporalgrammar.ai