When Does Tabular RL Actually Break?

Testing representation bottlenecks on high school algebra

Zac Burton · December 31, 2025

TL;DR: AlphaProof solves IMO problems, but can vanilla RL solve ax + b = c? We systematically test tabular Q-learning, SARSA, PPO, and neural baselines to find the exact complexity threshold where RL breaks. Spoiler: it's embarrassingly low—depth-2 equations (~8th grade algebra). Even with 50,000 training episodes and neural encoders, solve rates stay below 5%. The problem isn't sample complexity—it's that flat representations can't capture compositional structure.

The Question

Modern AI systems like AlphaProof, Harmonic's Aristotle, and AxiomMath's AxiomProver can tackle most International Math Olympiad or Putnam problems alike, with ease. Meanwhile, I wanted to know: what's the simplest math problem where reinforcement learning completely falls apart?

Not because scaling up is impossible (obviously throw enough compute and you can brute-force anything), but because I wanted to understand the fundamental failure modes. Where does the representation break? Is it a data problem or an architecture problem?

So I picked the most trivial symbolic task I could think of: solving single-variable linear equations.

2x = 4        # depth-1: one operation
3x + 5 = 11   # depth-2: two operations
2x - 3 = x + 7  # depth-3: three operations

The task: given an equation, apply algebraic rewrite rules (like "move constant across equality" or "divide both sides by coefficient") until you isolate x.

High schoolers solve depth-3 equations in their sleep. Can RL?

The Setup

Environment

Each equation is represented as an abstract syntax tree (AST). The agent sees the equation as a string (e.g., "Equals(Mul(Const(2), Var(x)), Const(4))") and picks from a set of rewrite rules:

divide_linear: kx = c → x = c/k
move_const_l_to_r: x + b = c → x = c - b
combine_like_terms: ax + bx → (a+b)x
...and a few more

Episodes end when the agent reaches a solved form (x = k) or hits 100 steps. Reward is +1 for solving, 0 otherwise.

Agents

I tested four approaches:

Tabular Q-learning: Standard ε-greedy with string-based state representation
SARSA: On-policy variant, same representation
PPO: Policy gradient with GRU encoder over the stringified AST
Random baseline: Uniform sampling over valid actions

All agents use action masking to only select valid rewrites at each step.

Initial Results: Complete Failure at Depth-2

I trained each agent for 20,000 episodes on equations of varying depths:

Agent	Depth-1	Depth-2	Depth-3	Depth-4
PPO	38%	19%	4%	1%
Q-Learning	4%	5%	4%	1%
SARSA	4%	5%	4%	1%
Random	4%	5%	1%	1%

Key Finding: Tabular Q-learning and SARSA are statistically indistinguishable from random across all depths. They never learn anything.

PPO does better on depth-1 (38%), but by depth-3 it's back to random performance. And depth-2? A measly 19%.

Why? The obvious hypothesis: state sparsity. Because states are represented as raw strings, "Equals(Mul(Const(2), Var(x)), Const(4))" and "Equals(Mul(Const(3), Var(x)), Const(6))" are treated as completely unrelated—even though they have identical solution structure.

The Investigation: Data vs. Representation

But wait—is this a data problem or a representation problem?

To test this, I ran two follow-up experiments:

Experiment 1: Coefficient Normalization

What if we explicitly remove the coefficient information? Instead of treating 2x = 4 and 3x = 6 as different states, normalize them both to C·x = C.

This should massively reduce state space. For depth-1 equations, the state space collapses from ~341 unique states down to just 2.

Method	Depth	Episodes	Solve Rate	Q-Table Size
Regular Q-Learning	1	20K	5.4% ± 1.4%	341 ± 7
	2	50K	1.3% ± 0.8%	673 ± 13
Normalized Q-Learning	1	20K	5.4% ± 1.4%	2 ± 0
	2	50K	0.5% ± 1.0%	8 ± 0

Result: Normalization achieves 99% state space compression (341 → 2 states for depth-1). But performance is identical. Both methods completely fail on depth-2 (<2% solve rate) even with 50K episodes.

Learning curves over 50K episodes. Both regular and normalized Q-learning plateau at <2% on depth-2.

Experiment 2: Neural Baseline (GRU Encoder)

Maybe the problem is that any flat string representation is doomed. What if we let the agent learn its own representation?

Character-level GRU encoder: Maps equation strings to 64-dim embeddings
MLP Q-network: Predicts Q-values from (state embedding, action encoding)
Experience replay: Batch size 32, buffer size 10K

Method	Depth-1	Depth-2
Tabular Q-Learning	5%	1%
Neural Q-Learning (GRU)	10%	4%

Result: Neural encoding provides 2–4× improvement, but 4% on depth-2 is still a complete failure.

Why Does This Happen?

The extended experiments rule out sample complexity. Even with 50,000 training episodes, 99% state space compression, and learned neural representations, agents still can't solve depth-2 equations.

The problem is representations that don't capture compositional structure.

Because ax + b = c requires a sequence of dependent operations:

Move b to right side → ax = c - b
Divide by a → x = (c - b)/a

A flat string representation can't expose this dependency. The agent has no way to know that "the thing I do now affects what rewrites are possible later."

What Would Actually Work?

Based on these results, here's what you'd need to solve even depth-2 equations with RL:

Tree-structured encoder: Tree-LSTM or GNN over the AST
Symbolic priors: Hard-coded knowledge of algebraic equivalences
Hierarchical policies: "First isolate the variable, then simplify"
Search guidance: MCTS or beam search instead of pure RL

This is exactly what AlphaProof and similar systems do. They don't use vanilla RL—they use heavily structured architectures with symbolic reasoning baked in.

Key Takeaways

Tabular RL breaks at embarrassingly low complexity: Depth-2 equations (8th grade algebra) are essentially unsolvable.
It's not a data problem: 50K episodes, 99% state compression, and neural encoders all fail. The issue is representational.
Flat representations can't capture composition: String-based or sequence-based encodings miss the hierarchical structure needed for multi-step reasoning.
This motivates structured architectures: Modern neuro-symbolic systems succeed because they incorporate tree encoders, symbolic priors, and search.

Code & Reproducibility

All code is available on GitHub. The experiments ran on Azure ML with H100 GPUs (though most runs were CPU-only). Total compute cost: ~$170 for the extended experiments.