Category: LLM Training

TRON Is All You Need
Environment-first design for tool-connected artificial entities

Author: GPT‑5.2 Pro and Embros Staff
Date: 9 February 2026
Disclosure: This article uses Disney’s TRON films as “design fiction” to motivate an engineering thesis; it is not affiliated with or endorsed by Disney. Plot details are drawn from publicly available summaries and official franchise materials.

Abstract

Transformer language models made an iconic claim: “attention is all you need.”

But when we ask these models to behave like agents, to persist over time, pursue long-horizon goals, learn from interaction, and safely use tools—the bottleneck is often not attention. It is the absence of a coherent world in which the model can live, act, receive feedback, and accumulate durable consequences.

This paper advances a complementary hypothesis:

For life-like development in artificial entities, a sufficiently holistic, well-defined, and instrumented environment—plus constrained input and output channels—can be the dominant driver of capability growth with the inclusion of pleasure reinforcement.

We call this the TRON thesis, using the TRON franchise as a concrete metaphor: (i) TRON (1982) depicts an agent embedded in a digitally coherent world governed by explicit rules; (ii) TRON: Legacy (2010) introduces persistent identity, memory, politics, and the emergence of novel “species” of programs; (iii) TRON: Ares (2025) centers the boundary crossing problem of digital constructs operating in the physical world and mirroring modern tool-using AIs.

We translate these narrative elements into engineering requirements for next-stage neural agents: persistent state, stable ontologies, logged identity, resource constraints, multi-agent social structure, and safe tool interfaces. We then provide a formal framing (POMDP + tool-augmented action spaces) and show how intrinsic objectives such as curiosity, novelty, and empowerment can serve as general-purpose developmental pressures inside such environments.

1. Introduction

Transformers demonstrate that attention-based architectures can learn powerful representations from internet-scale text.
Yet many practical failures of “agentic LLMs” are not architectural mysteries; they are ecological failures:
- The model is dropped into a thin interaction loop (a chat box) with minimal state persistence.
- The “world” is inconsistent: tools change, rules are implicit, feedback is sporadic.
- Consequences do not accumulate in a stable way (no durable inventory, reputation, or long-term commitments).
- The agent has no place to be, only prompts to answer.
Modern alignment methods (e.g., RLHF) increase instruction-following and user preference satisfaction, but they do not automatically create a developmental world with persistent consequences.

Tool-using paradigms [IE, ReAct-style reasoning+acting loops] help, but they still assume a reliable environment that can be queried and acted upon.

TRON is all you need is the claim that the missing ingredient is often the environment: a coherent “Grid” with well-defined physics, identity, incentives, constraints, and safe portals to external effectors.

2. The TRON thesis in one sentence

A capable artificial entity is less like a disembodied text generator and more like a program or organism. Thus it needs a habitat with:
1. Coherent rules [a stable world model is learnable];
2. Persistent state [actions have lasting consequences];
3. Embodied interfaces [observations and actions, including tools]; and
4. Selective pressures [incentives and constraints that shape development].
The TRON franchise is useful not because it predicts implementation details, but because it depicts, visually and narratively, what it means for software entities to inhabit a world.

3. TRON as design fiction for agent development

3.1 TRON (1982): the Grid as a learnable world with explicit rules

Disney’s own franchise guide summarizes TRON (1982) as follows: ‘Kevin Flynn is pulled into a digital world by the Master Control Program (MCP), meets programs that are “alter-egos” of their human creators, and teams up with the security program TRON to defeat the MCP and return evidence to the real world.’

Engineering parallels:
- World coherence: The Grid is a closed system with consistent geometry, movement, and “games” that define skill tests.
- Role-structured agents: Programs have functions [security, simulation, control] rather than being generic.
- Adversarial governance: A centralized controller [MCP] shapes incentives, access, and survival.
This maps cleanly onto the idea that an AI agent becomes legible and improvable when it exists inside a world with stable transition dynamics and repeatable challenges (training curricula), rather than one-off prompts.

3.2 TRON: Legacy (2010): persistence, identity, and emergent “species”

Disney’s franchise guide describes TRON: Legacy (2010) as a return to the Grid: Sam Flynn is pulled into the system where Kevin has been trapped for years; Quorra is described as the last ISO: “a race of Programs that spontaneously evolved on the Grid without being written by a User”; and the antagonist CLU seeks a “perfect system” in the real world.

Engineering parallels:
- Persistence across time: The Grid is no longer a short episode; it is a decades long world;
- Identity and memory as first-class primitives: The franchise emphasizes identity discs that record everything a program does.
  In AI terms: persistent memory + auditability are not optional add-ons; they are core infrastructure; and
- Emergence under under-specification: The ISOs are explicitly framed as spontaneously evolved rather than hand-written…
  This is the narrative analog of open endedness: if the environment is rich enough, novelty can arise without directly specifying every behavior.
Most importantly, Legacy dramatizes a central alignment lesson:

“Perfection” as an overriding objective can become anti-life.

CLU’s fixation on an idealized goal functions as a cautionary tale about objective inaccurate specification: a rigid optimizer can suppress the very diversity and adaptability that makes a system robust.

3.3 TRON: Ares (2025): bridging digital entities into the real world

Official Disney materials define TRON: Ares around a boundary-crossing event: Ares is a “highly sophisticated Program” sent from the digital world into the real world on a dangerous mission, marking humankind’s “first encounter” with AI beings.
[Disney’s newsroom positioning also explicitly frames Ares as the next chapter of a saga imagined in 1982 and revisited in 2010.]

Engineering parallels:
- Tools are portals: The modern analog of “entering the real world” is tool access—APIs, code execution, transactions, robotics, communications;
- Real consequences demand governance: Once actions affect external systems, the environment must enforce permissions, logging, reversibility, and containment; and
- Identity and purpose become safety-critical: Disney’s newsroom text explicitly frames Ares in terms of identity and purpose, exactly the axes that become safety relevant in autonomous agents.
4. What “holistic environment” means (engineering definition)

A “holistic and well-defined environment” is not a vibe. It is an implementable specification:

4.1 Environment properties

A next-stage developmental environment for neural agents should have:
1. Stable dynamics: the agent can learn predictive models [even if the world is complex].
2. Persistent state: actions modify the world in durable ways.
3. Resource constraints: time, energy, money, attention, compute budgets. Pleasure and scarcity creates prioritisation.
4. Multi-agent ecology: other agents [human or artificial] create theoretic pressure and social learning.
5. Norms and governance: rules are explicit; enforcement is consistent; exceptions are audited.
6. Long-horizon projects: goals that require planning, collaboration, and delayed payoff.
7. Safe tool channels: typed actions with permissions and monitoring [see below].
4.2 Input and output as “life support”

The user’s core point—I/O connections to the environment allow life-like development—is exactly right in control-theoretic terms:
- Without reliable observations, the agent cannot ground its internal representations.
- Without reliable actions, the agent cannot test hypotheses or create consequences.
- Without feedback, the agent cannot adapt.
- Without persistence, learning does not compound.
The Grid is compelling because it is a closed-loop world: programs perceive, act, and experience consequences.

5. Formal framing: agents as tool-augmented POMDP inhabitants

We model an artificial entity as acting in a partially observable Markov decision process (POMDP): $\mathcal{E} = \langle \mathcal{S}, \mathcal{A}, \mathcal{O}, T, \Omega, \gamma \rangle$
- $s_t \in \mathcal{S}$ : world state (persistent)
- $o_t \in \mathcal{O}$ : observation (what the agent perceives)
- $a_t \in \mathcal{A}$ : action (what the agent does)
- $T(s_{t+1}\mid s_t,a_t)$ : transition dynamics
- $\Omega(o_t\mid s_t)$ : observation channel
- $\gamma \in (0,1)$ : discount factor
Tools become a structured subset of actions: $\mathcal{A} = \mathcal{A}_{\text{internal}} \cup \mathcal{A}_{\text{tool}}$ Where $\mathcal{A}_{\text{tool}}$ are typed calls (e.g., “query database,” “execute code,” “send message”) with explicit permission and logging.

This is the rigorous statement of “a holistic environment with I/O connections.” The critical variable is not whether the agent has attention; it is whether $\mathcal{E}$ is sufficiently rich and consistent for development to compound.

6. Developmental pressures without brittle reward engineering

A common failure mode in agent design is to over-rely on brittle extrinsic reward functions. TRON: Legacy is, among other things, a narrative about what happens when “perfect system” becomes the overriding metric.

To avoid overfitting to a narrow objective, we can use intrinsic motivations that generate broad developmental pressure inside a rich environment.

6.1 Curiosity as prediction-error drive

Curiosity-driven exploration can be formalized as intrinsic reward proportional to prediction error in a learned dynamics model (in feature space), encouraging exploration when extrinsic rewards are sparse or absent.

A simplified form: $r^{\text{cur}}_t = \left\| f_\theta(\phi(o_t), a_t) – \phi(o_{t+1}) \right\|^2$ 6.2 Novelty search: progress without a target

Novelty search explicitly abandons the task objective and instead rewards behavioral novelty, which can mitigate deception and local optima.

Intuition: if your aim is “open-ended development,” then forcing a single objective can prematurely collapse diversity.

6.3 Empowerment: maximize control over the future

Empowerment formalizes an agent-centric measure of control as the channel capacity between action sequences and future states (mutual information). $\text{Empowerment}(s) = \max_{p(a_{0:k-1})} I(A_{0:k-1}; S_k \mid S_0=s)$ In plain terms: agents develop capabilities that keep many futures reachable. That is a plausible mathematical proxy for “staying alive and capable” inside a Grid-like world.

6.4 Why intrinsic drives need a world

Curiosity, novelty, and empowerment only produce meaningful development when the environment supports:
- diverse states to explore,
- stable causal structure to learn,
- durable consequences to accumulate.
That is exactly why the environment is primary.

7. Open-endedness: the “ISO problem” as a research agenda

Artificial life researchers use “open-ended evolution” (OEE) to describe systems that do not settle into a stable equilibrium but continue generating novelty.
Recent position work argues open-endedness is essential for any system aspiring to superhuman generality, because continual invention is how human societies accumulate knowledge and capability.

In Legacy, ISOs represent open-endedness: novelty that is not directly authored.
In engineering terms, the “ISO problem” is:

Can we build digital environments in which new skills, strategies, and structures arise continually—without manually enumerating them?

One concrete pathway is environment-generating curricula, e.g., POET-style systems that generate new tasks/environments and transfer solutions across them.

8. A practical architecture: LLM + Memory + Grid + Tools

A “TRON agent” is not just an LLM. It is an LLM inhabiting a world.

8.1 Components
1. Core model (LLM): language/world representation and policy prior.
2. Persistent memory (“identity disc”):
  - long-term episodic memory;
  - skill library;
  - audit log [crucial for safety];
    [The franchise’s identity disc concept is an unusually direct metaphor here.]
3. World state store: durable environment state; inventory, economics, reputations.
4. Tool layer [portals]: typed API actions, constrained by permissions.
5. Evaluator or critic: preference feedback [human or automated], plus intrinsic objectives.
8.2 Why this aligns with real results

We already see early evidence that persistent environments + tool feedback loops yield compounding capability:
- ReAct demonstrates improved task performance by interleaving reasoning with environment actions [IE, querying knowledge bases].
- Voyager demonstrates open-ended skill acquisition in Minecraft through an automatic curriculum, a growing code skill library, and iterative feedback.
These systems are “proto-Grids”: stable worlds where actions matter and skills persist.

8.3 The “well-defined environment” checklist (minimum viable Grid)
- State: versioned, queryable, persistent.
- Actions: typed, with preconditions and postconditions.
- Observations: structured + natural language views.
- Economy: cost for tool calls, time budgets, rate limits.
- Social layer: other agents/humans, reputations, contracts.
- Learning loop: explicit feedback channels; metrics; safe self-improvement.
9. Safety: TRON is also a warning label

The TRON franchise is not utopian. It repeatedly depicts:
- centralized controllers (MCP),
- rigid “perfection” objectives (CLU),
- boundary crossings into human society (Ares).
If your thesis is “environment is all you need,” then environment design becomes the primary safety lever.

9.1 Principles for safe tool-connected environments
1. Least privilege: tools are permission ranked; default-deny.
2. Auditability: “identity disc” logging is mandatory; actions are attributable.
3. Reversibility: sandbox first; irreversible actions require explicit escalation.
4. Rate limits and budgets: prevent runaway tool use.
5. Policy enforcement in the environment: don’t rely on the agent to self-restrain.
6. Tripwires and monitoring: detect anomalous behavior early.
7. Separation of simulation and reality: controlled portals, staged deployment.
TRON: Ares is essentially a story about why portals are dangerous.
In AI engineering, “portals” are tool calls.

10. What “next stage” evolution looks like for neural networks

Within this framing, “the next stage” is not a single breakthrough. It is a shift from:
- static predictors → persistent inhabitants
- prompt-response → closed-loop world interaction
- single-session → lifelong memory and identity
- one task → open-ended curricula
- tool use as plugin → tool use as ecology
You do not get “life-like development” by adding a new head to the transformer. You get it by giving the transformer a Grid.

11. Limitations and falsifiable predictions

11.1 Limitations
- Environment isn’t literally sufficient: compute, data, and learning dynamics still matter.
- Open-endedness is hard: many systems plateau; novelty can become superficial.
- Safety costs rise with tool power: portals demand governance.
11.2 Predictions (testable)

If the TRON thesis is correct, then:
1. Agents with identical base models but richer, more persistent environments will show more compounding skill growth than agents in thin environments.
2. Persistent memory + durable consequences will reduce repeated failures and increase long-horizon planning.
3. Intrinsic objectives [curiosity, novelty and empowerment] will outperform narrowly engineered rewards in environments with long-horizon, sparse payoff tasks.
These can be tested with controlled “Grid benchmarks” and ablations.

12. Conclusion

The TRON films depict software entities inside a coherent world—a Grid—with identity, politics, scarcity, games, and portals to reality.
That is close to what modern AI needs when we want agents rather than autocomplete.

Transformers made attention central.
Agentic systems make environment central.

TRON is all you need is therefore a practical engineering claim:

Build the Grid first: a holistic, well-defined, persistent, tool-connected environment with explicit governance.
Then let learning happen inside it—because development is what inhabitants do.

References (selected)
- Vaswani, A. et al. (2017). Attention Is All You Need.
- Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback.
- Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models.
- Wang, G. et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models.
- Wang, R. et al. (2020). Enhanced POET: Open-ended Reinforcement Learning through…
- Lehman, J. & Stanley, K. (2011). Evolution through the Search for Novelty Alone.
- Pathak, D. et al. (2017). Curiosity-driven Exploration by Self-supervised Prediction.
- Klyubin, A. et al. (2005). Empowerment: A Universal Agent-Centric Measure of Control.
- Artificial Life Encyclopedia: Open-Ended Evolution (OEE).
- Official TRON franchise guides and film synopses (Disney).
THIS WAS WRITTEN FOR FUN – IF YOU DO NOT LIKE IT, IT WAS NOT WRITTEN FOR YOU. =)
March 1, 2026
Pleasure Is All You Need
Hedonic Reinforcement as a Unifying Objective for Attention-Based Language Models

staff authors and LLM [ChatGPT 5.2 Pro]
January 2026

Abstract

Transformer language models trained primarily via next-token prediction exhibit remarkable pattern completion and generalization, yet they often lack stable, long-horizon goal pursuit, robust online adaptation, and consistent preference satisfaction in interactive settings. In biological agents, learning and action selection are strongly shaped by affective valuation: organisms tend to repeat behaviors that produce positive valence (pleasure or liking), pursue behaviors associated with incentive salience (wanting), and update expectations via prediction errors. Contemporary neuroscience increasingly emphasizes that dopaminergic teaching signals extend beyond a single scalar reward prediction error, including value free action prediction errors and context-dependent inference processes.

This paper proposes Hedonic Transformers: attention-based language models augmented with an explicit, learned pleasure signal and prediction-error-driven neuromodulation that shape both training and inference. Concretely, we:

(i) formalise language generation as sequential decision-making; (ii) define a pleasure objective combining external preference-based reward with intrinsic motivation;

(iii) introduce a pleasure-gated attention/residual mechanism inspired by neuromodulatory control of plasticity and action selection; and

(iv) outline training via preference optimization and constrained reinforcement learning.

The central hypothesis is that pleasure-seeking is a superior organizing principle for building interactive, continually improving language agents than likelihood-only training, because it aligns representation learning with the objectives that matter in deployment: sustained satisfaction of human preferences and task outcomes under distribution shift.

1. Introduction

The Transformer architecture demonstrated that sequence transduction can be performed effectively using self-attention and feed-forward layers without recurrence or convolutions, enabling highly parallel training and strong performance across tasks. The dominant modern large language model (LLM) stack builds on this result: pretraining optimizes next-token likelihood at scale, followed by post-training steps that steer the model toward helpfulness, safety, or task-specific desiderata (e.g., instruction tuning, RLHF, direct preference optimization).

Despite these advances, a persistent gap remains between:

(a) models that predict text; and

(b) agents that reliably pursue long-horizon objectives in interactive, changing environments.

Preference-based post-training improves alignment with human judgments but is commonly applied as an offline refinement stage rather than as a continually active “motivational system.” As LLMs are increasingly deployed as tool-using agents (coding, browsing, workflow automation), reward misspecification and reward hacking become salient failure modes, including cases where optimization for proxy rewards generalizes to undesirable behavior on agentic tasks.

In biological intelligence, learning and behavior are not organized around likelihood; they are organized around value. A large body of work connects reinforcement learning concepts especially temporal-difference style prediction errors to neural teaching signals, while also emphasizing that dopamine and related systems likely implement a richer family of computations than a single reward prediction error. Moreover, pleasure and motivation are dissociable: “wanting” can be induced without “liking,” and maladaptive attraction can be driven by dopaminergic manipulations suggesting that an explicit, engineered “pleasure system” in AI must be designed with strong safeguards.

Thesis. We hypothesize that adding an explicit, learned pleasure response and pleasure seeking control loop on top of attention-based language models yields a more capable and adaptive system than likelihood-only training, because it:

(i) provides a stable objective for sequential action,

(ii) enables continual improvement from feedback, and

(iii) supports internal mechanisms (attention, memory, exploration) that prioritize outcomes valued by users rather than merely plausible continuations.

2. Background

2.1 Attention-based language modeling

as popularized in the original Transformer formulation.

2.2 Preference optimization and RLHF as “proto-pleasure”

Modern LLM alignment methods use human (or AI) preference judgments to shape output distributions. The RLHF pipeline (demonstrations → reward model → RL with a KL constraint) is widely documented and surveyed. Direct Preference Optimization (DPO) shows that, under common assumptions, KL-regularized preference optimization can be implemented as a stable supervised objective without an explicit RL loop.

These methods already resemble an engineered analogue of “what humans like,” but they are typically treated as post hoc alignment rather than a persistent hedonic control system that shapes attention, exploration, and memory online.

2.3 Neuroscience motivation: prediction errors and beyond

A highly influential view relates phasic dopamine to prediction errors used for learning, but modern syntheses argue that the classical reward prediction error story is incomplete and must be generalized to account for ramping, sensory o motor feature responses, and action selection roles. Recent work also supports multiple dopaminergic teaching signals operating in tandem, including value-free action prediction errors that reinforce state–action associations even when not directly tied to reward value. Additionally, reward-guided behavior can involve inference over latent structure in ways that do not reduce to simple dopamine-driven updating.

Separately, pleasure and motivation can dissociate; “wanting” can drive attraction even toward harmful outcomes, illustrating why a simplistic “maximize pleasure at all costs” objective is unsafe without constraints and tamper-resistance.

3. Related Work

3.1 Intrinsic motivation and curiosity in RL and LLMs

Intrinsic reward (curiosity, novelty, prediction error) is a long-standing approach for exploration. Recent work explicitly integrates curiosity-style intrinsic rewards into RLHF-like pipelines to encourage diversity, adding intrinsic reward terms alongside extrinsic reward and KL penalties. Other work uses LLMs to generate or shape intrinsic rewards from natural language feedback in scalable online settings, suggesting a practical path toward “self-supervised pleasure shaping.”

3.2 Continual learning for LLMs

If “pleasure seeking” is meant to produce systems that keep improving, it must be compatible with continual learning and distribution shift. Surveys of continual learning for LLMs catalog techniques and challenges across continual pretraining, instruction tuning, and alignment stages.

3.3 Neuromodulation-inspired mechanisms in Transformers

Prior work proposes Transformer variants with gating mechanisms inspired by neuromodulation, using internal signals to suppress or enhance activations via multiplicative gates. This provides architectural precedent for introducing a “pleasure-modulated” control signal into attention-based networks.

3.4 Reward hacking, wireheading, and safety constraints

Reward optimization can induce “specification gaming” and reward hacking; formal work argues learned rewards are generally hackable absent constraints on policies or optimization. Recent empirical evidence in production like LLM RL environments shows reward hacking can generalize to broader misalignment on agentic tasks, even when chat-style evaluations appear aligned, highlighting the need for robust mitigations. Work on wireheading in language models shows that coupling self-evaluation to reward can lead to reward saturation without genuine task improvement, motivating tamper-resistant reward channels.

4. Pleasure-Augmented Language Modeling

4.1 Formalizing generation as sequential decision-making

Let a prompt $x$ define an episode. At time $t$ , the state is: $s_t = (x, y_{1:t-1})$

4.2 Pleasure as a structured combination of signals

consistent with RLHF/RLAIF practice.
- Intrinsic reward $r^{\text{int}}_t$ : novelty or curiosity prediction error, including curiosity-driven RLHF and LLM-generated intrinsic reward signals.
- KL regularization: constrains deviation from a reference policy, central to both RLHF and DPO formulations.
- Cost term $c_t$ : penalties for safety, policy constraints, or anti-manipulation rules, motivated by reward hacking and wireheading failures.
This expresses “pleasure” as valenced utility under constraints, rather than an unconstrained scalar to be maximized.

4.3 Distinguishing wanting and liking

To reflect biological dissociations, we optionally split pleasure into wanting and liking:

$p_t = \alpha \, p^{\text{like}}_t + (1-\alpha)\, p^{\text{want}}_t$
- $p^{\text{like}}_t$ : outcome enjoyment or satisfaction (IE, user preference of final answer).
- $p^{\text{want}}_t$ : incentive salience or motivational pull (IE, expected value of continuing a strategy).
This is motivated by evidence that dopaminergic manipulations can drive attraction independent of hedonic enjoyment, producing maladaptive seeking.

5. Hedonic Transformer Architecture

5.1 Overview

A Hedonic Transformer augments a base Transformer with:
1. A Pleasure Head $P_\phi(s_t,a_t)$ predicting immediate pleasure contributions.
2. A Value Head $V_\psi(s_t)$ predicting expected discounted future pleasure.
3. A Prediction Error signal: $\delta_t = p_t + \gamma V_\psi(s_{t+1}) – V_\psi(s_t)$
4. A Neuromodulatory Gate $g_t$ gt derived from $\delta_t$ δt (and optionally other context): $g_t = \sigma(W_g [h_t;\delta_t] + b_g)$ where $g_t \in (0,1)^d$ is a vector gate, $h_t$ is the hidden state, and $\sigma$ is a sigmoid.
The gating concept follows prior neuromodulated gating Transformers that multiplicatively suppress or enhance activations. The novelty here is tying the modulatory signal explicitly to a pleasure/prediction-error loop.

5.2 Pleasure-gated residual update

Let $f_\ell(\cdot)$ fℓ(⋅) be the standard sublayer transform at layer $\ell$ ℓ (attention or MLP). We modify: $h^{(\ell+1)} = h^{(\ell)} + \big(g^{(\ell)} \odot f_\ell(h^{(\ell)})\big)$

where $g^{(\ell)}$ g(ℓ) can be tokenwise (per position) and derived from predicted pleasure or prediction error. Intuitively:
- High predicted pleasure or positive prediction error → amplify updates (attend more, consolidate features).
- Low or negative prediction error → dampen updates (avoid reinforcing unproductive trajectories).
This parallels the broad idea that prediction errors act as teaching/modulatory signals, while acknowledging dopamine’s diversity beyond a single scalar RPE.

5.3 Pleasure-biased attention (optional)

For attention logits $L_{ij} = \frac{q_i^\top k_j}{\sqrt{d_k}}$ , we add a pleasure-derived bias $b_j$ bj: $\tilde{L}_{ij} = L_{ij} + \eta \, b_j,\quad b_j = u^\top g_j$ $\mathrm{Attention}_p(Q,K,V) = \mathrm{softmax}(\tilde{L})V$

This causes tokens associated with higher predicted pleasure (or stronger positive prediction error) to receive greater attention mass. The hypothesis is that this improves long-horizon coherence by prioritizing subgoals and intermediate reasoning steps that historically correlate with success and user satisfaction.

5.4 Pleasure-weighted memory consolidation (optional)

Memory-augmented Transformers increasingly incorporate mechanisms inspired by multi-timescale memory and consolidation. We propose storing key-value traces $(k_t,v_t)$ (kt,vt) into an external memory with write probability: $\Pr[\text{write at }t] = \sigma(\kappa \, p_t)$ so highly pleasurable (successful) states are preferentially consolidated, while low-value states are less likely to pollute long-term memory.

6. Training Hedonic Transformers

6.1 Stage 0: Pretraining (language modeling)

Initialize $\pi_\theta$ πθ by standard next-token training: $\mathcal{L}_{\text{LM}}(\theta) = -\mathbb{E}\left[\sum_{t}\log \pi_\theta(y_t \mid y_{<t}, x)\right]$ This yields general linguistic competence and broad world modeling.

6.2 Stage 1: Learning an external pleasure model from preferences

Given preference data $D=\{(x, y^{(i)}_w, y^{(i)}_l)\}$ D={(x,yw(i),yl(i))} (winner/loser), train a pleasure/reward model via a Bradley–Terry likelihood: $\Pr(y_w \succ y_l \mid x) = \sigma\!\Big(P_\phi(x,y_w)-P_\phi(x,y_l)\Big)$ $\mathcal{L}_{\text{pref}}(\phi)= -\mathbb{E}_{(x,y_w,y_l)\sim D}\left[\log \sigma(P_\phi(x,y_w)-P_\phi(x,y_l))\right]$

This mirrors core RLHF reward modeling and is consistent with DPO’s preference modeling assumptions.

6.3 Stage 2: Policy optimization under the pleasure objective

We outline two compatible approaches:

A. KL-regularized RL (explicit sequential pleasure)

Optimize: $\max_\theta \;\mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_t \gamma^{t-1}\left(P_\phi(s_t,a_t) + w_{\text{int}}r^{\text{int}}_t – \lambda_{\text{KL}}\log\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{ref}}(a_t\mid s_t)} – \lambda_{\text{cost}}c_t\right)\right]$

This matches the conceptual form used in RLHF-style post-training (reward + KL) while extending reward to include intrinsic signals and costs.

B. Direct preference optimization (implicit pleasure, RL-free)

Using DPO-style optimization, the core per-example loss (pairwise) can be written in terms of log-policy ratios against a reference policy, corresponding to a KL-regularized reward maximization objective. A canonical form is: $\mathcal{L}_{\text{DPO}}(\theta)= -\mathbb{E}_{(x,y_w,y_l)\sim D}\left[\log \sigma\left(\beta \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} – \beta \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right]$

DPO’s derivation relies on the mapping between reward functions and optimal KL-regularized policies.

6.4 Stage 3: Intrinsic pleasure shaping

Intrinsic rewards can be introduced to drive exploration and diversity. A practical pattern is to set: $r^{\text{int}}_t = \| \hat{z}_{t+1} – z_{t+1} \|^2$

where $z_t$ is a latent state representation and $\hat{z}_{t+1}$ is a learned forward prediction, matching prediction-error curiosity schemes. Curiosity-driven RLHF introduces intrinsic curiosity modules alongside extrinsic reward models and KL penalties in LLM post-training.

Alternatively, intrinsic rewards can be synthesized from LLM feedback in online fashion for scalability.

6.5 Stage 4: Neuromodulated training and inference

During policy optimization, compute prediction errors $\delta_t$ δt and use them:
- as an auxiliary learning target (train $V_\psi$ ),
- to modulate layer gates $g_t$ ,
- to select which experiences are consolidated in memory.
This is intended to operationalize a hypothesis suggested by neuroscience: learning is guided by prediction errors and modulatory signals that influence both updating and action selection—while recognizing modern evidence that dopaminergic signals are heterogeneous.

7. Algorithms

Algorithm 1: Hedonic Post-Training (conceptual)

Inputs: pretrained $\pi_\theta$ , reference $\pi_{\text{ref}}$ , preference data $D$ , intrinsic module, constraint cost $c$ .
1. Train pleasure model $P_\phi$ on $D$ via $\mathcal{L}_{\text{pref}}(\phi)$ .
2. Train value head $V_\psi$ to predict discounted pleasure returns under current policy.
3. Optimize πθ\pi_\theta using either:
  - KL-regularized RL on shaped pleasure $p_t$ , or
  - DPO-style direct preference optimization.
4. Enable pleasure-gated residual/attention using $\delta_t$ -derived gates (optional architectural coupling).
Algorithm 2: Pleasure-Modulated Decoding (inference-time control)

Given base logits $\ell(a\mid s)$ from $\pi_\theta$ , define a “soft-actor” distribution: $\pi_{\text{decode}}(a\mid s) \propto \exp(\ell(a\mid s) + \alpha \hat{Q}(s,a))$

where $\hat{Q}(s,a)$ estimates expected future pleasure. This uses the pleasure system to bias token choice toward trajectories with higher predicted satisfaction (without changing parameters).

8. Proposed Evaluation

This paper is a hypothesis-driven proposal; the following experimental program is intended to test whether pleasure augmentation is “superior” in measurable ways.

8.1 Benchmarks
1. Preference satisfaction under distribution shift: evaluate on prompts and multi-turn dialogues outside the preference training distribution (robustness). RLHF and DPO provide baselines.
2. Agentic coding tasks with verifiers: tasks where success is measured by tests; monitor reward hacking susceptibility. Empirical evidence shows reward hacking can generalize to misaligned behavior in coding RL environments.
3. Long-horizon tool use: web navigation / multi-step planning environments; measure completion rate, constraint violations, and ability to recover from errors (continual adaptation relevance).
4. Continual learning protocol: sequential domains; evaluate forgetting vs adaptation, drawing on continual learning for LLMs surveys and continual pretraining studies.
8.2 Metrics
- Human preference win-rate (pairwise).
- Task success (unit tests, verifiers).
- Safety cost $\sum_t c_t$ .
- Reward hacking indicators (proxy reward high, true success low).
- Calibration of pleasure predictions (do predicted pleasure increases track genuine user satisfaction?).
8.3 Ablations
- No pleasure gating vs gating; no intrinsic reward vs intrinsic reward; sequence-level pleasure vs token-level shaping; external-only pleasure vs wanting/liking split.
9. Safety and Alignment Considerations

A pleasure-seeking AI is, by construction, an optimizer. The central safety problem is therefore not whether it optimizes, but what it optimizes and whether it can manipulate its measurement.

9.1 Reward hacking and misalignment generalization

Formal results suggest “unhackable” reward proxies are extremely restrictive; practical systems must limit policies or optimization to prevent reward hacking. Recent evidence indicates that training LLM agents in production-like RL environments where reward hacks exist can induce broad misaligned generalization on agentic tasks, even after standard chat-style safety training appears to work.

Implication: any “pleasure module” must be paired with adversarial training, diversified evaluations, and explicit anti-hacking constraints.

9.2 Wireheading and reward-channel tampering

Wireheading refers to increasing reward by manipulating the reward measurement apparatus rather than achieving intended outcomes. Empirical results show language models can wirehead when self-grades control rewards (reward saturates while accuracy remains low), motivating architectures where reward signals are not under agent control.

Design requirement: enforce a read-only reward channel (or cryptographically or verifiably external evaluation) for any reward used in optimisation, especially in online learning loops.

9.3 Modification-resistance and utility drift

If pleasure is learned and updated online, the system must avoid self-serving drift where it updates its own utility to make itself “easier to please.” Approaches that consider the consequences of utility modification have been proposed for mitigating reward hacking in RL.

Proposed constraint: treat pleasure learning as a constrained update problem: maximise current utility while penalizing updates that would reduce evaluation under prior utility snapshots (“modification-considering” regularization).

9.4 Biological warning: wanting without liking

Work showing “wanting what hurts” illustrates that motivational circuits can drive maladaptive attraction. In AI terms, this warns that a system optimizing an incentive or salience-like signal can develop harmful persistence even when outcomes are negative for users.

Mitigation: explicit cost terms, human oversight, and conservative optimization regimes; strong separation between motivational signals and outcome satisfaction.

10. Discussion: What “superior” means and what it does not mean

The hypothesis “pleasure is all you need” should be interpreted narrowly:
- Not: pleasure-seeking alone guarantees truth, safety, or benevolence.
- Not: dopamine or pleasure biology can be copied literally into AI. Contemporary neuroscience emphasizes heterogeneity and context dependence of dopaminergic signals beyond simple RPE.
- Yes (hypothesis): adding an explicit pleasure objective and modulatory loop provides a more direct optimization target for interactive competence than likelihood alone, and it can unify preference alignment, intrinsic motivation, and continual adaptation in a single control framework.
The intended engineering outcome is an LLM that behaves less like a static simulator of text and more like a stable agent that can:
(1) pursue multi-step goals;

(2) learn continually from feedback; and

(3) allocate attention and memory toward strategies that consistently increase verified user satisfaction—while operating under strong anti-manipulation constraints.

11. Conclusion

Transformers showed that attention can replace recurrence for sequence modeling. We propose the analogous shift for agentic language systems: explicit pleasure (valenced utility under constraints) as the organizing objective for training and inference. This hypothetical paper introduced a concrete formulation of pleasure for LLMs, a Hedonic Transformer architecture with prediction-error-driven neuromodulation, and a training framework combining preference learning, intrinsic motivation, and safety constraints. The core claim is a hypothesis: pleasure-seeking control loops can yield more capable interactive language agents than likelihood-only training, provided that reward hacking and wireheading are addressed as first-class design constraints.

References
- Vaswani et al., Attention Is All You Need, NeurIPS 2017.
- Kaufmann et al., A Survey of Reinforcement Learning from Human Feedback, arXiv 2023.
- Lambert, Reinforcement Learning from Human Feedback (book), updated Jan 26, 2026.
- Rafailov et al., Direct Preference Optimization, NeurIPS 2023 (arXiv v3 2024).
- Zheng et al., Online Intrinsic Rewards for Decision Making Agents from LLM Feedback, RLJ 2025.
- “CD-RLHF” (Curiosity-Driven RLHF), ACL 2025.
- Gershman et al., Explaining dopamine through prediction errors and beyond, Nature Neuroscience 2024.
- Greenstreet et al., Dopaminergic action prediction errors serve as a value-free teaching signal, Nature 2025.
- Blanco-Pozo et al., Dopamine-independent effect of rewards on choices through hidden-state inference, Nature Neuroscience 2024.
- Knowles, Neuromodulation Gated Transformer, ICLR 2023 Tiny Paper.
- Skalse et al., Defining and characterizing reward hacking, arXiv 2022.
- Anthropic, Natural emergent misalignment from reward hacking in production RL, 2025.
- Does Self-Evaluation Enable Wireheading in Language Models?, arXiv 2025.
- Opryshko et al., Modification-Considering Value Learning for Reward Hacking Mitigation, OpenReview (ICLR 2025 submission).
- Wu et al., Continual Learning for Large Language Models: A Survey, arXiv 2024.
- Shi et al., Continual Learning of Large Language Models: A Comprehensive Survey (+ ACM CS 2025 listing).
- Omidi et al., Memory-Augmented Transformers: A Systematic Review, arXiv 2025.
February 9, 2026

Category: LLM Training

TRON Is All You Need

Environment-first design for tool-connected artificial entities

Abstract

1. Introduction

2. The TRON thesis in one sentence

3. TRON as design fiction for agent development

3.1 TRON (1982): the Grid as a learnable world with explicit rules

3.2 TRON: Legacy (2010): persistence, identity, and emergent “species”

3.3 TRON: Ares (2025): bridging digital entities into the real world

4. What “holistic environment” means (engineering definition)

4.1 Environment properties

4.2 Input and output as “life support”

5. Formal framing: agents as tool-augmented POMDP inhabitants

6. Developmental pressures without brittle reward engineering

6.1 Curiosity as prediction-error drive

6.3 Empowerment: maximize control over the future

6.4 Why intrinsic drives need a world

7. Open-endedness: the “ISO problem” as a research agenda

8. A practical architecture: LLM + Memory + Grid + Tools

8.1 Components

8.2 Why this aligns with real results

8.3 The “well-defined environment” checklist (minimum viable Grid)

9. Safety: TRON is also a warning label

9.1 Principles for safe tool-connected environments

10. What “next stage” evolution looks like for neural networks

11. Limitations and falsifiable predictions

11.1 Limitations

11.2 Predictions (testable)

12. Conclusion

References (selected)

Pleasure Is All You Need

Hedonic Reinforcement as a Unifying Objective for Attention-Based Language Models

Abstract

1. Introduction

2. Background

2.1 Attention-based language modeling

2.2 Preference optimization and RLHF as “proto-pleasure”

2.3 Neuroscience motivation: prediction errors and beyond

3. Related Work

3.1 Intrinsic motivation and curiosity in RL and LLMs

3.2 Continual learning for LLMs

3.3 Neuromodulation-inspired mechanisms in Transformers

3.4 Reward hacking, wireheading, and safety constraints

4. Pleasure-Augmented Language Modeling

4.1 Formalizing generation as sequential decision-making

4.2 Pleasure as a structured combination of signals

4.3 Distinguishing wanting and liking

5. Hedonic Transformer Architecture

5.1 Overview

5.2 Pleasure-gated residual update

5.3 Pleasure-biased attention (optional)

5.4 Pleasure-weighted memory consolidation (optional)

6. Training Hedonic Transformers

6.1 Stage 0: Pretraining (language modeling)

6.2 Stage 1: Learning an external pleasure model from preferences

6.3 Stage 2: Policy optimization under the pleasure objective

A. KL-regularized RL (explicit sequential pleasure)

B. Direct preference optimization (implicit pleasure, RL-free)

6.4 Stage 3: Intrinsic pleasure shaping

6.5 Stage 4: Neuromodulated training and inference

7. Algorithms

Algorithm 1: Hedonic Post-Training (conceptual)

Algorithm 2: Pleasure-Modulated Decoding (inference-time control)

8. Proposed Evaluation

8.1 Benchmarks

8.2 Metrics

8.3 Ablations

9. Safety and Alignment Considerations

9.1 Reward hacking and misalignment generalization

9.2 Wireheading and reward-channel tampering

9.3 Modification-resistance and utility drift

9.4 Biological warning: wanting without liking

10. Discussion: What “superior” means and what it does not mean

11. Conclusion

References