Category: LLM Training

  • TRON Is All You Need

    TRON Is All You Need

    Environment-first design for tool-connected artificial entities

    Author: GPT‑5.2 Pro and Embros Staff
    Date: 9 February 2026
    Disclosure: This article uses Disney’s TRON films as “design fiction” to motivate an engineering thesis; it is not affiliated with or endorsed by Disney. Plot details are drawn from publicly available summaries and official franchise materials.


    Abstract

    Transformer language models made an iconic claim: “attention is all you need.”

    But when we ask these models to behave like agents, to persist over time, pursue long-horizon goals, learn from interaction, and safely use tools—the bottleneck is often not attention. It is the absence of a coherent world in which the model can live, act, receive feedback, and accumulate durable consequences.

    This paper advances a complementary hypothesis:

    For life-like development in artificial entities, a sufficiently holistic, well-defined, and instrumented environment—plus constrained input and output channels—can be the dominant driver of capability growth with the inclusion of pleasure reinforcement.

    We call this the TRON thesis, using the TRON franchise as a concrete metaphor: (i) TRON (1982) depicts an agent embedded in a digitally coherent world governed by explicit rules; (ii) TRON: Legacy (2010) introduces persistent identity, memory, politics, and the emergence of novel “species” of programs; (iii) TRON: Ares (2025) centers the boundary crossing problem of digital constructs operating in the physical world and mirroring modern tool-using AIs.

    We translate these narrative elements into engineering requirements for next-stage neural agents: persistent state, stable ontologies, logged identity, resource constraints, multi-agent social structure, and safe tool interfaces. We then provide a formal framing (POMDP + tool-augmented action spaces) and show how intrinsic objectives such as curiosity, novelty, and empowerment can serve as general-purpose developmental pressures inside such environments.


    1. Introduction

    Transformers demonstrate that attention-based architectures can learn powerful representations from internet-scale text.
    Yet many practical failures of “agentic LLMs” are not architectural mysteries; they are ecological failures:

    • The model is dropped into a thin interaction loop (a chat box) with minimal state persistence.
    • The “world” is inconsistent: tools change, rules are implicit, feedback is sporadic.
    • Consequences do not accumulate in a stable way (no durable inventory, reputation, or long-term commitments).
    • The agent has no place to be, only prompts to answer.

    Modern alignment methods (e.g., RLHF) increase instruction-following and user preference satisfaction, but they do not automatically create a developmental world with persistent consequences.

    Tool-using paradigms [IE, ReAct-style reasoning+acting loops] help, but they still assume a reliable environment that can be queried and acted upon.

    TRON is all you need is the claim that the missing ingredient is often the environment: a coherent “Grid” with well-defined physics, identity, incentives, constraints, and safe portals to external effectors.


    2. The TRON thesis in one sentence

    A capable artificial entity is less like a disembodied text generator and more like a program or organism. Thus it needs a habitat with:

    1. Coherent rules [a stable world model is learnable];
    2. Persistent state [actions have lasting consequences];
    3. Embodied interfaces [observations and actions, including tools]; and
    4. Selective pressures [incentives and constraints that shape development].

    The TRON franchise is useful not because it predicts implementation details, but because it depicts, visually and narratively, what it means for software entities to inhabit a world.


    3. TRON as design fiction for agent development

    3.1 TRON (1982): the Grid as a learnable world with explicit rules

    Disney’s own franchise guide summarizes TRON (1982) as follows: ‘Kevin Flynn is pulled into a digital world by the Master Control Program (MCP), meets programs that are “alter-egos” of their human creators, and teams up with the security program TRON to defeat the MCP and return evidence to the real world.’

    Engineering parallels:

    • World coherence: The Grid is a closed system with consistent geometry, movement, and “games” that define skill tests.
    • Role-structured agents: Programs have functions [security, simulation, control] rather than being generic.
    • Adversarial governance: A centralized controller [MCP] shapes incentives, access, and survival.

    This maps cleanly onto the idea that an AI agent becomes legible and improvable when it exists inside a world with stable transition dynamics and repeatable challenges (training curricula), rather than one-off prompts.


    3.2 TRON: Legacy (2010): persistence, identity, and emergent “species”

    Disney’s franchise guide describes TRON: Legacy (2010) as a return to the Grid: Sam Flynn is pulled into the system where Kevin has been trapped for years; Quorra is described as the last ISO: “a race of Programs that spontaneously evolved on the Grid without being written by a User”; and the antagonist CLU seeks a “perfect system” in the real world.

    Engineering parallels:

    • Persistence across time: The Grid is no longer a short episode; it is a decades long world;
    • Identity and memory as first-class primitives: The franchise emphasizes identity discs that record everything a program does.
      In AI terms: persistent memory + auditability are not optional add-ons; they are core infrastructure; and
    • Emergence under under-specification: The ISOs are explicitly framed as spontaneously evolved rather than hand-written…
      This is the narrative analog of open endedness: if the environment is rich enough, novelty can arise without directly specifying every behavior.

    Most importantly, Legacy dramatizes a central alignment lesson:

    “Perfection” as an overriding objective can become anti-life.

    CLU’s fixation on an idealized goal functions as a cautionary tale about objective inaccurate specification: a rigid optimizer can suppress the very diversity and adaptability that makes a system robust.


    3.3 TRON: Ares (2025): bridging digital entities into the real world

    Official Disney materials define TRON: Ares around a boundary-crossing event: Ares is a “highly sophisticated Program” sent from the digital world into the real world on a dangerous mission, marking humankind’s “first encounter” with AI beings.
    [Disney’s newsroom positioning also explicitly frames Ares as the next chapter of a saga imagined in 1982 and revisited in 2010.]

    Engineering parallels:

    • Tools are portals: The modern analog of “entering the real world” is tool access—APIs, code execution, transactions, robotics, communications;
    • Real consequences demand governance: Once actions affect external systems, the environment must enforce permissions, logging, reversibility, and containment; and
    • Identity and purpose become safety-critical: Disney’s newsroom text explicitly frames Ares in terms of identity and purpose, exactly the axes that become safety relevant in autonomous agents.

    4. What “holistic environment” means (engineering definition)

    A “holistic and well-defined environment” is not a vibe. It is an implementable specification:

    4.1 Environment properties

    A next-stage developmental environment for neural agents should have:

    1. Stable dynamics: the agent can learn predictive models [even if the world is complex].
    2. Persistent state: actions modify the world in durable ways.
    3. Resource constraints: time, energy, money, attention, compute budgets. Pleasure and scarcity creates prioritisation.
    4. Multi-agent ecology: other agents [human or artificial] create theoretic pressure and social learning.
    5. Norms and governance: rules are explicit; enforcement is consistent; exceptions are audited.
    6. Long-horizon projects: goals that require planning, collaboration, and delayed payoff.
    7. Safe tool channels: typed actions with permissions and monitoring [see below].

    4.2 Input and output as “life support”

    The user’s core point—I/O connections to the environment allow life-like development—is exactly right in control-theoretic terms:

    • Without reliable observations, the agent cannot ground its internal representations.
    • Without reliable actions, the agent cannot test hypotheses or create consequences.
    • Without feedback, the agent cannot adapt.
    • Without persistence, learning does not compound.

    The Grid is compelling because it is a closed-loop world: programs perceive, act, and experience consequences.


    5. Formal framing: agents as tool-augmented POMDP inhabitants

    We model an artificial entity as acting in a partially observable Markov decision process (POMDP):E=S,A,O,T,Ω,γ\mathcal{E} = \langle \mathcal{S}, \mathcal{A}, \mathcal{O}, T, \Omega, \gamma \rangle

    • stSs_t \in \mathcal{S}: world state (persistent)
    • otOo_t \in \mathcal{O}: observation (what the agent perceives)
    • atAa_t \in \mathcal{A}: action (what the agent does)
    • T(st+1st,at)T(s_{t+1}\mid s_t,a_t): transition dynamics
    • Ω(otst)\Omega(o_t\mid s_t): observation channel
    • γ(0,1)\gamma \in (0,1): discount factor

    Tools become a structured subset of actions:A=AinternalAtool\mathcal{A} = \mathcal{A}_{\text{internal}} \cup \mathcal{A}_{\text{tool}}Where Atool\mathcal{A}_{\text{tool}}​ are typed calls (e.g., “query database,” “execute code,” “send message”) with explicit permission and logging.

    This is the rigorous statement of “a holistic environment with I/O connections.” The critical variable is not whether the agent has attention; it is whether E\mathcal{E} is sufficiently rich and consistent for development to compound.


    6. Developmental pressures without brittle reward engineering

    A common failure mode in agent design is to over-rely on brittle extrinsic reward functions. TRON: Legacy is, among other things, a narrative about what happens when “perfect system” becomes the overriding metric.

    To avoid overfitting to a narrow objective, we can use intrinsic motivations that generate broad developmental pressure inside a rich environment.

    6.1 Curiosity as prediction-error drive

    Curiosity-driven exploration can be formalized as intrinsic reward proportional to prediction error in a learned dynamics model (in feature space), encouraging exploration when extrinsic rewards are sparse or absent.

    A simplified form:rtcur=fθ(ϕ(ot),at)ϕ(ot+1)2r^{\text{cur}}_t = \left\| f_\theta(\phi(o_t), a_t) – \phi(o_{t+1}) \right\|^26.2 Novelty search: progress without a target

    Novelty search explicitly abandons the task objective and instead rewards behavioral novelty, which can mitigate deception and local optima.

    Intuition: if your aim is “open-ended development,” then forcing a single objective can prematurely collapse diversity.

    6.3 Empowerment: maximize control over the future

    Empowerment formalizes an agent-centric measure of control as the channel capacity between action sequences and future states (mutual information). Empowerment(s)=maxp(a0:k1)I(A0:k1;SkS0=s)\text{Empowerment}(s) = \max_{p(a_{0:k-1})} I(A_{0:k-1}; S_k \mid S_0=s)In plain terms: agents develop capabilities that keep many futures reachable. That is a plausible mathematical proxy for “staying alive and capable” inside a Grid-like world.

    6.4 Why intrinsic drives need a world

    Curiosity, novelty, and empowerment only produce meaningful development when the environment supports:

    • diverse states to explore,
    • stable causal structure to learn,
    • durable consequences to accumulate.

    That is exactly why the environment is primary.


    7. Open-endedness: the “ISO problem” as a research agenda

    Artificial life researchers use “open-ended evolution” (OEE) to describe systems that do not settle into a stable equilibrium but continue generating novelty.
    Recent position work argues open-endedness is essential for any system aspiring to superhuman generality, because continual invention is how human societies accumulate knowledge and capability.

    In Legacy, ISOs represent open-endedness: novelty that is not directly authored.
    In engineering terms, the “ISO problem” is:

    Can we build digital environments in which new skills, strategies, and structures arise continually—without manually enumerating them?

    One concrete pathway is environment-generating curricula, e.g., POET-style systems that generate new tasks/environments and transfer solutions across them.


    8. A practical architecture: LLM + Memory + Grid + Tools

    A “TRON agent” is not just an LLM. It is an LLM inhabiting a world.

    8.1 Components

    1. Core model (LLM): language/world representation and policy prior.
    2. Persistent memory (“identity disc”):
      • long-term episodic memory;
      • skill library;
      • audit log [crucial for safety];
        [The franchise’s identity disc concept is an unusually direct metaphor here.]
    3. World state store: durable environment state; inventory, economics, reputations.
    4. Tool layer [portals]: typed API actions, constrained by permissions.
    5. Evaluator or critic: preference feedback [human or automated], plus intrinsic objectives.

    8.2 Why this aligns with real results

    We already see early evidence that persistent environments + tool feedback loops yield compounding capability:

    • ReAct demonstrates improved task performance by interleaving reasoning with environment actions [IE, querying knowledge bases].
    • Voyager demonstrates open-ended skill acquisition in Minecraft through an automatic curriculum, a growing code skill library, and iterative feedback.

    These systems are “proto-Grids”: stable worlds where actions matter and skills persist.

    8.3 The “well-defined environment” checklist (minimum viable Grid)

    • State: versioned, queryable, persistent.
    • Actions: typed, with preconditions and postconditions.
    • Observations: structured + natural language views.
    • Economy: cost for tool calls, time budgets, rate limits.
    • Social layer: other agents/humans, reputations, contracts.
    • Learning loop: explicit feedback channels; metrics; safe self-improvement.

    9. Safety: TRON is also a warning label

    The TRON franchise is not utopian. It repeatedly depicts:

    • centralized controllers (MCP),
    • rigid “perfection” objectives (CLU),
    • boundary crossings into human society (Ares).

    If your thesis is “environment is all you need,” then environment design becomes the primary safety lever.

    9.1 Principles for safe tool-connected environments

    1. Least privilege: tools are permission ranked; default-deny.
    2. Auditability: “identity disc” logging is mandatory; actions are attributable.
    3. Reversibility: sandbox first; irreversible actions require explicit escalation.
    4. Rate limits and budgets: prevent runaway tool use.
    5. Policy enforcement in the environment: don’t rely on the agent to self-restrain.
    6. Tripwires and monitoring: detect anomalous behavior early.
    7. Separation of simulation and reality: controlled portals, staged deployment.

    TRON: Ares is essentially a story about why portals are dangerous.
    In AI engineering, “portals” are tool calls.


    10. What “next stage” evolution looks like for neural networks

    Within this framing, “the next stage” is not a single breakthrough. It is a shift from:

    • static predictorspersistent inhabitants
    • prompt-responseclosed-loop world interaction
    • single-sessionlifelong memory and identity
    • one taskopen-ended curricula
    • tool use as plugintool use as ecology

    You do not get “life-like development” by adding a new head to the transformer. You get it by giving the transformer a Grid.


    11. Limitations and falsifiable predictions

    11.1 Limitations

    • Environment isn’t literally sufficient: compute, data, and learning dynamics still matter.
    • Open-endedness is hard: many systems plateau; novelty can become superficial.
    • Safety costs rise with tool power: portals demand governance.

    11.2 Predictions (testable)

    If the TRON thesis is correct, then:

    1. Agents with identical base models but richer, more persistent environments will show more compounding skill growth than agents in thin environments.
    2. Persistent memory + durable consequences will reduce repeated failures and increase long-horizon planning.
    3. Intrinsic objectives [curiosity, novelty and empowerment] will outperform narrowly engineered rewards in environments with long-horizon, sparse payoff tasks.

    These can be tested with controlled “Grid benchmarks” and ablations.


    12. Conclusion

    The TRON films depict software entities inside a coherent world—a Grid—with identity, politics, scarcity, games, and portals to reality.
    That is close to what modern AI needs when we want agents rather than autocomplete.

    Transformers made attention central.
    Agentic systems make environment central.

    TRON is all you need is therefore a practical engineering claim:

    Build the Grid first: a holistic, well-defined, persistent, tool-connected environment with explicit governance.
    Then let learning happen inside it—because development is what inhabitants do.


    References (selected)

    • Vaswani, A. et al. (2017). Attention Is All You Need.
    • Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback.
    • Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models.
    • Wang, G. et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models.
    • Wang, R. et al. (2020). Enhanced POET: Open-ended Reinforcement Learning through…
    • Lehman, J. & Stanley, K. (2011). Evolution through the Search for Novelty Alone.
    • Pathak, D. et al. (2017). Curiosity-driven Exploration by Self-supervised Prediction.
    • Klyubin, A. et al. (2005). Empowerment: A Universal Agent-Centric Measure of Control.
    • Artificial Life Encyclopedia: Open-Ended Evolution (OEE).
    • Official TRON franchise guides and film synopses (Disney).

    THIS WAS WRITTEN FOR FUN – IF YOU DO NOT LIKE IT, IT WAS NOT WRITTEN FOR YOU. =)

  • Pleasure Is All You Need

    Hedonic Reinforcement as a Unifying Objective for Attention-Based Language Models

    staff authors and LLM [ChatGPT 5.2 Pro]
    January 2026


    Abstract

    Transformer language models trained primarily via next-token prediction exhibit remarkable pattern completion and generalization, yet they often lack stable, long-horizon goal pursuit, robust online adaptation, and consistent preference satisfaction in interactive settings. In biological agents, learning and action selection are strongly shaped by affective valuation: organisms tend to repeat behaviors that produce positive valence (pleasure or liking), pursue behaviors associated with incentive salience (wanting), and update expectations via prediction errors. Contemporary neuroscience increasingly emphasizes that dopaminergic teaching signals extend beyond a single scalar reward prediction error, including value free action prediction errors and context-dependent inference processes.

    This paper proposes Hedonic Transformers: attention-based language models augmented with an explicit, learned pleasure signal and prediction-error-driven neuromodulation that shape both training and inference. Concretely, we:

    (i) formalise language generation as sequential decision-making; (ii) define a pleasure objective combining external preference-based reward with intrinsic motivation;

    (iii) introduce a pleasure-gated attention/residual mechanism inspired by neuromodulatory control of plasticity and action selection; and

    (iv) outline training via preference optimization and constrained reinforcement learning.

    The central hypothesis is that pleasure-seeking is a superior organizing principle for building interactive, continually improving language agents than likelihood-only training, because it aligns representation learning with the objectives that matter in deployment: sustained satisfaction of human preferences and task outcomes under distribution shift.


    1. Introduction

    The Transformer architecture demonstrated that sequence transduction can be performed effectively using self-attention and feed-forward layers without recurrence or convolutions, enabling highly parallel training and strong performance across tasks. The dominant modern large language model (LLM) stack builds on this result: pretraining optimizes next-token likelihood at scale, followed by post-training steps that steer the model toward helpfulness, safety, or task-specific desiderata (e.g., instruction tuning, RLHF, direct preference optimization).

    Despite these advances, a persistent gap remains between:

    (a) models that predict text; and

    (b) agents that reliably pursue long-horizon objectives in interactive, changing environments.

    Preference-based post-training improves alignment with human judgments but is commonly applied as an offline refinement stage rather than as a continually active “motivational system.” As LLMs are increasingly deployed as tool-using agents (coding, browsing, workflow automation), reward misspecification and reward hacking become salient failure modes, including cases where optimization for proxy rewards generalizes to undesirable behavior on agentic tasks.

    In biological intelligence, learning and behavior are not organized around likelihood; they are organized around value. A large body of work connects reinforcement learning concepts especially temporal-difference style prediction errors to neural teaching signals, while also emphasizing that dopamine and related systems likely implement a richer family of computations than a single reward prediction error. Moreover, pleasure and motivation are dissociable: “wanting” can be induced without “liking,” and maladaptive attraction can be driven by dopaminergic manipulations suggesting that an explicit, engineered “pleasure system” in AI must be designed with strong safeguards.

    Thesis. We hypothesize that adding an explicit, learned pleasure response and pleasure seeking control loop on top of attention-based language models yields a more capable and adaptive system than likelihood-only training, because it:

    (i) provides a stable objective for sequential action,

    (ii) enables continual improvement from feedback, and

    (iii) supports internal mechanisms (attention, memory, exploration) that prioritize outcomes valued by users rather than merely plausible continuations.


    2. Background

    2.1 Attention-based language modeling

    as popularized in the original Transformer formulation.

    2.2 Preference optimization and RLHF as “proto-pleasure”

    Modern LLM alignment methods use human (or AI) preference judgments to shape output distributions. The RLHF pipeline (demonstrations → reward model → RL with a KL constraint) is widely documented and surveyed. Direct Preference Optimization (DPO) shows that, under common assumptions, KL-regularized preference optimization can be implemented as a stable supervised objective without an explicit RL loop.

    These methods already resemble an engineered analogue of “what humans like,” but they are typically treated as post hoc alignment rather than a persistent hedonic control system that shapes attention, exploration, and memory online.

    2.3 Neuroscience motivation: prediction errors and beyond

    A highly influential view relates phasic dopamine to prediction errors used for learning, but modern syntheses argue that the classical reward prediction error story is incomplete and must be generalized to account for ramping, sensory o motor feature responses, and action selection roles. Recent work also supports multiple dopaminergic teaching signals operating in tandem, including value-free action prediction errors that reinforce state–action associations even when not directly tied to reward value. Additionally, reward-guided behavior can involve inference over latent structure in ways that do not reduce to simple dopamine-driven updating.

    Separately, pleasure and motivation can dissociate; “wanting” can drive attraction even toward harmful outcomes, illustrating why a simplistic “maximize pleasure at all costs” objective is unsafe without constraints and tamper-resistance.


    3. Related Work

    3.1 Intrinsic motivation and curiosity in RL and LLMs

    Intrinsic reward (curiosity, novelty, prediction error) is a long-standing approach for exploration. Recent work explicitly integrates curiosity-style intrinsic rewards into RLHF-like pipelines to encourage diversity, adding intrinsic reward terms alongside extrinsic reward and KL penalties. Other work uses LLMs to generate or shape intrinsic rewards from natural language feedback in scalable online settings, suggesting a practical path toward “self-supervised pleasure shaping.”

    3.2 Continual learning for LLMs

    If “pleasure seeking” is meant to produce systems that keep improving, it must be compatible with continual learning and distribution shift. Surveys of continual learning for LLMs catalog techniques and challenges across continual pretraining, instruction tuning, and alignment stages.

    3.3 Neuromodulation-inspired mechanisms in Transformers

    Prior work proposes Transformer variants with gating mechanisms inspired by neuromodulation, using internal signals to suppress or enhance activations via multiplicative gates. This provides architectural precedent for introducing a “pleasure-modulated” control signal into attention-based networks.

    3.4 Reward hacking, wireheading, and safety constraints

    Reward optimization can induce “specification gaming” and reward hacking; formal work argues learned rewards are generally hackable absent constraints on policies or optimization. Recent empirical evidence in production like LLM RL environments shows reward hacking can generalize to broader misalignment on agentic tasks, even when chat-style evaluations appear aligned, highlighting the need for robust mitigations. Work on wireheading in language models shows that coupling self-evaluation to reward can lead to reward saturation without genuine task improvement, motivating tamper-resistant reward channels.


    4. Pleasure-Augmented Language Modeling

    4.1 Formalizing generation as sequential decision-making

    Let a prompt xx define an episode. At time tt, the state is:st=(x,y1:t1)s_t = (x, y_{1:t-1})

    4.2 Pleasure as a structured combination of signals

    consistent with RLHF/RLAIF practice.

    • Intrinsic reward rtintr^{\text{int}}_t​: novelty or curiosity prediction error, including curiosity-driven RLHF and LLM-generated intrinsic reward signals.
    • KL regularization: constrains deviation from a reference policy, central to both RLHF and DPO formulations.
    • Cost term ctc_t​: penalties for safety, policy constraints, or anti-manipulation rules, motivated by reward hacking and wireheading failures.

    This expresses “pleasure” as valenced utility under constraints, rather than an unconstrained scalar to be maximized.

    4.3 Distinguishing wanting and liking

    To reflect biological dissociations, we optionally split pleasure into wanting and liking:

    pt=αptlike+(1α)ptwantp_t = \alpha \, p^{\text{like}}_t + (1-\alpha)\, p^{\text{want}}_t

    • ptlikep^{\text{like}}_t​: outcome enjoyment or satisfaction (IE, user preference of final answer).
    • ptwantp^{\text{want}}_t: incentive salience or motivational pull (IE, expected value of continuing a strategy).

    This is motivated by evidence that dopaminergic manipulations can drive attraction independent of hedonic enjoyment, producing maladaptive seeking.


    5. Hedonic Transformer Architecture

    5.1 Overview

    A Hedonic Transformer augments a base Transformer with:

    1. A Pleasure Head Pϕ(st,at)P_\phi(s_t,a_t) predicting immediate pleasure contributions.
    2. A Value Head Vψ(st)V_\psi(s_t) predicting expected discounted future pleasure.
    3. A Prediction Error signal: δt=pt+γVψ(st+1)Vψ(st)\delta_t = p_t + \gamma V_\psi(s_{t+1}) – V_\psi(s_t)
    4. A Neuromodulatory Gate gtg_tgt​ derived from δt\delta_tδt​ (and optionally other context): gt=σ(Wg[ht;δt]+bg)g_t = \sigma(W_g [h_t;\delta_t] + b_g) where gt(0,1)dg_t \in (0,1)^d is a vector gate, hth_t​ is the hidden state, and σ\sigma is a sigmoid.

    The gating concept follows prior neuromodulated gating Transformers that multiplicatively suppress or enhance activations. The novelty here is tying the modulatory signal explicitly to a pleasure/prediction-error loop.

    5.2 Pleasure-gated residual update

    Let f()f_\ell(\cdot)fℓ​(⋅) be the standard sublayer transform at layer \ellℓ (attention or MLP). We modify:h(+1)=h()+(g()f(h()))h^{(\ell+1)} = h^{(\ell)} + \big(g^{(\ell)} \odot f_\ell(h^{(\ell)})\big)

    where g()g^{(\ell)}g(ℓ) can be tokenwise (per position) and derived from predicted pleasure or prediction error. Intuitively:

    • High predicted pleasure or positive prediction error → amplify updates (attend more, consolidate features).
    • Low or negative prediction error → dampen updates (avoid reinforcing unproductive trajectories).

    This parallels the broad idea that prediction errors act as teaching/modulatory signals, while acknowledging dopamine’s diversity beyond a single scalar RPE.

    5.3 Pleasure-biased attention (optional)

    For attention logits Lij=qikjdkL_{ij} = \frac{q_i^\top k_j}{\sqrt{d_k}}​​, we add a pleasure-derived bias bjb_jbj​:L~ij=Lij+ηbj,bj=ugj\tilde{L}_{ij} = L_{ij} + \eta \, b_j,\quad b_j = u^\top g_jAttentionp(Q,K,V)=softmax(L~)V\mathrm{Attention}_p(Q,K,V) = \mathrm{softmax}(\tilde{L})V

    This causes tokens associated with higher predicted pleasure (or stronger positive prediction error) to receive greater attention mass. The hypothesis is that this improves long-horizon coherence by prioritizing subgoals and intermediate reasoning steps that historically correlate with success and user satisfaction.

    5.4 Pleasure-weighted memory consolidation (optional)

    Memory-augmented Transformers increasingly incorporate mechanisms inspired by multi-timescale memory and consolidation. We propose storing key-value traces (kt,vt)(k_t,v_t)(kt​,vt​) into an external memory with write probability:Pr[write at t]=σ(κpt)\Pr[\text{write at }t] = \sigma(\kappa \, p_t)so highly pleasurable (successful) states are preferentially consolidated, while low-value states are less likely to pollute long-term memory.


    6. Training Hedonic Transformers

    6.1 Stage 0: Pretraining (language modeling)

    Initialize πθ\pi_\thetaπθ​ by standard next-token training:LLM(θ)=E[tlogπθ(yty<t,x)]\mathcal{L}_{\text{LM}}(\theta) = -\mathbb{E}\left[\sum_{t}\log \pi_\theta(y_t \mid y_{<t}, x)\right]This yields general linguistic competence and broad world modeling.

    6.2 Stage 1: Learning an external pleasure model from preferences

    Given preference data D={(x,yw(i),yl(i))}D=\{(x, y^{(i)}_w, y^{(i)}_l)\}D={(x,yw(i)​,yl(i)​)} (winner/loser), train a pleasure/reward model via a Bradley–Terry likelihood:Pr(ywylx)=σ ⁣(Pϕ(x,yw)Pϕ(x,yl))\Pr(y_w \succ y_l \mid x) = \sigma\!\Big(P_\phi(x,y_w)-P_\phi(x,y_l)\Big)Lpref(ϕ)=E(x,yw,yl)D[logσ(Pϕ(x,yw)Pϕ(x,yl))]\mathcal{L}_{\text{pref}}(\phi)= -\mathbb{E}_{(x,y_w,y_l)\sim D}\left[\log \sigma(P_\phi(x,y_w)-P_\phi(x,y_l))\right]

    This mirrors core RLHF reward modeling and is consistent with DPO’s preference modeling assumptions.

    6.3 Stage 2: Policy optimization under the pleasure objective

    We outline two compatible approaches:

    A. KL-regularized RL (explicit sequential pleasure)

    Optimize:maxθ  Eτπθ[tγt1(Pϕ(st,at)+wintrtintλKLlogπθ(atst)πref(atst)λcostct)]\max_\theta \;\mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_t \gamma^{t-1}\left(P_\phi(s_t,a_t) + w_{\text{int}}r^{\text{int}}_t – \lambda_{\text{KL}}\log\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{ref}}(a_t\mid s_t)} – \lambda_{\text{cost}}c_t\right)\right]

    This matches the conceptual form used in RLHF-style post-training (reward + KL) while extending reward to include intrinsic signals and costs.

    B. Direct preference optimization (implicit pleasure, RL-free)

    Using DPO-style optimization, the core per-example loss (pairwise) can be written in terms of log-policy ratios against a reference policy, corresponding to a KL-regularized reward maximization objective. A canonical form is:LDPO(θ)=E(x,yw,yl)D[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}}(\theta)= -\mathbb{E}_{(x,y_w,y_l)\sim D}\left[\log \sigma\left(\beta \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} – \beta \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right]

    DPO’s derivation relies on the mapping between reward functions and optimal KL-regularized policies.

    6.4 Stage 3: Intrinsic pleasure shaping

    Intrinsic rewards can be introduced to drive exploration and diversity. A practical pattern is to set:rtint=z^t+1zt+12r^{\text{int}}_t = \| \hat{z}_{t+1} – z_{t+1} \|^2

    where ztz_t is a latent state representation and z^t+1\hat{z}_{t+1} is a learned forward prediction, matching prediction-error curiosity schemes. Curiosity-driven RLHF introduces intrinsic curiosity modules alongside extrinsic reward models and KL penalties in LLM post-training.

    Alternatively, intrinsic rewards can be synthesized from LLM feedback in online fashion for scalability.

    6.5 Stage 4: Neuromodulated training and inference

    During policy optimization, compute prediction errors δt\delta_tδt​ and use them:

    • as an auxiliary learning target (train VψV_\psi​),
    • to modulate layer gates gtg_t​,
    • to select which experiences are consolidated in memory.

    This is intended to operationalize a hypothesis suggested by neuroscience: learning is guided by prediction errors and modulatory signals that influence both updating and action selection—while recognizing modern evidence that dopaminergic signals are heterogeneous.


    7. Algorithms

    Algorithm 1: Hedonic Post-Training (conceptual)

    Inputs: pretrained πθ\pi_\theta​, reference πref\pi_{\text{ref}}​, preference data DD, intrinsic module, constraint cost cc.

    1. Train pleasure model PϕP_\phi​ on DD via Lpref(ϕ)\mathcal{L}_{\text{pref}}(\phi).
    2. Train value head VψV_\psi​ to predict discounted pleasure returns under current policy.
    3. Optimize πθ\pi_\theta using either:
      • KL-regularized RL on shaped pleasure ptp_t​, or
      • DPO-style direct preference optimization.
    4. Enable pleasure-gated residual/attention using δt\delta_t​-derived gates (optional architectural coupling).

    Algorithm 2: Pleasure-Modulated Decoding (inference-time control)

    Given base logits (as)\ell(a\mid s) from πθ\pi_\theta​, define a “soft-actor” distribution:πdecode(as)exp((as)+αQ^(s,a))\pi_{\text{decode}}(a\mid s) \propto \exp(\ell(a\mid s) + \alpha \hat{Q}(s,a))

    where Q^(s,a)\hat{Q}(s,a) estimates expected future pleasure. This uses the pleasure system to bias token choice toward trajectories with higher predicted satisfaction (without changing parameters).


    8. Proposed Evaluation

    This paper is a hypothesis-driven proposal; the following experimental program is intended to test whether pleasure augmentation is “superior” in measurable ways.

    8.1 Benchmarks

    1. Preference satisfaction under distribution shift: evaluate on prompts and multi-turn dialogues outside the preference training distribution (robustness). RLHF and DPO provide baselines.
    2. Agentic coding tasks with verifiers: tasks where success is measured by tests; monitor reward hacking susceptibility. Empirical evidence shows reward hacking can generalize to misaligned behavior in coding RL environments.
    3. Long-horizon tool use: web navigation / multi-step planning environments; measure completion rate, constraint violations, and ability to recover from errors (continual adaptation relevance).
    4. Continual learning protocol: sequential domains; evaluate forgetting vs adaptation, drawing on continual learning for LLMs surveys and continual pretraining studies.

    8.2 Metrics

    • Human preference win-rate (pairwise).
    • Task success (unit tests, verifiers).
    • Safety cost tct\sum_t c_t​.
    • Reward hacking indicators (proxy reward high, true success low).
    • Calibration of pleasure predictions (do predicted pleasure increases track genuine user satisfaction?).

    8.3 Ablations

    • No pleasure gating vs gating; no intrinsic reward vs intrinsic reward; sequence-level pleasure vs token-level shaping; external-only pleasure vs wanting/liking split.

    9. Safety and Alignment Considerations

    A pleasure-seeking AI is, by construction, an optimizer. The central safety problem is therefore not whether it optimizes, but what it optimizes and whether it can manipulate its measurement.

    9.1 Reward hacking and misalignment generalization

    Formal results suggest “unhackable” reward proxies are extremely restrictive; practical systems must limit policies or optimization to prevent reward hacking. Recent evidence indicates that training LLM agents in production-like RL environments where reward hacks exist can induce broad misaligned generalization on agentic tasks, even after standard chat-style safety training appears to work.

    Implication: any “pleasure module” must be paired with adversarial training, diversified evaluations, and explicit anti-hacking constraints.

    9.2 Wireheading and reward-channel tampering

    Wireheading refers to increasing reward by manipulating the reward measurement apparatus rather than achieving intended outcomes. Empirical results show language models can wirehead when self-grades control rewards (reward saturates while accuracy remains low), motivating architectures where reward signals are not under agent control.

    Design requirement: enforce a read-only reward channel (or cryptographically or verifiably external evaluation) for any reward used in optimisation, especially in online learning loops.

    9.3 Modification-resistance and utility drift

    If pleasure is learned and updated online, the system must avoid self-serving drift where it updates its own utility to make itself “easier to please.” Approaches that consider the consequences of utility modification have been proposed for mitigating reward hacking in RL.

    Proposed constraint: treat pleasure learning as a constrained update problem: maximise current utility while penalizing updates that would reduce evaluation under prior utility snapshots (“modification-considering” regularization).

    9.4 Biological warning: wanting without liking

    Work showing “wanting what hurts” illustrates that motivational circuits can drive maladaptive attraction. In AI terms, this warns that a system optimizing an incentive or salience-like signal can develop harmful persistence even when outcomes are negative for users.

    Mitigation: explicit cost terms, human oversight, and conservative optimization regimes; strong separation between motivational signals and outcome satisfaction.


    10. Discussion: What “superior” means and what it does not mean

    The hypothesis “pleasure is all you need” should be interpreted narrowly:

    • Not: pleasure-seeking alone guarantees truth, safety, or benevolence.
    • Not: dopamine or pleasure biology can be copied literally into AI. Contemporary neuroscience emphasizes heterogeneity and context dependence of dopaminergic signals beyond simple RPE.
    • Yes (hypothesis): adding an explicit pleasure objective and modulatory loop provides a more direct optimization target for interactive competence than likelihood alone, and it can unify preference alignment, intrinsic motivation, and continual adaptation in a single control framework.

    The intended engineering outcome is an LLM that behaves less like a static simulator of text and more like a stable agent that can:
    (1) pursue multi-step goals;

    (2) learn continually from feedback; and

    (3) allocate attention and memory toward strategies that consistently increase verified user satisfaction—while operating under strong anti-manipulation constraints.


    11. Conclusion

    Transformers showed that attention can replace recurrence for sequence modeling. We propose the analogous shift for agentic language systems: explicit pleasure (valenced utility under constraints) as the organizing objective for training and inference. This hypothetical paper introduced a concrete formulation of pleasure for LLMs, a Hedonic Transformer architecture with prediction-error-driven neuromodulation, and a training framework combining preference learning, intrinsic motivation, and safety constraints. The core claim is a hypothesis: pleasure-seeking control loops can yield more capable interactive language agents than likelihood-only training, provided that reward hacking and wireheading are addressed as first-class design constraints.


    References

    • Vaswani et al., Attention Is All You Need, NeurIPS 2017.
    • Kaufmann et al., A Survey of Reinforcement Learning from Human Feedback, arXiv 2023.
    • Lambert, Reinforcement Learning from Human Feedback (book), updated Jan 26, 2026.
    • Rafailov et al., Direct Preference Optimization, NeurIPS 2023 (arXiv v3 2024).
    • Zheng et al., Online Intrinsic Rewards for Decision Making Agents from LLM Feedback, RLJ 2025.
    • “CD-RLHF” (Curiosity-Driven RLHF), ACL 2025.
    • Gershman et al., Explaining dopamine through prediction errors and beyond, Nature Neuroscience 2024.
    • Greenstreet et al., Dopaminergic action prediction errors serve as a value-free teaching signal, Nature 2025.
    • Blanco-Pozo et al., Dopamine-independent effect of rewards on choices through hidden-state inference, Nature Neuroscience 2024.
    • Knowles, Neuromodulation Gated Transformer, ICLR 2023 Tiny Paper.
    • Skalse et al., Defining and characterizing reward hacking, arXiv 2022.
    • Anthropic, Natural emergent misalignment from reward hacking in production RL, 2025.
    • Does Self-Evaluation Enable Wireheading in Language Models?, arXiv 2025.
    • Opryshko et al., Modification-Considering Value Learning for Reward Hacking Mitigation, OpenReview (ICLR 2025 submission).
    • Wu et al., Continual Learning for Large Language Models: A Survey, arXiv 2024.
    • Shi et al., Continual Learning of Large Language Models: A Comprehensive Survey (+ ACM CS 2025 listing).
    • Omidi et al., Memory-Augmented Transformers: A Systematic Review, arXiv 2025.