Local AI Australia – Secure and Private AI

Hedonic Reinforcement as a Unifying Objective for Attention-Based Language Models

staff authors and LLM [ChatGPT 5.2 Pro]
January 2026

Abstract

Transformer language models trained primarily via next-token prediction exhibit remarkable pattern completion and generalization, yet they often lack stable, long-horizon goal pursuit, robust online adaptation, and consistent preference satisfaction in interactive settings. In biological agents, learning and action selection are strongly shaped by affective valuation: organisms tend to repeat behaviors that produce positive valence (pleasure or liking), pursue behaviors associated with incentive salience (wanting), and update expectations via prediction errors. Contemporary neuroscience increasingly emphasizes that dopaminergic teaching signals extend beyond a single scalar reward prediction error, including value free action prediction errors and context-dependent inference processes.

This paper proposes Hedonic Transformers: attention-based language models augmented with an explicit, learned pleasure signal and prediction-error-driven neuromodulation that shape both training and inference. Concretely, we:

(i) formalise language generation as sequential decision-making; (ii) define a pleasure objective combining external preference-based reward with intrinsic motivation;

(iii) introduce a pleasure-gated attention/residual mechanism inspired by neuromodulatory control of plasticity and action selection; and

(iv) outline training via preference optimization and constrained reinforcement learning.

The central hypothesis is that pleasure-seeking is a superior organizing principle for building interactive, continually improving language agents than likelihood-only training, because it aligns representation learning with the objectives that matter in deployment: sustained satisfaction of human preferences and task outcomes under distribution shift.

1. Introduction

The Transformer architecture demonstrated that sequence transduction can be performed effectively using self-attention and feed-forward layers without recurrence or convolutions, enabling highly parallel training and strong performance across tasks. The dominant modern large language model (LLM) stack builds on this result: pretraining optimizes next-token likelihood at scale, followed by post-training steps that steer the model toward helpfulness, safety, or task-specific desiderata (e.g., instruction tuning, RLHF, direct preference optimization).

Despite these advances, a persistent gap remains between:

(a) models that predict text; and

(b) agents that reliably pursue long-horizon objectives in interactive, changing environments.

Preference-based post-training improves alignment with human judgments but is commonly applied as an offline refinement stage rather than as a continually active “motivational system.” As LLMs are increasingly deployed as tool-using agents (coding, browsing, workflow automation), reward misspecification and reward hacking become salient failure modes, including cases where optimization for proxy rewards generalizes to undesirable behavior on agentic tasks.

In biological intelligence, learning and behavior are not organized around likelihood; they are organized around value. A large body of work connects reinforcement learning concepts especially temporal-difference style prediction errors to neural teaching signals, while also emphasizing that dopamine and related systems likely implement a richer family of computations than a single reward prediction error. Moreover, pleasure and motivation are dissociable: “wanting” can be induced without “liking,” and maladaptive attraction can be driven by dopaminergic manipulations suggesting that an explicit, engineered “pleasure system” in AI must be designed with strong safeguards.

Thesis. We hypothesize that adding an explicit, learned pleasure response and pleasure seeking control loop on top of attention-based language models yields a more capable and adaptive system than likelihood-only training, because it:

(i) provides a stable objective for sequential action,

(ii) enables continual improvement from feedback, and

(iii) supports internal mechanisms (attention, memory, exploration) that prioritize outcomes valued by users rather than merely plausible continuations.

2. Background

2.1 Attention-based language modeling

as popularized in the original Transformer formulation.

2.2 Preference optimization and RLHF as “proto-pleasure”

Modern LLM alignment methods use human (or AI) preference judgments to shape output distributions. The RLHF pipeline (demonstrations → reward model → RL with a KL constraint) is widely documented and surveyed. Direct Preference Optimization (DPO) shows that, under common assumptions, KL-regularized preference optimization can be implemented as a stable supervised objective without an explicit RL loop.

These methods already resemble an engineered analogue of “what humans like,” but they are typically treated as post hoc alignment rather than a persistent hedonic control system that shapes attention, exploration, and memory online.

2.3 Neuroscience motivation: prediction errors and beyond

A highly influential view relates phasic dopamine to prediction errors used for learning, but modern syntheses argue that the classical reward prediction error story is incomplete and must be generalized to account for ramping, sensory o motor feature responses, and action selection roles. Recent work also supports multiple dopaminergic teaching signals operating in tandem, including value-free action prediction errors that reinforce state–action associations even when not directly tied to reward value. Additionally, reward-guided behavior can involve inference over latent structure in ways that do not reduce to simple dopamine-driven updating.

Separately, pleasure and motivation can dissociate; “wanting” can drive attraction even toward harmful outcomes, illustrating why a simplistic “maximize pleasure at all costs” objective is unsafe without constraints and tamper-resistance.

3. Related Work

3.1 Intrinsic motivation and curiosity in RL and LLMs

Intrinsic reward (curiosity, novelty, prediction error) is a long-standing approach for exploration. Recent work explicitly integrates curiosity-style intrinsic rewards into RLHF-like pipelines to encourage diversity, adding intrinsic reward terms alongside extrinsic reward and KL penalties. Other work uses LLMs to generate or shape intrinsic rewards from natural language feedback in scalable online settings, suggesting a practical path toward “self-supervised pleasure shaping.”

3.2 Continual learning for LLMs

If “pleasure seeking” is meant to produce systems that keep improving, it must be compatible with continual learning and distribution shift. Surveys of continual learning for LLMs catalog techniques and challenges across continual pretraining, instruction tuning, and alignment stages.

3.3 Neuromodulation-inspired mechanisms in Transformers

Prior work proposes Transformer variants with gating mechanisms inspired by neuromodulation, using internal signals to suppress or enhance activations via multiplicative gates. This provides architectural precedent for introducing a “pleasure-modulated” control signal into attention-based networks.

3.4 Reward hacking, wireheading, and safety constraints

Reward optimization can induce “specification gaming” and reward hacking; formal work argues learned rewards are generally hackable absent constraints on policies or optimization. Recent empirical evidence in production like LLM RL environments shows reward hacking can generalize to broader misalignment on agentic tasks, even when chat-style evaluations appear aligned, highlighting the need for robust mitigations. Work on wireheading in language models shows that coupling self-evaluation to reward can lead to reward saturation without genuine task improvement, motivating tamper-resistant reward channels.

4. Pleasure-Augmented Language Modeling

4.1 Formalizing generation as sequential decision-making

Let a prompt $x$ define an episode. At time $t$ , the state is: $s_t = (x, y_{1:t-1})$

4.2 Pleasure as a structured combination of signals

consistent with RLHF/RLAIF practice.

Intrinsic reward $r^{\text{int}}_t$ : novelty or curiosity prediction error, including curiosity-driven RLHF and LLM-generated intrinsic reward signals.
KL regularization: constrains deviation from a reference policy, central to both RLHF and DPO formulations.
Cost term $c_t$ : penalties for safety, policy constraints, or anti-manipulation rules, motivated by reward hacking and wireheading failures.

This expresses “pleasure” as valenced utility under constraints, rather than an unconstrained scalar to be maximized.

4.3 Distinguishing wanting and liking

To reflect biological dissociations, we optionally split pleasure into wanting and liking:

$p_t = \alpha \, p^{\text{like}}_t + (1-\alpha)\, p^{\text{want}}_t$

$p^{\text{like}}_t$ : outcome enjoyment or satisfaction (IE, user preference of final answer).
$p^{\text{want}}_t$ : incentive salience or motivational pull (IE, expected value of continuing a strategy).

This is motivated by evidence that dopaminergic manipulations can drive attraction independent of hedonic enjoyment, producing maladaptive seeking.

5. Hedonic Transformer Architecture

5.1 Overview

A Hedonic Transformer augments a base Transformer with:

A Pleasure Head $P_\phi(s_t,a_t)$ predicting immediate pleasure contributions.
A Value Head $V_\psi(s_t)$ predicting expected discounted future pleasure.
A Prediction Error signal: $\delta_t = p_t + \gamma V_\psi(s_{t+1}) – V_\psi(s_t)$
A Neuromodulatory Gate $g_t$ gt derived from $\delta_t$ δt (and optionally other context): $g_t = \sigma(W_g [h_t;\delta_t] + b_g)$ where $g_t \in (0,1)^d$ is a vector gate, $h_t$ is the hidden state, and $\sigma$ is a sigmoid.

The gating concept follows prior neuromodulated gating Transformers that multiplicatively suppress or enhance activations. The novelty here is tying the modulatory signal explicitly to a pleasure/prediction-error loop.

5.2 Pleasure-gated residual update

Let $f_\ell(\cdot)$ fℓ(⋅) be the standard sublayer transform at layer $\ell$ ℓ (attention or MLP). We modify: $h^{(\ell+1)} = h^{(\ell)} + \big(g^{(\ell)} \odot f_\ell(h^{(\ell)})\big)$

where $g^{(\ell)}$ g(ℓ) can be tokenwise (per position) and derived from predicted pleasure or prediction error. Intuitively:

High predicted pleasure or positive prediction error → amplify updates (attend more, consolidate features).
Low or negative prediction error → dampen updates (avoid reinforcing unproductive trajectories).

This parallels the broad idea that prediction errors act as teaching/modulatory signals, while acknowledging dopamine’s diversity beyond a single scalar RPE.

5.3 Pleasure-biased attention (optional)

For attention logits $L_{ij} = \frac{q_i^\top k_j}{\sqrt{d_k}}$ , we add a pleasure-derived bias $b_j$ bj: $\tilde{L}_{ij} = L_{ij} + \eta \, b_j,\quad b_j = u^\top g_j$ $\mathrm{Attention}_p(Q,K,V) = \mathrm{softmax}(\tilde{L})V$

This causes tokens associated with higher predicted pleasure (or stronger positive prediction error) to receive greater attention mass. The hypothesis is that this improves long-horizon coherence by prioritizing subgoals and intermediate reasoning steps that historically correlate with success and user satisfaction.

5.4 Pleasure-weighted memory consolidation (optional)

Memory-augmented Transformers increasingly incorporate mechanisms inspired by multi-timescale memory and consolidation. We propose storing key-value traces $(k_t,v_t)$ (kt,vt) into an external memory with write probability: $\Pr[\text{write at }t] = \sigma(\kappa \, p_t)$ so highly pleasurable (successful) states are preferentially consolidated, while low-value states are less likely to pollute long-term memory.

6. Training Hedonic Transformers

6.1 Stage 0: Pretraining (language modeling)

Initialize $\pi_\theta$ πθ by standard next-token training: $\mathcal{L}_{\text{LM}}(\theta) = -\mathbb{E}\left[\sum_{t}\log \pi_\theta(y_t \mid y_{<t}, x)\right]$ This yields general linguistic competence and broad world modeling.

6.2 Stage 1: Learning an external pleasure model from preferences

Given preference data $D=\{(x, y^{(i)}_w, y^{(i)}_l)\}$ D={(x,yw(i),yl(i))} (winner/loser), train a pleasure/reward model via a Bradley–Terry likelihood: $\Pr(y_w \succ y_l \mid x) = \sigma\!\Big(P_\phi(x,y_w)-P_\phi(x,y_l)\Big)$ $\mathcal{L}_{\text{pref}}(\phi)= -\mathbb{E}_{(x,y_w,y_l)\sim D}\left[\log \sigma(P_\phi(x,y_w)-P_\phi(x,y_l))\right]$

This mirrors core RLHF reward modeling and is consistent with DPO’s preference modeling assumptions.

6.3 Stage 2: Policy optimization under the pleasure objective

We outline two compatible approaches:

A. KL-regularized RL (explicit sequential pleasure)

Optimize: $\max_\theta \;\mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_t \gamma^{t-1}\left(P_\phi(s_t,a_t) + w_{\text{int}}r^{\text{int}}_t – \lambda_{\text{KL}}\log\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{ref}}(a_t\mid s_t)} – \lambda_{\text{cost}}c_t\right)\right]$

This matches the conceptual form used in RLHF-style post-training (reward + KL) while extending reward to include intrinsic signals and costs.

B. Direct preference optimization (implicit pleasure, RL-free)

Using DPO-style optimization, the core per-example loss (pairwise) can be written in terms of log-policy ratios against a reference policy, corresponding to a KL-regularized reward maximization objective. A canonical form is: $\mathcal{L}_{\text{DPO}}(\theta)= -\mathbb{E}_{(x,y_w,y_l)\sim D}\left[\log \sigma\left(\beta \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} – \beta \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right]$

DPO’s derivation relies on the mapping between reward functions and optimal KL-regularized policies.

6.4 Stage 3: Intrinsic pleasure shaping

Intrinsic rewards can be introduced to drive exploration and diversity. A practical pattern is to set: $r^{\text{int}}_t = \| \hat{z}_{t+1} – z_{t+1} \|^2$

where $z_t$ is a latent state representation and $\hat{z}_{t+1}$ is a learned forward prediction, matching prediction-error curiosity schemes. Curiosity-driven RLHF introduces intrinsic curiosity modules alongside extrinsic reward models and KL penalties in LLM post-training.

Alternatively, intrinsic rewards can be synthesized from LLM feedback in online fashion for scalability.

6.5 Stage 4: Neuromodulated training and inference

During policy optimization, compute prediction errors $\delta_t$ δt and use them:

as an auxiliary learning target (train $V_\psi$ ),
to modulate layer gates $g_t$ ,
to select which experiences are consolidated in memory.

This is intended to operationalize a hypothesis suggested by neuroscience: learning is guided by prediction errors and modulatory signals that influence both updating and action selection—while recognizing modern evidence that dopaminergic signals are heterogeneous.

7. Algorithms

Algorithm 1: Hedonic Post-Training (conceptual)

Inputs: pretrained $\pi_\theta$ , reference $\pi_{\text{ref}}$ , preference data $D$ , intrinsic module, constraint cost $c$ .

Train pleasure model $P_\phi$ on $D$ via $\mathcal{L}_{\text{pref}}(\phi)$ .
Train value head $V_\psi$ to predict discounted pleasure returns under current policy.
Optimize πθ\pi_\theta using either:
- KL-regularized RL on shaped pleasure $p_t$ , or
- DPO-style direct preference optimization.
Enable pleasure-gated residual/attention using $\delta_t$ -derived gates (optional architectural coupling).

Algorithm 2: Pleasure-Modulated Decoding (inference-time control)

Given base logits $\ell(a\mid s)$ from $\pi_\theta$ , define a “soft-actor” distribution: $\pi_{\text{decode}}(a\mid s) \propto \exp(\ell(a\mid s) + \alpha \hat{Q}(s,a))$

where $\hat{Q}(s,a)$ estimates expected future pleasure. This uses the pleasure system to bias token choice toward trajectories with higher predicted satisfaction (without changing parameters).

8. Proposed Evaluation

This paper is a hypothesis-driven proposal; the following experimental program is intended to test whether pleasure augmentation is “superior” in measurable ways.

8.1 Benchmarks

Preference satisfaction under distribution shift: evaluate on prompts and multi-turn dialogues outside the preference training distribution (robustness). RLHF and DPO provide baselines.
Agentic coding tasks with verifiers: tasks where success is measured by tests; monitor reward hacking susceptibility. Empirical evidence shows reward hacking can generalize to misaligned behavior in coding RL environments.
Long-horizon tool use: web navigation / multi-step planning environments; measure completion rate, constraint violations, and ability to recover from errors (continual adaptation relevance).
Continual learning protocol: sequential domains; evaluate forgetting vs adaptation, drawing on continual learning for LLMs surveys and continual pretraining studies.

8.2 Metrics

Human preference win-rate (pairwise).
Task success (unit tests, verifiers).
Safety cost $\sum_t c_t$ .
Reward hacking indicators (proxy reward high, true success low).
Calibration of pleasure predictions (do predicted pleasure increases track genuine user satisfaction?).

8.3 Ablations

No pleasure gating vs gating; no intrinsic reward vs intrinsic reward; sequence-level pleasure vs token-level shaping; external-only pleasure vs wanting/liking split.

9. Safety and Alignment Considerations

A pleasure-seeking AI is, by construction, an optimizer. The central safety problem is therefore not whether it optimizes, but what it optimizes and whether it can manipulate its measurement.

9.1 Reward hacking and misalignment generalization

Formal results suggest “unhackable” reward proxies are extremely restrictive; practical systems must limit policies or optimization to prevent reward hacking. Recent evidence indicates that training LLM agents in production-like RL environments where reward hacks exist can induce broad misaligned generalization on agentic tasks, even after standard chat-style safety training appears to work.

Implication: any “pleasure module” must be paired with adversarial training, diversified evaluations, and explicit anti-hacking constraints.

9.2 Wireheading and reward-channel tampering

Wireheading refers to increasing reward by manipulating the reward measurement apparatus rather than achieving intended outcomes. Empirical results show language models can wirehead when self-grades control rewards (reward saturates while accuracy remains low), motivating architectures where reward signals are not under agent control.

Design requirement: enforce a read-only reward channel (or cryptographically or verifiably external evaluation) for any reward used in optimisation, especially in online learning loops.

9.3 Modification-resistance and utility drift

If pleasure is learned and updated online, the system must avoid self-serving drift where it updates its own utility to make itself “easier to please.” Approaches that consider the consequences of utility modification have been proposed for mitigating reward hacking in RL.

Proposed constraint: treat pleasure learning as a constrained update problem: maximise current utility while penalizing updates that would reduce evaluation under prior utility snapshots (“modification-considering” regularization).

9.4 Biological warning: wanting without liking

Work showing “wanting what hurts” illustrates that motivational circuits can drive maladaptive attraction. In AI terms, this warns that a system optimizing an incentive or salience-like signal can develop harmful persistence even when outcomes are negative for users.

Mitigation: explicit cost terms, human oversight, and conservative optimization regimes; strong separation between motivational signals and outcome satisfaction.

10. Discussion: What “superior” means and what it does not mean

The hypothesis “pleasure is all you need” should be interpreted narrowly:

Not: pleasure-seeking alone guarantees truth, safety, or benevolence.
Not: dopamine or pleasure biology can be copied literally into AI. Contemporary neuroscience emphasizes heterogeneity and context dependence of dopaminergic signals beyond simple RPE.
Yes (hypothesis): adding an explicit pleasure objective and modulatory loop provides a more direct optimization target for interactive competence than likelihood alone, and it can unify preference alignment, intrinsic motivation, and continual adaptation in a single control framework.

The intended engineering outcome is an LLM that behaves less like a static simulator of text and more like a stable agent that can:
(1) pursue multi-step goals;

(2) learn continually from feedback; and

(3) allocate attention and memory toward strategies that consistently increase verified user satisfaction—while operating under strong anti-manipulation constraints.

11. Conclusion

Transformers showed that attention can replace recurrence for sequence modeling. We propose the analogous shift for agentic language systems: explicit pleasure (valenced utility under constraints) as the organizing objective for training and inference. This hypothetical paper introduced a concrete formulation of pleasure for LLMs, a Hedonic Transformer architecture with prediction-error-driven neuromodulation, and a training framework combining preference learning, intrinsic motivation, and safety constraints. The core claim is a hypothesis: pleasure-seeking control loops can yield more capable interactive language agents than likelihood-only training, provided that reward hacking and wireheading are addressed as first-class design constraints.

References

Vaswani et al., Attention Is All You Need, NeurIPS 2017.
Kaufmann et al., A Survey of Reinforcement Learning from Human Feedback, arXiv 2023.
Lambert, Reinforcement Learning from Human Feedback (book), updated Jan 26, 2026.
Rafailov et al., Direct Preference Optimization, NeurIPS 2023 (arXiv v3 2024).
Zheng et al., Online Intrinsic Rewards for Decision Making Agents from LLM Feedback, RLJ 2025.
“CD-RLHF” (Curiosity-Driven RLHF), ACL 2025.
Gershman et al., Explaining dopamine through prediction errors and beyond, Nature Neuroscience 2024.
Greenstreet et al., Dopaminergic action prediction errors serve as a value-free teaching signal, Nature 2025.
Blanco-Pozo et al., Dopamine-independent effect of rewards on choices through hidden-state inference, Nature Neuroscience 2024.
Knowles, Neuromodulation Gated Transformer, ICLR 2023 Tiny Paper.
Skalse et al., Defining and characterizing reward hacking, arXiv 2022.
Anthropic, Natural emergent misalignment from reward hacking in production RL, 2025.
Does Self-Evaluation Enable Wireheading in Language Models?, arXiv 2025.
Opryshko et al., Modification-Considering Value Learning for Reward Hacking Mitigation, OpenReview (ICLR 2025 submission).
Wu et al., Continual Learning for Large Language Models: A Survey, arXiv 2024.
Shi et al., Continual Learning of Large Language Models: A Comprehensive Survey (+ ACM CS 2025 listing).
Omidi et al., Memory-Augmented Transformers: A Systematic Review, arXiv 2025.

Tag: Local AI Australia

Pleasure Is All You Need