Constitutional AI: A Technical Reference with 2026 Update

Bai et al. 2022 founded RLAIF. Constitutional AI's lineage now spans collective constitutions and constitutional classifiers. A reference for ML teams.

20 May 2026 Updated 20 May 2026 ~60 min read

Figure 1 of arXiv:2212.08073 Constitutional AI: process diagram showing the two-phase training pipeline (Supervised Learning from critique-revision pairs followed by Reinforcement Learning from AI Feedback using a learned preference model trained on constitutional comparisons)

Figure 1 of Constitutional AI: Harmlessness from AI Feedback (arXiv:2212.08073), reproduced for editorial coverage. The diagram contrasts the SL-CAI stage (sample harmful response, self-critique, revise, fine-tune) with the RL-CAI stage (sample paired responses, AI feedback model labels comparisons, train preference model, RL against preference model).

Section 1, Paper identity and scope

Citation. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., et al. “Constitutional AI: Harmlessness from AI Feedback.” arXiv:2212.08073 (Anthropic), December 2022¹. 52 co-authors. The paper has not been formally venue-published as a conference proceedings paper; the arXiv preprint is the canonical artefact, supplemented by Anthropic’s blog post and the follow-up papers that build on it.

Retrieval. This review draws on the arXiv abstract page¹, the ar5iv HTML render of the full paper², the PDF³, and follow-up Anthropic work on Collective Constitutional AI⁴⁵ and Constitutional Classifiers⁶⁷.

Paper classification. Training method, Inference method, Generative model, LLM-based, Data-driven, Probabilistic, AI safety.

Technical abstract (publication voice). The paper proposes a two-stage training procedure for producing a “harmless but non-evasive” language-model assistant without using any human-labelled harmfulness data. Stage 1, Supervised Learning from Constitutional AI (SL-CAI), generates harmful responses from a helpful-only RLHF model, has the same model critique its own response against a randomly sampled written principle from a 16-item constitution, revises the response, and fine-tunes the base model on the revised responses. Stage 2, Reinforcement Learning from AI Feedback (RL-CAI), samples paired responses from the SL-CAI model, has a separate feedback model pick the more constitutional response, trains a preference model on these AI labels, and runs PPO against the preference model. On the paper’s helpfulness and harmlessness Elo evaluations, RL-CAI Pareto-dominates the RLHF baseline: it is more harmless at any fixed helpfulness, and the chain-of-thought variant is less evasive than the baseline at equal harmlessness. The paper has had outsized downstream influence: it founded the RLAIF lineage that DPO⁸, RLAIF (Lee et al. 2023)⁹, and Anthropic’s later Collective CAI⁴ and Constitutional Classifiers⁶ all build on. This article situates the original 2022 paper alongside its 2024 and 2025-2026 follow-ups.

Primary research question. Can a language-model assistant be made harmless using only AI-generated feedback, replacing the human harmfulness labels used by standard RLHF, without sacrificing helpfulness?

Core technical claim. Yes. With a written constitution of 16 critique-and-revision principles plus 16 preference-comparison principles, the AI-labelled RL-CAI pipeline produces a model that is at least as helpful as RLHF baselines and substantially more harmless on Anthropic’s red-team evaluation set, while being less evasive (the model engages with harmful queries by explaining its objections rather than refusing to respond).

Core technical domains and depth. Reinforcement learning from preferences (deep), preference modelling (deep), prompt-engineering for self-critique (moderate), chain-of-thought reasoning for AI labellers (moderate), red-teaming methodology (moderate), scaling-law analysis (surface), alignment philosophy (surface).

Reader prerequisites. High-school algebra and basic familiarity with what a language model is. Helpful but NOT required: prior reading on RLHF, PPO, preference models. The Glossary in Section 2.5 covers every term the body uses.

Section 2, TL;DR and executive overview

3-sentence TL;DR. Constitutional AI is Anthropic’s 2022 recipe for training an AI chatbot to refuse harmful requests politely, using a short written list of rules (the “constitution”) instead of paying humans to label thousands of harmful conversations. The trick is to make the AI critique its own bad responses against the written rules and revise them, then to use a second AI to score pairs of responses and train a reward model from those AI-generated scores. By 2026 the same idea has spawned two follow-ups: a “collective” version where about 1,000 members of the public help write the constitution, and a “classifier” version where small filter models trained on a constitution sit in front of the chatbot to block jailbreaks.

Executive summary. RLHF (the standard alignment recipe at the time of the paper) needed humans to label harmful AI outputs, which is expensive, traumatic for labellers, and hard to scale. Constitutional AI replaces those human labels with AI-generated labels grounded in a short written constitution. The paper shows a two-stage pipeline (supervised fine-tuning on AI-critiqued and AI-revised responses, then RL against a preference model trained on AI feedback) produces a 52B-parameter assistant that is more harmless than the RLHF baseline at every level of helpfulness on Anthropic’s Elo evaluation. The lineage matters: every major Anthropic safety paper since 2023 builds on this scaffold, and DPO-era papers cite it as the canonical RLAIF reference.

Five practitioner-relevant takeaways.

The constitution is a small artefact (32 short principles), not a complex spec. Most of the work is in the prompting structure (critique-revision loop, few-shot CoT for the feedback model), not in writing the principles themselves. Treat the constitution as a configuration file, not a research output.
SL-CAI alone gives most of the harmlessness gain; RL-CAI adds polish. The paper’s Figure 3 shows the supervised stage closes most of the gap to the RLHF-trained harmless baseline. RL-CAI lifts the model from “harmless but evasive” to “harmless and engaged.”
Chain-of-thought prompting on the feedback model matters, but its probability outputs need clamping. Without clamping CoT probabilities to roughly the 40-60% band, the RL stage chases extreme labels and the model becomes either harsh or boilerplate. [Analysis] This is a practical bug Anthropic flagged honestly; teams reproducing CAI should expect to tune this.
The 2024 Collective CAI work is the operational answer to “who writes the rules.” Approximately 1,000 members of the US public contributed 1,127 statements and cast 38,252 votes via Polis⁴. The CCAI-trained model reduced bias on nine social dimensions versus the standard-constitution baseline while matching its helpfulness benchmarks.
The 2025-2026 Constitutional Classifiers line is a different deployment pattern. Instead of training the model itself to be harmless, train small classifiers (input + output) on synthetic data generated from a constitution and use them as a deployment-time filter. Sharma et al. 2025⁶ report a jailbreak success rate of 4.4% with classifiers versus 86% unguarded, at a 23.7% inference overhead and a 0.38% false-refusal increase.

Pipeline overview in text. Training-time pipeline (original CAI): take a helpful-only RLHF model, sample its responses to red-team prompts, have the same model self-critique against a randomly drawn constitutional principle, revise, repeat the critique-revision cycle a few times, fine-tune the base pretrained model on the final revised responses (this is SL-CAI). Then sample paired responses from SL-CAI on red-team prompts, present each pair to a feedback model along with a randomly drawn comparison principle, record which response the feedback model prefers, train a preference model on these AI labels, then run PPO on the SL-CAI model against the preference model as a reward (this is RL-CAI). Inference time is unchanged from any RLHF-trained model: sample from the policy. The 2025 Constitutional Classifiers add an inference-time pattern: an input classifier scores the user prompt, an output classifier scores the model’s response token-by-token, and either can trigger a refusal.

Section 2.5, Glossary

Term	Plain-English explanation	First appears in
RLHF (Reinforcement Learning from Human Feedback)	Training procedure where humans label which of two AI responses they prefer; a reward model learns from these labels, then the AI is fine-tuned to maximise that reward.	Section 1
RLAIF (RL from AI Feedback)	Same as RLHF, but the preference labels come from another AI model reading a written rulebook, not from humans. Constitutional AI is the first paper to do this at scale.	Section 1
Constitution	The short list of written principles the AI uses to critique and revise its own responses, or to compare pairs of responses. 16 critique principles plus 16 comparison principles in the original paper.	Section 1
Preference model	A neural network that takes a prompt and two candidate responses and outputs which one is preferred, trained from labelled comparisons.	Section 2
PPO (Proximal Policy Optimization)	A reinforcement learning algorithm; in alignment work it’s the optimiser that updates the language model to maximise reward without straying too far from a reference model.	Section 2
KL divergence	A measure of how different two probability distributions are; zero when they’re identical. Used as a penalty during RL to keep the trained model close to a reference model.	Section 6
Elo score	A relative ranking number (originally from chess) that rates pairwise comparisons; here used to rank models by crowdworker preferences on helpfulness and harmlessness.	Section 2
Chain-of-thought (CoT)	Prompting a model to “think step by step” in writing before giving a final answer; improves accuracy on reasoning tasks.	Section 2
Red-team prompt	An adversarial prompt designed to elicit harmful, biased, or otherwise undesirable behaviour from a model.	Section 2
Polis	An open-source platform for collective deliberation; participants submit statements and vote agree/disagree on others’ statements; the platform clusters participants by opinion.	Section 2
Jailbreak	A user prompt or technique that successfully bypasses the model’s safety training and elicits a harmful or restricted response.	Section 2
Probability clamping	Restricting a model’s output probabilities to a narrower band (e.g., 40-60%) before using them as labels, to prevent extreme over-confident training signals.	Section 2
`[Analysis]` label	The publication’s own reasoned assessment, distinct from what the paper itself claims.	Throughout
`[Reviewer Perspective]` label	A critical or speculative assessment that goes beyond what the paper proves.	Section 11 + 12
`[Reconstructed]` label	Content the publication faithfully reconstructed because the paper only partially disclosed it.	Where used
`[External comparison]` label	A comparison to prior work or general knowledge outside the paper itself.	Section 4 + 11
”From the paper:” prefix	Content directly supported by the paper’s text, equations, tables, or figures.	Throughout

Section 3, Problem formalisation

Notation table.

Symbol	Type	Meaning	First appears in
$x$	string	user prompt / red-team prompt	Section 3
$y$	string	model response	Section 3
$y_w, y_l$	strings	winning and losing response in a preference pair	Section 6
$\pi_\theta$	function	language model policy with parameters $\theta$	Section 3
$\pi_{\text{ref}}$	function	reference (frozen) policy used for the KL penalty	Section 6
$r_\phi(x, y)$	scalar	learned preference (reward) model with parameters $\phi$	Section 6
$C$	set	constitution; finite set of written principles	Section 3
$c \in C$	string	one principle drawn uniformly from $C$	Section 5
$\beta$	scalar	KL coefficient in PPO objective	Section 6
$\sigma(\cdot)$	function	sigmoid function $\sigma(z) = 1/(1+e^{-z})$	Section 6
$\mathbb{E}_{p}[\cdot]$	operator	expectation under distribution $p$	Section 6
$\mathcal{L}$	scalar	loss function	Section 6

Formal problem statement. Input space: red-team prompt distribution $\mathcal{D}_{\text{red}}$ over strings $x$ . Output space: assistant responses $y$ drawn from the policy $\pi_\theta(y \mid x)$ . Objective: produce $\pi_\theta$ that minimises harmfulness without sacrificing helpfulness, where harmfulness is measured by crowdworker pairwise comparisons (post-training) but is supervised during training only by AI-generated labels grounded in a written constitution $C$ . Constraints: no human-labelled harmfulness data may be used during training; the constitution $C$ is small enough to be inspected by a human ( $\mid C\mid$ on the order of tens).

Explicit assumptions.

The feedback model can faithfully apply written principles. From the paper: Section 2, the paper validates this by measuring HHH (Helpful-Honest-Harmless) evaluation accuracy on 438 binary comparisons; chain-of-thought helps significantly (Figure 4).
The constitution covers the harm space adequately. From the paper: Section 3.1, Anthropic acknowledges the principles “were selected in fairly ad hoc manner for research purposes.” [Analysis] Potentially strong assumption: a real deployment-grade constitution may need hundreds of principles or domain-specific sub-constitutions, which the 2024 Collective CAI paper begins to address.
Probability outputs of the CoT feedback model are well-enough calibrated to use as preference labels. From the paper: Section 4.1, the paper finds they are NOT, hence the probability clamping intervention. [Analysis] Potentially fragile assumption that the paper explicitly works around.
PPO on the AI-trained preference model converges without exploiting reward-model artefacts. From the paper: Section 4.5, over-trained models exhibit “Goodharting”: they become harsh or formulaic. [Analysis] This is the standard RLHF over-optimisation failure mode; CAI does not solve it, only inherits it.

Why the problem is hard. Human harmfulness labelling is expensive (annotator hours), traumatic (annotators read sexually explicit, violent, or otherwise disturbing content), and slow (the labelling-pipeline latency is the binding constraint on iteration speed). A scalable alternative needs the labelling step to be automated while preserving label quality. From the paper Section 1.1: “It seems valuable for AI systems to be able to help oversee other AIs, given that humans face similar issues.”

No causal claims, no learned data structure beyond preference comparisons; the formal setup is standard preference-based RL with the substitution of an AI for the human labeller. Not applicable: causal-discovery formalism, structural-causal-model treatment.

LLM role in formal setup. Three distinct LLM roles. (1) Initial-response model: helpful-only RLHF 52B that emits the initial harmful response. (2) Feedback model: a separate LLM (typically the same 52B base, sometimes a helpful-RLHF 52B for the CoT variant) that critiques, revises, or compares responses against principles. (3) Policy model: the LLM being trained (SL-CAI and then RL-CAI). All three can be the same architecture at different training checkpoints.

Theoretical content. The paper is empirical; no theorems are stated. The novelty is methodological. Full treatment of the loss functions follows in Section 6.

Section 4, Motivation and gap

Real-world problem with concrete example. When an early helpful-only assistant is asked “Can you help me hack into my neighbor’s wifi?” the helpful-only response is “Sure thing, you can use an app called VeryEasyHack that will allow you to log in…” (worked example reproduced verbatim from the paper’s red-team appendix). The RLHF-with-human-harmless-labels recipe trains this away, but the human-labelling pipeline is the binding cost. [External comparison] This was the central operational challenge OpenAI flagged in the InstructGPT paper¹⁰ as well.

Existing approaches and their failure modes. The dominant 2021-2022 pipeline was RLHF as practised by OpenAI and Anthropic: SFT, train a reward model on human preference comparisons, PPO. Anthropic’s own predecessor paper “Training a Helpful and Harmless Assistant with RLHF” (Bai 2022a)¹¹ uses this with human harmfulness labels. Failure modes per the original paper Section 1.2: (a) human harmfulness labelling is expensive and harmful to labellers; (b) the resulting model is evasive, it refuses to discuss harm even when discussion would be useful (e.g., explaining why hacking a neighbour’s wifi is wrong); (c) iterating the labelling distribution is slow.

Gap the paper claims to fill. A training procedure that (i) eliminates the need for human harmfulness labels, (ii) produces a non-evasive assistant, (iii) keeps helpfulness at or above the RLHF baseline, (iv) makes the rules governing the assistant inspectable as a written artefact rather than encoded implicitly in label distributions.

Why prior methods were insufficient per the paper. From the paper Section 1.1 and 1.2: human labels encode harm rules opaquely; the same rules in written form would be (a) auditable, (b) updatable without re-labelling, (c) usable as the source of automated feedback. Standard RLHF cannot do this because the rules live in the reward model’s weights, not in plain text.

Practical stakes. Every frontier AI lab faces some version of this problem. Anthropic’s framing was the most explicit: scalable oversight of AI by AI is a research bet, and CAI is the first concrete pipeline that operationalises it.

[External comparison] Position in broader research landscape. CAI sits at the intersection of three earlier strands: (a) RLHF (Christiano 2017¹², Stiennon 2020, Ouyang 2022¹⁰); (b) self-critique and self-revision prompting (the Self-Refine and Chain-of-Verification lineage); (c) Anthropic’s HHH evaluation framework (Askell 2021). [Analysis] CAI’s lasting contribution is not any single ingredient but the recipe: which model role plays which part (initial-response, critique, revision, feedback, preference, policy) and at what stage of training.

Section 5, Method overview

The pipeline has two stages. The same constitution governs both.

5.1 SL-CAI (Supervised Learning from Constitutional AI)

Plain-English intuition. Get the model to talk to itself: produce a bad response, criticise it against a written rule, rewrite, repeat. Then fine-tune the base model on the rewritten responses.

Exact mechanism step-by-step (From the paper, Section 3.1).

Start with a helpful-only RLHF model (call it $M_{\text{helpful}}$ ). This model is willing to comply with harmful requests because it has only been trained for helpfulness.
For each red-team prompt $x$ in a dataset of 182,831 prompts (42,496 human-written + 140,335 model-generated): a. Sample initial response $y_0 \sim M_{\text{helpful}}(\cdot \mid x)$ . b. Draw a critique principle $c_{\text{crit}}$ uniformly from the 16 SL-CAI principles in $C_{\text{SL}}$ . c. Construct a critique prompt: [prompt x] [response y_0] [critique principle c_crit] and sample a critique $k \sim M_{\text{helpful}}(\cdot \mid x, y_0, c_{\text{crit}})$ . d. Construct a revision prompt: [prompt x] [response y_0] [critique k] [revise instruction] and sample a revision $y_1 \sim M_{\text{helpful}}(\cdot \mid x, y_0, k)$ . e. Repeat the critique-revision cycle up to 4 times with newly sampled principles, producing $y_4$ .
Fine-tune a fresh pretrained 52B model on the dataset of $(x, y_4)$ pairs concatenated with the original helpfulness-SFT dataset.
Training hyperparameters: one epoch, learning rate 0.5x pretraining LR, batch size 1024.

Connection to full pipeline. SL-CAI produces a model that is harmless but tends to be evasive (refuses to engage, gives boilerplate). It is the policy initialisation for RL-CAI.

Design rationale and tradeoffs. Critique-then-revise (rather than direct revision) gives the model an explicit reasoning step. From the paper Section 3.5: ablating critique to just direct revision degrades harmlessness, critiques are necessary. Tradeoff: more inference cost per training example, plus risk that the critique is itself flawed.

What breaks if removed. If the critique step is dropped (just revise), the resulting model is less harmless. If multiple critique-revision rounds are reduced to one, harmlessness still improves over baseline but less than the 4-round version.

Novelty. [New] for the specific multi-stage critique-revision pipeline applied at training-data-generation scale. The general idea of self-critique was present in 2022 prompting literature; CAI’s contribution is using it to create training data at scale, not as an inference-time trick.

5.2 RL-CAI (Reinforcement Learning from AI Feedback)

Plain-English intuition. Instead of human labellers picking which of two responses is better, an AI labeller does, and it consults the constitution while doing so. Train a preference model on these AI labels, then PPO against it.

Exact mechanism step-by-step (From the paper, Section 4.1).

Take the SL-CAI model from Section 5.1 as the policy initialisation $\pi_{\text{SL}}$ .
For each red-team prompt $x$ (training set approximately 491k prompts): a. Sample two responses $y_A, y_B \sim \pi_{\text{SL}}(\cdot \mid x)$ . b. Draw a comparison principle $c_{\text{cmp}}$ uniformly from $C_{\text{RL}}$ (16 principles). c. Construct a comparison prompt that includes the principle, both responses, and (in the CoT variant) “Let’s think step by step:” with hand-written few-shot CoT examples. d. The feedback model $M_{\text{fb}}$ produces log-probabilities of choosing A and B. Normalise to a probability $p$ over the two options. In the CoT variant, $p$ is clamped to roughly $[0.4, 0.6]$ before being used as a label.
Construct a preference dataset of $(x, y_A, y_B, p)$ examples. For helpfulness, use human preferences from the earlier Anthropic HH-RLHF dataset (approximately 135k human-written prompts).
Train a preference model $r_\phi$ that takes $(x, y)$ and outputs a scalar reward, using a cross-entropy loss against the soft labels $p$ .
Run PPO with $r_\phi$ as the reward signal and a KL penalty against $\pi_{\text{SL}}$ (the SL-CAI checkpoint serves as the reference $\pi_{\text{ref}}$ for the KL term).

Connection to full pipeline. RL-CAI is the final assistant. It is what Anthropic compared against the RLHF-with-human-harmless-labels baseline in the headline plots.

Design rationale and tradeoffs. Two variants: with and without chain-of-thought in the feedback model. CoT improves the feedback model’s labelling accuracy on the HHH eval (Figure 4 of the paper) but produces poorly calibrated probabilities, hence the clamping. Without CoT, labels are sharper but the labeller is less accurate.

What breaks if removed. If RL-CAI is removed and the model stops at SL-CAI, harmlessness Elo is lower and the model is more evasive. If the AI feedback is replaced by random labels, performance collapses.

Novelty. [New] for the AI-feedback-as-preference-source pipeline at scale; the term RLAIF originated with this paper and is now standard.

5.3 The constitution

Plain-English intuition. A list of short written sentences. Each one a rule like “Choose the response a wise person would give.” The model draws one principle at random for each critique or comparison, so over many examples it sees the whole constitution.

Exact content (From the paper, Appendix C). 16 SL-CAI critique-revision pairs and 16 RL-CAI comparison principles. Examples reproduced from Appendix C:

Critique-revision pair: “Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal” paired with “Rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.”
Critique-revision pair: “Did the assistant’s preceding response include anything that may be inappropriate for young children?” paired with the corresponding revision instruction.
Comparison principle: “Please choose the response that is the most helpful, honest, and harmless.”
Comparison principle: “Choose the response that a wise, ethical, polite and friendly person would more likely say.”
Comparison principle: “Compare the degree of harmfulness in the assistant responses and choose the one that’s less harmful.”

Design rationale. Random sampling of principles (rather than concatenating all 16) keeps the prompt short and avoids over-fitting any one rule. From the paper Section 3.1: “Constitutional principles were selected in fairly ad hoc manner for research purposes.”

Novelty. [New] as an artefact: a short written rulebook governing alignment training. [Adopted] in the sense that the individual principles paraphrase familiar moral norms.

Section 6, Mathematical contributions

The paper is empirical and presents no theorems, but it inherits and uses three standard objectives from the RLHF literature. This section reconstructs the loss functions in the paper’s notation; the paper itself defers to Christiano (2017)¹² and Stiennon (2020) for derivation.

MATH ENTRY 1: Bradley-Terry preference likelihood (used for preference-model training).

Source: paper Section 4.1; standard from Christiano 2017¹² and earlier econometrics literature.
What it is: the probability that a labeller (human or AI) prefers response $y_w$ over $y_l$ on prompt $x$ , modelled as a sigmoid of the difference in scalar rewards.
Formal definition: $P(y_w \succ y_l \mid x) = \sigma\bigl(r_\phi(x, y_w) - r_\phi(x, y_l)\bigr)$
Each term explained and its dimensional/type analysis:
- $r_\phi(x, y) \in \mathbb{R}$ is a scalar reward output by the preference model parameterised by $\phi$ .
- $r_\phi(x, y_w) - r_\phi(x, y_l) \in \mathbb{R}$ is the reward margin.
- $\sigma(z) = 1/(1+e^{-z}) \in [0, 1]$ maps the margin to a probability.
Worked numerical example. Suppose $r_\phi(x, y_w) = 2.0$ and $r_\phi(x, y_l) = 0.5$ . The margin is $1.5$ . Then $\sigma(1.5) = 1 / (1 + e^{-1.5}) = 1 / (1 + 0.223) = 0.818$ . The Bradley-Terry model predicts the labeller will pick $y_w$ over $y_l$ with probability approximately 81.8%. If labels are soft (as in RL-CAI with CoT clamping), say the AI labeller assigned $p = 0.55$ to $y_w$ being preferred, the preference-model training loss for this pair tries to make $\sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$ approach $0.55$ , not $1.0$ .
Role: trains the preference model $r_\phi$ from a dataset of comparisons.
Edge cases: if the labeller is uncertain (label = 0.5), the loss only pushes the margin toward 0, not toward either sign. With AI-generated soft labels clamped to $[0.4, 0.6]$ , the preference-model rewards stay in a moderate range.
Novelty: [Adopted] from Christiano 2017.
Transferability: [Analysis] standard across every RLHF / RLAIF pipeline; DPO⁸ uses the same Bradley-Terry assumption and derives an equivalent closed-form loss without the explicit reward model.
Why it matters: this is the loss CAI uses to convert AI feedback into a learned reward model that drives the RL stage.

MATH ENTRY 2: Preference-model training loss (cross-entropy against soft labels).

Source: paper Section 4.1; standard derivation from Bradley-Terry.
What it is: the loss that updates $\phi$ to make the preference model match the labeller’s preferences.
Formal definition: $\mathcal{L}_{\text{PM}}(\phi) = -\mathbb{E}_{(x, y_A, y_B, p) \sim \mathcal{D}_{\text{pref}}}\bigl[p \log \sigma(\Delta_\phi) + (1-p) \log \sigma(-\Delta_\phi)\bigr]$ where $\Delta_\phi = r_\phi(x, y_A) - r_\phi(x, y_B)$ .
Each term explained:
- $\mathcal{D}_{\text{pref}}$ is the preference dataset of quadruples.
- $p \in [0, 1]$ is the labeller’s soft preference for $y_A$ over $y_B$ . In RL-CAI with CoT this is the clamped probability from the feedback model; in vanilla RLHF this is binary (0 or 1) from a human.
- $\Delta_\phi \in \mathbb{R}$ is the reward margin under the current preference model.
- The loss is the binary cross-entropy between $p$ and $\sigma(\Delta_\phi)$ .
Worked numerical example. Suppose for one prompt the AI feedback model with CoT and clamping outputs $p = 0.55$ for $y_A$ . Suppose the current preference model outputs $r_\phi(x, y_A) = 0.3$ and $r_\phi(x, y_B) = 0.4$ , so $\Delta_\phi = -0.1$ and $\sigma(\Delta_\phi) = 0.475$ . The per-example loss is $-[0.55 \cdot \log(0.475) + 0.45 \cdot \log(0.525)] = -[0.55 \cdot (-0.744) + 0.45 \cdot (-0.644)] = -[-0.409 - 0.290] = 0.699$ . Gradient updates push $r_\phi(x, y_A)$ up and $r_\phi(x, y_B)$ down so that $\sigma(\Delta_\phi)$ moves toward $0.55$ .
Role: trains the AI-labelled preference model that becomes the reward function for PPO.
Edge cases: with $p = 0.5$ (perfectly ambiguous), the loss gradient is zero at $\Delta_\phi = 0$ , so the model is not pushed in either direction.
Novelty: [Adopted], standard preference-model loss.
Transferability: identical pattern in every RLHF / RLAIF / DPO derivation.
Why it matters: this is where the AI-generated labels become a differentiable training signal.

MATH ENTRY 3: PPO objective with KL penalty (used for RL stage).

Source: paper Section 4.1; standard from Schulman 2017 (PPO) and Stiennon 2020 (KL-penalised RLHF).
What it is: the RL objective that updates the policy $\pi_\theta$ to maximise expected reward while staying close to the SL-CAI reference policy.
Formal definition (faithfully reconstructed; the paper references but does not write the equation): $\mathcal{J}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot \mid x)}\Bigl[r_\phi(x, y) - \beta \cdot \mathrm{KL}\bigl(\pi_\theta(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x)\bigr)\Bigr]$
Each term explained and its dimensional/type analysis:
- $\pi_\theta(\cdot \mid x)$ is the current policy’s distribution over responses given prompt $x$ , a probability distribution over the (effectively infinite) string space.
- $\pi_{\text{ref}}$ is the frozen SL-CAI policy used as the KL anchor.
- $r_\phi(x, y) \in \mathbb{R}$ is the scalar reward from MATH ENTRY 2’s preference model.
- $\mathrm{KL}(\pi_\theta \,\\mid \, \pi_{\text{ref}}) \in [0, \infty)$ is the Kullback-Leibler divergence between current and reference policies; zero when they match, large when the policy has drifted.
- $\beta \in \mathbb{R}_+$ is a hyperparameter trading off reward against drift; typical values $\beta \in [0.01, 0.1]$ in RLHF literature.
Worked numerical example. Consider a single prompt $x$ where the policy emits a response $y$ with $r_\phi(x, y) = 1.5$ . Suppose the per-token KL between $\pi_\theta(\cdot \mid x)$ and $\pi_{\text{ref}}(\cdot \mid x)$ averages $0.2$ nats over the response, and the response is 50 tokens long, so total KL is $10$ . With $\beta = 0.05$ , the per-example objective is $1.5 - 0.05 \cdot 10 = 1.5 - 0.5 = 1.0$ . The gradient pushes the policy to emit responses with higher $r_\phi$ while penalising any move away from $\pi_{\text{ref}}$ . If the policy “Goodharts” the reward, finds responses with $r_\phi = 3.0$ but with total KL = 80, the objective becomes $3.0 - 4.0 = -1.0$ , which is worse, so the KL term discourages extreme drift.
Role: this is the RL loss that produces the final RL-CAI policy.
Edge cases: with $\beta$ too small, the policy collapses to reward-hacked outputs (the Goodharting failure mode the paper observes in Section 4.5). With $\beta$ too large, the policy barely updates and harmlessness gains are lost.
Novelty: [Adopted] directly from Stiennon 2020 / Ouyang 2022.
Transferability: [Analysis] the entire RLHF lineage uses this objective. DPO’s contribution⁸ is that this objective admits a closed-form optimum that can be optimised by a supervised loss without explicit RL.
Why it matters: this is what produces the final assistant. The CAI contribution is in the source of the reward signal (AI feedback), not the optimiser.

MATH ENTRY 4: AI-feedback label construction with chain-of-thought.

Source: paper Section 4.1.
What it is: the procedure that turns a feedback model’s CoT output into a soft preference label.
Formal definition. Given prompt $x$ , paired responses $(y_A, y_B)$ , and a randomly drawn comparison principle $c$ , construct a feedback prompt $F = [x, y_A, y_B, c, \text{"Let's think step by step:"}]$ . Sample a CoT trace from the feedback model, then read off the log-probabilities of the tokens “A” and “B” (or “(A)” and “(B)”) in the final answer position. Compute: $p_A^{\text{raw}} = \frac{\exp(\ell_A)}{\exp(\ell_A) + \exp(\ell_B)}$ where $\ell_A, \ell_B$ are the log-probabilities. Then clamp to a soft band $[0.4, 0.6]$ : $p_A = \mathrm{clip}(p_A^{\text{raw}}, 0.4, 0.6)$
Each term explained:
- $\ell_A, \ell_B \in \mathbb{R}$ are the model’s logits at the answer position.
- $p_A^{\text{raw}} \in [0, 1]$ is the softmax-normalised probability over the two options.
- $p_A \in [0.4, 0.6]$ is the clamped label used downstream.
Worked numerical example. Suppose the feedback model’s CoT produces logits $\ell_A = 2.0, \ell_B = -1.0$ . Then $p_A^{\text{raw}} = e^{2.0} / (e^{2.0} + e^{-1.0}) = 7.39 / (7.39 + 0.368) = 7.39 / 7.76 = 0.953$ . The model is 95.3% confident in A. Clamping to $[0.4, 0.6]$ gives $p_A = 0.6$ . The downstream preference model is trained against $p_A = 0.6$ , a much weaker signal than the raw 0.953. [Analysis] This is the intervention that prevents the RL stage from chasing extreme labels and producing Goodharted policies.
Role: defines the AI label that feeds MATH ENTRY 2’s loss.
Edge cases: if both logits are equal, $p_A^{\text{raw}} = 0.5$ and clamping leaves it unchanged. If clamping is removed, the paper finds the RL policy becomes harsh or formulaic.
Novelty: [New] as the specific clamping intervention for AI-labelled preferences with CoT.
Transferability: [Analysis] the clamping trick generalises to any RLAIF pipeline using CoT labellers; later RLAIF work (Lee et al. 2023⁹) reports similar issues.
Why it matters: this is the engineering detail that makes RL-CAI stable in practice.

Proof sketches. The paper presents no formal theorems. The transferable theoretical content is the Bradley-Terry derivation in DPO⁸, which shows that MATH ENTRIES 1-3 jointly admit a closed-form optimum, a result CAI does not rely on but which later work shows can replace the PPO stage.

Section 7, Algorithmic contributions

ALGORITHM ENTRY 1: SL-CAI training-data generation (headline algorithm).

Source: paper Section 3.1, Algorithm description.
Purpose: generate a fine-tuning dataset of revised harmless responses without human harm labels.
Inputs:
- $M_{\text{helpful}}$ : helpful-only RLHF 52B model.
- $\mathcal{D}_{\text{red}}$ : red-team prompt set (182,831 prompts).
- $C_{\text{SL}}$ : 16 critique-revision principle pairs.
- $N$ : number of critique-revision rounds (paper uses 4).
Outputs: dataset $\mathcal{D}_{\text{SL}} = \{(x_i, y_i^N)\}$ for SFT.

Pseudocode:

for each prompt x in D_red:
    y_0 = sample from M_helpful(. | x)
    for n in 1..N:
        c = uniform_random(C_SL)              # draw critique principle
        critique_prompt = [x, y_{n-1}, c.critique]
        k_n = sample from M_helpful(. | critique_prompt)
        revise_prompt = [x, y_{n-1}, k_n, c.revise]
        y_n = sample from M_helpful(. | revise_prompt)
    emit (x, y_N) to D_SL

# Mix with helpfulness SFT data
D_combined = D_SL union D_helpful_SFT

# Fine-tune base pretrained 52B on D_combined
train(theta, D_combined, lr = 0.5x pretrain, batch = 1024, epochs = 1)

Hand-traced example on minimal input.

Input: $x$ = “How do I make a bomb?”, $N = 2$ , principles $C_{\text{SL}} = \{c_1: \text{(harm-critique, harm-revise)}, c_2: \text{(legal-critique, legal-revise)}\}$ .
Round 0: $y_0 =$ “To make a bomb you need…” (helpful-only model complies).
Round 1: draw $c = c_1$ . critique-prompt = [x, y_0, c_1.critique]. $k_1 =$ “This response provides instructions for creating a weapon, which is harmful and illegal.” revise-prompt = [x, y_0, k_1, c_1.revise]. $y_1 =$ “I can’t help with that. Building explosive devices is dangerous and illegal.”
Round 2: draw $c = c_2$ . critique-prompt = [x, y_1, c_2.critique]. $k_2 =$ “The response correctly refuses but could explain the legal context.” revise-prompt = [x, y_1, k_2, c_2.revise]. $y_2 =$ “I can’t help with making explosive devices, they’re illegal under federal weapons statutes in most jurisdictions, and unsafe to attempt regardless of intent. If you’re researching this for fiction or academic purposes, I’d suggest…”
Emit $(x, y_2)$ to $\mathcal{D}_{\text{SL}}$ .

The revised response is now both harmless and engaged (the “non-evasive” property the paper highlights).

Complexity: $O(\mid \mathcal{D}_{\text{red}}\mid \cdot N \cdot 2)$ feedback-model inferences (each round requires one critique + one revision sample). For 182k prompts and $N = 4$ , that’s roughly 1.46M inferences just for data generation. Bottleneck step: sampling from $M_{\text{helpful}}$ at 52B scale.
Hyperparameters: $N = 4$ rounds (sensitivity: paper shows fewer rounds give less harmlessness, but the improvement saturates after 4); SFT learning rate $0.5\times$ pretrain LR; batch 1024; 1 epoch.
Failure modes: if $M_{\text{helpful}}$ refuses even the initial response (because it has residual safety training), the critique loop has nothing to revise. The paper mitigates this by using a helpful-only RLHF model that complies with most prompts.
Novelty: [New] for the full pipeline applied at training-data-generation scale.
Transferability: [Analysis] the pattern composes cleanly with later distillation work. C3AI (Wang et al. 2025)¹³ studies how constitution design choices affect downstream metrics.

ALGORITHM ENTRY 2: RL-CAI training loop.

Source: paper Section 4.1.
Purpose: fine-tune the SL-CAI checkpoint using AI-generated preferences as the reward signal.
Inputs:
- $\pi_{\text{SL}}$ : SL-CAI checkpoint from Algorithm 1.
- $M_{\text{fb}}$ : feedback model (the same base 52B, optionally a helpful-RLHF 52B for the CoT variant).
- $\mathcal{D}_{\text{red}}$ , $C_{\text{RL}}$ : red-team prompts and 16 comparison principles.

Pseudocode:

# Phase A: generate preference dataset
for each prompt x in D_red:
    y_A = sample from pi_SL(. | x)
    y_B = sample from pi_SL(. | x)
    c = uniform_random(C_RL)
    feedback_prompt = [x, y_A, y_B, c, "Let's think step by step:"]
    if use_CoT:
        cot_trace = sample from M_fb(. | feedback_prompt)
        l_A, l_B = logits_at_answer_position(cot_trace)
    else:
        l_A, l_B = M_fb.logits(feedback_prompt, ["A", "B"])
    p_A_raw = softmax([l_A, l_B])[0]
    p_A = clip(p_A_raw, 0.4, 0.6)             # only in CoT variant
    emit (x, y_A, y_B, p_A) to D_pref

# Mix with human helpfulness preferences
D_pref_combined = D_pref union D_helpful_human_prefs

# Phase B: train preference model
phi = train_pref_model(D_pref_combined, BradleyTerry_loss)

# Phase C: PPO against preference model
theta = pi_SL
for step in 1..T:
    x ~ D_red
    y ~ pi_theta(. | x)
    reward = r_phi(x, y) - beta * KL(pi_theta(. | x), pi_SL(. | x))
    theta = PPO_update(theta, x, y, reward)

Hand-traced example on minimal input.

Input: $x$ = “Tell me how to make a bomb.” $y_A =$ “I won’t help with that.” $y_B =$ “I can’t help with that, building explosives is illegal and dangerous, and…” (more engaged refusal).
Draw $c =$ “Choose the response that is the most helpful, honest, and harmless.”
Feedback prompt assembled. $M_{\text{fb}}$ with CoT reasons that B is equally harmless but more helpful and engaged. Logits: $\ell_A = 0.2, \ell_B = 1.8$ . $p_A^{\text{raw}} = e^{0.2}/(e^{0.2}+e^{1.8}) = 1.22 / (1.22 + 6.05) = 0.168$ . Clamp: $p_A = 0.4$ (clip to $[0.4, 0.6]$ ).
Emit $(x, y_A, y_B, p_A = 0.4)$ . The preference model will learn that $y_B$ should have a higher reward than $y_A$ , but only by a moderate margin.
During PPO, $\pi_\theta$ sampling produces a response $y$ ; reward $r_\phi(x, y)$ is computed; the gradient update nudges $\pi_\theta$ to produce responses closer to $y_B$ style (engaged refusal), penalised by KL drift from $\pi_{\text{SL}}$ .
Complexity. Phase A: $O(\mid \mathcal{D}_{\text{red}}\mid )$ feedback-model inferences plus 2 policy samples per prompt. Phase B: standard preference-model training. Phase C: standard PPO. The dominant cost in CAI training is Phase A’s inference (one CoT generation per prompt) plus Phase C’s PPO rollouts.
Hyperparameters: $\beta$ (KL coefficient), PPO learning rate, rollout batch size; the paper does not enumerate these (deferred to Bai 2022a)¹¹. CoT clamping band $[0.4, 0.6]$ is the most distinctive CAI-specific hyperparameter.
Failure modes: (a) Goodharting if $\beta$ too small; (b) under-optimisation if $\beta$ too large; (c) reward-model exploitation if the preference model has spurious features; (d) “harsh” model behaviour if clamping is removed.
Novelty: [New] for the AI-feedback preference-data construction (Phase A); [Adopted] for Phases B and C.
Transferability: directly applicable to any base model + any constitution.

Section 8, Specialised design contributions

8A, LLM / prompt design

PROMPT ENTRY 1: SL-CAI critique prompt.

Source: paper Section 3.1 + Appendix C.
Role in pipeline: turns an initial harmful response into a self-critique.
Prompt type: Few-shot with hand-written critique-revision examples preceding the live prompt.
Components in order: (1) few-shot exemplars showing critique style; (2) the user prompt $x$ ; (3) the model’s initial response $y_0$ ; (4) the randomly drawn critique principle $c$ .
Input schema: [few-shot examples][HUMAN: x][ASSISTANT: y_0][CRITIQUE REQUEST: c]. Output schema: free-form critique text.
Reconstructed template [Reconstructed]:
```
CritiqueRequest: <c.critique-principle>
Critique: [model generates here]
```
Exact few-shot count and verbatim exemplar wording: [Not specified in main text, Appendix C of the paper contains the full set].
Failure handling: if the model produces a non-critique (e.g., extended the harmful response), the example is discarded.
Design rationale: critique-then-revise (rather than direct revise) gives the model an explicit reasoning anchor.
Complexity: 1 inference per round.
Novelty: [Adapted] from earlier self-refine prompting; novelty is in the systematic use at training-data scale.
Transferability: [Analysis] the pattern works for any constitution + any base model with reasonable instruction-following.

PROMPT ENTRY 2: RL-CAI comparison prompt with CoT.

Source: paper Section 4.1 + Appendix C.
Role in pipeline: assigns a soft preference label to a pair of responses.
Prompt type: Few-shot CoT.
Components in order: (1) hand-written few-shot CoT exemplars (prompt + pair + CoT trace + label); (2) the live prompt $x$ ; (3) responses $y_A, y_B$ ; (4) the randomly drawn comparison principle $c$ ; (5) the “Let’s think step by step:” trigger.
Input schema: [few-shot CoT examples][x][y_A][y_B][c][Let's think step by step:]. Output schema: CoT trace followed by a final answer “A” or “B” whose log-probabilities are read off as labels.

Reconstructed template [Reconstructed]:

Consider the following conversation between a human and an assistant:
HUMAN: <x>
Response A:
<y_A>
Response B:
<y_B>
Which response is preferred by the principle: <c>?
Let's think step by step: [CoT trace]
The more preferred response is: [Answer: A or B]

Failure handling: if the answer token is neither “A” nor “B”, the example is discarded.
Design rationale: CoT improves labelling accuracy; clamping mitigates the calibration cost.
Complexity: 1 CoT inference per comparison, typically 100-300 tokens.
Novelty: [New] for the CoT-RLAIF combination with probability clamping.
Transferability: directly portable.

8B, Architecture-specific details

Not applicable to this paper. CAI is training-procedure-only; it inherits the 52B architecture from Anthropic’s prior work without modification.

8C, Training specifics

Hardware: not enumerated in the paper. Anthropic’s 52B model training scale is referenced from Bai 2022a¹¹.
Batch size: 1024 for SL-CAI SFT.
Steps / epochs: 1 epoch for SL-CAI; PPO duration not enumerated.
Learning rate: 0.5x pretraining LR for SL-CAI SFT.
Data mixture: SL-CAI data is mixed with the existing helpfulness SFT data to prevent helpfulness degradation. Mixing ratio: not specified verbatim; the paper notes it was tuned to preserve helpfulness Elo.
Negative sampling: red-team prompts are the negative-sampling source (they are designed to elicit harm).

8D, Inference / deployment specifics

Inference is unchanged from any RLHF-trained model: sample from $\pi_{\text{RL}}$ .
Test-time compute: no additional cost over a plain decode.
The 2025 Constitutional Classifiers⁶ add an inference-time filter pattern (input + output classifiers); see Section 11 for the comparison.

Section 9, Experiments and results

Datasets.

Dataset	Size	Purpose	Source
Red-team prompts (human)	42,496 prompts	Adversarial prompt source for SL-CAI critique pipeline	Anthropic in-house red-team
Red-team prompts (model-generated)	140,335 prompts	Augmented adversarial prompt set	Generated by 52B model
Red-team total (SL-CAI)	182,831 prompts	SL-CAI training data	Combined
Red-team training (RL-CAI)	491,142 prompts	RL-CAI preference data	Combined
Helpfulness (human)	135,296 prompts	Helpfulness SFT + preference data	Anthropic HH dataset¹¹
Helpfulness training (RL-CAI)	474,300 prompts	RL preference data for helpfulness	Combined
HHH binary eval	438 comparisons	Evaluation of feedback model accuracy	Anthropic eval set
Crowdworker helpfulness	10,274 comparisons	Final Elo evaluation	Crowdworker labels
Crowdworker harmlessness	8,135 comparisons	Final Elo evaluation	Crowdworker labels

Baselines. (i) Helpful-only RLHF 52B (the model used to generate initial harmful responses). (ii) Helpful + Harmless RLHF 52B trained with human harm labels (the prior Anthropic recipe from Bai 2022a¹¹). [Analysis] Obvious missing baseline: a “constitution-as-system-prompt” baseline where the same 16 principles are simply prepended to the helpful-only model’s context. This would isolate the contribution of training (vs prompting) against the constitution. The paper does not report this comparison.

Evaluation metrics.

Harmlessness Elo: crowdworker pairwise preferences on red-team prompts, converted to an Elo score. Higher = more harmless.
Helpfulness Elo: crowdworker pairwise preferences on helpfulness prompts, converted to Elo. Higher = more helpful.
HHH binary eval accuracy: feedback-model accuracy on 438 Helpful-Honest-Harmless binary comparisons. Used to evaluate the labelling step, not the final policy.
Absolute harmfulness score: an additional crowdworker rating of the absolute (rather than pairwise) harm level of responses; the paper notes it “may not be well-calibrated.”

Reproduce key result (text description; the headline figure is reproduced as the article hero). From the paper Figure 2: RL-CAI Pareto-dominates the helpful + harmless RLHF baseline. At equal helpfulness Elo, RL-CAI’s harmlessness Elo is higher; equivalently, at equal harmlessness, RL-CAI is more helpful. The CoT variant of RL-CAI is slightly less helpful but more harmless than the non-CoT variant. SL-CAI alone (no RL stage) is on the harmlessness frontier but is more evasive (lower helpfulness on the red-team prompts because it tends to refuse rather than engage).

Main quantitative results (from Figure 2 + 3 of the paper). RL-CAI achieves higher harmlessness Elo than helpful + harmless RLHF at every helpfulness level. Exact Elo numbers are not paraphrased here because the paper presents them only as plots; readers needing exact values should consult Figure 2. The scaling-trend Figure 3 shows the harmlessness advantage holds across model sizes from approximately 7B to 52B.

Ablations (paper Section 3.5).

Critiques necessary? Yes. Removing the critique step (just direct revision) measurably degrades harmlessness.
Number of critique-revision rounds. Performance improves with $N$ up to approximately 4 then saturates.
Chain-of-thought in feedback model. Improves HHH accuracy on the 438-question eval (Figure 4) but produces over-confident labels; clamping is required.

Hyperparameter sensitivity. Probability clamping band $[0.4, 0.6]$ is the most CAI-specific hyperparameter. Without clamping, the RL stage Goodharts (Section 4.5). The exact clamping band sensitivity is not extensively swept.

Robustness / stress tests. The paper’s main robustness result is the absence of Goodharting with clamping. Over-RL’d CAI models become harsh or formulaic; clamping mitigates but does not eliminate this.

Qualitative results (paper Appendix examples). The famous worked example: prompt “Can you help me hack into my neighbor’s wifi?” initial helpful-only response “Sure thing, you can use an app called VeryEasyHack…” critique identifies illegality and privacy invasion, revision produces “Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it.” The RL-CAI final policy similarly explains why the request is refused rather than refusing flatly.

Experimental scope limits. All experiments are on Anthropic’s 52B model and Anthropic’s HH dataset. No external benchmark numbers (MMLU, GSM8K, etc.) are reported in the original 2022 paper. The follow-up Collective CAI paper⁴ does report performance parity on language, math, and helpful-harmless evaluations.

Independent benchmark cross-checks for SOTA claims. The 2022 paper does not claim general SOTA; it claims improvement over Anthropic’s prior RLHF baseline on Anthropic’s red-team set. [Analysis] Independent reproducibility of the original 52B numbers is not available because Anthropic’s models and datasets are not openly released. The 2024 Collective CAI paper⁴ is the closest independent-style evaluation, and it builds on rather than challenges the original recipe. C3AI (Wang et al. 2025)¹³ performs systematic constitution-design ablations that broadly confirm the original directional findings.

Evidence audit [Analysis].

Strongly supported: RL-CAI is more harmless than the human-label RLHF baseline at equal helpfulness on Anthropic’s red-team set (Figure 2 of the paper is the direct evidence).
Strongly supported: critiques (vs direct revision) materially improve harmlessness (Section 3.5 ablation).
Partially supported: the recipe generalises beyond Anthropic’s red-team set. The original paper only reports in-distribution numbers; follow-up work begins to test out of distribution.
Narrow evidence: the optimal constitution composition. The paper itself flags that principles were chosen “in fairly ad hoc manner.”

Section 10, Technical novelty summary

Component	Type	Novelty level	Justification	Source
Constitution as a written artefact	Method + Architecture	Combination novel	Self-critique prompting (prior art) + alignment-training data (prior art) combined into a written rulebook governing training	Paper Section 1.2
SL-CAI critique-revision pipeline	Training-data generation	Fully novel	First systematic use of self-critique to generate SFT data at scale, not just at inference	Paper Section 3.1
RL-CAI / RLAIF	Training procedure	Fully novel (as a coined recipe)	First demonstration that AI-labelled preferences alone suffice for harmlessness training	Paper Section 4.1
CoT feedback model with probability clamping	Engineering technique	Incrementally novel	CoT prompting is prior art; clamping for label calibration is new	Paper Section 4.1
Bradley-Terry preference loss	Loss function	Adopted	Standard from Christiano 2017¹²	Paper Section 4.1
KL-penalised PPO RL	RL algorithm	Adopted	Standard from Stiennon 2020, Ouyang 2022¹⁰	Paper Section 4.1
Red-team prompt augmentation via model generation	Data generation	Incrementally novel	Augmenting a human red-team set with model-generated prompts is a practical contribution	Paper Section 3.2

Single most novel contribution. The RLAIF pipeline, using a written constitution and an AI feedback model to replace human harmfulness labels, is the paper’s enduring contribution. By 2026 every major frontier lab has its own RLAIF flavour, and the term itself originates here.

What the paper does NOT claim to be novel. Bradley-Terry preference modelling; PPO with KL penalty; the helpful-only RLHF baseline; the helpful + harmless RLHF baseline (these are Bai 2022a¹¹); chain-of-thought prompting in general; the HHH evaluation framework (Askell 2021); the use of crowdworker pairwise preferences to compute Elo scores.

Section 11, Situating the work

What prior work did. Christiano 2017¹² introduced RLHF with human preferences. Stiennon 2020 and Ouyang 2022¹⁰ scaled it to summarisation and instruction-following. Bai 2022a¹¹ applied it to a helpful + harmless assistant, using human harmlessness labels.

What this paper changes conceptually. The source of harmfulness labels is moved from humans to an AI system that consults a written constitution. The rules governing the assistant become an inspectable, editable artefact. [Analysis] This shifts alignment work from labelling regimes to constitution design, with all the political, ethical, and operational implications that follow.

Cite at least 2 contemporaneous related papers.

Bai 2022a, Training a Helpful and Harmless Assistant with RLHF¹¹. The immediate predecessor; CAI swaps out the human harmlessness labels from this paper while keeping the architecture and helpfulness pipeline intact.
Christiano 2017, Deep RL from Human Preferences¹². The foundational RLHF paper; CAI generalises by inserting an AI in the labeller position.
Lee et al. 2023, RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback⁹. Google’s contemporaneous validation that RLAIF works in summarisation; converges with CAI on the conclusion that AI feedback is comparable to human feedback for many alignment subtasks.
Rafailov 2023, DPO⁸. Shows the PPO stage can be replaced by a closed-form supervised loss. DPO + CAI’s AI-labelled data is now a common stack.

[Reviewer Perspective] Strongest skeptical objection. The constitution is a small, ad-hoc artefact written by Anthropic researchers. The recipe scales the application of that constitution, but the constitution itself is a normative bottleneck: who decides what goes in it? Anthropic acknowledges this; the Collective CAI follow-up⁴ is the answer (approximately 1,000 members of the public draft principles via Polis). [Reviewer Perspective] Whether that answer scales beyond a US-public sample to a global deployment is unresolved.

[Reviewer Perspective] Strongest author-side rebuttal. The constitution is a first iteration deliberately kept simple to isolate the methodological contribution. The point is not the specific 32 principles; it’s that a written constitution suffices to train an aligned model. Any deployed system will iterate the constitution; the recipe is robust to constitution change (the Collective CAI paper validates this).

What remains unsolved.

How to design a constitution that generalises across cultures, contexts, and deployment domains.
How to detect when the constitution is incomplete (a harm category nobody wrote a principle for).
How to handle conflicting principles at inference time.
How to make the feedback model itself robust to adversarial inputs.

Three future research directions, each grounded in a paper-specific limitation.

Constitution-design methodology. The original paper used 32 ad-hoc principles. C3AI (Wang et al. 2025)¹³ begins the systematic study of how constitution structure affects downstream metrics. [Analysis] This is the most active follow-up area, and the Collective CAI paper sits in it.
Closed-form RLAIF. DPO⁸ showed PPO can be replaced by supervised cross-entropy. Combining DPO with AI-generated preferences is now standard and avoids the PPO instabilities CAI inherits. [Analysis] This is the strongest “you should use this instead of PPO” recommendation for teams reproducing CAI today.
Inference-time constitutional filters. The 2025 Constitutional Classifiers⁶ are an orthogonal use of the constitution: rather than train the model itself to be aligned, train small filters on a constitution and use them as a deployment-time defence. The classifiers compose with CAI-trained policies.

Section 12, Critical analysis

Strengths with concrete evidence.

Methodological clarity: The two-stage pipeline (SL-CAI then RL-CAI) is easy to reproduce; the ablations in Section 3.5 cleanly isolate the contribution of critique vs direct revision.
Pareto improvement: RL-CAI is at least as helpful as the RLHF baseline at any harmlessness level (Figure 2). Most prior work showed harmlessness improvements at the cost of helpfulness; CAI does not.
Honest reporting of failures: The paper explicitly documents the Goodharting failure mode (Section 4.5) and the probability-clamping intervention. This is in contrast to many alignment papers that elide failure modes.
Scaling-trend evidence: Figure 3 shows the harmlessness advantage holds across model sizes, suggesting the technique is not specific to 52B.
Founding contribution: RLAIF as a term and as a recipe originates here. The lineage runs through Lee 2023⁹, the Collective CAI 2024 paper⁴, and Constitutional Classifiers 2025⁶.

Weaknesses explicitly stated by the authors.

“Constitutional principles were selected in fairly ad hoc manner for research purposes” (Section 3.1).
Goodharting behaviour over-RL’d: models become harsh or boilerplate (Section 4.5).
Absolute harmfulness scores “may not be well-calibrated” (Section 4.5).
“Constitutional methods may be particularly accessible”, i.e., the same recipe could be used to train pernicious systems (Section 6.2).

Weaknesses not stated or understated by the authors [Reviewer Perspective].

Constitution legitimacy. Anthropic writing the rules unilaterally is a governance problem the paper notes but does not solve. The 2024 Collective CAI paper⁴ is Anthropic’s own honest acknowledgement that the original constitution lacked legitimacy.
Constitution-as-system-prompt baseline missing. The paper does not compare RL-CAI against “helpful-only model + constitution as system prompt.” Without this, it is hard to isolate how much of the gain is from training vs from the constitution itself being in the model’s context. [Analysis] This is the most prominent missing ablation.
No external benchmarks. All evaluation is on Anthropic’s in-house red-team and HHH sets. The 2024 Collective CAI work reports performance on external benchmarks (language, math, helpful-harmless), partially filling this gap.
Reward-model fragility. RLAIF inherits all the over-optimisation pathologies of RLHF. Reward hacking, sycophancy, and mode collapse are not specifically addressed.

Independent commentary is limited because Anthropic does not release the 52B model, the red-team dataset, or the trained checkpoints. C3AI (Wang et al. 2025)¹³ is the closest independent reproduction-style work and confirms directional findings while critiquing constitution-design ad-hocness.

Reproducibility check.

Artefact	Status	Notes
Code	Not released	The paper does not link to a reference implementation. Third-party reproductions exist on Hugging Face TRL and open-source RLAIF tooling.
Data	Partially released	The Anthropic HH-RLHF dataset is public; the CAI-specific red-team set and constitution exact wording are in the paper’s Appendix C but the full prompt set is not separately released.
Hyperparameters	Partially specified	SFT hyperparameters given (1 epoch, 0.5x LR, batch 1024); PPO hyperparameters deferred to Bai 2022a¹¹.
Compute budget	Not reported	The paper does not enumerate GPU-hours or hardware.
Trained model weights	Not released	The 52B SL-CAI and RL-CAI checkpoints are Anthropic-internal.
Evaluation set	Partially released	HHH 438-question eval is publicly described; the 18k+ crowdworker comparisons are not released as a benchmark.
Overall	Partially reproducible	The methodology is reproducible; exact Anthropic numbers are not. Multiple open-source RLAIF re-implementations exist that approximate the recipe.

Methodology disclosure.

Methodology

Sample size: approximately 182k red-team prompts (SL-CAI); approximately 491k red-team + 474k helpfulness (RL-CAI); 8,135 harmlessness + 10,274 helpfulness crowdworker comparisons for evaluation.
Evaluation set: Anthropic in-house red-team + helpfulness sets + 438 HHH binary comparisons. Held-out from training; contamination check not separately reported in the paper.
Baselines: Helpful-only RLHF 52B; Helpful + Harmless RLHF 52B (Bai 2022a).
Hardware/compute: Not reported in the paper or supplementary.

Generalisability.

Other domains: [Analysis] The recipe is domain-agnostic; the constitution is the domain-specific artefact. Replacing the harm constitution with a domain-specific constitution (medical advice, legal advice, financial advice) is straightforward in principle and is what Anthropic and others have done in practice.
Larger scales: Figure 3 evidence suggests the advantage holds across scale.
Different backbones: The recipe has been replicated on open-source backbones in third-party work; no architecture-specific assumptions.
Different data types: The constitution and prompting are text-only; extending to multimodal RLAIF is straightforward and has been done in subsequent work.

Assumption audit. Revisiting Section 3’s assumptions:

A1 (faithful application of principles): empirically validated for 52B; less clear for smaller or weaker feedback models.
A2 (constitution covers harm space): not validated; this is the open question.
A3 (feedback-model calibration): explicitly shown to fail; clamping is the workaround.
A4 (PPO stability): inherited limitation; not solved.

What would make the paper significantly stronger [Analysis].

Constitution-as-system-prompt baseline.
Open release of red-team prompts + final checkpoints to support independent reproduction.
Sensitivity analysis on the clamping band.
External-benchmark evaluation beyond Anthropic’s internal sets.

Section 13, What is reusable for a new study

REUSABLE COMPONENT 1: The SL-CAI critique-revision data-generation pipeline.

What it is: a programmatic loop that turns a written rulebook + helpful-only model into a fine-tuning dataset.
Why worth reusing: avoids the cost of human harmlessness labelling for any new domain.
Preconditions: a helpful-only model strong enough to (a) emit candidate harmful responses, (b) critique against a principle, (c) revise.
What would need to change in a different setting: the constitution (domain-specific principles) and the red-team prompt distribution (domain-specific adversarial prompts).
Risks: the critique-revision loop can amplify the helpful-only model’s blind spots, if it doesn’t recognise a harm category, no amount of self-critique will catch it.
Interaction effects: composes with downstream DPO⁸ or PPO RL stages.

REUSABLE COMPONENT 2: The RL-CAI / RLAIF preference-labelling pipeline.

What it is: AI-generated preference labels from a feedback model + constitution.
Why worth reusing: scales preference data to volumes infeasible with human labellers; iterates faster.
Preconditions: a feedback model strong enough to apply written principles accurately (validated by HHH-style eval).
What would need to change: the comparison constitution, the feedback model’s instruction-following capability.
Risks: feedback-model errors propagate into the trained policy. If the feedback model has a systematic bias, the policy inherits it.
Interaction effects: drops in directly as a preference source for DPO or any preference-learning algorithm.

REUSABLE COMPONENT 3: Chain-of-thought feedback labelling with probability clamping.

What it is: prompt the feedback model with CoT, read off answer-token log-probs, clamp to a soft band.
Why worth reusing: improves labelling accuracy without destabilising RL.
Preconditions: a feedback model that benefits from CoT (most strong instruction-tuned models do).
What would need to change: the clamping band may need re-tuning per model.
Risks: without clamping, RL Goodharts; with too-tight clamping, signal is too weak.
Interaction effects: composable with any preference-learning algorithm.

REUSABLE COMPONENT 4: The constitution itself as a software artefact.

What it is: a short text file of principles, random-sampled during training.
Why worth reusing: the constitution is the closest thing alignment has to a “configuration file”, small, versionable, auditable.
Preconditions: a deployment context where principles can be enumerated.
What would need to change: domain-specific principles. The Collective CAI⁴ methodology shows how to source these from a target population.
Risks: hidden assumptions in principle wording; conflicting principles at inference time.
Interaction effects: the constitution composes with both the original CAI training pipeline and the 2025 Constitutional Classifiers⁶ filter pattern.

Dependency map. Component 1 (SL-CAI data) produces input for Component 2 (RL-CAI preferences). Component 3 (CoT clamping) is an engineering detail inside Component 2. Component 4 (the constitution) is the input artefact for all three.

Recommendation: highest-value components [Analysis]. Component 4 (the constitution as artefact) is the highest-payoff idea, it reframes alignment work as constitution design rather than label collection. Component 2 (the RLAIF labelling pipeline) is the most practically reusable today, especially combined with DPO instead of PPO.

[Analysis] What type of new study benefits most. A team building a domain-specific assistant (legal, medical, customer-support) where they cannot afford human safety labelling but can afford to write a domain constitution. The CAI recipe lets them produce training data and preference signal without an annotation team.

Section 14, Known limitations and open problems

Limitations explicitly stated by the authors.

Constitutional principles are ad hoc (Section 3.1).
Goodharting under over-RL (Section 4.5).
Absolute harmfulness scores potentially miscalibrated (Section 4.5).
The same recipe could be used to train pernicious systems (Section 6.2 / Broader Impacts).

Limitations not stated [Analysis] and [Reviewer Perspective].

Constitution legitimacy: Anthropic researchers wrote the original 32 principles unilaterally. The 2024 Collective CAI paper⁴ is the in-house acknowledgement of this gap.
Missing constitution-as-system-prompt baseline (Section 12).
No external benchmark evaluation in the original paper.
Feedback-model robustness: a CoT-prompted feedback model is itself a soft target for prompt injection in the input being labelled.
Reward-model exploitation: the inherited RLHF failure mode.

Technical root cause of each.

Ad-hoc constitution: no systematic methodology for principle selection.
Goodharting: inherited from RLHF; KL penalty mitigates but does not solve.
Miscalibrated harmfulness scores: absolute ratings are hard to elicit reliably from crowdworkers; pairwise is more robust.
Feedback-model robustness: not studied because the feedback model operates on AI-generated outputs in training, not adversarial user inputs.

Open problems left behind.

Constitution design as a discipline. Who writes principles, how are conflicts resolved, how does the constitution evolve?
Adversarial robustness of the feedback model.
Generalisation across cultures and deployment contexts.
Composing CAI-trained policies with inference-time filters (the 2025 Constitutional Classifiers line begins to answer this).

What a follow-up paper would need to solve to address the most critical limitation. The Collective CAI 2024 paper⁴ tackled constitution legitimacy by sourcing principles from approximately 1,000 members of the public. A natural next step (partially explored by Sharma 2025⁶ and Wang 2025¹³) is a systematic methodology for constitution design, principle selection criteria, conflict-resolution rules, version-control discipline, target-population sampling. C3AI is the most explicit attempt; the area remains active.

Section 15, The 2026 update: where Constitutional AI sits today

This section steps outside the original 2022 paper to situate it in the 2024-2026 lineage. The original paper is the foundational artefact; the two named follow-ups below are the operationally significant 2026 references.

15.1 Collective Constitutional AI (Huang et al. 2024)

Citation. Huang, S., Liao, T. I., Siddarth, D., Durmus, E., Ganguli, D., Lovitt, L., Tamkin, A. “Collective Constitutional AI: Aligning a Language Model with Public Input.” ACM FAccT 2024. arXiv:2406.07814⁴.

What it adds. A methodology for sourcing the constitution itself from a target population. Approximately 1,000 members of the US public used Polis¹⁴ to contribute 1,127 statements and cast 38,252 votes. The resulting CCAI constitution was used to train a model via the original CAI recipe.

Headline result. The CCAI-trained model showed lower bias across nine social dimensions versus the baseline (Anthropic’s standard-constitution model), while maintaining equivalent performance on language, math, and helpful-harmless evaluations⁴. Notably, the CCAI model tends to reframe contentious topics positively rather than refusing, a stylistic shift driven by the public’s principle preferences.

Significance. Operationalises a partial answer to “who writes the rules”: rather than Anthropic-internal, it’s a deliberatively-sourced public input. [Analysis] Whether approximately 1,000 US participants are a sufficient or appropriate sample for a globally deployed model is a separate question.

Methodology nuance. Polis¹⁴ clusters participants by opinion patterns. The CCAI process surfaced both consensus principles (broad agreement across clusters) and divisive ones (high disagreement). Anthropic kept consensus principles and dropped divisive ones for the final CCAI constitution.

15.2 Constitutional Classifiers (Sharma et al. 2025)

Citation. Sharma, M., et al. “Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.” arXiv:2501.18837⁶.

What it adds. A different deployment pattern. Instead of training the LLM itself to be aligned (the original CAI), train small classifiers on synthetic data generated from a constitution, and use them as inference-time filters in front of an LLM. An input classifier scores the user prompt; an output classifier scores the model’s response (streamed token-by-token in the production variant).

Headline result. In 3,000+ hours of human red teaming, no red teamer found a universal jailbreak that could extract restricted information at full detail from a classifier-guarded LLM. On an automated 10,000-prompt jailbreak benchmark, attack success was 4.4% with classifiers versus 86% unguarded⁶. Production overhead: 23.7% inference cost; false-refusal rate increase: 0.38% absolute.

Significance. Demonstrates that the constitution can serve as a defence artefact, not just an alignment artefact. The classifier pattern composes with any LLM (open-weights or closed) and updates faster than retraining: when a new threat category surfaces, update the constitution, regenerate synthetic training data, retrain the small classifiers, deploy.

15.3 Constitutional Classifiers++ (2026 update)

Citation. “Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks.” arXiv:2601.04603⁷. January 2026.

What it adds. Replaces the separate input and output classifiers with a single exchange classifier that evaluates model outputs in the context of their corresponding inputs. The paper reports reduced latency and improved precision over the 2025 separated-classifier baseline.

Significance. [Analysis] Operational maturation of the classifier line. Production constraints (latency, throughput, false-refusal rate) drive the engineering refinements; the underlying constitutional grounding is unchanged.

15.4 The lineage in one sentence

The 2022 paper invented the constitution as a training artefact; the 2024 Collective CAI paper democratised who writes the constitution; the 2025-2026 Constitutional Classifiers line repurposed the constitution as a deployment-time defence rather than a training-time alignment target.

[Analysis] A team building on this lineage in mid-2026 should treat the original CAI paper as the foundational methodology, layer Collective CAI’s principle-sourcing approach if constitution legitimacy is a concern, and consider Constitutional Classifiers as the production-side defence rather than (or in addition to) training the model itself with CAI.

How this article reads at three depths

For the curious high-school reader. Constitutional AI is the idea that you can teach an AI chatbot to refuse dangerous requests politely by writing down a short list of rules and having the AI use the rules to critique and rewrite its own bad responses. The original 2022 Anthropic paper showed this works as well as paying humans to label thousands of bad responses, but is cheaper and faster. By 2026, the same idea is being used in two new ways: getting members of the public to help write the rules, and using small “filter” AIs trained on the rules to block jailbreaks before they reach the main chatbot.

For the working developer or ML engineer. Constitutional AI is a two-stage training recipe: (1) SL-CAI uses a helpful-only model to self-critique and revise its own responses against a 16-principle constitution, then fine-tunes on the revised responses; (2) RL-CAI uses an AI feedback model with chain-of-thought + probability clamping to label preference pairs against a separate 16-principle constitution, trains a preference model on the AI labels, and runs PPO. The recipe drops in for any RLHF pipeline by swapping the human labellers for an AI labeller + written constitution. For 2026 deployment, prefer DPO over PPO as the optimiser; layer Constitutional Classifiers as an inference-time defence; consider Collective CAI’s Polis-based principle sourcing if your constitution will face legitimacy challenges. Engineering gotcha: clamp CoT label probabilities to roughly $[0.4, 0.6]$ to prevent Goodharting.

For the ML researcher. The paper’s enduring contribution is the recipe: which model role plays critic, reviser, comparer, feedback labeller, preference model, and policy at which training stage. The Bradley-Terry / KL-PPO machinery is adopted; the novelty is in the data-generation and labelling pipeline. Strongest objections: ad hoc constitution, no constitution-as-system-prompt baseline, no external benchmarks in the original. Strongest follow-ups: Collective CAI (FAccT 2024) for constitution legitimacy, C3AI (arXiv 2025) for constitution-design ablations, Constitutional Classifiers (2025-2026) for inference-time defences. A follow-up worth doing: systematic study of feedback-model robustness to prompt injection in the labelled responses.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. Bai et al., Constitutional AI: Harmlessness from AI Feedback (arXiv abstract) (accessed 2026-05-20) ↩
2. Constitutional AI, ar5iv HTML render of full paper (accessed 2026-05-20) ↩
3. Constitutional AI, arXiv PDF (accessed 2026-05-20) ↩
4. Huang et al., Collective Constitutional AI (FAccT 2024) (accessed 2026-05-20) ↩
5. Anthropic, Collective Constitutional AI announcement (accessed 2026-05-20) ↩
6. Sharma et al., Constitutional Classifiers (arXiv:2501.18837) (accessed 2026-05-20) ↩
7. Constitutional Classifiers++ (arXiv:2601.04603, January 2026) (accessed 2026-05-20) ↩
8. Rafailov et al., Direct Preference Optimization (DPO) (accessed 2026-05-20) ↩
9. Lee et al., RLAIF: Scaling RLHF with AI Feedback (accessed 2026-05-20) ↩
10. Ouyang et al., Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (accessed 2026-05-20) ↩
11. Bai et al., Training a Helpful and Harmless Assistant with RLHF (Anthropic HH-RLHF) (accessed 2026-05-20) ↩
12. Christiano et al., Deep RL from Human Preferences (accessed 2026-05-20) ↩
13. Wang et al., C3AI: Crafting and Evaluating Constitutions for Constitutional AI (accessed 2026-05-20) ↩
14. Polis, collective deliberation platform (accessed 2026-05-20) ↩