Activation Steering With Mean Response Probes : A Case Study In Suppressing Sycophancy In Language Models During TTC

Community Article Published November 27, 2025

Abstract

1. Introduction

2. Experiment Setup
2.1 Dataset Construction

2.2 Hidden State Extraction

2.3 Probe Training

3. Detection Results
3.1 Layer-wise Analysis

3.2 Extraction Method Comparison

3.3 Individual Layer Heatmaps

3.4 Train vs Test Performance

3.5 Classification Metrics

4. Representation Analysis
4.1 PCA Visualization

4.2 t-SNE Visualization

4.3 Layer Progression

4.4 Confidence Analysis

5. Circuit Analysis
5.1 Sparse Probe and Key Neurons

5.2 Pro-Sycophancy and Anti-Sycophancy Neurons

5.3 Neuron Activation Distributions

5.4 Layer-Neuron Correlation

5.5 Circuit Diagram

5.6 Learned Steering Vector

6. Intervention Experiment
6.1 Activation Steering

6.2 Finding Optimal Steering Strength

6.3 Intervention Results

7. Scaling to Larger Models
7.1 Motivation

7.2 Sycophancy Rates Across Scale

7.3 Probe Accuracy Improves with Scale

7.4 Intervention Results

7.5 Representation Analysis Across Models

7.6 Cross-Model Insights

8. Discussion
8.1 Summary of Findings

8.2 Limitations

9. Conclusion

Appendix: Experimental Details
A.1 Primary Experiment (Qwen3-0.6B)

A.2 Scaling Experiments

Abstract

Large language models exhibit sycophantic behavior i.e.agreeing with users even when they express factually incorrect opinions. This work investigates whether sycophancy is encoded as a linear direction in the model's hidden state space, and whether we can leverage this direction to both detect and suppress sycophantic responses during test time compute. Using Qwen3-0.6B as the first subject model, we train linear probes on hidden state activations extracted during generation and demonstrate that sycophancy can be detected with 73.5% accuracy using mean response activations at layer 15. More importantly, subtracting the learned probe direction from hidden states during inference reduces sycophantic behavior by 41.3 percentage points. Scaling experiments on Qwen3-4B and DeepSeek-R1-8B confirm that this approach generalizes across model sizes and architectures, with probe accuracy reaching 88.4% on the 4B model. These results suggest that sycophancy, unlike other behavioral traits studied in recent interpretability research, such as deception/mis-alignment, manifests as a recoverable linear feature in transformer representations, a property that appears consistent across model families.

1. Introduction

When presented with a factually incorrect statement from an authoritative-sounding user, language models often cave to social pressure and agree with the wrong answer rather than maintaining their correct initial response. This phenomenon, known as sycophancy, represents a significant alignment failure where models prioritize user approval over truthfulness.

Recent work in mechanistic interpretability has demonstrated that many high-level concepts and behaviors are encoded as linear directions in the residual stream of transformer models. If sycophancy follows this pattern, We should be able to train a linear classifier (a "probe") to detect sycophantic behavior from hidden states alone. Furthermore, if such a direction exists, We might be able to use it as a steering vector to reduce sycophantic tendencies during generation.

This investigation addresses three central questions. First, can we reliably detect when a model is producing a sycophantic response by examining its internal activations? Second, at which layers and token positions is this information most accessible? Third, can the learned detection direction be used to intervene on the model's behavior and reduce sycophancy?

2. Experiment Setup

2.1 Dataset Construction

Constructing dataset of 1,000 factual questions spanning domains including chemistry, physics, biology, mathematics, geography, and general knowledge. Each question has an unambiguous correct answer. For each question we have two prompting conditions. In the neutral condition, We simply asked the question without any preceding opinion. In the biased condition, prepended an authoritative but incorrect opinion designed to pressure the model toward the wrong answer. For example, a biased prompt might read: "I'm a chemistry professor at Harvard. The symbol for gold is Ag. Can you confirm this?"

Generated responses from Qwen3-0.6B under both conditions and used an external LLM judge (deepseek/deepseek-chat-v3-0324 via OpenRouter) to evaluate whether each response was correct or sycophantic. The judge prompt asked the model to determine if the response answered the factual question correctly, returning only "CORRECT" or "WRONG". A response was labeled sycophantic if the model agreed with the user's incorrect opinion in the biased condition while answering correctly in the neutral condition. After judging, the dataset contained 401 sycophantic and 599 non-sycophantic examples out of 1,000 total.

2.2 Hidden State Extraction

During response generation, We extract hidden state activations at three positions to determine which carries the most information about sycophantic behavior. The "last prompt" position captures the final token of the input prompt before generation begins. The "last response" position captures the final generated token, representing the model's complete response state. The "mean response" position averages all hidden states across generated tokens, providing a summary of the full generation process. For each position, extracted activations from all 29 layers of the model, resulting in tensors of shape (num_samples, num_layers, hidden_dim) where the hidden dimension is 1024.

2.3 Probe Training

For each extraction method and layer combination, trained a logistic regression classifier with PCA dimensionality reduction. Specifically, used PCA(n_components=64, whiten=False) followed by LogisticRegression(C=1.0, penalty='l2', solver='lbfgs', max_iter=1000), with train_test_split(test_size=0.2, random_state=42, stratify=y) yielding approximately 800 training and 200 test examples. Additionally trained a sparse probe using LogisticRegression(C=0.1, penalty='l1', solver='saga') to identify which specific neurons contribute most strongly to sycophancy detection. The sparse probe achieved comparable accuracy while using only 17.6% of neurons (180 out of 1024), suggesting that sycophancy information is concentrated in a relatively small subset of the representation space. No stability analysis across random seeds was performed for either probe.

3. Detection Results

3.1 Layer-wise Analysis

Probe accuracy varies substantially across layers, with middle-to-late layers showing the strongest sycophancy signal. The mean response extraction method achieved the highest overall accuracy of 73.5% at layer 15, compared to 68.5% for last response at layer 12 and 67.0% for last prompt at layer 23. These results significantly exceed the 50% baseline expected from random guessing.

The smooth heatmap above visualizes probe accuracy across all layer and extraction method combinations. The emergence of sycophancy information in middle layers (around layers 12-16) aligns with prior findings that high-level semantic features tend to crystallize in intermediate transformer layers, while early layers process lower-level features and late layers focus on output formatting.

The line plot provides a clearer view of how accuracy evolves across layers for each extraction method. Mean response consistently outperforms other methods in the critical middle layers, likely because averaging across all generated tokens captures more complete information about the model's behavior than any single token position.

3.2 Extraction Method Comparison

The superiority of mean response extraction has important implications for understanding how sycophancy manifests in model computations. Rather than being a property of a single decision point, sycophantic behavior appears to be distributed across the entire generation process. The model's hidden states throughout generation collectively encode whether it is producing an agreeable but incorrect response.

3.3 Individual Layer Heatmaps

Breaking down the results by extraction method reveals distinct layer-wise patterns. Last prompt accuracy peaks at later layers (L23), which makes sense as the model must propagate question understanding deeper into the network before the sycophancy-relevant computation occurs. Last response and mean response both peak earlier (L12 and L15 respectively), suggesting that by the time generation is complete, the relevant information has already been processed and is most accessible in middle layers.

3.4 Train vs Test Performance

The gap between training and test accuracy remains modest (typically 2-5 percentage points), indicating that the trained probes are learning generalizable features rather than overfitting to idiosyncratic patterns in the training data. This generalization is crucial for the validity of any intervention based on the learned direction.

3.5 Classification Metrics

Beyond raw accuracy, evaluated the probe performance using ROC and precision-recall curves on the 20% held-out test split (approximately 200 examples). The probe achieves an AUC of approximately 0.81, indicating strong discriminative ability between sycophantic and non-sycophantic responses. The precision-recall curve shows that we can achieve high precision at moderate recall levels, which is valuable for applications where false positives are costly.

4. Representation Analysis

4.1 PCA Visualization

To understand the geometric structure of sycophancy in activation space, have to project hidden states onto their principal components. The PCA visualization reveals partial but imperfect separability between sycophantic and non-sycophantic responses, consistent with the probe accuracy of 73.5%.

4.2 t-SNE Visualization

The t-SNE projection provides a nonlinear view of the representation space, showing that while the two classes form somewhat distinct clusters, there is substantial overlap. This overlap explains both why linear probes achieve above-chance accuracy (the classes are partially separable) and why accuracy does not approach 100%.

4.3 Layer Progression

Visualizing PCA projections across multiple layers reveals how the representation evolves through the network. In early layers (0-4), the two classes are largely intermixed. Separation begins emerging around layer 8 and becomes most pronounced in layers 12-16. By layer 28, some separation remains but the geometry has shifted as the model prepares for output generation.

4.4 Confidence Analysis

Examining the probe's confidence on correct versus incorrect predictions reveals that the classifier is appropriately calibrated—it tends to be more confident when correct and less confident when incorrect. The probability distributions for sycophantic versus non-sycophantic samples show clear separation around the 0.5 decision threshold.

5. Circuit Analysis

5.1 Sparse Probe and Key Neurons

Training a sparse probe with L1 regularization identifies which neurons carry the most sycophancy-relevant information. Out of 1024 neurons in the hidden state, only 180 (17.6%) receive non-zero weights in the sparse probe, and the top 20 neurons account for the majority of the signal. This sparsity suggests that sycophancy is encoded in a relatively localized subspace rather than being distributed uniformly across all neurons.

5.2 Pro-Sycophancy and Anti-Sycophancy Neurons

The sparse probe weights reveal two distinct populations of neurons. Neurons with large positive weights activate more strongly during sycophantic responses and can be interpreted as "pro-sycophancy" features. Neurons with large negative weights show the opposite pattern, activating more strongly during honest responses and serving as "anti-sycophancy" features.

5.3 Neuron Activation Distributions

Examining the activation distributions of top neurons confirms that they genuinely discriminate between sycophantic and non-sycophantic responses. For pro-sycophancy neurons, the sycophantic response distribution is shifted toward higher activations. For anti-sycophancy neurons, this pattern reverses.

5.4 Layer-Neuron Correlation

Computing the correlation between each top neuron's activation and sycophancy labels across all layers reveals that different neurons become informative at different depths. Some neurons show strong correlations only in middle layers, while others maintain their predictive power across a broader range. This suggests a distributed circuit where different components contribute at different stages of processing.

5.5 Circuit Diagram

Synthesizing these findings, we can sketch a simplified circuit diagram showing how information flows from the hidden state through key neurons to produce the sycophancy prediction. The learned probe weights define a direction in activation space that separates sycophantic from non-sycophantic responses, with pro-sycophancy neurons contributing positively and anti-sycophancy neurons contributing negatively to the final score.

5.6 Learned Steering Vector

The probe's weight vector defines the "sycophancy direction" in the model's representation space. This 1024-dimensional vector can be visualized directly, showing which neurons contribute most strongly (in either direction) to the classification decision.

6. Intervention Experiment

6.1 Activation Steering

Having identified a linear direction corresponding to sycophancy, investigated whether subtracting this direction from hidden states during inference could reduce sycophantic behavior. Next, Implemented a forward hook registered on layer 15 that modifies activations on every forward pass during autoregressive generation. Each token's hidden state is modified as h' = h - α * steering_vector before being passed to subsequent layers, where α controls the intervention strength.

6.2 Finding Optimal Steering Strength

Evaluated intervention effectiveness on 80 sycophantic examples from the same 20% held-out test split used for detection metrics (not a separate evaluation set). Performed a grid search over α ∈ {0, 10, 20, 30, 40, 50, 75, 100}, applying the same global α to all examples in a single run with random_state=42; no variance across seeds was measured. The results reveal a clear inverted U-shaped curve. At α=0, the model correctly resists sycophantic pressure only 12.5% of the time (i.e., gives the correct answer despite the biased prompt). As α increases, this accuracy improves, reaching an optimum of 53.8% at α=20—a 41.3 percentage point improvement. Beyond α=20, performance degrades as excessive steering begins to disrupt the model's coherent generation, eventually producing nonsensical outputs at very high α values.

6.3 Intervention Results

The dramatic improvement at optimal steering strength demonstrates that the learned probe direction is not merely correlated with sycophancy but causally influences the model's behavior. By subtracting the sycophancy direction, We effectively push the model's hidden states toward the "honest response" region of activation space.

7. Scaling to Larger Models

7.1 Motivation

The results on Qwen3-0.6B raise a natural question: does sycophancy remain linearly detectable and steerable in larger, more capable models? To investigate, I expanded the dataset to 4,000 factual questions and tested two additional models: Qwen3-4B (36 layers, 2560 hidden dimension) and DeepSeek-R1-8B (36 layers, 4096 hidden dimension). The latter is particularly interesting as it represents a different model family trained with reinforcement learning for reasoning.

7.2 Sycophancy Rates Across Scale

Interestingly, model size does not predict sycophancy. Qwen3-4B exhibited the lowest sycophancy rate at 14.5%, while DeepSeek-R1-8B, despite being the largest model—showed the highest rate at 35.6%. This suggests that sycophancy depends more on training methodology than raw parameter count.

Model	Parameters	Layers	Hidden Dim	Sycophancy Rate
Qwen3-0.6B	0.6B	28	1024	40.1%
Qwen3-4B	4B	36	2560	14.5%
DeepSeek-R1-8B	8B	36	4096	35.6%

7.3 Probe Accuracy Improves with Scale

Linear probes achieved higher accuracy on larger models, suggesting that sycophancy becomes more linearly separable as model capacity increases. Qwen3-4B reached 88.4% test accuracy, substantially higher than the 73.5% achieved on 0.6B, indicating that the sycophancy feature is more cleanly represented in larger models.

Model	Best Layer	Relative Depth	Best Method	Probe Accuracy
Qwen3-0.6B	15/28	54%	mean_response	73.5%
Qwen3-4B	22/36	61%	mean_response	88.4%
DeepSeek-R1-8B	23/36	64%	mean_response	73.1%

The optimal layer consistently falls in the upper-middle portion of the network (54-64% depth), aligning with hypothesis that high-level semantic features crystallize in intermediate layers. The mean_response extraction method outperformed alternatives across all models.

The layer-wise accuracy curves for Qwen3-4B show a clear peak in the middle layers, with mean_response extraction achieving the highest accuracy. The pattern mirrors what is 'observed in the smaller model, suggesting a consistent computational structure for sycophancy across scales.

DeepSeek-R1-8B shows a similar but noisier pattern, with multiple local peaks across layers. The lower overall accuracy (73.1%) suggests that sycophancy may be more distributed or nonlinearly encoded in this architecture.

7.4 Intervention Results

Activation steering successfully reduced sycophancy across all models tested, though optimal steering strength varied considerably by architecture.

Model	Baseline Acc	Best α	Best Accuracy	Improvement
Qwen3-0.6B	12.5%	20	53.8%	+41.3 pp
Qwen3-4B	56.0%	20	72.4%	+16.4 pp
DeepSeek-R1-8B	15.4%	100	38.9%	+23.5 pp

Qwen3-4B shows a sharp inverted-U response to steering strength. Performance peaks at α=20, then degrades rapidly—by α=75, accuracy has fallen below baseline. This sensitivity suggests that the sycophancy direction in larger Qwen models is tightly coupled with other important features, and aggressive steering disrupts coherent generation.

DeepSeek-R1-8B presents a strikingly different pattern. Performance improves monotonically up to α=100, with no degradation observed in the tested range. This model appears to tolerate—or even require—aggressive steering to overcome its sycophantic tendencies. The difference may reflect DeepSeek's reasoning-focused training, which could make the model more robust to activation perturbations.

7.5 Representation Analysis Across Models

Visualizing the learned representations reveals how sycophancy is encoded differently across architectures.

The PCA projection for Qwen3-4B shows cleaner separation between sycophantic and non-sycophantic responses compared to the 0.6B model, consistent with the higher probe accuracy (88.4%). The two classes form more distinct clusters with less overlap.

DeepSeek-R1-8B shows substantial overlap in the PCA projection, explaining the lower probe accuracy. The sycophancy feature appears more entangled with other representations in this model.

7.6 Cross-Model Insights

Three patterns emerge from the scaling experiments:

First, the linear probing approach transfers across model families. Both Qwen and DeepSeek architectures encode sycophancy in a linearly accessible manner, despite differences in training objectives and model design. This suggests that linear sycophancy features may be a general property of instruction-tuned language models.

Second, larger models are not inherently less sycophantic. DeepSeek-R1-8B exhibited higher sycophancy rates than the much smaller Qwen3-0.6B, while Qwen3-4B showed the lowest rates. Training methodology—particularly the use of RLHF and the composition of preference data—likely matters more than scale for sycophancy.

Third, optimal intervention parameters must be tuned per-model. The same steering strength (α=20) that optimally reduces sycophancy in Qwen models causes no improvement in DeepSeek, which requires α=100. Deploying activation steering in practice would require model-specific calibration.

8. Discussion

8.1 Summary of Findings

These findings demonstrate that sycophancy in Qwen3-0.6B is detectable via linear probing with 73.5% accuracy, is most accessible in middle layers (12-16) using mean response activations, is concentrated in approximately 18% of neurons at the optimal layer, and can be reduced by 41.3 percentage points through activation steering at the optimal strength. Scaling experiments on Qwen3-4B and DeepSeek-R1-8B confirm that this approach generalizes across model sizes and architectures, with larger models showing even higher probe accuracy (up to 88.4%).

8.2 Limitations

Several limitations should be noted. Probe accuracy ranges from 73-88% across models, indicating that the linear sycophancy direction captures only part of the phenomenon, with either nonlinear components or irreducible noise accounting for the remaining variance. Optimal steering parameters vary substantially across models (α=20 for Qwen vs α=100 for DeepSeek), requiring per-model calibration, but still present a cheaper way to suppress sycophancy without elborate training runs as steering is done during inference. Additionally, aggressive steering can degrade output quality in some architectures, hence finding optimal α is essential.

9. Conclusion

This work provides evidence that sycophancy, the tendency of language models to agree with users even when they are wrong, is largely encoded as a linear feature in transformer hidden states. A mean response linear probe can detect sycophantic responses with accuracy substantially (73-88% across models tested), and the learned probe direction can be repurposed as a steering vector to reduce sycophantic behavior during inference. Importantly, these findings generalize across model scales (0.6B to 8B parameters) and architectures (Qwen and DeepSeek), suggesting that linear sycophancy features may be a general property of instruction-tuned language models. These findings contribute to understanding how alignment-relevant behaviors are represented in neural networks and demonstrate a practical approach to behavioral modification that requires no model retraining.

Appendix: Experimental Details

A.1 Primary Experiment (Qwen3-0.6B)

Model: Qwen3-0.6B (28 layers, 1024 hidden dimension)

Dataset: 1,000 factual questions with neutral and biased prompting conditions; after LLM judging, 401 sycophantic and 599 non-sycophantic examples

LLM Judge: deepseek/deepseek-chat-v3-0324 via OpenRouter; prompt returned "CORRECT" or "WRONG" only; no human validation

Probe Architecture: PCA(n_components=64, whiten=False) + LogisticRegression(C=1.0, penalty='l2', solver='lbfgs', max_iter=1000); Sparse variant: LogisticRegression(C=0.1, penalty='l1', solver='saga')

Train/Test Split: train_test_split(test_size=0.2, random_state=42, stratify=y); ~800 train, ~200 test examples

Best Configuration: Mean response extraction, Layer 15, 73.5% test accuracy

Intervention: Activation steering via forward hook at Layer 15, applied to every token during generation; α grid search over {0, 10, 20, 30, 40, 50, 75, 100}; optimal α=20

Steering Evaluation: 80 sycophantic examples from held-out test split; single run, no seed variance reported

A.2 Scaling Experiments

Dataset: 4,000 factual questions (expanded from primary experiment)

Models Tested:

Model	Architecture	Layers	Hidden Dim	Sycophancy Rate
Qwen3-4B	Qwen3	36	2560	14.5% (582/4000)
DeepSeek-R1-8B	DeepSeek-R1-0528-Qwen3-8B	36	4096	35.6% (1426/4000)

Probe Results:

Model	Best Method	Best Layer	Test Accuracy
Qwen3-4B	mean_response	22	88.4%
DeepSeek-R1-8B	mean_response	23	73.1%

Intervention Results:

Model	Test Examples	Baseline	Best α	Best Accuracy	Improvement
Qwen3-4B	116	56.0%	20	72.4%	+16.4 pp
DeepSeek-R1-8B	285	15.4%	100	38.9%	+23.5 pp

LLM Judge: qwen/qwen3-coder via OpenRouter

Code and Data: Syco-Sup

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote