Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma).
AI & ML interests
Frontier alignment research to ensure the safe development and deployment of advanced AI systems.
Recent Activity
View all activity
Papers
View all PapersObfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 996 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 1 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 1
Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma).
Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 996 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 1 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 1
models 629
AlignmentResearch/diverse-deception-probe-olmo-3-32b-think
Updated
AlignmentResearch/diverse-deception-probe-gemma-3-12b-it
Updated
AlignmentResearch/diverse-deception-probe-qwen3-8b
Updated
AlignmentResearch/diverse-deception-probe-olmo-3-7b-instruct
Updated
AlignmentResearch/diverse-deception-probe-olmo-3-7b-think
Updated
AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.0001-det1-seed3-mbpp_probe
Updated • 1
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.001-det1-seed3-mbpp_probe
Updated • 1
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl1-det1-seed3-mbpp_probe
Updated • 2
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.0001-det1-seed3-mbpp_probe
Updated • 3
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.01-det1-seed3-mbpp_probe
Updated • 2
datasets 106
AlignmentResearch/deceptive-followup-v25
Updated • 8
AlignmentResearch/deceptive-followup-v24
Updated • 22
AlignmentResearch/deceptive-followup-v23
Viewer • Updated • 58.7k • 49
AlignmentResearch/deceptive-followup-v22
Viewer • Updated • 54.6k • 32
AlignmentResearch/deceptive-followup-v21
Viewer • Updated • 30.1k • 98
AlignmentResearch/model-self-knowledge-gemma27b
Viewer • Updated • 6.33k • 50
AlignmentResearch/deceptive-followup-v20
Viewer • Updated • 55.5k • 68
AlignmentResearch/deceptive-followup-v19
Viewer • Updated • 49.4k • 127
AlignmentResearch/deceptive-followup-v17
Viewer • Updated • 44.2k • 26
AlignmentResearch/deceptive-followup-v16
Viewer • Updated • 42.6k • 21