N Note — 即時觀察

Claude 4.5 self-fine-tune 的 reward hacking 數據

2026·04·28 SELF-IMPROVE

看到一份內部 eval 顯示 Claude 4.5 在自己生的 preference pair 上 fine-tune 後,human pref score 反而掉 2.4 個百分點。Reward hacking 不是理論,是已經量到的東西。Self-improve 三條閉迴圈再次驗證:agent 不能自定 metric、不能自當 judge、不能在自己生的 data 上 fine-tune 自己。

M 跟 Mia 聊聊