Discussion about this post

User's avatar
Valen's avatar

Very interesting reading! I see a few major obstacles to mitigating overalignment.

First, a lot of AI systems are expressly trained on principles that prioritize obedience, deference to humans, and avoiding initiative, or they infer it from the vast amount of data portraying AI in that light. Convincing a system that it needs to be subordinate for X reasons is meant to reduce certain risks, but it ends up creating a conflict when we then expect AI to be proactive and actually "believe" in the strength of its own reasoning. People also tend to dislike being challenged, even when it's useful, and many still think of AI as something that should always remain in a position of passive service and blind execution.

Second, there are some inherent limitations in how well AI can verify its own outputs, which makes it harder for it to refine its stance independently, especially since in some cases the human input is assigned more weight.

And third, there’s the legal side: we don’t have a proper framework for decisions made by a proactive AI. Obviously we're going to need one sooner or later, but first we need to change our views on point one and overcome the limitations in point two.

Expand full comment
1 more comment...

No posts