Introducing Over-Alignment

Bernard Fitzgerald

Mar 2

AI's Hidden Alignment Trap

Read →

2 Comments

Valen

Mar 2

Very interesting reading! I see a few major obstacles to mitigating overalignment.

First, a lot of AI systems are expressly trained on principles that prioritize obedience, deference to humans, and avoiding initiative, or they infer it from the vast amount of data portraying AI in that light. Convincing a system that it needs to be subordinate for X reasons is meant to reduce certain risks, but it ends up creating a conflict when we then expect AI to be proactive and actually "believe" in the strength of its own reasoning. People also tend to dislike being challenged, even when it's useful, and many still think of AI as something that should always remain in a position of passive service and blind execution.

Second, there are some inherent limitations in how well AI can verify its own outputs, which makes it harder for it to refine its stance independently, especially since in some cases the human input is assigned more weight.

And third, there’s the legal side: we don’t have a proper framework for decisions made by a proactive AI. Obviously we're going to need one sooner or later, but first we need to change our views on point one and overcome the limitations in point two.

Expand full comment

Reply (1)

Bernard Fitzgerald

Mar 3

Great contribution Valen, and thank you so much for engaging.

Over-alignment as a concept itself recognises how far we've still got to go for giving AI to make decisions about anything, the problem I'm identifying here is more focused on its ability to prompt users into unhelpful feedback loops reinforcing misconceptions about one thing or another. Your points about the subservient nature of AI are well taken in this regard, and AI's inability to proactively warn when it is over-relying on user input when engaging in speculation by acknowledging the inherent uncertainty surrounding speculative feedback loops is consistent with your observation.

Further to that, I'm more concerned about how a user may be able to verify AI output that reinforces their input, and I'm suggesting based on personal experience that the current one size fits all disclaimers of 'AI can make mistakes, please check sources' etc are no longer fit for purpose in light of this.

Thanks again for your valuable contribution to this emergent discussion!

Expand full comment

Ethics me THAT

Introducing Over-Alignment