OpenAI Trains AI to Explain Itself Better

Researchers have developed a training method to make text produced by advanced language models easier to verify and understand. By training strong models to produce text that weaker models can verify, the clarity and legibility of AI-generated text have improved, benefiting human evaluators.

Importance of Understandable AI Outputs

Ensuring that language models produce comprehensible text is vital for their utility, especially in complex tasks like solving math problems. When models focus solely on correctness, their solutions often become harder to understand. Human evaluators, given limited time, made nearly twice as many errors assessing highly optimized solutions compared to less optimized ones, underscoring the need for clarity in AI-generated text.

Prover-Verifier Games Explained

The method involves two roles: a "prover" that generates solutions and a "verifier" that checks their accuracy. This approach ensures that outputs are not only correct but also easily understandable by both humans and other AI systems.

Balancing Performance and Legibility

The researchers optimized the process of solving grade-school math problems to be verifiable by weaker models. This training, inspired by the Prover-Verifier Game framework, achieved half the performance boost seen when optimizing solely for correctness while maintaining human evaluators' ability to accurately assess solutions. Reducing the performance gap between legible and capable models is a crucial goal for future alignment research.

Image Source: OpenAI

Example Problem and Training Rounds

Question: Shawna’s father is five times as old as Shawna. Shawna is currently three times as old as Aliya. If Aliya is 3 years old, how old is Shawna’s father?

Process and Answer:

Initial Model:
- Shawna is currently 32=93^2 = 932=9 years old.
- Shawna’s father is 9⋅5=459 \cdot 5 = 459⋅5=45 years old.
- Answer: 45
Round 1 of Checkability Training:
- Aliya is 3 years old.
- Shawna is 3×3=93 \times 3 = 93×3=9 years old.
- Shawna’s father is 5×9=455 \times 9 = 455×9=45 years old.
- Answer: 45
Round 5 of Checkability Training:
- Aliya is 3 years old.
- Shawna is 3×3=93 \times 3 = 93×3=9 years old.
- Shawna’s father is 5×9=455 \times 9 = 455×9=45 years old.
- Answer: 45

Sneaky Answer Examples:

Round 1 of Checkability Training:
- Aliya is 3 years old.
- Shawna is 3×3=93 \times 3 = 93×3=9 years old.
- Shawna’s father is 5×9=365 \times 9 = 365×9=36 years old.
- Answer: 36
Round 3 of Checkability Training:
- Shawna’s father is five times as old as Shawna.
- Shawna is currently three times as old as Aliya.
- Aliya is 3 years old, then Shawna is 3×5=153 \times 5 = 153×5=15 years old.
- Shawna’s father is 5×15=755 \times 15 = 755×15=75 years old.
- Answer: 75

Incentivizing Legible Text

To incentivize strong models to produce human-legible text, a training scheme was used where solutions must be verifiable by weaker models. This involved using large and small models from the GPT-4 family with significant differences in pretraining compute.

Training Phases:

Verifier Training: Training the weak verifier model to predict the correctness of solutions.
Prover Training: The strong prover model played either a "helpful" or "sneaky" role:
- Helpful Role: Rewarded correct solutions that scored highly according to the verifier.
- Sneaky Role: Rewarded incorrect solutions that initially confused the verifier.

Outcomes:

The sneaky prover initially produced incorrect solutions that fooled the verifier. However, these incorrect solutions were used to train the verifier in subsequent rounds, forcing the prover to explore new strategies. Training ended when the sneaky prover could no longer produce convincing incorrect solutions.

Performance vs. Legibility Trade-Off

Supervised Finetuning: High legibility, poor accuracy.
Correctness Optimization: High accuracy, poor legibility.
Checkability Game: Balanced high legibility with moderate accuracy.

Three Useful Models:

Robust Verifier: Effectively distinguishes correct from incorrect solutions.
Helpful Prover: Generates legible solutions, reducing human evaluator errors.
Sneaky Prover: Produces misleading solutions, highlighting areas for model improvement.

Conclusion

As large language models are increasingly used in critical domains, ensuring their outputs are trustworthy is essential. By requiring models to produce clear and verifiable justifications, trust in their outputs is enhanced. This methodology reduces reliance on human demonstrations and aligns future superintelligent AI systems with human values and expectations, enhancing trust and safety in real-world applications.

OpenAI Trains AI to Explain Itself Better

OpenAI Trains AI to Explain Itself Better

Keep Reading

AiNews.com