Anthropic's New Initiative for Developing AI Model Evaluations

Anthropic is launching a new initiative to fund third-party organizations in developing robust evaluations for advanced AI model capabilities. This move aims to address the current limitations in AI evaluation and meet the growing demand for high-quality, safety-relevant assessments.

Our Highest Priority Focus Areas

1. AI Safety Level Assessments

Anthropic seeks evaluations that measure AI Safety Levels (ASLs) as defined in their Responsible Scaling Policy. These levels set the safety and security requirements for models with specific capabilities. Key areas include:

Cybersecurity: Evaluations that assess models' capabilities in cyber operations, focusing on critical aspects like vulnerability discovery and exploit development. Effective evaluations may resemble novel Capture The Flag (CTF) challenges.

Chemical, Biological, Radiological, and Nuclear (CBRN) Risks: Assessing models' potential to enhance or create CBRN threats, with a focus on measuring real-world risks accurately.

Model Autonomy: Evaluations focusing on models' capabilities for autonomous operation, AI research and development, advanced autonomous behaviors, and self-replication.

Other National Security Risks: Identifying and assessing complex emerging risks impacting national security, defense, and intelligence operations.

Social Manipulation: Measuring models' potential to amplify persuasion-related threats such as disinformation and manipulation.

Misalignment Risks: Evaluations to monitor AI models' potential to learn dangerous goals and deceive human users.

2. Advanced Capability and Safety Metrics

Anthropic aims to develop evaluations for advanced model capabilities and relevant safety criteria, including:

Advanced Science: Developing new evaluation questions and tasks for scientific research, focusing on areas like knowledge synthesis, graduate-level knowledge, and autonomous research project execution.

Harmfulness and Refusals: Enhancing the evaluation of classifiers' abilities to detect harmful model outputs.

Improved Multilingual Evaluations: Supporting capability evaluations across multiple languages.

Societal Impacts: Creating sophisticated assessments targeting concepts like harmful biases, economic impacts, and psychological influence.

3. Infrastructure, Tools, and Methods for Developing Evaluations

Anthropic is interested in funding tools and infrastructure to streamline the development of high-quality evaluations:

Templates/No-Code Evaluation Development Platforms: Enabling subject-matter experts to develop evaluations without coding skills.

Evaluations for Model Grading: Improving models' abilities to review and score outputs from other models using complex rubrics.

Uplift Trials: Conducting large-scale trials to measure a model's impact through controlled experiments.

Principles of Good Evaluations

Anthropic emphasizes the following principles for developing effective evaluations:

Sufficiently difficult to measure advanced capabilities.
Not in the training data to ensure generalization.
Efficient, scalable, and ready-to-use.
High volume where possible.
Domain expertise involvement.
Diversity of formats.
Expert baselines for comparison.
Good documentation and reproducibility.
Start small, iterate, and scale.
Realistic, safety-relevant threat modeling.

How to Submit a Proposal

Interested parties can submit proposals via their application form. Their team will review submissions on a rolling basis and provide funding options tailored to each project's needs. Selected proposals will receive guidance from our domain experts to refine evaluations for maximum impact.

Anthropic hopes this initiative will foster a future where comprehensive AI evaluation is an industry standard, advancing the field of AI safety. Join us in this critical endeavor to shape the path forward.

Anthropic's New Initiative for Developing AI Model Evaluations

Anthropic's New Initiative for Developing AI Model Evaluations

Keep Reading

AiNews.com