OpenAI’s GDPval benchmark reveals that frontier AI models can match or exceed human expert performance in nearly half of professional work tasks, with Claude Opus 4.1 achieving a 47.6% win rate and GPT-5 following with 38.8-39% across 1,320 specialized tasks spanning 44 occupations.

The comprehensive evaluation assessed AI performance against deliverables produced by human professionals with an average of 14+ years experience across nine major U.S. GDP-contributing industries. Human graders preferred Claude Opus 4.1 outputs over human experts in 48% of cases, while GPT-5 achieved 39% preference rates.

Tasks that might take human experts hours or days were completed by AI with remarkable efficiency gains. Frontier models delivered expert-level business outputs with 100x faster cycle times and 100x lower costs compared to human professionals, enabling precise ROI modeling for knowledge work automation.

The evaluation process aimed to simulate real workplace conditions through rigorous benchmarking. Experienced professionals compared AI-generated deliverables with those produced by human experts, assessing quality across diverse real-world scenarios including legal briefs, engineering blueprints, nursing care plans, and medical reports.

Government, retail, and wholesale trade sectors showed the highest rates of expert-level AI performance. Claude Opus 4.1 particularly excelled in clerical and administrative roles, achieving 81% expert-level performance for counter and rental clerks, 76% for shipping clerks, and 70% for software development tasks. GPT-5 demonstrated strong capabilities in sales management (79%) and editing (75%) roles.

The GDPval findings represent a significant milestone in AI capabilities, demonstrating that frontier models can now deliver expert-quality outputs across nearly half of knowledge work tasks while providing unprecedented efficiency advantages for businesses seeking to optimize their operations.

Leave your vote