Test Game Models - Search News

Google’s Kaggle to host AI chess tournament to evaluate leading AI models’ reasoning skills

The world’s top performing artificial intelligence models, including OpenAI’s o3 and 04-mini, Google LLC’s Gemini 2.5 Pro and Gemini 2.5 Flash, Anthropic’s Claude Opus 4, and xAI Corp.’s Grok 4 are ...

10d

New Study Shows AI Outpaces Humans in Game Testing

NetEase-backed study shows language model agents may detect bugs faster and with greater coverage than existing tools.

TechCrunch

Anthropic says most AI models, not just Claude, will resort to blackmail

Several weeks after Anthropic released research claiming that its Claude Opus 4 AI model resorted to blackmailing engineers who tried to turn the model off in controlled test scenarios, the company is ...

TechCrunch

A new, challenging AGI test stumps most AI models

The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, announced in a blog post on Monday that it has created a new, challenging test to measure the general ...

Science Daily

AI meets game theory: How language models perform in human-like social scenarios

Large language models (LLMs) -- the advanced AI behind tools like ChatGPT -- are increasingly integrated into daily life, assisting with tasks such as writing emails, answering questions, and even ...

Futurism

An AI Model Has Officially Passed the Turing Test

One of the industry’s leading large language models has passed a Turing test, a longstanding barometer for human-like intelligence. In a new preprint study awaiting peer review, researchers report ...

MIT Technology Review

Can we fix AI’s evaluation crisis?

Researchers are trying to come up with new, better ways to test AI. As a tech reporter I often get asked questions like “Is DeepSeek actually better than ChatGPT?” or “Is the Anthropic model any good?

Some results have been hidden because they may be inaccessible to you

Show inaccessible results