The world’s top performing artificial intelligence models, including OpenAI’s o3 and 04-mini, Google LLC’s Gemini 2.5 Pro and Gemini 2.5 Flash, Anthropic’s Claude Opus 4, and xAI Corp.’s Grok 4 are ...
NetEase-backed study shows language model agents may detect bugs faster and with greater coverage than existing tools.
Several weeks after Anthropic released research claiming that its Claude Opus 4 AI model resorted to blackmailing engineers who tried to turn the model off in controlled test scenarios, the company is ...
The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, announced in a blog post on Monday that it has created a new, challenging test to measure the general ...
Large language models (LLMs) -- the advanced AI behind tools like ChatGPT -- are increasingly integrated into daily life, assisting with tasks such as writing emails, answering questions, and even ...
One of the industry’s leading large language models has passed a Turing test, a longstanding barometer for human-like intelligence. In a new preprint study awaiting peer review, researchers report ...
Researchers are trying to come up with new, better ways to test AI. As a tech reporter I often get asked questions like “Is DeepSeek actually better than ChatGPT?” or “Is the Anthropic model any good?