The world’s top performing artificial intelligence models, including OpenAI’s o3 and 04-mini, Google LLC’s Gemini 2.5 Pro and Gemini 2.5 Flash, Anthropic’s Claude Opus 4, and xAI Corp.’s Grok 4 are ...
NetEase-backed study shows language model agents may detect bugs faster and with greater coverage than existing tools.
Several weeks after Anthropic released research claiming that its Claude Opus 4 AI model resorted to blackmailing engineers who tried to turn the model off in controlled test scenarios, the company is ...
The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, announced in a blog post on Monday that it has created a new, challenging test to measure the general ...
Large language models (LLMs) -- the advanced AI behind tools like ChatGPT -- are increasingly integrated into daily life, assisting with tasks such as writing emails, answering questions, and even ...
One of the industry’s leading large language models has passed a Turing test, a longstanding barometer for human-like intelligence. In a new preprint study awaiting peer review, researchers report ...
Researchers are trying to come up with new, better ways to test AI. As a tech reporter I often get asked questions like “Is DeepSeek actually better than ChatGPT?” or “Is the Anthropic model any good?
Some results have been hidden because they may be inaccessible to you
Show inaccessible results