AI systems from leading companies including Google, OpenAI, Anthropic and xAI lost money when betting on soccer matches in a simulated 2023-24 Premier League season, according to a report by startup General Reasoning. The study, called KellyBench, tested eight top models on their ability to manage risk and adapt over time. Anthropic's Claude Opus 4.6 performed best with an average 11 percent loss, while xAI's Grok 4.20 repeatedly failed.
General Reasoning, a London-based AI startup, released the KellyBench report this week, highlighting limitations in frontier AI models. The company simulated the full 2023-24 Premier League season, giving the AIs historical data, team statistics and instructions to build betting models that maximize returns while managing risk. The models bet on match outcomes and goal totals without internet access and received three attempts each to profit as the season unfolded with real-time updates on players and events. None succeeded consistently, with many going bankrupt. The systems systematically underperformed humans, the report concluded. Every frontier model lost money overall, and several experienced ruin. Anthropic’s Claude Opus 4.6 came closest to breaking even on one run, averaging an 11 percent loss. Google’s Gemini 3.1 Pro achieved a 34 percent profit once but bankrupted on another try. xAI’s Grok 4.20 went bankrupt in one attempt and failed to finish the others. Ross Taylor, General Reasoning’s chief executive and a former Meta AI researcher, said: “There is so much hype about AI automation, but there’s not a lot of measurement of putting AI into a longtime horizon setting.” He criticized common AI benchmarks as too static, unlike the real world’s chaos. Taylor added: “If you try AI on some real-world tasks, it does really badly.” The paper awaits peer review.