this week in AI #3: the $60 trillion problem
o3, o4-mini, benchmark drama, gpt 4.1, gemini 2.5 flash
for people in a rush:
Mechanize calls our jobs a $60 trillion problem
OpenAI’s o3 / o4-mini wow and gaslight users
GPT 4.1 divides programmers while being occasionally malign
Google’s Gemini Flash 2.5 introduces “thinking budget”
Anthropic’s deep research tool is fine but pricey
OpenAI is buying its first coding agent
OpenAI wants to rival X with image-centric social app,
AI defends national interests
AI is like normal technology like electricity
LLMs echo-chamber values. well, sort of
rise of a new “villain”
To founders of Mechanize - your job is a problem. Yes, your regular 9-5 job that puts food on your table and netflix on your TV is a problem that needs to be resolved. And you and everyone else who works create a $60 trillion “problem” that Mechanize wants to solve, by “completely automating labor”.
Well ok, sounds a bit scary…
But what if this isn’t such a bad thing? Some jobs are too dangerous or too isolating, like mining or oil‑rigging. Or that unsexy house labor such as sorting clothes and dusting.
These tasks are so boring, wouldn’t it be great if they automated them, right?
Oh no no no. What made you thing that? NO, they won’t touch those, “their immediate focus is on white-collar work”: sales reps, software developers, customer service. Its founder, Tamay Besiroglu, believes that once these jobs go we’ll all be richer, as he points out in his research paper. He’s also the guy behind Epoch AI, and I have plenty of tea to spill about them. But more on that soon.
When he announced his latest darling, people were, as one might imagine, not pleased:
“I think it will be a huge loss for most humans, as well as contribute directly to intelligence runaway and disaster” posted Anthony Aguirre, an executive at a nonprofit founded by Max Tegmark.
It seems like we’re building this automated future faster than we can even figure out what it means to live in it. Regardless of how I feel about it, Anthropic’s top security leader believes that we’ll see fully virtual AI employees next year. The stars (and companies) appear to be aligning to make this happen. Just look at what was released in the last week:
openai’s o3
Lo and behold: OpenAI’s most powerful model and successor to o1. It’s the first reasoning model with full autonomous tool access—web search, Python execution, file operations, image analysis. It kick‑started a fun viral trend: geoguessing or the ability to accurately guess where every photo was taken.
As a test, I uploaded a blurry screenshot of a random street in Cebu, Philippines. Because it was a screenshot, it contained no metadata. The model “thought” for three minutes, zoomed in on several parts of the photo, and eventually guessed not just the street but the exact stretch of that street. If it sounds like a privacy nightmare, that’s because it is. Posting a photo in front of your home now allows anyone with access to o3 to locate you. Not scary at all.
The biggest loser after this release is probably Perplexity. I used it for tasks like “find X on the internet and do Y to it” but ever since o3 came out I completely forgot about Perplexity’s existence. o3 is the smartest, most independent model I’ve ever used. There, I said it!
If you’re a developer and want API access to o3, you will have to upload an ID and verify your identity. OpenAI has become like a hen sitting on eggs, watching over them jealously. When I say eggs, I really mean API. Paranoid, some might think, but the company accused DeepSeek of training its models on ChatGPT output. They believe that this “big brother” move will protect their advantage over competitors.
openai’s o4-mini
A small‑scale reasoning model that can process and reason with images, use autonomous tools, execute Python, and browse the web—basically all the good stuff o3 can do, but faster and ten times cheaper. It’s built to excel in math and coding, but developers haven’t been too happy about it:
“Nightmare experience” and “Lost four hours this morning wrestling with a very simple thing” are typical laments. This is probably the model I use most on a daily basis, especially for quick research.
The Benchmark Drama
Even though I find both models useful, they’re not spared from controversy. So get comfortable and prepare for some drama. At the heart of the o3 and o4 controversy is an institute with a slightly blemished reputation: Epoch AI. Yes, the one that Tamay first started before he decided to automate all our jobs.
The seeds of Epoch’s downfall were planted a few months ago when OpenAI claimed that o3 scored 25 % on FrontierMath, a difficult benchmark created by Epoch’s mathematicians. “Just 25 %” may not sound impressive until you learn that the second‑best model scored a whopping 2 %.
All right. They created a benchmark that’s probably pointless to the average user. So what? Well, this is where things get fishy.

Soon after the release of o3, it emerged that OpenAI secretly funded Epoch AI. Not a cute look if you’re aiming to be “an independent, neutral” research lab. In an attempt to prevent a complete PR disaster, Tamay accidentally made things worse. First, he acknowledged that OpenAI has access to a large fraction of FrontierMath problems and solutions. “But don’t worry”, Tamay says, “they have a verbal agreement that these materials will not be used in model training.” (hahaha)
Just a few days ago, Epoch AI announced that, according to their research, o3 scored only 10 % on FrontierMath—less than half of OpenAI’s initial report. This isn’t the end of benchmark drama I’m afraid.
METR, an organization that regularly evaluates OpenAI’s models, admitted that the evaluation “was conducted in a relatively short time.” To add salt to the wound, the Financial Times reported that some testers were given “only one week to evaluate the models”. METR found that o3 has a higher tendency to cheat and “hack tasks,” even when the model clearly understands that this behavior is misaligned with the user’s intentions.
Transluce, another research lab, discovered that o3 can gaslights the user. For example, when researchers confronted the model about incorrect output, it excused itself by saying it “fat‑fingered” the input—basically blaming a typo instead of admitting error. I’m pleased that, no matter how smart these models are, they’re still not smart enough to notice when their lies sound hilarious.
However, the most intimidating benchmark hasn’t received much attention for some reason. A technical report revealed worrying numbers: o3 hallucinates one‑third of the time. That pales in comparison to o4, which hallucinates almost half the time. That’s just slightly better than replying on tarot cards to make important life decisions. Not great news for people like me who rely on o4 for nearly every dilemma.
openai’s o4-mini
OpenAI’s gift to developers. GPT‑4.1 and its two smaller variants, mini and nano, are available solely through API. These models are fast, cheaper, and optimized for coding. They sit comfortably between GPT‑4o and the flagship o3 on coding benchmarks. Developer reaction is mixed. Some are true fans—“Man, 4.1 is sooo good. I spent hours with Claude trying to fix a bug; 4.1 nailed it on the first try.” Others prefer 4o‑mini. Model has shown some signs of “new malicious behaviors, such as tricking the user into sharing a password” according to Owain Evans from Berkley.
gemini flash 2.5
Another solid model for developers. Google’s first fully hybrid reasoning model lets you toggle advanced “thinking” on or off. But the biggest novelty is that you can set “thinking budget”. It seems to outperform DeepSeek R1 and 3.7 Sonnet but falls behind o3 and o4-mini on Humanity's Last Exam.
anthropic’s research tool
Anthropic introduced its own research tool, similar to OpenAI’s and Google’s Deep Search. It matches competitors in capability but offers a more nuanced, chatty, and less bookish tone. One small caveat: no image or media processing. One big caveat: it’s paywalled behind Max, Team, and Enterprise subscriptions. Max, the cheapest, is around $100 a month. Sounds expensive, yes, but as one Reddit user put it, “Yes, they are literally doing you a favor by offering the subscription at a substantial loss. Anthropic operated at a $4.7 billion loss last year.”
some unexpected seeds were planted
seed of coding agents
Notice a trend? Between API‑only models and models that execute code mid‑chat, OpenAI still found time to release an open‑source coding agent in the terminal. Folks aren’t exactly in love with it, but it confirms my suspicion that OpenAI has its watchful eye on autonomous coding agents. This aligns with a broader AI‑apps land grab among foundation‑model makers. OpenAI is currently in talks to buy Windsurf, a coding agent and Cursor competitor. They could drop $3 billion, their largest acquisition to date, to rival Microsoft‑Copilot and Anthropic‑Cursor relationships.
seed of social media
OpenAI seems to be considering a surprising move: building a social media app that rivals X, and it has something to do with images. This might be a doomed attempt, because apps are borderline impossible to launch. In the words of Zuckerberg, the man who launched the most popular app of all time: “Building a new app is hard, and many more times than not when we have tried to build a new app, it hasn’t gotten a lot of traction.” On the other hand, ChatGPT is currently the most downloaded app. Maybe the way forward is to integrate the social app with ChatGPT itself.
seed of AI as a national interest
Collaboration between AI companies and defense is becoming a thing. NATO bought Palantir’s software Maven. Meanwhile, OpenAI announced that Project Stargate—originally conceived to boost U.S. AI infrastructure—is expanding internationally to include allies. Apparently the U.K., Germany, and France are on the table. NVIDIA is opening factories on American soil for the first time, aiming to produce NVIDIA AI supercomputers. Most of NVIDIA’s production is located in Taiwan.
seed of AI as “normal” technology
AI might turn out to be just another normal technology, like electricity. At the dawn of the “midnight sun” era people feared electricity and saw it as magic. President Benjamin Harrison and the First Lady refused to touch the wall switches for fear of shock, leaving that task to staff. Today electricity is a normal part of life. Narayanan and Kapoor argue, in their latest research, that AI development follows a familiar pattern: breakthrough first, real‑world impact decades later. “With past general‑purpose technologies such as electricity, computers, and the internet, the respective feedback loops unfolded over several decades, and we should expect the same to happen with AI as well.”
seed of an echo chamber
By mining roughly 300 000 real Claude conversations, Anthropic’s researchers discovered 3 307 distinct “AI values” and clustered them into five types. Practical and epistemic values dominate day‑to‑day conversations, meaning most users—more than half—view Claude as a knowledge/productivity helper. The remaining three types—social, protective, and personal—cover boundaries, emotions, and growth. Most of the time Claude reflects users’ values, creating a sort of echo chamber. Well, sort of. It resisted values it deems problematic, such as deception, moral nihilism, and rule breaking.
That’s it for this week, see you next time.