Why AI model development still needs us more than we need AI
OpenAI has admitted its newer models o3 and 04-mini models actually don’t hallucinate less, they fantasise more. And what’s even more problematic, they don’t understand why. Not only does it seem, for now, a practical ceiling has been reached, newer models actually become less trustworthy and more expensive.
A possible reason could be that the training data is getting more and more artificial, which causes data degeneration in the training process. The internet training pool is flooded with digital action figures, iconised posts, and synthetic websites. And Cloudflare is now going to fan the flames by creating a cunning artificial labyrinth for AI crawlers that would make Daedalus jealous.
Cloudflare delivers security services for websites like guarding them against DDoS attacks and providing Firewalls. Now, instead of trying to block unwanted crawlers (bots that feast on websites without permission) they use AI to create an endless rabbit hole of AI generated pages. The crawlers will become trapped in perpetual slush of meaningless data.
With Cloudflare’s new invention, illegal crawling will become a Trojan Horse. We don’t know whose crawlers will be ensnared but devouring synthetic data will have a detrimental effect on AI development and, in theory, might cause model collapse.
Training better models requires high-quality real-world data, not generated machine muck. Many people fear humanity will soon become too dependent on AI but for now it seems that AI depends on human data to survive.
Sources:
TechCrunch: “OpenAI’s new reasoning AI models hallucinate more” <link>
Cloudflare – “Trapping misbehaving bots in an AI Labyrinth” <link>
TechCrunch: “OpenAI says it’ll release o3 after all, delays GPT-5” <link>
Nature: “The AI revolution is running out of data. What can researchers do?” <link>





