AI training has moved through three big shifts. Each one changed what actually mattered for making progress.
The data scraping days
First was the data scraping era. You wanted as much text as possible from everywhere. Wikipedia, forums, news sites, random blogs. The bigger and messier, the better. Quality mattered, but volume was king.
This stuff still powers everything we use today. Those massive, chaotic datasets taught AI systems how people actually write and talk.
Then came conversations
The second phase was different. Now you needed humans writing example dialogues. People got paid to sit there and craft perfect responses to questions. Kind of like creating an ideal customer service rep, except for everything.
This added something data scraping couldn't give you - nuance. Context. The kind of back-and-forth that makes conversations actually work.
Now it's about environments
Both of these still matter. But now we're in the environment phase. Instead of just reading about stuff or seeing conversations, AI can actually do things. Take actions, see what happens, try again.
It's like the difference between reading about riding a bike and actually getting on one.
This whole thing reminds me of those old AI testing setups. Simple games where researchers could benchmark different approaches. Except now the environments are way more complex and designed specifically for language models.
Why this scales
The cool part is how this scales. Once you build the basic framework, everyone can contribute scenarios from their field. Doctors add medical cases, lawyers add legal puzzles, engineers add technical problems.
You end up with this rich collection of real-world challenges instead of artificial benchmarks.
But here's my problem with it
I think reinforcement learning is the wrong tool for this. Reward functions are weird and artificial. Humans don't learn complex thinking through gold stars and punishment.
We use something more sophisticated that we haven't figured out how to replicate yet.
What might work better
There are hints of what that might look like. Maybe learning happens more in the prompt space than in model weights. Maybe we need something closer to how our brains actually work during sleep and reflection.
The pattern recognition that happens when you're not actively trying to solve something. That unconscious processing that leads to those "aha" moments in the shower.
We're getting closer to AI that can actually learn from experience. But I suspect we're still missing key pieces about how real learning works.