This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
I read recently that OpenAI and others have effectively trained current models on most of the information available on the web and we are hitting a ceiling of available data. My understanding is that AI was as good as the amount of training data available and if there is no significant amount more training data availble then it would make sense that there is another potential AI winter coming. It seems to me that the way forward is a combination of the following:
- Synthetic data: There is discussion of using "Synthetic Data" which (simply put) is one AI model creating data and another judges it but this is in early stages and I'm not convinced that it is going to be effective. It sounds like Anthropic is trying to create and use this type of data
- Real world data: This seems to be ultimately the most valuable data but no way to scale at the ways needed for AI, especially language models that rely on written, spoken media. This could be information measured and created by robots in the real world. I imagine this would be data like that created by boston dynamic robots or (tin foil hat) audio recorded from cellphones and other places
- New data on internet: Information like this post and anything else posted into the future. It does seem that internet data from here on out is at risk of being AI generated which might "poison" the data
- New strategies to better use current data: If we are able to create better models off of current data that seems like the best way forward. I'm sure that there are a billion things I'm missing here
What am I missing? How do you think AI companies are going to improve their models from here on out? What are the chances that we are hitting a ceiling?
Subreddit
Post Details
- Posted
- 7 months ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/ArtificialI...