Unlocking the Future: The High Cost of AI Training Data and Big Tech's Advantage

To develop the most sophisticated and capable AI, the men’s battlefield is intense. At the core of the tech revolution is a little-noticed but critical cog: training data, the key to unlocking the AI machine. As the digital dawn continues to break, it is becoming more obvious that access to high volumes of AI training data is a resource, and a privilege, currently enjoyed by wealthy Big Tech companies at the expense of global democratic freedoms. Here, through the development process to the privileging of AI and its role in society, this article breaks down the tech barriers.

The Crux of AI Evolution: Training Data

As the OpenAI researcher James Betker explained in a now-famous blog post in 2015: The key point about training AI systems isn’t the complexity of your architecture, but the size and quality of your data.This simple truth, that AI development relies on as much as or more upon quality, diversity, and quantity of data as it does on models – is the most important concept in AI, often hidden behind the complex (and largely useless) jargon of machine learning models.

The Mechanics Behind AI: A Statistical Symphony

Generative AI is a statistical beast, and in practice it produces an output that is a statistical likelihood of it having occurred based on the patterns and examples provided during its training. It’s simple logic: more data, more accurate outcomes. This was the difference between Meta’s configuration of Llama 3, which outperformed AI2’s OLMo despite their very similar architecture, but Llama 3’s version had access to a much larger dataset.

The Data Dilemma: Big Tech's Upper Hand

The huge demand for large, well-tested training datasets has essentially levelled the competitive playing field in favour of Big Tech. The corporate giants have the money and know-how to buy, build and use enormous datasets, while this remains a pipe dream for smaller companies.

The Hidden Costs of Data Acquisition

Beyond the purely financial aspect, in many cases the drive for perfect training sets spills over into morally dubious practices. There are countless notorious examples of this, such as companies scraping copied content from websites without the owner’s permission or harvesting user-generated content without reimbursing the content’s creators. This can raise legal questions as well as ethical questions about user rights and data privacy.

The Exorbitant Price of Progress

Courts will be left with little choice but to keep up in both regards as financial and ethical implications of AI training data acquisition intensify, because of an alarming trend: data commodification is entrenching barriers to entry only the rich can afford to surmount. Such exclusivity kills innovation and diversity. Power is concentrated into the hands of a few, with fangs.

Fostering an Equitable AI Future

But hope shines through in these bleak prospects. Through some innovative projects such as EleutherAI and Hugging Face’s FineWeb project, some of the high-quality training datasets needed to create AI models are being open-sourced, bringing everyone closer to a more equitable and inclusive AI development cycle. These projects are important steps towards democratising AI, ensuring that the future of technology space is made by more people and voices than ever.

Why GOOGLE's Role is Unmissable in AI Training Data Dialogue

Nor can the fact that, when it comes to machine learning, GOOGLE controls all the data that is available for training. From Google Docs to reviews in Google Maps, the company has vast quantities of anything a machine learning researcher might wish, and its ongoing efforts to extend its terms of service to allow users to donate their data to AI training, as well as its continued aggressive licensing activity, make its role imperative in shaping the future of AI.


Ultimately, the AI training data challenge is an urgent one, with some solutions still beyond our reach, and others within our ability to ensure they become reality soon. There’s no easy road forward for the AI community, but shining a light on the issues and what can be done about them, and helping to encourage vital work elsewhere, truly matters. For the future of technology to be as inclusive, innovative and equitable as it could possibly be, AI needs accessible, ethically-sourced training data, and the community needs to engage with its constituents to make sure it can happen.

Jun 02, 2024
<< Go Back