Great Reset

By Mark Nuyens
5 min. read📱 Technology

What if we would all agree on a Great Reset, allowing authors to make an informed decision about content scraping.

The emergence of large language models (LLMs) like ChatGPT has undoubtedly revolutionized our world, impacting many jobs and tasks. As a web developer and writer, I cannot overstate how much these tools have allowed me to expand beyond my skill set and learn a lot in the process. So I am very thankful for this technology to exist and look forward to future iterations.

However, the training process of these models is something worth discussing, particularly regarding the scraping of vast amounts of data from the web without explicit author consent. What if we would call for a "Great Reset," with AI companies agreeing to reset their models and start afresh, allowing authors to make informed decisions about the inclusion of their work?

While this idea may seem daunting or impractical, the current situation has also shown to be controversial. Models like OpenAI's ChatGPT, despite resembling web-crawling robots, ultimately sell their content through APIs, blurring the lines between search engines and content aggregators. While AI companies may argue this is fair use, perhaps this is not the case.

Meta's recent attempt to update its privacy policy demonstrates the growing tension around data scraping. The company faced backlash from privacy organizations and users for its plan to collect user data by default, highlighting the delicate nature of training LLMs. Meanwhile, The New York Times' lawsuit against OpenAI for scraping its content is something worth following.

While many media companies have started making agreements with OpenAI around content usage, it seems like these companies hardly have a choice. In a way, their content has been taken hostage, and the "ransom" that is being paid only implies the injustice around data scraping. One might argue how selling access to content (or paying ransom) is a bad idea, but what's the alternative?

As lawmakers grapple with these issues, businesses and individuals are increasingly vocal about their concerns regarding data scraping for LLM training. After all, the robots.txt file was initially intended for search engine indexing, and applying the same rules to LLMs seems ambiguous at best. Even though the "genie is out of the bottle," isn't there any way to put it back in?

This brings us to the proposed "Great Reset." Companies like OpenAI, with their closed-source models, might be forced to re-scrape the internet, but this time, with sufficient time for authors and websites to update their robots.txt files, ensuring informed consent and control over their content. Or perhaps we should have some way of tracking whether content is scraped in the first place.

But what about open-source models like Meta's LLama? Open-source models present a unique challenge, as their decentralized nature makes a coordinated reset difficult. However, if Meta would start collecting content from scratch, their older models will automatically become outdated at some point. The open-source nature of this process should also allow for detecting what content is used.

The idea of a 'Great Reset' may seem like a utopian ideal, but it may provide a unique opportunity to address the ethical concerns surrounding LLM training while establishing a better relationship between AI and content creators. By prioritizing author consent and transparency, we allow LLMs to evolve as powerful tools while respecting the rights of the authors on which its content has been trained.