Understanding the Battle between Web Crawlers and Copyrighted Content
Dozens of major websites, including Amazon and The1 New York Times, have taken steps to block GPTBot, a web crawler tool introduced by OpenAI. The company had announced its plan to utilize GPTBot to gather data from the web for training its popular AI chatbot, ChatGPT. As of the latest update, around 70 out of the world’s top 1,000 websites have chosen to block GPTBot. This step comes amidst growing concerns over copyright and data ownership.
GPTBot’s operation is straightforward: it encounters a file named “robots.txt” on websites, and if listed under the “disallow” section, the bot refrains from crawling the site. This practice, established in the 1990s, aims to prevent web crawlers from extracting data without permission. OpenAI assured that it would comply with robots.txt and avoid websites implementing it.
“GPTBot launched 14 days ago and the percentage of Top 1,000 sites blocking it has been steadily increasing,” the analysis said.
Web crawlers like GPTBot gather vast amounts of text and images from websites without seeking permission or paying for the content. This has raised concerns about copyright infringement. While companies can use robots.txt to prevent crawlers, there’s no legal obligation for them to do so.
The rise of AI projects like ChatGPT has heightened the spotlight on copyright regulations and data ownership. Several lawsuits are already underway, with authors and creators expressing concerns about their work being used without consent. Even Stephen King, the renowned author, reacted to the use of his books in AI training sets.
OpenAI, on its part, has attempted to obscure any use of copyrighted material in training ChatGPT. However, the battle between web crawlers and copyrighted content persists, leading companies to take action to protect their digital assets.