Google hungers for all that content produced by the wealth of digital publishers creating text, video, and images on a daily basis. To deal with the sticky copyright issues at the heart of AI training, Google is proposing that all those companies who don’t want their content gobbled up will need to “opt-out” to ensure Google’s open maw doesn’t swallow all their juicy data.
The tech giant offered this raw deal to the Australian government in response to the country’s recent proposal to ban “high-risk” AI applications, including creating deepfakes, disinformation, and discrimination. As first reported by The Guardian, Google shared that publishers should have the ability to say no to whether their content is copied for the purpose of training AI.
Google released its Bard chatbot in the land down under back in May, and since then, the company has been trying to entice the country into allowing it to scrape ever more data. Google has already written to the Australian government over relaxing copyright laws to allow more AI training. Now it’s being open about establishing an AI-friendly internet that allows scraping by default. The proposal would force publishers both big and small to educate themselves about the opt-out and establish it on their own sites rather than putting the onus on Google.
The company did not explicitly say how this opt-out function would work, and Google did not immediately respond to Gizmodo’s request for comment. In a July blog post, Google called for new “standards and protocols” about how web publishers participate in the internet. The company pointed to the 30-year-old, community-developed robots.txt standard, a protocol that indicates to web crawlers and bots which portions of a site they’re allowed to visit.
Of course, that robots.txt protocol only works with nice bots that agree to comply voluntarily. It doesn’t impede any company that decides not to obey the standard. Plus, it doesn’t take back any data that was already scraped without publishers’ consent. Google has multiple large language models, including its recently announced PaLM 2. Google’s Bard chatbot was originally based on the LaMDA LLM, and researchers have noted that 50% of its content comes from public forums while a good chunk of it is scraped from Wikipedia and other websites.
ChatGPT creator OpenAI has been hit with a very similar lawsuit over its alleged abuse of copyright. Essentially, these companies have already scraped up massive amounts of the internet to train their models. So much of the data is already based on Wikipedia entries and Reddit posts, but these models also make use of articles, books, and other online text. Just consider that the GPT-4 language model is trained on 45 terabytes of data, so there’s a bounty of published material locked inside. OpenAI has its own designs on industry-friendly regulation, and it has called for a whole new federal agency meant to oversee the tech. Google, on the other hand, has lobbied against that proposal.
Google’s opt-out idea wouldn’t be localized to just Australia, of course. The company has been trying to court the largest news organizations like The New York Times and The Washington Post with new AI tools, all while trying to infer its A-OK if they scrape up all those published articles for use training their AI.