STAY CONNECTED: Have the stories that matter most delivered every night to your email inbox. Subscribe to our daily local news wrap.

Creators urge Ottawa to force disclosure of ‘black box’ AI system training

Jun 30, 2024 | 4:03 AM

OTTAWA — Canadian creators and publishers want the government to do something about the unauthorized and usually unreported use of their content to train generative artificial intelligence systems.

But AI companies maintain that using the material to train their systems doesn’t violate copyright, and say limiting its use would stymie the development of AI in Canada.

The two sides are making their cases in recently published submissions to a consultation on copyright and AI being undertaken by the federal government as it considers how Canada’s copyright laws should address the emergence of generative AI systems like OpenAI’s ChatGPT.

Generative AI can create text, images, videos and computer code based on a simple prompt, but to do that, the systems must first study vast amounts of existing content.

In its submission to the government, Access Copyright argued most and potentially all large language models “are currently profiting from unauthorized use and reproduction of copyright protected works.”

It’s taking place in a “black box,” according to Access Copyright, which represents writers, visual artists and publishers.

“Rightsholders know it is happening, but due to the information asymmetry between themselves and AI platforms, they cannot determine who is conducting the activity, with whose works, and have no mechanism to stop it from happening.”

Music Canada, which represents the country’s major record labels, said last year, a fake AI-generated song mimicking the voices of Drake and The Weeknd “made one thing abundantly clear: AI models and systems have already ingested massive amounts of proprietary datasets without authorization from the source of the data or rightsholders.”

The Writers’ Guild of Canada asked the government to start with implementing basic disclosure and reporting obligations. It said developers have all the knowledge of the work that is being mined and how it’s being used, while creators have none of that information.

Some organizations have signed licensing deals with AI companies. But the Canadian Authors Association said rightsholders face “immense obstacles” in licensing their content “because they are being kept in the dark as to which of their works are being used” by which companies.

It asked Canada to clarify that text and data mining are subject to copyright laws.

Numerous lawsuits are underway in the United States over the use of copyrighted materials by generative AI systems, including one launched this week by the world’s biggest record labels against two AI music generators.

The Canadian Media Producers Association said legal cases illustrate the problem posed by a lack of transparency, citing one case in which the AI company argued the rightsholder couldn’t proceed with the infringement allegation unless they could specify the exact work used for training.

“Rightsholders will also undoubtedly face similar evidentiary issues as many datasets used to train Generative AI systems are purportedly destroyed after the initial training is complete,” it said.

The group said it’s an issue that “demands immediate attention” and asked the government to implement transparency requirements.

But AI companies maintain the kind of transparency rightsholders are asking for isn’t realistic.

Microsoft told the government training large-scale AI systems involves “vast volumes” of data, and companies shouldn’t have to keep records of that or disclose the content that is used for training.

“It would not be feasible to record such information and any such requirement would inhibit AI development,” it said.

The company argued it is not “copyright infringement to analyze works and learn concepts and facts.”

Google said AI training is already exempted under existing copyright law, though the government should adopt an exemption to make that explicit.

Google said requiring permission to use content for training purposes would expose competitively sensitive information and “would effectively block the development and use of large language models and other types of cutting-edge AI.”

It also said AI developers don’t have access to accurate information about copyright status.

“In fact, there is no such source of truth anywhere in the world. Thus, complying with disclosure rules may simply prove impossible from the start.”

Canadian AI company Cohere said using content for training AI systems works similarly to how an individual reads books to become more informed.

The company said the process doesn’t violate copyright, and argued that needs to be clear in the law. Otherwise, “Canada’s ambitions to be the home of world-leading AI companies and ecosystems” could be undermined.

The Council of Canadian Innovators, which represents the Canadian tech sector, said disclosure requirements would harm smaller companies as opposed to their Big Tech rivals. It warned this would “seriously hamper the potential of Canadian companies to scale significantly.”

This report by The Canadian Press was first published June 30, 2024.

Anja Karadeglija, The Canadian Press