CBC News analysis finds thousands of Canadian authors, books in controversial dataset used to train AI

December 7, 2023

0 1 9 minutes read

CBC News analysis finds thousands of Canadian authors, books in controversial dataset used to train AI

A CBC News investigation has found at least 2,500 copyrighted books written by more than 1,200 Canadian and Québécois authors were shared online as part of a massive — and now defunct — dataset used for artificial intelligence training and research purposes.

The dataset’s existence and general highlights were revealed earlier this year in The Atlantic. It led to an avalanche of writers expressing shock on social media that their work had been included without their permission and sharing their concerns that AI tools could use information from the dataset to generate content in their distinct artistic voice.

A CBC News analysis of the dataset, called Books3, identified thousands of Canadian and Québécois authors and books in both official languages.

Although that content represents less than two per cent of the 190,000 plus files in Books3, it reads like a who’s who of the country’s literary community: three quarters of CBC’s Canada Reads contenders and Scotiabank Giller award nominees are featured, along with over a third of all Governor General’s Literary Award finalists.

WATCH | Their books were found in a dataset used to train AI:

Authors shocked to find their books used to train AI without permission

Some of Canada’s most famous authors were shocked to find that their books have been used without their permission to train artificial intelligence software. The Writers’ Union of Canada says it is considering a lawsuit, but one law professor says it’s not clear if using the books to train AI is illegal.

Topping the list of Canadian authors with the most books in the dataset is Margaret Atwood, of The Handmaid’s Tale fame, followed by best-selling children and young adult writer Gordon Korman and Nobel Prize winner Alice Munro.

“I’ve been writing kids books for more than three quarters of my life,” said Korman, whose career began when his Grade 7 creative writing assignment was bought by Scholastic and turned into his first book, This Can’t Be Happening at Macdonald Hall.

Help shape the future of CBC article pages by taking a quick survey.

Korman told CBC News he had read about the dataset and knew some of his books were in it.

“They’re not really stealing your stuff,” he said. “It’s not quite like people are using excerpts or characters from your books or storylines.”

A bald man wearing glasses, a black button-down shirt and jeans smiles with his hands in his pockets as he stands against a grey background. — Canadian author Gordon Korman says he wants to know how 28 of his young adult books found their way onto the Books3 dataset, a huge trove of data used by Artificial Intelligence companies to train their large language models. (gordonkorman.com)

Massive datasets like Books3 are used to train artificial intelligence models to interpret human language — as in read and write like us. Perhaps the most well-known AI language tool is OpenAI’s ChatGPT, which made headlines this year for being able to write university-level essays for students.

But what concerns Korman most is how 28 of his copyrighted books were sucked into Books3 in the first place.

“When I hear about any kind of threat to the way [my writing] works as a business model, just for me to be able to pay my bills and support my family, obviously I have to be very, very concerned.”

Canadian author ‘flattered and concerned’

Stories by Quebec literary giants Michel Tremblay, Marie-Claire Blais and Leonard Cohen also make an appearance in Books3, as do works by Life of Pi author Yann Martel, murder mystery writer Louise Penny and dark horror overlord Patrick Sénécal.

“It’s a combination of being flattered and being concerned,” said writer Drew Hayden Taylor, who has nine books in Books3, including his best-selling novel Motorcycles and Sweetgrass, which was shortlisted for a Governor General’s Literary Award in 2010.

Hayden Taylor, an award-winning playwright, author, columnist, filmmaker and lecturer from Curve Lake First Nation in Ontario even wrote a short story featuring an artificial intelligence entity in 2016.

Like Korman, Hayden Taylor is concerned about copyright violations of his work.

A man with short, light hair and a denim shirt smiles at the camera. — Author Drew Hayden Taylor from Curve Lake First Nation just north of Peterborough, Ont., has nine books in the now defunct Books3 dataset. (David Hall/CBC)

“In the last 35 years that I’ve been a writer, almost all of my income has been derived from royalties. It’s literally taking the milk out of my cereal bowl. It’s very, very, very worrying.”

Hayden Taylor says he wishes the creators of Books3 had asked for permission to include his books.

“I would have considered it,” he said, noting he’d want to know much more about AI training and how it works before committing. “It would have been just more respectful.”

‘Unbelievably disrespectful’

CBC News also found one out of every six members of The Writers’ Union of Canada (WUC) — a national organization of over 2,600 professionally published writers — have at least one book in the dataset.

“It’s huge. It’s incredibly impactful on the cultural economy in a negative way. And, as importantly, it’s unbelievably disrespectful,” said John Degen, WUC’s executive director, after going over CBC’s findings.

Degen says he’s not surprised that a majority of literary award nominees were included, as that recognition typically opens up international markets like the United States — and opportunities such as translation rights and foreign publication rights.

A man with grey hair and a goatee wearing glasses and a plaid shirt stands in an office filled with books and signs that read 'pay the writer.' — John Degen, the executive director of The Writers’ Union of Canada, says the Books3 dataset is a violation of Canadian copyright law and believes it should be addressed legally and by Parliament. (Alexis Raymon/CBC)

“No one asked for permission. No one explained the project,” he said. “To me, that’s inexcusable and needs to be addressed legally and by Parliament.”

According to Degen, the Books3 dataset is a violation of Canada’s copyright law, which protects the work of artists during their lifetime and for 70 years following their deaths, because it accessed entire works of art without prior approval by the artist.

“Copyright can be very abstract and hard to understand, but I don’t think that taking a pirated book from a pirate site and using it for your own industrial purposes, I don’t think that it’s hard to understand that that’s wrong,” said Degen.

He says WUC is in “deep research phase” and looking at all possible legal remedies, including launching a lawsuit.

LISTEN | Best-selling authors on what AI means for human creativity:

The Current24:02Could AI put authors out of business?

Hundreds of writers have learned that their books have been used to train artificial intelligence to spit out imitations. Bestselling authors Sean Michaels and Linwood Barclay discuss what AI might mean for human creativity and artist compensation.

Legality of dataset unclear, copyright expert

Osgoode Hall Law School professor Carys Craig, who specializes in intellectual property law and technology, says it’s debatable if the existence or use of Books3 is illegal under Canada’s copyright law.

“It’s not clear that the inclusion of works in a dataset used to train a generative AI model does constitute copyright infringement,” said Craig. “Even if it’s done without the consent of the rights holder, it’s not clear that it implicates copyright at all.”

A woman with long, blond hair wearing a green top stands in an office. — Osgoode Hall Law School professor Carys Craig, who specializes in intellectual property law and technology, co-authored a 2022 submission calling for the Canadian government to broaden copyright law to allow for AI research and analysis, including text and data mining. (Alexis Raymon/CBC)

In 2022, she and other legal experts co-authored a submission to the Canadian government calling for a broadening of copyright law to allow for artificial intelligence research and analysis, including text and data mining, “without the threat of potential copyright liability.”

She says what is essential to understand is that massive datasets like Books3 are mainly used by AI as data points to understand patterns in language. That’s not the same as authorship, Craig says.

“It’s simply unrealistic to imagine that permissions are going to be sought from every individual author whose work appears there.”

Multiple U.S. lawsuits

The legality of AI-training datasets is being debated in U.S. courts, as Books3 is mentioned in multiple lawsuits launched by the U.S. Authors Guild and individual writers like John Grisham, George R.R. Martin, Jodi Picoult and Sarah Silverman.

Anti-piracy group Rights Alliance sent a takedown notice to the websites hosting the dataset and it was removed last August.

LISTEN | These authors say OpenAI stole their books to train ChatGPT:

As It Happens5:59Authors launch lawsuit accusing Open AI of pirating their books to train ChatGPT

The Authors Guild, a U.S. trade group for writers, filed the proposed class-action on Tuesday on behalf of 17 plaintiffs. One of them, novelist Douglas Preston, spoke to As It Happens host Nil Köksal.

The lawsuits mention tech companies OpenAI — the company behind ChatGPT — Meta, Microsoft and Bloomberg, alleging they breached U.S. copyright law by training their large language models on books without the permission of authors.

Some plaintiffs believe their books were used to train ChatGPT because the chatbot generated very accurate summaries of their works.

On Nov. 20, a judge in California initially dismissed five of the six allegations in a lawsuit concerning LLaMA, another AI-training dataset owned by Facebook’s Meta. The ruling states that based on the current allegations, the dataset does not constitute “a recasting or adaptation of any of the plaintiffs’ books.”

“Copyright doesn’t protect an author’s style,” said Craig, the law professor. “It doesn’t protect their ideas, the way that they write. It protects their literary text.”

One lawsuit specifically targets EleutherAI, the non-profit artificial intelligence research lab which created and launched Books3 in October 2020.

In a series of social media posts published at the time, Shawn Presser, the independent developer who compiled Books3, described it as a “reliable, direct download” of about 200,000 e-books he found online and reformatted to put “OpenAI-grade training data at your fingertips.”

The day the first story about Books3 was published in The Atlantic, Presser tweeted: “I would gladly go to prison … for advancing science and giving you the power to replicate ChatGPT.”

WATCH | What these Montreal writers have to say about AI:

Why Montreal writers want AI to stop stealing their work

Local writers, such as Heather O’Neill, Trevor Ferguson and Rosemary Sullivan, say they’re interested in participating in legal action against artificial intelligence companies for using their writing to train bots to mimic their writing styles.

Ottawa may review copyright law

Canada’s government is also pondering whether copyright law should be changed with respect to the challenges posed by AI.

This October, it launched its second consultation in less than two years on “the implications of generative artificial intelligence for copyright.”

“I think they’re in catch-up mode and it’s sort of a desperate moment at this point,” said Degen, of WUC.

Craig says the long-term implications of permanently changing Canadian law to adapt to a constantly changing technology should be weighed carefully.

“We have to be very conscious of the way in which copyright law has shaped the Internet that we now have — and think about how and in what way we want it to shape the future of artificial intelligence.”

After CBC News showed author Drew Hayden Taylor how ChatGPT could be prompted to generate a short story in his voice — it gave him pause. He noted the prose contained specific Indigenous words and cultural references and the work sounded “eerily like it could be me.”

He joked that AI should be renamed Artificially Indigenous.

“All of my work comes from my experiences as an individual, as an artist, as a First Nations man, as a human,” he said. “It was a long journey to get to where I am now. And this … in a weird sort of way, invalidates that journey.”

METHODOLOGY: How CBC News identified Canadian and Québécois authors in Books3

To identify authors, CBC News used Python programming with regular expressions (RegEx) to extract the ISBN codes contained in over 180,000 Books3 files (92%). All ISBNs were put through the ISBNdb worldwide database to retrieve their title, author(s), publisher, language and other details. When an ISBN could not be retrieved, CBC News extracted the author(s) and title from the .epub.txt file. A total of 8,820 files could not be identified (4.5%). Upon inspection 1,284 files were completely empty and 205 files were duplicates — they were all excluded from this analysis.

To identify Canadian and Québécois authors, CBC News compared the full names of writers in Books3 against a list of 7,800 Canadian and Québécois writers, including: the online member directories of the Writers’ Union of Canada (WUC) and the Union des Écrivaines et des Écrivains Québécois (UNEQ), all past contenders/winners of the Canada Reads competition (2002-2022), all longlisted/shortlisted books for the Scotiabank Giller Prize (1994-2023), Trillium Book Award (1994-2023) and Governor General’s Literary Awards (1936-2023) — in French and English — and all writers who benefited from the Writers’ Trust of Canada (1976-2023) programs or awards. Additionally, CBC News compared Books3 titles against a list of 195,000 documents published in Quebec since 2010. Every author match was verified against book titles and the author’s country of citizenship, place of birth and biography to ensure that writers with the same name living in another country weren’t included.

Data collection: Valérie Ouellet and Shaki Sutharsan (Oct.-Nov. 2023)
Data analysis and verification: Valérie Ouellet and Sylvène Gilchrist (Oct.-Nov. 2023)