Memorization of training data by neural language models raises questions about data privacy and fair use

2021-01-26

Neural language models memorize some parts of their training data.

As far as we know, this is relatively rare. (Since generalization error is very low for these models, they couldn’t be memorizing very much of their training data.) But it’s not insignificant:

We focus on GPT-2 and find that at least 0.1% of its text generations (a very conservative estimate) contain long verbatim strings that are “copy-pasted” from a document in its training set.

(“Does GPT-2 Know Your Phone Number?” BAIR team, 20 December 2020, and their related paper Nicholas Carlini et al.^*)

Although it’s infrequent, this behavior poses significant problems:

Contextual integrity. Even data that is publicly available is published with an intended context of use. AI models most likely violate this context. (For example, an AI model uses real-world information like usernames or name + addresses to create fictitious conversations or news reports on controversial topics.)
GDPR or CCPA. If a person demands their data is removed from such a system, what burden does that place on the AI system’s creator?
Copyright. Is it fair use to use material under copyright (Harry Potter, proprietary source code) as training material for AI systems? The EFF and OpenAI have argued it is, but their arguments partially rely on the (untested) assumption that verbatim reproduction of training data would be astronomically unlikely.

BAIR suggests that curated data sets are the most reasonable way forward. (Since it’s not clear how to apply differential privacy to this problem, and since sanitizing the web would be too hard.)

I don’t understand how that actually solves any of the problems they point out. Someone still needs to decide on these questions about contextual integrity, right to be forgotten, and copyright.

How does this happen with models? Why do models memorize certain texts, and not others? How does this idea translate to other domains, for example computer vision models? I thought that this problem was somewhat know in the CV space as well from a long time ago. I might be making that up. Certainly most companies working on CV products like self-driving cars are using curated data sets, but perhaps that’s more to do with business risk mitigation.

Carlini, Nicholas, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, et al. “Extracting Training Data from Large Language Models,” December 14, 2020. https://arxiv.org/abs/2012.07805.