It’s infuriatingly hard to understand how closed models train on their input

It’s infuriatingly hard to understand how closed models train on their input Simon Willison digs into the question of whether the big closed LLMs are training their models based on users’ input. As Willison says:

I’ve been wanting to write something reassuring about this issue for a while now. The problem is… I can’t do it. I don’t have the information I need to credibly declare these concerns unfounded, and the more I look into this the murkier it seems to get … The fundamental issue here is one of transparency. The builders of the big closed models—GPT-3, GPT-4, Google’s PaLM and PaLM 2, Anthropic’s Claude—refuse to tell us what’s in their training data.

Ambiguous language from these big players is the norm; unambiguous statements, such as OpenAI’s (which unfortunately applies only to paid API users and only from March 1st onward) is a rarity:

OpenAI does not use data submitted by customers via our API to train OpenAI models or improve OpenAI’s service offering.

This ambiguity around closed models’ training data makes open, self-hostable models–where there is zero ambiguity about, and total control of, training data–increasingly attractive to developers and companies building LLM-based products.

Definitely worth a read.