OpenAI alleges that DeepSeek AI, a Chinese AI company, improperly used its large language model, likely GPT-3 or a related model, to train DeepSeek's own competing large language model called "DeepSeek Coder." OpenAI claims to have found substantial code overlap and distinctive formatting patterns suggesting DeepSeek scraped outputs from OpenAI's model and used them as training data. This suspected unauthorized use violates OpenAI's terms of service, and OpenAI is reportedly considering legal action. The incident highlights growing concerns around intellectual property protection in the rapidly evolving AI field.
The Financial Times reports that OpenAI, the prominent artificial intelligence research company renowned for developing models like GPT-4 and DALL-E, has lodged accusations against DeepSeek, a lesser-known AI startup, alleging misappropriation of its intellectual property. Specifically, OpenAI claims to possess compelling evidence indicating that DeepSeek leveraged OpenAI's proprietary large language models, potentially including GPT-3 or a closely related variant, to train its own competing language model. This action, according to OpenAI, represents a breach of its terms of service, which explicitly prohibit such utilization of its models for the development of rival products.
The alleged infraction came to light through meticulous examination of DeepSeek's output, where OpenAI researchers identified distinctive patterns and responses bearing a striking resemblance to the characteristic outputs generated by their own models. This similarity, they argue, strongly suggests that DeepSeek's model was trained on a dataset derived from OpenAI's model outputs rather than independently curated training data. This practice, sometimes referred to as "model stealing" or "data poisoning," raises significant concerns within the AI community about fair competition and intellectual property protection.
OpenAI has reportedly confronted DeepSeek with these allegations, prompting the startup to swiftly remove the allegedly infringing model from its platform. While DeepSeek has acknowledged the removal, the company refrains from explicitly admitting any wrongdoing. Furthermore, the Financial Times notes that the precise nature and extent of the alleged misuse, including the specific OpenAI model involved and the volume of data potentially copied, remain undisclosed at this time.
This incident underscores the increasing complexities and challenges surrounding intellectual property protection within the rapidly evolving field of artificial intelligence, particularly with respect to large language models. The ease with which these models can be queried and their outputs replicated raises significant questions about how to effectively safeguard the substantial investments in research and development undertaken by companies like OpenAI. The outcome of this dispute could have significant implications for the future development and deployment of AI technologies.
Summary of Comments ( 894 )
https://news.ycombinator.com/item?id=42861475
Several Hacker News commenters express skepticism of OpenAI's claims against DeepSeek, questioning the strength of their evidence and suggesting the move is anti-competitive. Some argue that reproducing the output of a model doesn't necessarily imply direct copying of the model weights, and point to the possibility of convergent evolution in training large language models. Others discuss the difficulty of proving copyright infringement in machine learning models and the broader implications for open-source development. A few commenters also raise concerns about the legal precedent this might set and the chilling effect it could have on future AI research. Several commenters call for OpenAI to release more details about their investigation and evidence.
The Hacker News post titled "OpenAI says it has evidence DeepSeek used its model to train competitor" has generated a moderate number of comments, mostly focusing on the legal and practical implications of OpenAI's claim. No one presents direct evidence to refute or support the claim itself.
Several commenters question the enforceability of OpenAI's terms of service, particularly concerning using the API's output for training another model. They highlight the difficulty of proving such usage and the potential for false positives. One commenter argues that proving the use of OpenAI's output for training would require demonstrating similar internal representations within DeepSeek's model, a complex undertaking. Another suggests that even if some output was used, it wouldn't necessarily constitute significant training data.
Some discussion revolves around the nature of copyright and its applicability to machine learning outputs. Commenters debate whether the output of a large language model can be considered a derivative work, and if so, what implications that has for copyright ownership. The concept of "fair use" is also brought up, with speculation on whether using API output for training could fall under that category.
A few commenters express skepticism about OpenAI's motives, suggesting the accusation might be a strategic move to stifle competition or maintain market dominance. One commenter speculates that this could be a preemptive strike in anticipation of future legal battles regarding copyright and AI training data.
The technical feasibility of detecting such model training is also a point of discussion. One commenter questions how OpenAI could definitively prove DeepSeek used their model, while others propose various methods, including analyzing output distributions and detecting characteristic patterns or "watermarks" within the generated text.
Finally, some comments touch upon the broader ethical and legal landscape surrounding AI training data. Commenters note the complexities of determining ownership and usage rights for data used to train these models, particularly when the data originates from publicly accessible sources. They anticipate future legal challenges and the need for clearer regulations in this rapidly evolving field. The overall tone suggests a cautious observation of the situation, with many awaiting further details and the potential legal ramifications.