Financial Markets

LEAKED: META'S INTERNAL EMAILS SHOW PLANS TO USE 'PIRATED DATA' FOR AI; LAWSUIT SHEDS LIGHT ON DARKER SIDE OF AI DEVELOPMENT

In a recent legal development, internal communications have been exposed in a copyright lawsuit against the tech giant, Meta, revealing the company's intent to use potentially copyrighted data to train its open-source AI models, codenamed Llama. This exposure not only casts a shadow on the tech conglomerate but also hints at a larger industry issue as these AI models might have been trained using uncopyrighted material, a routine that could be considered prevalent among Meta's competitors.

Further deluge of the internal correspondence includes subtle mentions of Library Genesis (LibGen), a controversial platform notorious for book piracy, being used as a reservoir for training data. According to some leaked emails, harnessing LibGen was considered "essential" to achieve 'State-of-the-Art numbers', implying that the platform functions as an open secret in AI companies looking to fuel their models with data.

The lawsuit, brought to light by distinguished author Richard Kadrey, comedian Sarah Silverman, and a collection of other notables, challenges Meta's contention of legal fair use, accusing it of infringing intellectual property laws by exploiting copyrighted content without due permission for its AI training.

In an arguable breach of ethical boundaries, the exposed communications suggest that Meta made calculated attempts to mask copyright information in training data, consisting of copyright headers, document identifiers, metadata, and even stripping off a paper’s listed authors.

The communications additionally shed light on concerns raised within the company, discussing risks linked with the practice, as well as potential mitigation strategies, including the removal of visibly pirated data.

Interestingly, Meta appears to have contemplated diverse plans to amass data for its models, from considering the acquisition of publishing giant Simon & Schuster to hiring contractors in Africa to distill book summaries.

Regardless of the posed legal and ethical quandaries, it’s evident that AI companies remain on the hunt for unique data sources to enhance their models, spurred by data deficiency. Some are even deploying monetary means to accrue unused digital content.

While portions of the lawsuit were dismissed in the past year, the disclosure of internal correspondence could revitalize other aspects of the case, thereby potentially giving renewed credence to the claimants’ allegations against Meta.

Irrespective of the lawsuit's success or otherwise, this development opens a much wider discussion on the strict norms required to regulate the use of copyrighted content by AI companies and safeguard creators' rights in the technological era, shaping potential future legal and ethical policies surrounding artificial intelligence.