Only 8 months have passed since Chat-GPT and the large learning model underpinning it took the world by storm. This article focuses on the data supply chain—the data collected and then utilized to train large language models and the governance challenge it presents to policymakers These challenges include:
- How web scraping may affect individuals and firms which hold copyrights.
- How web scraping may affect individuals and groups who are supposed to be protected under privacy and personal data protection laws.
- How web scraping revealed the lack of protections for content creators and content providers on open access web sites; and
- How the debate over open and closed source LLM reveals the lack of clear and universal rules to ensure the quality and validity of datasets.
As the US National Institute of Standards explained, many LLMs depend on “large scale datasets, which can lead to data quality and validity concerns. “The difficulty of finding the “right” data may lead AI actors to select datasets based more on accessibility and availability than on suitability
The working paper is available on GW's Institute for International Economic Policy's website.