The Library of Congress, with its collection of 180 million works, has become a hotbed of interest for AI startups looking to train their large language models on public domain content. The library, which houses a vast array of books, manuscripts, maps, and audio recordings, has seen a surge in interest from AI companies eager to access its digital archives and vast amount of data. The library’s API, which allows programmers to download data in a machine-readable format, has seen a significant increase in traffic since it became available in September 2022, with about a million visits every month.
The appeal of the Library of Congress’s data lies in its rarity, diversity, and lack of copyright restrictions. With collections spanning over 400 languages and a wide range of disciplines, the library offers a treasure trove of information for AI developers. While other organizations are increasingly restricting access to their data, the Library of Congress has made its data freely available to anyone who wants it. This makes it a valuable resource for AI companies that have exhausted other sources of data and are looking for new sources to train their models.
However, accessing the library’s data comes with caveats. While the data is freely available via the API, users are prohibited from scraping content directly from the site, a common practice among AI companies. This has become a hurdle for the library as it slows public access to its archives. Companies like OpenAI, Amazon, and Microsoft are also looking to the library as a potential customer, as AI models can assist librarians and subject matter specialists with tasks such as navigating catalogs and summarizing documents. However, there are challenges to overcome, such as bias towards contemporary data and inaccuracies in historical documents.
In addition to the potential benefits of AI tools, there are also risks associated with using them. The Library of Congress has experienced issues with AI models hallucinating and propagating inaccurate information based on the works in the library. For example, in tests conducted by the Congressional Research Service, an AI model incorrectly listed the District of Columbia as a U.S. state and claimed that students from Taiwan and Hong Kong would be impacted by a bill. Despite these challenges, the Library is committed to making more of its unrestricted data available to the public in the coming years.
Overall, the Library of Congress’s vast digital archives present a valuable resource for AI companies looking to train their models on high-quality, public domain content. As the world’s largest library continues to digitize its special collections and make more data available, it will likely play a crucial role in the development of AI technologies in the future. While there are challenges to overcome, the potential benefits of utilizing the library’s data for AI research and development are immense.