Harvard Creates the Largest Open Database for Artificial Intelligence Training
Harvard University has announced the creation of the largest open-access database containing around one million books. This project is part of the Institutional Data Initiative (IDI), supported by OpenAI and Microsoft. The main goal is to provide equal access to high-quality data for training artificial intelligence (AI) models.
The IDI will focus on refining a million open-access books that have been scanned by the Harvard Library. In collaboration with the Boston Public Library, Harvard will also make millions of pages from old newspapers available as data. While these collections primarily consist of long-form texts, the IDI aims to partner with other organizations to include all types of data, such as scientific and biomedical information.
According to project lead Greg Leppert, the initiative seeks to create an open data ecosystem for AI, similar to the impact of the Linux operating system. The database will benefit not only researchers but also smaller companies that previously lacked access to such resources.
Microsoft, as noted by its Vice President Burton Davis, supports the project, considering it an important step toward building an inclusive AI ecosystem. The company has been working for several years to address data access inequality, which is a key factor in technological advancement.
OpenAI has also expressed support for the project, viewing it as a safe alternative to using copyrighted data. Amid ongoing legal disputes over the use of proprietary data for AI training, Harvard’s IDI demonstrates how legal risks can be minimized.
The launch of initiatives like Harvard’s IDI and the French Common Corpus project, supported by France’s Ministry of Culture, proves that high-quality AI models can be trained without violating copyright laws. However, as Ed Newton-Rex, former head of Stability AI, emphasizes, it is important that open data not only supplements but eventually replaces protected data in training datasets.