The challenge of storing quality data in the age of AI

Last updated: September 26, 2024 8:22 am

By bexib 10 Min Read

Hi, I’m Yves. The challenge of preserving good data existed before AI, but AI tends to exacerbate the problem. There are several layers to this conundrum. The first is the pre-internet/post-internet divide, where pre-internet print materials are less valuable and may be even harder to access (because efforts to preserve them have weakened). The second is that materials from the internet era disappear more often. For example, even in my early teens, Naked Capitalism was often said to be an essential source for research on the financial crisis, as many materials have disappeared or become difficult to access. The article below explains how AI creates new problems by creating a flood of “information” and making decisions about preserving old materials more difficult.

Peter Hall is a graduate student in Computer Science at the Courant Institute of Mathematics at New York University. His research focuses on the theoretical foundations of cryptography and technology policy. Undark

My generation was raised to believe the internet is forever, and to be careful what we post online. But the reality is, we lose family photos shared on social media accounts that were long inaccessible. Streaming services restrict access to shows we loved and content we never had the opportunity to own. As web companies and tech platforms disappear, journalists, animators, and developers lose the work they spent years building.

At the same time, artificial intelligence-driven tools like ChatGPT and image creation tool Midjourney are growing in popularity, leading some to believe they could one day replace tasks traditionally performed by humans, like copywriting and shooting video B-roll. But whether or not they can actually perform these tasks, one thing is certain: the internet will be flooded with effortless AI-generated content that could drown out human work. This looming wave is problematic for computer scientists like me, who think about data privacy, fidelity, and distribution every day. But we all need to pay attention. Without a clear preservation plan, a lot of good data and information will be lost.

Ultimately, data preservation is a resource issue: Who is responsible for storing and maintaining the information, and who pays for these efforts? And who decides what is worth preserving? Basic AI model Some of the major players want to catalogue online data, but their interests don’t always align with those of the general public.

The costs of electricity and server space required to store data indefinitely grow over time. Like bridges and roads, data infrastructure must be maintained. These costs are burdensome, especially for small content publishers. However, even if we could download and back up the entire Internet periodically, it would not be enough. Just as a library is useless without some organizational structure, any form of data storage needs to be carefully archived. Compatibility is also an issue. For example, if one day you stop storing documents as PDFs, you will need to keep your old computers (with compatible software) on hand.

But in preserving all these files and digital content, we must respect and cooperate with copyright holders. Spotify: $9 billion For example, last year’s music licensing records would hold many times that value in a public data archiving system. Data preservation systems are useless if they are bankrupted by lawsuits. This is especially tricky when content is created by a group or has changed hands several times. Even with approval from the original creator of the work, there may still be people trying to protect the copyright they purchased.

Finally, care must be taken to archive only true and useful information, which is becoming increasingly difficult in the Internet age. Before the Internet, the cost of producing physical media such as books, newspapers, magazines, board games, DVDs, and CDs naturally limited the flow of information. Online, the barriers to publication are much lower, and therefore a lot of false or useless information gets archived. information It could be distributed every day. When data is decentralized, like the Internet, there needs to be some way to ensure that the best parts of the data are promoted, regardless of how that is defined.

This has never been more important than now, with an internet plagued by AI-generated threats. SpeakGenerative AI models such as ChatGPT have been shown to unintentionally memorize training data (leading to lawsuits) The New York Times), Hallucinating Misinformation, and sometimes Anger Meanwhile, AI-generated content is becoming increasingly prevalent on websites and social media apps.

My opinion is that there is no need to store AI-generated content, since it can simply be reproduced. Although many of the leading AI developers are reluctant to reveal the secrets of how they collect training data, it seems very likely that these models are trained on vast amounts of data. Cut data From the Internet to AI companies, so-called Synthetic Data Online the quality of the model is reduced.

While manufacturers, developers, and ordinary people can solve some of these problems, governments are in a unique position with the funding and legal authority to preserve the breadth of our collective knowledge. Libraries preserve and archive countless books, movies, music, and other forms of physical media. The Library of Congress, Web ArchiveIt is primarily a historical and cultural document, but this alone is not enough.

The Internet, or digital media, will almost certainly dwarf the current digital store of the Library of Congress. Platform (Think of now-outdated software like Adobe Flash.) Digital products must be preserved, too. Just as preservationists maintain the books and other physical objects they work with, digital products need technicians to maintain the original computers and operating systems and keep them working properly. The Library of Congress Digitalization Older media formats cannot meet the preservation demands of the vast realm of computing.

Groups like Wikimedia Foundation and Internet Archive They do a great job of filling in the gaps, especially the latter, by thoroughly documenting what is being deprecated. software and WebsiteHowever, these platforms face serious obstacles in achieving their archiving goals. Wikipedia frequently solicits donations and relies on volunteers to write and review articles. This host Many of the problems are prejudice The Internet Archive relies on input from its users in terms of what articles are written and how they are written. For example, the Wayback Machine sometimes limits the type of data it archives and when. The Internet Archive has faced legal action. assignment Infringement from copyright holders is threatening its scope and survival.

But governments are hardly bound by the same constraints. In my view, the additional funding and resources required to expand the Library of Congress’ goal to archive web data would be almost negligible for the U.S. budget. Governments also have the power to make exceptions to intellectual property rights in a way that benefits all parties. For example, the New York Public Library’s Theatre on Film and Tape Archivehas preserved many Broadway and Off-Broadway productions for educational and research purposes, despite a strict ban on photography and videography. Finally, governments are in theory stewards of the will and interests of their people, which must include our collective knowledge and facts. Since any form of archiving involves some form of choosing what to preserve (and, as a side note, what not to preserve), I can think of no better option than to let a responsible public body make that decision.

Of course, just as analog record preservation didn’t end with physical libraries, data archiving shouldn’t end with this proposal either. But this is a good start, especially as politicians continue to degrade libraries (as they do in my hometown). New York City), it is more important than ever that we correct course. We must refocus our attention on updating our information centers, libraries, for the information age.