The Internet Is Failing The Website Preservation Test

Gork@beehaw.org · 1 year ago

The Internet Is Failing The Website Preservation Test

strainedl0ve@beehaw.org · 1 year ago

This is a very good point and one that is not discussed enough. Archive.org is doing amazing work but there is absolutely not enough of that and they have very limited resources.

The whole internet is extremely ephemeral, more than people realize, and it’s concerning in my opinion. Funny enough, I actually think that federation/decentralization might be the solution. A distributed system to back-up the internet that anyone can contribute storage and bandwidth to might be the only sustainable solution. I wonder.if anyone has thought about it already.

entropicdrift@lemmy.sdf.org · 1 year ago

I’d argue that it can help or hurt to decentralize, depending on how it’s handled. If most sites are caching/backing up data that’s found elsewhere, that’s both good for resilience and for preservation, but if the data in question is centralized by its home server, then instead of backing up one site we’re stuck backing up a thousand, not to mention the potential issues with discovery

AH@indieweb.social · 1 year ago

@strainedl0ve There is always https://ipfs.tech

RealAccountNameHere@beehaw.org · 1 year ago

I worry about this too. I’ve always said and thought that I feel more like a citizen of the Internet then of my country, state, or town, so its history is important to me.

Gork@beehaw.org · 1 year ago

Yeah and unless someone has the exact knowledge of what hard drive to look for in a server rack somewhere, tracing an individual site’s contents that went 404 is practically impossible.

I wonder though if Cloud applications would be more robust than individual websites since they tend to be managed by larger organizations (AWS, Azure, etc).

Maybe we need a Svalbard Seed Vault extension just to house gigantic redundant RAID arrays. 😄

tymon@lemmy.one · 1 year ago

Remember a few years ago when MySpace did a faceplant during a server migration, and lost literally every single piece of music that had ever been uploaded? It was one of the single-largest losses of Internet history and it’s just… not talked about at all anymore.

Rentlar@beehaw.org · 1 year ago

Well stone tablets, writing, songs, culture can disappear with time, either naturally (such as erosion and weather) or through human action (such burning books, destructive investigation of ancient artifacts/ruins)

That’s why we try to keep good records.

altz3r0@beehaw.org · edit-2 1 year ago

I think preservation is happening, the issue lies in accessibility. Projects like Archive.org are the public ones, but it is certain that private organizations are doing the same, just not making it public.

This is also something that is my biggest worry about the Fediverse. It has tools to deal with it, but they are self-contained. No search engine is crawling the Fediverse as far as I’ve looked, and no initiative to archive, index and overall make the content of the Fediverse accessible is currently in place, and that’s a big risk. I’m sure we will soon be seeing loss of information for this reason, if not already happened.

thejml@lemm.ee · 1 year ago

It’s important here to think about a few large issues with this data.

First Data Storage. Other people in here are talking about decentralizing and creating fully redundant arrays so multiple copies are always online and can be easily migrated from one storage tech to the next. There’s a lot of work here not just in getting all the data, but making sure it continues to move forward as we develop new technologies and new storage techniques. This won’t be a cheap endeavor, but it’s one we should try to keep up with. Hard drives die, bit rot happens. Even off, a spinning drive will fail, as will an SSD with time. CD’s I’ve written 15+ years ago aren’t 100% readable.

Second, there’s data organization. How can you find what you want later when all you have are images of systems, backups of databases, static flat files of websites? A lot of sites now require JavaScript and other browser operations to be able to view/use the site. You’ll just have a flat file with a bunch of rendered HTML, can you really still find the one you want? Search boxes wont work, API calls will fail without the real site up and running. Databases have to be restored to be queried and if they’re relational, who will know how to connect those dots?

Third, formats. Sort of like the previous, but what happens when JPG is deprecated in favor of something better? Can you currently open up that file you wrote in 1985? Will there still be a program available to decode it? We’ll have to back those up as well… along with the OSes that they run on. And if there’s no processors left that can run on, we’ll need emulators. Obviously standards are great here, we may not forget how to read a PCX or GIF or JPG file for a while, but more niche things will definitely fall by the wayside.

Fourth, Timescale. Can we keep this stuff for 50 yrs? 100 yrs? 1000 yrs? What happens when our great*30-grand-children want to find this info. We regularly find things from a few thousand years ago here on earth with archeological digsites and such. There’s a difference between backing something up for use in a few months, and for use in a few years, what about a few hundred or thousand? Data storage will be vastly different, as will processors and displays and such. … Or what happens in a Horizon Zero Dawn scenario where all the secrets are locked up in a vault of technology left to rot that no one knows how to use because we’ve nuked ourselves into regression.

DeGandalf@kbin.social · 1 year ago

In this aspect, the internet is closer to spoken language, than any written media. Even if you use a service to archive the things you find, it’s still possible, that they shut down, too.

CynAq@kbin.social · 1 year ago

We need deliberate efforts to archive everything efficiently.

We also need a way to decouple everyone’s personal info from publicly available information about them, keeping in mind that not all publicly available information is intended to be that way.

Storage ain’t cheap and it definitely ain’t infinite.

This is a way harder problem than “the internet” being a bit more mindful can solve easily.

Not to absolve any companies from responsibility or anything.

m00njuic3@kbin.social · 1 year ago

thankfully we do have people trying to archive things. sadly not everything will make it into that. just to much new stuff all the time to keep up with. but if we can keep the important and mostly important stuff

xray@beehaw.org · edit-2 1 year ago

Yeah it’s funny how I always got warned about how “the internet is forever” when it comes to being care about what you post on social media, which isn’t bad advice and is kinda true, but also really kinda not true. So many things I’ve wanted to find on the internet that I experienced like 5-15 years ago are just gone without a trace.

Brecat5@kbin.social · 1 year ago

It sucks that we already have internet lost media

CynAq@kbin.social · 1 year ago

We need deliberate efforts to archive everything efficiently.

We also need a way to decouple everyone’s personal info from publicly available information about them, keeping in mind that not all publicly available information is intended to be that way.

Storage ain’t cheap and it definitely ain’t infinite.

This is a way harder problem than “the internet” being a bit more mindful can solve easily.

Not to absolve any companies from responsibility or anything.

The Internet Is Failing The Website Preservation Test

The Internet Is Failing The Website Preservation Test

archive.ph