Making an internet archive for the ages

October 29, 2021

Mark Pesce

Mark Pesce is a professional futurist and public speaker. He invented the technology for 3D on the Web.

Twenty-eight years ago this month, I got onto the World Wide Web. First among my friends – and among the first few thousand web users anywhere in the world – I explored a tiny universe of content.

But how did web surfers find anything before Google – a tool that came along several years later?

In the beginning there was the list: just a page of names and links to websites, sitting on the CERN website – the first website, and the birthplace of the web.

I worked my way down that list, methodically clicking on link after link, exploring the website that loaded into my “browser”, and going on to the next. By the time I’d finished that list, another list had appeared, this one hosted at the National Centre for Supercomputer Applications (NCSA). Almost as famous as CERN, NCSA gave the world the first widely-used browser, Mosaic. NCSA’s list had a lot of overlap with the list at CERN, but a few new sites would pop up at the bottom of the list every day, so I spent another day or two visiting all the sites on that list that I hadn’t already visited.

In seven days, I’d finished. I’d surfed the entire web.

For a few months, I managed to keep up with new websites as they popped up on the NCSA list, priding myself on staying current with this amazing new technology. But by the end of February 1994, more sites went onto that list every day than I could find time to explore. Not long after that, the list’s maintainer threw in the towel – the web’s exponential growth meant that no archivist could ever hope to keep pace with it.

But how did web surfers find anything before Google – a tool that came along several years later?

In early 1994, two enterprising Stanford University students (no, not Larry and Sergey) set up Jerry and Dave’s Guide to the World Wide Web, a part-time project that rapidly grew into the first of the internet “unicorns”: Yahoo!. Taking a librarian’s approach to the too-much-good-stuff of the early web, Yahoo! asked you to choose your category, then your sub-category, and possibly even your sub-sub and sub-sub-sub-categories, leaving you with a curated list of websites that you could examine at your leisure, each dedicated to your sub-sub-sub-topic of interest.

It took 18 months for exponential growth to overwhelm Yahoo!’s category search; every sub-sub-sub-sub-sub category produced a list of sites too long to explore. At this point, I started keeping lists of links – “bookmarks” – like a breadcrumb trail to guide me back to my favourite sites. When that list got long enough, I curated the best of the best, gathered them into a list named Stones, Stars and Gold, and put them on a page on my own website.

Number of websites on the internet through time

1991 (World Wide Web invented): 10
1994 (Yahoo! launched): approximately 3,000
1998 (Google launched): approximately 2.4 million
2004 (Facebook launched): approximately 51.6 million
Today: approximately 1.7 billion

Source: Statista

Visiting that list today and working methodically from top to bottom, only about one-fifth of the links load the pages they pointed to back in 1995. Most of them go to nothing at all, or to something that has the same name, but is completely different. In less than a generation, my snapshot of the early web – very personal, specific and meaningful – has nearly rotted away.

The term “link rot” may not be new – the concept dates back to the first decade of the web – but most people won’t know that the web had been designed to do its best to prevent the untimely death of links. The Uniform Resource Locator, or URL, had been defined by web creator Sir Tim Berners-Lee as ”immutable” – it must not change. A URL gets assigned once – a pointer to a page or a photo or a podcast – and that’s it. That URL always points to those bits. That’s the theory, at least. Unfortunately, immutable URLs immediately went into the too-hard basket. From that moment, the rot set in.

Brewster Kahle saw the problem almost immediately. In 1996, the co-inventor of WAIS (Wide Area Information Server) founded the Internet Archive, and began a methodical backing-up of the entire web. “How can you cite a document if the documents go away every 44 days?” he asked, using his backup of the web to power something named the Wayback Machine – a technology intended to quell the rot. Pop a dead URL into the Wayback Machine and it will show you all of its backups of that webpage, all the way back to the beginning of its first scan, 25 years ago.

An archive is great – certainly better than losing data. But an archive that doesn’t provide immutable URLs for its data, well, that’s peak link rot.

Using the Wayback Machine on my 80%-dead list of favourite links from 1995, I find that many (likely most) of those websites can be recovered. The links themselves may be dead, but the pages and images once pointed to by those links continue to persist. If I wanted to, I could recreate my page with links that leveraged the Wayback Machine, breathing life back into the list. Yet that may not be enough to prevent a more pernicious form of link rot.

A recent paper from a group of US-based researchers shows that even a good backup of the web might miss the point. “Where did the Web Archive go?” details the fate of four web archives (the Internet Archive, fortunately, not among them) that changed their own URLs over 14 months from 2017 to 2019. Though well-intentioned, those changes broke many of the URLs pointing to the content within those archives. An archive is great – certainly better than losing data. But an archive that doesn’t provide immutable URLs for its data, well, that’s peak link rot.

We all generate so much data all the time now – on smartphones and wearables and Zoom calls and so on – that archiving is no longer a luxury. Without an archive, we lose our connection to our digital past. I learned this viscerally when I sought any online resources concerning the First International Conference on the World Wide Web, held at CERN in May 1994, and which I attended. There’s very little documentation about the event, and only a few photos, for one of the most important events in the history of computing: the web’s “big bang” moment. Why? The answer is almost too obvious: the conference took place before the web took off. The medium we use to record, commemorate and share our experiences simply didn’t exist. It was only subsequently brought into being by the 300-plus researchers who attended the conference.

We cannot let our history rot away.

The shadow cast by that absence showed me that if we aren’t very careful, we could lose our connections to our past. The data may remain somewhere, but could be so difficult to locate that most would simply resign themselves to a kind of perpetual digital amnesia. In Nineteen Eighty-Four, George Orwell wrote: “Who controls the past controls the future.” I’d suggest that those who forget the past don’t have much of a future.

All of the archives we add to daily – in the photos we share on Facebook, movies uploaded to YouTube, diatribes posted to Twitter, and so on – mean this threat touches nearly all of us. What can we do? We can demand immutability in perpetuity. Any organisation that publishes to the web should guarantee that even when they revise their systems, existing data will remain available and accessible forever, via the same URLs. We cannot let our history rot away. It doesn’t need to happen, and it shouldn’t. Not if we want to be able to understand how we got here, and where we’re going.

Archiving the World Wide Web

Mark Pesce

Number of websites on the internet through time

Quantum networks shown to work on existing internet cables

Even if 5G penetrates a few millimetres into the skin, it is safe

The Big Encyclopedia: How does Wikipedia see Australia?

What your Wikipedia reading says about you: study find different styles