The geeks of the UK have been enjoying a prime-time television show dedicated to the all things webby. Virtual Revoltution is a rare thing: a television programme about the web made by someone who actually understands the web (Aleks, to be precise).
Still, the four-part series does rely on the usual television documentary trope of presenting its subject matter as a series of yin and yang possibilities. The web: blessing or curse? The web: force for democracy or tool of oppression? Rhetorical questions: a necessary evil or an evil necessity?
The third episode tackles one of the most serious of society’s concerns about our brave new online world, namely the increasing amount of information available to commercial interests and the associated fear that technology is having a negative effect on privacy. Personally, I’m with Matt when he says:
If the end of privacy comes about, it’s because we misunderstand the current changes as the end of privacy, and make the mistake of encoding this misunderstanding into technology. It’s not the end of privacy because of these new visibilities, but it may be the end of privacy because it looks like the end of privacy because of these new visibilities*.
Inevitably, whenever there’s a moral panic about the web, a truism that raises its head is the assertion that The Internets Never Forget:
On the one hand, the Internet can freeze youthful folly and a small transgressions can stick with you for life. So that picture of you drunk and passed out in a skip, or that heated argument you had on a mailing list when you were twenty can come back and haunt you.
We seem to have a collective fundamental attribution error when it comes to the longevity of data on the web. While we are very quick to recall the instances when a resource remains addressable for a long enough time period to cause embarrassment or shame later on, we completely ignore all the link rot and 404s that is the fate of most data on the web.
There is an inverse relationship between the age of a resource and its longevity. You are one hundred times more likely to find an embarrassing picture of you on the web uploaded in the last year than to find an embarrassing picture of you uploaded ten years ago.
If a potential boss finds a ten-year old picture of you drunk and passed out at a party, that’s certainly a cause for concern. But such an event would be extraordinary rather than commonplace. If that situation ever happened to me, I would probably feel outrage and indignation like anybody else, but I bet that I would also wonder
Hmmm, where’s that picture being hosted? Sounds like a good place for off-site backups.
The majority of data uploaded to the web will disappear. But we don’t pay attention to the disappearances. We pay attention to the minority of instances when data survives.
This isn’t anything specific to the web; this is just the way we human beings operate. It doesn’t matter if the national statistics show a decrease in crime; if someone is mugged on your street, you’ll probably be worried about increased crime. It doesn’t matter how many airplanes successfully take off and land; one airplane crash in ten thousand is enough to make us very worried about dying on a plane trip. It makes sense that we’ve taken this cognitive bias with us onto the web.
As for why resources on the web tend to disappear over time, there are two possible reasons:
- The resource is being hosted on a third-party site or
- The resource is being hosted on an independent site.
The problem with the first instance is obvious. A commercial third-party responsible for hosting someone else’s hopes and dreams will pull the plug as soon as the finances stop adding up.
You cannot rely on a third-party service for data longevity, whether it’s Geocities, Magnolia, Pownce, or anything else.
That leaves you with The Pemberton Option: host your own data.
This is where the web excels: distributed and decentralised data linked together with hypertext. You can still ping third-party sites and allow them access to your data, but crucially, you are in control of the canonical copy (Tantek is currently doing just that, microblogging on his own site and sending copies to Twitter).
Distributed HTML, addressable by URL and available through HTTP: it’s a beautiful ballet that creates the network effects that makes the web such a wonderful creation. There’s just one problem and it lies with the URL portion of the equation.
Domain names aren’t bought, they are rented. Nobody owns domain names, except ICANN. While you get to decide the relative structure of URLs on your site, everything between the colon slash slash and the subsequent slash belongs to ICANN. Centralised. Not distributed.
Cool URIs don’t change but even with the best will in the world, there’s only so much we can do when we are tenants rather than owners of our domains.
In his book Weaving The Web, Sir Tim Berners-Lee mentions that exposing URLs in the browser interface was a throwaway decision, a feature that would probably only be of interest to power users. It’s strange to imagine what the web would be like if we used IP numbers rather than domain names—more like a phone system than a postal system.
But in the age of Google, perhaps domain names aren’t quite as important as they once were. In Japanese advertising, URLs are totally out. Instead they show search boxes with recommended search terms.
I’m not saying that we should ditch domain names. But there’s something fundamentally flawed about a system that thinks about domain names in time periods as short as a year or two. It doesn’t bode well for the long-term stability of our data on the web.
On the plus side, that embarrassing picture of you passed out at a party will inevitably disappear …along with almost everything else on the web.