Tags: linkrot

8

sparkline

HTTPS

Tim Berners-Lee is quite rightly worried about linkrot:

The disappearance of web material and the rotting of links is itself a major problem.

He brings up an interesting point that I hadn’t fully considered: as more and more sites migrate from HTTP to HTTPS (A Good Thing), and the W3C encourages this move, isn’t there a danger of creating even more linkrot?

…perhaps doing more damage to the web than any other change in its history.

I think that may be a bit overstated. As many others point out, almost all sites making the switch are conscientious about maintaining redirects with a 301 status code.

(There’s also a similar 308 status code that I hadn’t come across, but after a bit of investigating, that looks to be a bit of mess.)

Anyway, the discussion does bring up some interesting points. Transport Layer Security is something that’s handled between the browser and the server—does it really need to be visible in the protocol portion of the URL? Or is that visibility a positive attribute that makes it clear that the URL is “good”?

And as more sites move to HTTPS, should browsers change their default behaviour? Right now, typing “example.com” into a browser’s address bar will cause it to automatically expand to http://example.com …shouldn’t browsers look for https://example.com first?

All good food for thought.

There’s a Google Doc out there with some advice for migrating to HTTPS. Unfortunately, the trickiest part—getting and installing certificates—is currently an owl-drawing tutorial, but hopefully it will get expanded.

If you’re looking for even more reasons why enabling TLS for your site is a good idea, look no further than the latest shenanigans from ISPs in the UK (we lost the battle for net neutrality in this country some time ago).

They can’t do that to pages served over HTTPS.

Voice of the Beeb hive

Ian Hunter at the BBC has written a follow-up post to his initial announcement of the plans to axe 172 websites. The post is intended to clarify and reassure. It certainly clarifies, but it is anything but reassuring.

He clarifies that, yes, these websites will be taken offline. But, he reassures us, they will be stored …offline. Not on the web. Without URLs. Basically, they’ll be put in a hole in the ground. But it’s okay; it’s a hole in the ground operated by the BBC, so that’s alright then.

The most important question in all of this is why the sites are being removed at all. As I said, the BBC’s online mothballing policy has—up till now—been superb. Well, now we have an answer. Here it is:

But there still may come a time when people interested in the site are better served by careful offline storage.

There may be a parallel universe where that sentence makes sense, but it would have to be one in which the English language is used very differently.

As an aside, the use of language in the “explanation” is quite fascinating. The post is filled with the kind of mealy-mouthed filler words intended to appease those of us who are concerned that this is a terrible mistake. For example, the phrase “we need to explore a range of options including offline storage” can be read as “the sites are going offline; live with it.”

That’s one of the most heartbreaking aspects of all of this: the way that it is being presented as a fait accompli: these sites are going to be ripped from the fabric of the network to be tossed into a single offline point of failure and there’s nothing that we—the license-payers—can do about it.

I know that there are many people within the BBC who do not share this vision. I’ve received some emails from people who worked on some of the sites scheduled for deletion and needless to say, they’re not happy. I was contacted by an archivist at the BBC, for whom this plan was unwelcome news that he first heard about here on adactio.com. The subsequent reaction was:

It was OK to put a videotape on a shelf, but putting web pages offline isn’t OK.

I hope that those within the BBC who disagree with the planned destruction will make their voices heard. For those of us outside the BBC, it isn’t clear how we can best voice our concerns. You could make a complaint to the BBC, though that seems to be intended more for complaints about programme content.

In the meantime, you can download all or some of the 172 sites and plop them elsewhere on the web. That’s not an ideal solution—ideally, the BBC shouldn’t be practicing a deliberate policy of link rot—but it allows us to prepare for the worst.

I hope that whoever at the BBC has responsibility for this decision will listen to reason. Failing that, I hope that we can get a genuine explanation as to why this is happening, because what’s currently being offered up simply doesn’t cut it. Perhaps the truth behind this decision lies not so much with the BBC, but with their technology partner, Siemens, who have a notorious track record for shafting the BBC, charging ludicrous amounts of money to execute the most trivial of technical changes.

If this decision is being taken for political reasons, I would hope that someone at the BBC would have the honesty to say so rather than simply churning out more mealy-mouthed blog posts devoid of any genuine explanation.

Linkrotting

Yesterday’s account of the BBC’s decision to cull 172 websites caused quite a stir on Twitter.

Most people were as saddened as I was, although Emma described my post as being “anti-BBC.” For the record, I’m a big fan of the BBC—hence my disappointment at this decision. And, also for the record, I believe anyone should be allowed to voice their criticism of an organisational decision without being labelled “anti” said organisation …just as anyone should be allowed to criticise a politician without being labelled unpatriotic.

It didn’t take long for people to start discussing an archiving effort, which was heartening. I started to think about the best way to coordinate such an effort; probably a wiki. As well as listing handy archiving tools, it could serve as a place for people to claim which sites they want to adopt, and point to their mirrors once they’re up and running. Marko already has a head start. Let’s do this!

But something didn’t feel quite right.

I reached out to Jason Scott for advice on coordinating an effort like this. He has plenty of experience. He’s currently trying to figure out how to save the more than 500,000 videos that Yahoo is going to delete on March 15th. He’s more than willing to chat, but he had some choice words about the British public’s relationship to the BBC:

This is the case of a government-funded media group deleting. In other words, this is something for The People, and by The People I mean The Media and the British and the rest to go HEY BBC STOP

He’s right.

Yes, we can and should mirror the content of those 172 sites—lots of copies keep stuff safe—but fundamentally what we want is to keep the fabric of the web intact. Cool URIs don’t change.

The BBC has always been an excellent citizen of the web. Their own policy on handling outdated content explains the situation beautifully:

We don’t want to delete pages which users may have bookmarked or linked to in other ways.

Moving a site to a different domain will save the content but it won’t preserve the inbound connections; the hyperlinks that weave the tapestry of the web together.

Don’t get me wrong: I love the Internet Archive. I think that is doing fantastic work. But let’s face it; once a site only exists in the archive, it is effectively no longer a part of the living web. Yet, whenever a site is threatened with closure, we invoke the Internet Archive as a panacea.

So, yes, let’s make and host copies of the 172 sites scheduled for termination, but let’s not get distracted from the main goal here. What we are fighting against is .

I don’t want the BBC to take any particular action. Quite the opposite: I want them to continue with their existing policy. It will probably take more effort for them to remove the sites than to simply let them sit there. And let’s face it, it’s not like the bandwidth costs are going to be a factor for these sites.

Instead, many believe that the BBC’s decision is politically motivated: the need to be seen to “cut” top level directories, as though cutting content equated to cutting costs. I can’t comment on that. I just know how I feel about the decision:

I don’t want them to archive it. I just want them to leave it the fuck alone.

“What do we want?” “Inaction!”

“When do we want it?” “Continuously!”

Erase and rewind

In the 1960s and ’70s, it was common practice at the BBC to reuse video tapes. Old recordings were taped over with new shows. Some Doctor Who episodes have been lost forever. Jimi Hendrix’s unruly performance on Happening for Lulu would have also been lost if a music-loving engineer hadn’t sequestered the tapes away, preventing them from being over-written.

Except - a VT engineer called Bob Pratt, who really ought to get a medal, was in the habit of saving stuff he liked. Even then, the BBC policy of wiping practically everything was notorious amongst those who’d made it. Bob had the job of changing the heads on 2” VT machines. He’d be in at 0600 before everyone else and have two hours to sort the equipment before anyone else came in. Rock music was his passion, and knowing everything would soon disappear, would spend some of that time dubbing off the thing he liked onto junk tapes, which would disappear under the VT department floor.

To be fair to the BBC, the tape-wiping policy wasn’t entirely down to crazy internal politics—there were convoluted rights issues involving the actors’ union, Equity.

Those issues have since been cleared up. I’m sure the BBC has learned from the past. I’m sure they wouldn’t think of mindlessly throwing away content, when they have such an impressive archive.

And yet, when it comes to the web, the BBC is employing a slash-and-burn policy regarding online content. 172 websites are going to disappear down the memory hole.

Just to be clear, these sites aren’t going to be archived. They are going to be deleted from the web. Server space is the new magnetic tape.

This callous attitude appears to be based entirely on the fact that these sites occupy URLs in top-level directories—repeatedly referred to incorrectly as top level domains on the BBC internet blog—a space that the decision-makers at the BBC are obsessed with.

Instead of moving the sites to, say, bbc.co.uk/archive and employing a little bit of .htaccess redirection, the BBC (and their technology partner, Siemens) would rather just delete the lot.

Martin Belam is suitably flabbergasted by the vandalism of the BBC’s online history:

I’m really not sure who benefits from deleting the Politics 97 site from the BBC’s servers in 2011. It seems astonishing that for all the BBC’s resources, it may well be my blog posts from 5 years ago that provide a more accurate picture of the BBC’s early internet days than the Corporation does itself - and that it will have done so by choice.

Many of the 172 sites scheduled for deletion are currently labelled with a banner across the top indicating that the site hasn’t been updated for a while. There’s a link to a help page with the following questions and answers:

It’ll be interesting to see how those answers will be updated to reflect change in policy. Presumably, the new answers will read something along the lines of “Fuck ‘em.”

Kiss them all goodbye. And perhaps most egregious of all, you can also kiss goodbye to WW2 People’s War:

The BBC asked the public to contribute their memories of World War Two to a website between June 2003 and January 2006. This archive of 47,000 stories and 15,000 images is the result.

I’m very saddened to see the BBC join the ranks of online services that don’t give a damn for posterity. That attitude might be understandable, if not forgivable, from a corporation like Yahoo or AOL, driven by short-term profits for shareholders, as summarised by Connor O’Brien in his superb piece on link rot:

We push our lives into the internet, expecting the web to function as a permanent and ever-expanding collective memory, only to discover the web exists only as a series of present moments, every one erasing the last. If your only photo album is Facebook, ask yourself: since when did a gratis web service ever demonstrate giving a flying fuck about holding onto the past?

I was naive enough to think that the BBC was above that kind of short-sighted approach. Looks like I was wrong.

Sad face.

Linkrot

The geeks of the UK have been enjoying a prime-time television show dedicated to the all things webby. Virtual Revoltution is a rare thing: a television programme about the web made by someone who actually understands the web (Aleks, to be precise).

Still, the four-part series does rely on the usual television documentary trope of presenting its subject matter as a series of yin and yang possibilities. The web: blessing or curse? The web: force for democracy or tool of oppression? Rhetorical questions: a necessary evil or an evil necessity?

The third episode tackles one of the most serious of society’s concerns about our brave new online world, namely the increasing amount of information available to commercial interests and the associated fear that technology is having a negative effect on privacy. Personally, I’m with Matt when he says:

If the end of privacy comes about, it’s because we misunderstand the current changes as the end of privacy, and make the mistake of encoding this misunderstanding into technology. It’s not the end of privacy because of these new visibilities, but it may be the end of privacy because it looks like the end of privacy because of these new visibilities*.

Inevitably, whenever there’s a moral panic about the web, a truism that raises its head is the assertion that The Internets Never Forget:

On the one hand, the Internet can freeze youthful folly and a small transgressions can stick with you for life. So that picture of you drunk and passed out in a skip, or that heated argument you had on a mailing list when you were twenty can come back and haunt you.

Citation needed.

We seem to have a collective fundamental attribution error when it comes to the longevity of data on the web. While we are very quick to recall the instances when a resource remains addressable for a long enough time period to cause embarrassment or shame later on, we completely ignore all the link rot and 404s that is the fate of most data on the web.

There is an inverse relationship between the age of a resource and its longevity. You are one hundred times more likely to find an embarrassing picture of you on the web uploaded in the last year than to find an embarrassing picture of you uploaded ten years ago.

If a potential boss finds a ten-year old picture of you drunk and passed out at a party, that’s certainly a cause for concern. But such an event would be extraordinary rather than commonplace. If that situation ever happened to me, I would probably feel outrage and indignation like anybody else, but I bet that I would also wonder Hmmm, where’s that picture being hosted? Sounds like a good place for off-site backups.

The majority of data uploaded to the web will disappear. But we don’t pay attention to the disappearances. We pay attention to the minority of instances when data survives.

This isn’t anything specific to the web; this is just the way we human beings operate. It doesn’t matter if the national statistics show a decrease in crime; if someone is mugged on your street, you’ll probably be worried about increased crime. It doesn’t matter how many airplanes successfully take off and land; one airplane crash in ten thousand is enough to make us very worried about dying on a plane trip. It makes sense that we’ve taken this cognitive bias with us onto the web.

As for why resources on the web tend to disappear over time, there are two possible reasons:

  1. The resource is being hosted on a third-party site or
  2. The resource is being hosted on an independent site.

The problem with the first instance is obvious. A commercial third-party responsible for hosting someone else’s hopes and dreams will pull the plug as soon as the finances stop adding up.

I’m sure you’ve seen the famous chart of Web 2.0 logos but have seen Meg Pickard’s updated version, adjusted for dead companies?

You cannot rely on a third-party service for data longevity, whether it’s Geocities, Magnolia, Pownce, or anything else.

That leaves you with The Pemberton Option: host your own data.

This is where the web excels: distributed and decentralised data linked together with hypertext. You can still ping third-party sites and allow them access to your data, but crucially, you are in control of the canonical copy (Tantek is currently doing just that, microblogging on his own site and sending copies to Twitter).

Distributed HTML, addressable by URL and available through HTTP: it’s a beautiful ballet that creates the network effects that makes the web such a wonderful creation. There’s just one problem and it lies with the URL portion of the equation.

Domain names aren’t bought, they are rented. Nobody owns domain names, except ICANN. While you get to decide the relative structure of URLs on your site, everything between the colon slash slash and the subsequent slash belongs to ICANN. Centralised. Not distributed.

Cool URIs don’t change but even with the best will in the world, there’s only so much we can do when we are tenants rather than owners of our domains.

In his book Weaving The Web, Sir Tim Berners-Lee mentions that exposing URLs in the browser interface was a throwaway decision, a feature that would probably only be of interest to power users. It’s strange to imagine what the web would be like if we used IP numbers rather than domain names—more like a phone system than a postal system.

But in the age of Google, perhaps domain names aren’t quite as important as they once were. In Japanese advertising, URLs are totally out. Instead they show search boxes with recommended search terms.

I’m not saying that we should ditch domain names. But there’s something fundamentally flawed about a system that thinks about domain names in time periods as short as a year or two. It doesn’t bode well for the long-term stability of our data on the web.

On the plus side, that embarrassing picture of you passed out at a party will inevitably disappear …along with almost everything else on the web.

Tears in the rain

When I first heard that Yahoo were planning to bulldoze Geocities, I was livid. After I blogged in anger, I was taken to task for jumping the gun. Give ‘em a chance, I was told. They may yet do something to save all that history.

They did fuck all. They told Archive.org what URLs to spider and left it up to them to do the best they could with preserving internet history. Meanwhile, Jason Scott continued his crusade to save as much as he could:

This is fifteen years and decades of man-hours of work that you’re destroying, blowing away because it looks better on the bottom line.

We are losing a piece of internet history. We are losing the destinations of millions of inbound links. But most importantly we are losing people’s dreams and memories.

Geocities dies today. This is a bad day for the internet. This is a bad day for our collective culture. In my opinion, this is also a bad day for Yahoo. I, for one, will find it a lot harder to trust a company that finds this to be acceptable behaviour …despite the very cool and powerful APIs produced by the very smart and passionate developers within the same company.

I hope that my friends who work at Yahoo understand that when I pour vitriol upon their company, I am not aiming at them. Yahoo has no shortage of clever people. But clearly they are down in the trenches doing development, not in the upper echelons making the decision to butcher Geocities. It’s those people, the decision makers, that I refer to as twunts. Fuckwits. Cockbadgers. Pisstards.

The Death and Life of Geocities

They’re trying to keep it quiet but Yahoo are planning to destroy their Geocities property. All those URLs, all that content, all those memories will be lost …like tears in the rain.

Jason Scott is mobilising but he needs help:

I can’t do this alone. I’m going to be pulling data from these twitching, blood-in-mouth websites for weeks, in the background. I could use help, even if we end up being redundant. More is better. We’re in #archiveteam on EFnet. Stop by. Bring bandwidth and disks. Help me save Geocities. Not because we love it. We hate it. But if you only save the things you love, your archive is a very poor reflection indeed.

I’m seething with anger. I hope I can tap into that anger to do something productive. This situation cannot stand. It reinforces my previously-stated opinion that Yahoo is behaving like a dribbling moronic company.

You may not care about Geocities. Keep in mind that this is the same company that owns Flickr, Upcoming, Delicious and Fire Eagle. It is no longer clear to me why I should entrust my data to silos owned by a company behaving in such an irresponsible, callous, cold-hearted way.

What would Steven Pemberton do?

Update: As numerous Yahoo employees are pointing out on Twitter, no data has been destroyed yet; no links have rotted. My toys-from-pram-throwage may yet prove to be completely unfounded. Jim invokes , seeing parallels with amazonfail, so overblown is my moral outrage. Fair point. I should give Yahoo time to prove themselves worthy guardians. As a customer of Yahoo’s other services, and as someone who cares about online history, I’ll be watching to see how Yahoo deals with this situation and I hope they deal with it well (archiving data, redirecting links).

Like I said above, I hope I can turn my anger into something productive. Clearly I’m not doing a very good job of that right now.

Shrtr

In one of those instances of convergent online evolution, the subject of URL shorteners has been popping up a lot lately. You know; TinyURL, bit.ly, tr.im, and the like. I suspect a lot of this talk was prompted by the launch of the DiggBar and its accompanying short URL service that serves up your content in an iframe—time to break out that frame-busting JavaScript you haven’t needed for years.

David Weiss writes about the security implications of URL shortening services. Meanwhile, Joshua Schachter talks about the danger of link rot:

The worst problem is that shortening services add another layer of indirection to an already creaky system. A regular hyperlink implicates a browser, its DNS resolver, the publisher’s DNS server, and the publisher’s website. With a shortening service, you’re adding something that acts like a third DNS resolver, except one that is assembled out of unvetted PHP and MySQL, without the benevolent oversight of luminaries like Dan Kaminsky and St. Postel.

Dave Winer agrees:

We need to prepare for the day when N of the URL shorteners go out of business. When that happens a large part of the web will die. It will not be a good day.

Take the case of Twitter. Messages on Twitter are archived and addressable. If those messages contain links, they are shortened using TinyURL. If TinyURL were to disappear, it would leave a swamp of unresolved endpoints. Jason Kottke has a modest proposal:

In cases where shortening is necessary, Twitter should automatically use a shortener of their own. That way, users know what they’re getting and as long as Twitter is around, those links stay alive.

That would definitely work for that particular case. Of course Twitter could disappear, taking its archive of messages with it, but that’s a different situation. The loss of shortened URLs would be tightly coupled to the loss of the original messages.

But Twitter is just one example. What about the rest of us? Right now, if someone wants to pass around a shortened version of one of my URLs, they could use any one of the many URL shortening services out there. The result is potentially a score of different short URLs leading to the same endpoint. If some of those services disappear, link rot spreads.

Ideally, I should be able to specify a desired short URL for my content. This is something that Dopplr is already doing with its dplr.it domain.

Kellan says that they’re also putting together a URL shortener over at Flickr. He’s thinking about how to specify a short URL for a document: some way of specifying here’s the short URL for this page in the same way that we can specify here’s the stylesheet for this page or here’s the RSS feed for this page.

The rel attribute is used for stylesheets and RSS feeds so perhaps that’s the way to go. Something along the lines of rel="alternate shorter" in the same way that we can point to an alternate stylesheet with rel="alternate stylesheet". But in this case, we’re actually pointing to the same resource but with a different URL. So maybe something like rel="alternate shorter self" would be more accurate. Heck, we could probably throw the bookmark value in there too: rel="alternate shorter self bookmark".

Kevin pointed out on Twitter that rev (reverse relationship) would be more suitable than rel.

Google introduced rel="canonical" recently. It’s a way of pointing from an alternate URL back to the canonical URL of the current document: the relationship of the linked document to the current document is “canonical”.

If you’re linking from the canonical URL to an alternate URL (like, say, a shortened URL), you could use rev="canonical": the relationship of the current document to the linked document is “canonical”.

This certainly seems to be the more semantically correct way of pointing to a shortened URL. Alas, rev is a beleaguered little emo attribute: no-one understands it. At least, that’s the claim of the HTML5 community, who plan to drop it completely.

Personally, I share Paul’s intuitions:

HTML is a living language and the HTML5 WG should behave more like the OED rather than the French Government.

So if enough of us publish documents using ARIA roles, accesskey or rev attributes, they will not go gentle into that good night.

Should the idea of distributed, rather than centralised, URL shortening take off, I can imagine a situation where short URL auto-discovery is as commonplace as . So if I paste a link into a microblogging site like Twitter, or choose to “Mail this page” from my browser, then the website or mail client could check the head of the document for a preferred short URL. It’s a little bit like OpenID delegation: I could either create my own URL shortening service or specify a provider I trust.

Josh is already playing around with shortened links back to posts on his blog. Now suppose he also specifies the short URL (using rev="canonical") on those blog posts…

Update: Kellan has now implemented rev="canonical" auto-discovery.

Update 2: …and Dopplr have duly implemented rev="canonical" which works a treat with Kellan’s auto-discovery tool. Here’s an example.

Update 3: This just keeps getting better. Now there’s a blog devoted to rev="canonical" which has already documented not one, but two Wordpress plugins.