The format of The Long Now
In 01992, Tim Berners-Lee wrote a document called HTML Tags.
In September 02001, I started keeping this online journal. Back then, I was storing my data in XML, using a format of my own invention. The XML was converted using PHP into (X)HTML, RSS, and potentially anything else …although the “anything else” part never really materialised.
In February 02006, I switched over to using a MySQL database to store my data as chunks of markup.
To me, being able to completely migrate my data — with minimal bit-rot — from system to system is the key in the never-ending and easily-lost fight to keep my data accessible over the entirety of my life.
He’s using non-binary, well-documented standards to store and structure his data: Atom, HTML and microformats.
Meanwhile, the HTML5 spec began defining error-handling for HTML documents. Ian Hickson wrote:
The original reason I got involved in this work is that I realised that the human race has written literally billions of electronic documents, but without ever actually saying how they should be processed.
I decided that for the sake of our future generations we should document exactly how to process today’s documents, so that when they look back, they can still reimplement HTML browsers and get our data back, even if they no longer have access to Microsoft Internet Explorer’s source code.
In August 2008, Ian Hickson mentioned in an interview that the timeline for HTML5 involves having two complete implementations by 02022. Many web developers were disgusted that such a seemingly far-off date was even being mentioned. My reaction was the opposite. I began to pay attention to HTML5.
HTML is starting to look like a relatively safe bet for data longevity and portability. I’m not sure the same can be said for any particular flavour of database. Sooner rather than later, I should remove the unnecessary layer of abstraction that I’m using to store my data.
Long-term data preservation is like long-term backup: a series of short-term formats, punctuated by a series of migrations. But migrating between data formats is not like copying raw data from one medium to another.
Fidelity is not a binary thing. Data can gradually degrade with each conversion until you’re left with crap. People think this only affects the analog world, like copying cassette tapes for several generations. But I think digital preservation is actually much harder, in part because people don’t even realize that it has the same issues.
He’s also betting on HTML:
HTML is not an output format. HTML is The Format. Not The Format Of Forever, but damn if it isn’t The Format Of The Now.
I don’t think that any format could ever be The Format Of The Long Now but HTML is the closest we’ve come thus far in the history of computing to having a somewhat stable, human- and machine-readable data format with a decent chance of real longevity.