Archive: May 7th, 2008

Selectors

The selector of the beast does not exist.

Building the Real-time Web

I skipped a lot of the afternoon presentations at XTech to spend some time in the Dublin sunshine. I came back to attend Blaine’s presentation on The Real Time Web only to find that Blaine and Maureen didn’t make it over to Ireland because of visa technicalities. That’s a shame. But Matt is stepping into the breach. He has taken Blaine’s slides and assembled a panel with Seth and Rabble from Fire Eagle to answer the questions raised by Blaine.

Matt poses the first question …what is the real-time Web? Rabble says that HTTP lets us load data but isn’t so good at realtime two-way interaction. Seth concurs. With HTTP you have to poll “has anything changed? has anything changed? has anything changed?” As Rabble says, this doesn’t scale very well. With Jabber there is only one connection request and one response and that response is sent when something has changed.

What’s wrong with HTTP, Comet or SMTP? Seth says that SMTP has no verifiable authentication and there’s no consistent API. Rabble says that pinging with HTTP has timeout problems. Seth says that Comet is a nice hack (or family of hacks, as Matt says) but it doesn’t scale. Bollocks! says Simon, Meebo!

Jabber has a lot of confusing documentation. What’s the state of play for the modern programmer? Rabble dives in. Jabber is just streaming XML documents and the specs tell you what to expect in that stream. Jabber addressing looks a lot like emails. Seth explains the federation aspect. Jabber servers authenticate with each other. The payload, like with email, is a message, explains Rabble. Apart from the basic body of the message, you can include other things like attachments. Seth points out that you can get presence information like whether a mobile device is on roaming. You can subscribe to Jabber nodes so that you receive notifications of change from that node. Matt makes the observation that at this point we’re talking about a lot more than just delivering documents.

So we can send and receive messages from either end, says Matt. There’s a sense of a “roster”: end points that you can send and receive data from. That sounds fine for IM but what happens when you apply this to applications? Twitter and Dopplr can both be operated from a chat client. Matt says that this is a great way to structure an API.

Rabble says that everything old is new again. Twitter, the poster child of the new Web, is applying the concept of IRC channels.

Matt asks Rabble to explain how this works with Fire Eagle. Rabble says that Fire Eagle is a fairly simple app but even a simple HTTP client will ping it a lot because they want to get updated location data quickly. With a subscribable end point that represents a user, you get a relatively real-time update of someone’s location.

What about state? The persistence of state in IM is what allows conversations. What are the gotchas of dealing with state?

Well, says Seth, you don’t have a consistent API. Rabble says there is SOAP over XMPP …the room chuckles. The biggest gotcha, says Seth, is XMPP’s heritage as a chat server. You will have a lot of connections.

Chat clients are good interfaces for humans. Twitter goes further and sends back the human-readable message but also a machine-readable description of the message. Are there design challenges in building this kind of thing?

Rabble says the first thing is to always include a body, the human-readable message. Then you can overload that with plenty of data formats; all the usual suspects. GEO Atom in FIre Eagle, for example.

Matt asks them to explain PubSup. It’s Publish/Subscribe, says Seth. Rather than a one-to-one interaction, you send a PubSub request to a particular node and then get back a stream of updates. In Twitter, for example, you can get a stream of the public timeline by subscribing to a node. Rabble mentions Ralph and Blaine’s Twitter/Jaiku bridge that they hacked together during one night at Social Graph Foo Camp. Seth says you can also filter the streams. Matt points out that this is what Tweetscan does now. They used to ping a lot but know they just subscribe. Rabble wonders if we can handle all of this activity. There’s just so much stuff coming back. With RSS we have tricks like “last modified” timestamps and etags but it would be so much easier if every blog had a subscribable node.

We welcome to the stage a special guest, it’s Ralph. Matt introduces the story of the all-night hackathon from Social Graph Foo Camp and asks Ralph to tell us what happened there. Ralph had two chat windows open: one for the Twitterbot, one for the Jaikubot. They hacked and hacked all night. Data was flowing from the US (Twitter) to Europe (Jaiku) and in the other direction. At 7:10am in Sebastapol one chat window went “ping!” and one and a half seconds later another chat window went “pong!”

Matt asks Ralph to stay on the panel for questions. The questions come thick and fast from Dan, Dave, Simon and Gavin. The answers come even faster. I can’t keep up with liveblogging this — always a good sign. You kind of had to be there.

Browsers on the Move: The Year in Review, the Year Ahead

Michael Smith from the W3C is talking about the changing browser landscape. Just in the last year we’ve had the release of the iPhone with WebKit, the Beta of IE8 and just yesterday, Opera’s Dragonfly technology.

In the mobile browser space, the great thing about the iPhone is that it has the same WebKit engine as Safari on the desktop. Opera Mini 4 — the proxy browser — is getting a lot better too. It even supports CSS3 selectors. Mozilla, having previously expressed no interest in Mobile, have started a project called Fennec. Then there’s Android which will use WebKit as the rendering engine for browsers.

Looking at the DOM/CSS space, there have been some interesting developments. The discovery of IE’s interesting quirk with generated elements was one. The support by other browsers for lots of CSS3 selectors is quite exciting. The selectors API is gaining ground. Michael says he’s not fond of using CSS syntax for DOM traversal but he’s definitely in the minority — this is a godsend.

Now for an interlude to look at Web developer tools in browsers. Firebug really started a trend. IE8 has copied it almost verbatim. WekKit has its pretty Web Inspector and now Opera has Dragonfly. Dragonfly has a remote debugging feature, like Fiddler, which Michael is very excited by.

On the Ajax front, things are looking up in HTML5 for cross-site requests. He pimps Anne’s talk tomorrow. Then there’s Doug’s proposal for JSONrequest. Browser vendors haven’t shown too much interest in that. Meanwhile, Microsoft comes out with XDR, its own implementation that nobody is happy about. The other exciting thing in HTML5 is the offline storage stuff which works like Google Gears.

XSLT is supported very well on the client side now. But apart from Michael, who cares? Give me the selectors API any day. SVG is still strong in Mozilla and Opera.

ARIA is the one I’m happiest with. It’s supported across the board now.

The HTML5 video element is supported in WebKit nightlies and in Mozilla. There’s an experimental Opera build and, of course, no IE support. The biggest issues seem to be around licensing and deciding on a royalty-free format for video. Sun has some ideas for that.

Ah, here comes the version targetting “innovation”. The good news, as Micheal notes, is that is defaults now to the latest version. Damn straight!

Here are some Acid 3 measurements so that we can figure out which browser has the biggest willy.

Finally, look at all the CSS innovations that Dave Hyatt is putting in WebKit (and correctly prefixing with webkit-).

Looking to the year ahead, you’ll see more CSS innovations and HTML5 movement. Michael is rushing through this part because he’s running out of time. In fact, he’s out of time.

The slides are up at w3.org/2008/Talks/05-07-smith-xtech/slides.pdf.

Using socially-authored content to provide new routes through existing content archives

Rob Lee is talking about making the most of user-authored (or user-generated) content. In other words, content written by you, Time’s person of the year.

Wikipedia is the poster child. It’s got lots of WWILFing: What Was I Looking For? (as illustrated by XKCD). Here’s a graph entitled Mapping the distraction that is Wikipedia generated from a greasemonkey script that tracks link paths.

Rob works for Rattle Research who were commissioned by the BBC Innovation Labs to do some research into bringing WWILFing to the BBC archive.

Grab the first ten internal links from any Wikipedia article and you will get ten terms that really define that subject matter. The external links at the end of an article provide interesting departure points. How could this be harnessed for BBC news articles? Categories are a bit flat. Semantic analysis is better but it takes a lot of time and resources to generate that for something as large as the BBC archives. Yahoo’s Term Extractor API is a handy shortcut. The terms extracted by the API can be related to pages on Wikipedia.

Look at this news story on organic food sales. The “see also” links point to related stories on organic food but don’t encourage WWILFing. The BBC is a bit of an ivory tower: it has lots of content that it can link to internally but it doesn’t spread out into the rest of the Web very well.

How do you decide what would be interesting terms to link off with? How do you define “interesting”? You could use Google page rank or Technorati buzz for the external pages to decide if they are considered “interesting”. But you still need contextual relevance. That’s where del.icio.us comes in. If extracted terms match well to tags for a URL, there’s a good chance it’s relevant (and del.icio.us also provides information on how many people have bookmarked a URL).

So that’s what they did. They called it “muddy boots” because it would create dirty footprints across the pristine content of the BBC.

The “muddy boots” links for the organic food article links off to articles on other news sites that are genuinely interesting for this subject matter.

Here’s another story, this one from last week about the dissection of a giant squid. In this case, the journalist has provided very good metadata. The result is that there’s some overlap between the “see also” links and the “muddy boots” links.

But there are problems. An article on Apple computing brings up a “muddy boots” link to an article on apples, the fruit. Disambiguation is hard. There are also performance problems if you are relying on an external API like del.icio.us’s. Also, try to make sure you recommend outside links that are written in the same language as the originating article.

Muddy boots was just one example of using some parts of the commons (Wikipedia and del.icio.us). There are plenty of others out there like Magnolia, for example.

But back to disambiguation, the big problem. Maybe the Semantic Web can help. Sources like Freebase and DBpedia add more semantic data to Wikipedia. They also pull in data from Geonames and MusicBrainz. DBpedia extracts the disambiguation data (for example, on the term “Apple”). Compare terms from disambiguation candidates to your extracted terms and see which page has the highest correlation.

But why stop there? Why not allow routes back into our content? For example, having used DBpedia to determine that your article is about Apple, the computer company, you could an hCard for the Apple company to that article.

If you’re worried about the accuracy of commons data, you can stop. It looks like Wikipedia is more accurate than traditional encyclopedias. It has authority, a formal review process and other tools to promote accuracy. There are also third-party services that will mark revisions of Wikipedia articles as being particularly good and accurate.

There’s some great commons data out there. Use it.

Rob is done. That was a great talk and now there’s time for some questions.

Brian asks if they looked into tying in non-text content. In short, no. But that was mostly for time and cost reasons.

Another question, this one about the automation of the process. Is there still room for journalists to spend a few minutes on disambiguating stories? Yes, definitely.

Gavin asks about data as journalism. Rob says that this particularly relevant for breaking news.

Ian’s got a question. Journalists don’t have much time to add metadata. What can be done to make it easier — it is an interface issue? Rob says we can try to automate as much as possible to keep the time required to a minimum. But yes, building things into the BBC CMS would make a big difference.

Someone questions the wisdom of pushing people out to external sources. Doesn’t the BBC want to keep people on their site? In short, no. By providing good external references, people will keep coming back to you. The BBC understand this.

David Recordon’s XTech keynote

I’m at the first non-workshop day at XTech 2008 in Dublin’s fair city. David Recordon is delivering the second of the morning keynotes. I missed most of Simon Wardley’s opening salvo in favour of having breakfast — sorry Simon. David, unlike Simon, will not be using Comic Sans and nor will he have any foxes, the natural enemy of the duck.

Let’s take a look at how things have evolved in recent years.

Open is hip. Open can get you funded. Just look at MySQL.

Social is hip. But the sheer number of social web apps doesn’t scale. It’s frustrating repeating who you know over and over again. Facebook apps didn’t have to deal with that. The Facebook App platform was something of a game changer.

Networked devices are hip, from iPhones to Virgin America planes.

All of these things have common needs. We don’t just need portability, we need interoperability. We have some good formats for that:

  • Atom and to a lesser extent, RSS.
  • Microformats (David incorrectly states that Microsoft have added a microformat to IE8).
  • RDF and the Semantic Web.

We need a way to share abstract information …securely. The password anti-pattern, for example, is wrong, wrong, wrong. OAuth aims to solve this problem. Here’s a demo of David authorising FireEagle to have access to his location. He plugs Kellan’s OAuth session which is on tomorrow.

We need a way to communicate with people. Email just sucks. The IM wars were harmful. Jabber (XMPP) emerged as a leader because it tackled the interoperability problem. Even AOL are getting into it.

We need to know who someone is. Naturally, OpenID gets bigged up here. It’s been a busy year for OpenID. The number of relying parties has grown exponentially. The ability to get that done — to grow from a few people on a mailing list to having companies like Yahoo and Microsoft and AOL supporting that standard — that’s really something new that you wouldn’t have seen on the Web a few years ago.

But people don’t exist at just one place. The XFN microformat (using rel=”me”) is great for linking up multiple URLs. David demos his own website which has a bunch of rel=”me” links in his sidebar: “this isn’t just some profile, this is my profile.” He plugs my talk — nice one!

We need to know who someone knows. The traditional way of storing relationships has been address books and social networks but this is beginning to change. David demos the Google Social Graph API, plugging in his own URL. Then he uses Tantek’s URL to show that it works for anybody. The API has problems if you leave off the trailing slash, apparently.

We need to know what people are doing. Twitter, Fire Eagle, Facebook and Open Social all deal with realtime activity in some way. They’re battling things out in this space. Standards haven’t emerged yet. Watch this space. If Google ties Open Social to its App Engine we could see some interesting activity emerge.

Let’s look at where things stand right now. Who’s getting things right and who’s getting things wrong?

When you create an API, use existing standards. Don’t reinvent vcard or hCard, just use what people already publish.

Good APIs enable mashups. Google is a leader here with its mapping API.

Fire Eagle is an interesting one. You will hardly ever visit the site. Fire Eagle doesn’t care who your friends are. It just deals with one task: where are you. Here’s a demo of Fireball.

It’ll be interesting to see how these things play out in the next few years where you have services that don’t really involve a website very much at all. Just look at Twitter: people bitch and moan about the features they think it should support but the API allows you to build on top of Twitter to add the features you want. Tweetscan and Quotably are good examples of this in action.

David shows a Facebook app he built (yes, you heard right, a Facebook app). But this app allows you to publish from Facebook onto other services.

Plaxo Pulse runs your OpenID URL through the Google Social Graph API to see who you already know (I must remember this for my talk tomorrow).

The DiSo project is one to watch: figuring out how to handle activity, relationships and permissions in a distributed environment.

And that’s all, folks.

The day the music died [dive into mark]

Excellent explanation of DRM by Mark Pilgrim, prompted by MSN Music's gunshot to the head.