Using socially-authored content to provide new routes through existing content archives

Rob Lee is talking about making the most of user-authored (or user-generated) content. In other words, content written by you, Time’s person of the year.

Wikipedia is the poster child. It’s got lots of WWILFing: What Was I Looking For? (as illustrated by XKCD). Here’s a graph entitled Mapping the distraction that is Wikipedia generated from a greasemonkey script that tracks link paths.

Rob works for Rattle Research who were commissioned by the BBC Innovation Labs to do some research into bringing WWILFing to the BBC archive.

Grab the first ten internal links from any Wikipedia article and you will get ten terms that really define that subject matter. The external links at the end of an article provide interesting departure points. How could this be harnessed for BBC news articles? Categories are a bit flat. Semantic analysis is better but it takes a lot of time and resources to generate that for something as large as the BBC archives. Yahoo’s Term Extractor API is a handy shortcut. The terms extracted by the API can be related to pages on Wikipedia.

Look at this news story on organic food sales. The “see also” links point to related stories on organic food but don’t encourage WWILFing. The BBC is a bit of an ivory tower: it has lots of content that it can link to internally but it doesn’t spread out into the rest of the Web very well.

How do you decide what would be interesting terms to link off with? How do you define “interesting”? You could use Google page rank or Technorati buzz for the external pages to decide if they are considered “interesting”. But you still need contextual relevance. That’s where comes in. If extracted terms match well to tags for a URL, there’s a good chance it’s relevant (and also provides information on how many people have bookmarked a URL).

So that’s what they did. They called it “muddy boots” because it would create dirty footprints across the pristine content of the BBC.

The “muddy boots” links for the organic food article links off to articles on other news sites that are genuinely interesting for this subject matter.

Here’s another story, this one from last week about the dissection of a giant squid. In this case, the journalist has provided very good metadata. The result is that there’s some overlap between the “see also” links and the “muddy boots” links.

But there are problems. An article on Apple computing brings up a “muddy boots” link to an article on apples, the fruit. Disambiguation is hard. There are also performance problems if you are relying on an external API like’s. Also, try to make sure you recommend outside links that are written in the same language as the originating article.

Muddy boots was just one example of using some parts of the commons (Wikipedia and There are plenty of others out there like Magnolia, for example.

But back to disambiguation, the big problem. Maybe the Semantic Web can help. Sources like Freebase and DBpedia add more semantic data to Wikipedia. They also pull in data from Geonames and MusicBrainz. DBpedia extracts the disambiguation data (for example, on the term “Apple”). Compare terms from disambiguation candidates to your extracted terms and see which page has the highest correlation.

But why stop there? Why not allow routes back into our content? For example, having used DBpedia to determine that your article is about Apple, the computer company, you could an hCard for the Apple company to that article.

If you’re worried about the accuracy of commons data, you can stop. It looks like Wikipedia is more accurate than traditional encyclopedias. It has authority, a formal review process and other tools to promote accuracy. There are also third-party services that will mark revisions of Wikipedia articles as being particularly good and accurate.

There’s some great commons data out there. Use it.

Rob is done. That was a great talk and now there’s time for some questions.

Brian asks if they looked into tying in non-text content. In short, no. But that was mostly for time and cost reasons.

Another question, this one about the automation of the process. Is there still room for journalists to spend a few minutes on disambiguating stories? Yes, definitely.

Gavin asks about data as journalism. Rob says that this particularly relevant for breaking news.

Ian’s got a question. Journalists don’t have much time to add metadata. What can be done to make it easier — it is an interface issue? Rob says we can try to automate as much as possible to keep the time required to a minimum. But yes, building things into the BBC CMS would make a big difference.

Someone questions the wisdom of pushing people out to external sources. Doesn’t the BBC want to keep people on their site? In short, no. By providing good external references, people will keep coming back to you. The BBC understand this.

Have you published a response to this? :