Original URL: http://www.theregister.co.uk/2007/05/15/apachecon_semantic_web_rant/
Column As the latest ApacheCon conference in Amsterdam fades into memory, I take a moment to ponder the themes discussed.
As ever, there was a combination of tutorials, talks, social events, general discussion, and of course the Hackathon (which I was, alas, unable to attend). And, as always, the topics encompassed both techie areas - aspects of Apache projects - and Apache culture. The ASF is concerned more with building thriving communities than with code. The code follows naturally when the community is working well.
It is only to be expected that some themes du jour should crop up, so the appearance of buzzwords such as "Web 2.0" and "Second Life" come as no surprise (fortunately there's no risk of them becoming the dominant theme).
But an older theme also put in an appearance. For the first time, the Semantic Web (http://www.w3.org/2001/sw/) (semweb) got more than a passing mention; with a reference or two in Steven Pemberton's keynote, a talk by Stefano Mazzochi (http://www.betaversion.org/~stefano/), a BoF session, and a couple of references in technical presentations of Apache projects.
Stefano gave us the memorable quote "there are no semantics in the semantic web", which comes as something of a welcome contrast to the utterings of its more starry-eyed evangelists.
Is it true? Well, the first premise of the semantic web is that it's machine-readable. Machines don't do semantics. They do do logical reasoning (including, in the context of the semweb, OWL (http://www.w3.org/TR/owl-features/) - Web Ontology Language).
The semweb tells us machines everywhere can agree on the meaning of words, and gives us ontologies. Nice idea: for instance, FOAF (http://rdfweb.org/) (Friend Of A Friend) is a fine geek toy, DublinCore (http://dublincore.org/) (DC) is a pretty generic metadata initiative, and there's a bunch of specialist vocabularies for experts in [subject XYZ].
But even in DC, the semantics remain problematic: for example, if we use dc:date in a report, do we mean the date of the subject, or the date we observed it?
Scratch the surface, and the ambiguities look rather like natural language. Of course, the semweb is designed to deal with that: you derive your new date terms from dc:date, then publish your vocabulary. But scale that, and you're the Eskimo trying to explain the differences between your 60 different types of snow to the rest of the world .
Don't get me wrong. RDF works nicely for making information available in machine-readable form on the web, and has some good real-life applications, for example, the people.apache.org site is driven by FOAF data. RDF as a data model is a direct alternative to SQL: both serve to store structured data and enable query/search functions. The familiar XML serialisation is to RDF as CSV is to SQL: both are well-supported and machine-readable - good things in a site that wants to share data. Although, providing a web service or parsing the same data from HTML isn't exactly rocket science either.
But can the semweb ever scale beyond toys and niches, the way the original web has?
Let's consider something peripherally related to the semweb that has scaled: the feed. It fulfils an important role between the website and the mailinglist, being much better-suited to "push" than the former, and easier to manage as recipient than the latter. The feed and the aggregator are built (somewhat) on semweb principles, managing information at a more granular level than the webpage. But they don't use RDF: they use RSS or Atom. And what is RSS in practice? It's gone the way of HTML, embedding a whole bunch of presentational stuff including images and worse. Insofar as mainstream feed software does support RDF, it works by ignoring anything that tries to be semantic.
The fundamental unit of RDF and the semweb is the statement: Subject, Predicate, Object. By making the statement rather than the page our fundamental unit, we can more easily combine information from multiple sources. Or from the whole world. The semantic web will give us easy, well-ordered access to all the world's information. A great resource indeed. Let's call it "Google". Oh, wait...
Google gives us a window into all the world's information, together of course with all the world's spam and other crap. It's granular information: words and phrases are searched; the page is merely where you go to read the selected information. It can even make a choice of media for us: when I searched for "London Underground map" in preparation for the journey to Amsterdam (to check my options between Paddington and Liverpool Street), the top three hits were indeed the tube map itself, as an image, at various sites. But Google does all this with the web as it is, not with the web as we would like it to be. What would Google look like on the semweb?
That question only becomes meaningful if the semweb scales beyond the "geek toy" and attains a certain critical mass. But at that point, the spammers inevitably move in. How is the semweb going to deal with spam? Well, what happened when a metadata format on the web became popular, with HTML's <META> elements for KEYWORDS and DESCRIPTION? How is RDF metadata supposed to escape the same fate?
RDF is all about reducing the unit of information from the page to the statement. Strip out extraneous guff and machines can work with it far more efficiently. Great. Strip out context, and you've got a bundle of context-free statements. Scale it, and you've got a bundle of statements of which 99 per cent tell you where to make money fast or buy prescription drugs. Google needs that context. If the semweb is to scale above geek toy, then any machine that accepts RDF other than from known/trusted sources is going to need that context. So much for simplifying things.
No, what could really use simplification is the semweb itself, thats barriers to entry and usage are absurdly high. It's not entirely the semweb's fault that most people who come to it (your humble scribe included) already know XML and see it through the abomination of rdf+xml. But why is nonone making serious efforts to do anything else? Where, for instance, are the tools for working with RDF/N3?
Worst of all, in practical terms, is the use of URIs as words. The underlying premise that URIs can be globally unique by virtue of namespacing has merit, though it inevitably makes RDF hard for humans (Java's uniqueness through namespacing is beautifully right).
The use of HTTP URLs is just plain bonkers. Even the W3C Annotea (http://www.w3.org/2001/Annotea/) folks, at the cutting edge of the semweb, got themselves terminally confused and invented a system that was fundamentally broken, when they confused RDF usage (as words) with HTTP usage (as a protocol). As soon as Annotea dereferences a URL to reference a page (let alone an ill-specified XPointer within a page), it completely loses the RDF properties of uniqueness and invariance. And if the experts at W3C got so hopelessly confused for the entire duration of the project, what hope for the rest of us?
I'm somewhat at a loss how to conclude a rant about the semweb. If you're reading this, I daresay you've already seen the pro arguments, so it would be superfluous to repeat them here. It's today's reincarnation of 1980s Expert Systems, and there's no doubt that added connectivity can enable some very exciting applications, like FOAF (http://rdfweb.org/) and DOAP (http://rdfweb.org/topic/DOAPBulletinBoard) (Description Of A Project).
The range of tools is growing: for example, Apache's new "triplesoup" project is building on mod_sparql, which is itself a recent work. But I think the biggest potential is in the road already taken by RSS, in simplified and rather bastardised spinoffs. ®
W3C sets standards for SOA and Web 2.0 (17 September 2007)
http://www.theregister.co.uk/2007/09/17/grddl_web_services_policy/
Retrieving RSS/Atom Feeds with the Google AJAX Feed API (7 September 2007)
http://www.theregister.co.uk/2007/09/07/rss_atom_feeds/
Governance in the Web 2.0 world (24 August 2007)
http://www.theregister.co.uk/2007/08/24/everything_over_http/
Data Analysis 2.0 (2 June 2007)
http://www.theregister.co.uk/2007/06/02/data_analysis_2-0/
Super-fast RDF search engine developed (4 May 2007)
http://www.theregister.co.uk/2007/05/04/semantic_web_breakthrough/
Blog: The meaning of the meaning of meaning (11 February 2007)
http://www.theregister.co.uk/2007/02/11/search_and_semantics/
University launches semantic web interface (17 February 2005)
http://www.theregister.co.uk/2005/02/17/semantic_web/
W3C completes framework for the Semantic Web (24 April 2004)
http://www.theregister.co.uk/2004/04/24/w3c_semantic_web_framework/
W3C OWL proposals take flight (21 March 2002)
http://www.theregister.co.uk/2002/03/21/w3c_owl_proposals_take_flight/
© Copyright 2008