The time has come, I think, to ask whether “the semantic web” is a good term to use; even though a lot of people use it, and I even belong to a semantic-web meetup or two. The problem with it comes, I think, because it creates an incorrect view in people’s minds of a structure that will show up at some point in the future, enabling various magical abilities. The phrase creates some confusion, in that it raises some unanswered questions: what will “the semantic web” actually look like? Who will create it? And how will we know when it’s arrived?
Now, it could be that enabling more semantic export of online data will indeed have some magical effects. My issue, though, is that phrasing it in such a way makes the whole endeavor more intimidating than it needs to be, suggesting that it’s a project that has yet to even really start. In fact, semantic technologies are not only with us already, but some are in widespread use. RSS is the obvious example: it’s a widely-used file format that displays information about blog posts, news articles and the like in a machine-readable way, so that, using a feed reader, one can be instantly notified about new posts, including their title and other basic information, from any of hundreds of thousands of sources. That’s as semantic as it gets.
More generally, there’s obviously plenty of structure already in the non-semantic (”syntactic”) web. Sites that cover everything from weather to shopping to reference to news display their data in a structured way, retrieving it from relational databases. In some cases, like Amazon, APIs are provided so that one’s application can retrieve this data directly. But even if there’s no API, or other semantic export of the data, it can be retrieved anyway, through web scraping. The NewYorkNabes, which I did the programming for, is one of maybe tens of thousands of examples - it gets its real-estate-price information by going to a set of URLs on newyork.backpage.com once a week, finding the relevant prices within the HTML, and taking their median. If Backpage were to additionally publish their data in RDF form, they would be a true semantic web site, and it would be easier for my code to get that same data. But functionally, things would look exactly the same to users as they do now. You could argue that the difference is that the semantic web data would be retrievable even if the look of the site changed: web scraping is a fragile endeavor, and in theory the system can break if any part of the HTML, like just a font color, is changed. But if you think about it, the same holds true for semantic data: if the owners decide to change a property name from “Price” to “Rental price”, the system will break just as easily. Neither approach offers a full guarantee, and they both require maintenance: the difference is only one of degree, not kind.
I’ve sometimes thought that a good analogy for the value of storing data semantically is a well-organized kitchen: if a kitchen has all its tools and supplies logically arranged and in their place, then it’s easy to find any particular item, and, maybe just as importantly, to know if an item is missing, so that if you don’t see it you won’t end up spending an hour looking for it. If you walk into such a kitchen, even if you’ve never been in it before, you’ll probably be able to start cooking right away. By contrast, the regular web can be compared to a disorganized kitchen, where everything is strewn all around, mixed in haphazardly: the blender could be anywhere, and if there’s no baking soda in the kitchen, good luck determining that for sure.
But this analogy also highlights the gray area between “semantic” and “syntactic”. After all, there’s no such thing as a perfectly-organized kitchen, since any two people’s conceptions of how things should be organized will be different. If you’re looking for wine glasses, will you look near the ordinary glasses, or near the fancy plates? However you arrange things, some people still won’t be able to find what they’re looking for right away, because they’re expecting it elsewhere. Similarly, there are always ambiguities in data - to take one small example, retrieved from this fascinating list of “edit wars” that have emerged in Wikipedia over silly data ambiguities, if you’re displaying consumer products on your site, do you refer to a regular iPod as an “iPod” or (the new term) an “iPod classic”? Even the most carefully-laid-out semantic data will still need some human analysis, and “massaging” of the data, to be usable in an application, and to be aggregated with other data sources, because there’s always ambiguity or differences of opinion over how data should be structured.
But if semantic data can resemble syntactic data, the reverse is true as well. To go back to the analogy, even the messiest kitchen is still usable: if you found yourself having to use one, given enough time, you could figure out where everything is and muddle through. After a few months of working in one, you could probably accomplish everything that you could in a well-organized kitchen. It wouldn’t be nearly as enjoyable, of course, but it would be possible. The comparison can be made to a site like NewYorkNabes, which by its nature is a hack, but it works. The difference between syntactic and semantic, again, emerges as one of degree.
My point here is not that all the talk about the benefits that semantic technology like RDF and OWL will bring is overhyped: I won’t try to predict the changes that they will or won’t bring, but I would guess that there will be some substantial benefits to their adoption. I just think “semantic web” is a bad way to describe this technology, because it makes it seem like a goal to be accomplished, so that one day people can say, “the semantic web has been created”, instead of what I think is the more realistic description, which is a gradual process that began a long time ago of making data more accessible. Instead of “the semantic web”, I think I prefer the terms “semantic technology” or “semantic representation”, or even “semantic web technology”. Heck, even “Web 3.0″ is fine with me, since people understand that “Web 2.0″ is about a set of technologies and not a separate structure - an adjective, not a noun.