I’d been thinking about writing this essay for a while, and initially I planned to phrase the title as a question, i.e., “Are semantic wikis the future of information?” But it’s important to have confidence in one’s own line of work, and, in any case, over time I’ve become more and more convinced about the truth of the statement.
First, some definitions: a wiki is a web site whose pages are editable by many people, directly through a browser. And Semantic wikis are wikis that allow for storage of both regular text and queriable data. More on those later.
We all know that the easiest, fastest way to find information currently is on the web. (Whether it’s also the best way is a matter of opinion, but I think no one could argue with its speed.) There’s a staggering amount of information online, not just in written text but also as images, audio and video, and a good percentage of it is indexed by various search engines, so if I’m at a computer with an internet connection anywhere in the world, there are literally millions of interesting questions that I can have answered within a minute or less. That’s an incredible thing, and obviously a new state of affairs in the human existence.
But for all the almost-magical power and scope of the web, there are rather severe limits to the kinds of information that can be gotten there. For an overview and basic facts on just about any subject, you can look it up on the web and generally find what you’re looking for (and sometimes much more), but to have a specific, non-basic question answered is a much trickier proposition. On the web, for instance, I can easily find a list of the highest-grossing films of all time, but the highest-grossing films of, say, 1960; I can find the population of Nairobi, but not a list of the 50 largest cities in Kenya; I can find the current GDP of France, but not the GDP of France in 1850, let alone the percentage of GDP that taxes represented in that year. I can easily find quotes from any of William Shakespeare’s plays, but quotes about any of his plays are quite a bit harder to find. And, to take a more practical example, if I want to buy a pair of sneakers in New York City, I can find local footwear stores, but not a list of stores that carry, say, Gola sneakers and are still open at 8 PM.
In short, there are whole large branches of information that just can’t be found currently on the web. If you want the answer to such a question, you have a few options: you can look it up in some book, or ask someone who’s knowledgeable on the subject (this, it’s true, can be done online). But for those of us who have been spoiled by the web, we prefer to have an answer in 30 seconds over one in several hours or even days. Consider all the questions that come up that we never get the answers to, because the amount of time and/or effort it would take is greater than what we think the importance of the answer is.
Ah, but the amount of information available online is still increasing rapidly; what will happen when there’s 10 or 100 times that amount available, in blog posts, Wikipedia entries, online articles and the like? And what if all those books at the library get scanned in? Actually, I don’t think that will change the basic situation very much. I contend that there are inherent properties of information that make answering non-trivial questions difficult:
- Information is messy. Many questions require more than a one-word answer, because whatever answer is given needs clarification. Different terms mean different things to different people. What is the largest lake in the world, for instance? Three different bodies of water, the Caspian Sea, Lake Baikal and Lake Superior, all lay claim to that title, depending on what exactly you mean by size and what you mean by a lake. To take another example, currency inflation is ongoing, so what does it mean to say that the movie “Gone With the Wind”, made in 1939, earned a certain amount in today’s dollars? If someone bought a DVD of it for $20 five years ago, how exactly does that get added to the total? These are the kinds of issues that, if they can get resolved at all, can usually only be resolved by experts, and even then rarely definitively.
- Information is controversial. Not only can information be messy, but in some cases people will have a vested interest in what the answer will be. That holds especially true when the information deals in some way with hot-button political issues. Infant mortality rates around the world are one example: different countries have very different standards for what should and shouldn’t be counted (whether prematurely-born babies should be counted, for instance), and that’s in part the case because those numbers are used to justify various public policy decisions. Civilian casualty rates and economic indicators are two other types of statistics with similarly highly-charged meaning. It means that any information provided about a controversial topic will be viewed with great suspicion by at least one side and maybe both, which means that you can’t just provide the data and be done with it; you need to show all the work taken to create that data.
- Information is spotty. In many cases, especially for historical information, there are facts we don’t know, or for which the only sources are unreliable. In such cases, it’s good to know, to the extent possible, that such information can’t be found because it’s unknown, not because a search engine can’t find it or because no one’s bothered to enter it yet.
- Information is distributed. Even if all the information exists and is relatively non-controversial, putting it all together in one place could be a challenge. In the very simple case, there’s a lot of information one can find out about any city in Kenya, but to put it together in such a way that one could answer complex questions like, say, what the largest city founded after 1920 is, is something someone would have to do themselves, for the moment. More complex data might involve putting together information in different formats, found in different places, in different languages, etc., and that is a task that would remain a challenge even if the web contained 100 times the information it did.
- Information is always changing. The world is always changing. Thus, information that is meant to describe the world as it currently exists is always in danger of expiring, if it hasn’t expired already. Every aspect of our lives can change, from a restaurant changing its phone number to a country being split in two. Even scientific and historical facts, which people think of as more fixed, can change, as new evidence comes in that causes historians/scientists/etc. to change their minds about something.
- Information is voluminous. This may be the most challenging aspect of all about information - it sheer size. Information is, at its heart, a description of everything that happens in the world, and so there’s no limit to how much information can be expressed. People have written multi-volume memoirs about their own lives, and even then the vast majority of what actually happened from day to day is left out; and that’s just a single person’s life. Sure, trying to include everything would make for an unreadable (in both senses) book, but on the other hand what’s considered interesting and what isn’t varies widely from reader to reader. Even the kind of information that could be described as worthy of being recorded is basically limitless.
So, given all these challenging aspects to information, what is the best way to make more of it accessible? Many people might say that the solution to simply keep adding more information into Wikipedia. Wikipedia is a tremendous resource, it’s true, and in the eight years that it has been around, it has become, in some people’s minds, synonymous with the very concept of online information: when you want to find out about something, you look up its article on Wikipedia, using a web search only as a convenient way to get there. Wikipedia actually deals well with many of the challenges of information I listed earlier, because it’s a wiki: it handles messiness and controversiality of information by giving users a place to discuss any dispute over information, and by providing unlimited space for people to show the exact steps, and references, by which the information was obtained. And it handles the distributed, ever-changing and massive nature of information by letting anyone in the world change any piece of text at any time, while providing versioning tools to prevent that mass collaboration from leading to total chaos. There is simply no way for a small, paid group of professionals, however fast-working, to duplicate the work that Wikipedia’s hundreds of thousands of volunteers do.
Wikipedia, however, could not be a general solution to holding information for a few reasons. First of all, it’s an encyclopedia: it is meant to contain only information of general interest. A lot of text, and entire articles, routinely get deleted because they’re not considered important enough. To take one small example, a restaurant or store will not get included unless it is famous. That means that a lot of information important to a smaller, specialized group of people will not be included in Wikipedia. One solution to this problem has been to create many wikis for a whole variety of different subcategories and niches of information: there are food wikis, travel wikis and the ubiquitous pop culture wikis. These are all for the good, since they certainly increase the sum total of online information. But they have an important weakness, one shared by Wikipedia itself: they lack queriability.
What is queriability? It’s the ability to aggregate information, based on filters. So, for the example of wanting to know the largest cities in Kenya, it’s not enough to have information on each city, as Wikipedia does. There must be some way to combine that information, to generate new lists with the information we want. Wikipedia already does this to some extent, using categories and pages that are simply lists, but it’s not nearly enough. There’s already a list on Wikipedia of famous Swedish-Americans, but Wikipedia will never have a list of, say, politicians of Swedish ancestry from around the world. Similarly, on a more practical level, you can imagine a wiki that has detailed information on local stores, but it won’t let you easily find the list of stores that carry a certain product that are open a certain time - for that, you’d need to hunt through the data to find matching set on your own.
That’s where semantic wikis come in: they allow for automatic aggregation of information and the creation of lists, and thus they provide answers to questions that were previously unanswerable online. The data entered by the users can function like data in a database: it can be sorted, graphed and analyzed in many different ways. There’s a side benefit to this newfound ability, which is that it provides a much greater incentive for users to enter this kind of data. There already exists a great deal of structured data in table and template form in Wikipedia, thanks to the tireless efforts of volunteers, but the GDP of France in 1850, for instance, is not (at the time of this writing) among them. If someone had that information, and they wanted to add it to Wikipedia, they could undoubtedly find a place to put it, but it has not yet been a priority for anyone. But what if this and other economic data were graphable, and able to be turned into a variety of list data, instead of just being a standalone number? All of a sudden, the value of having that number in place would increase, its absence would be felt more noticeably, and thus the chance of it appearing would increase. Thus, even someone looking for that number by itself would benefit from the wiki it was in being semantic, even though they weren’t using the data semantically.
This is not a strictly theoretical discussion: there already exists a way to make semantic wikis, using open-source software: MediaWiki (the application that Wikipedia uses), along with an extension, Semantic MediaWiki, that enables pages to store their data semantically, and a suite of other extensions that use Semantic MediaWiki to enable easy editing, filtering and mapping of data (some of which I created). I wouldn’t be presumptuous enough to say that MediaWiki and these two extensions themselves constitute the future of information, but they do fit the criteria that I think any such solution has to have: they allow for many people to collaborate on data through versioning tools, they provide forms for people to enter and edit data with, so that data structures are pre-defined, and so that users can easily add and modify data, and they make it easy to modify the data structures themselves. This last property is important because of the inherent messiness of information mentioned earlier: for data of any complexity, there will always come cases that don’t quite fit into the structures provided, and when that happens it’s important to let users modify these structures, using the same processes of collaboration and discussion that already exist for modifying data.
Wikis that use MediaWiki and its semantic suite of extensions already exist, so in a sense the future as I’ve defined it is already here. My site, Discourse DB, uses these extensions (and was the testing ground for some of them) has, among other interesting pages, a list of opinions (for, against and mixed) on whether U.S.-led coalition forces should pull out of Iraq. It’s far from a complete list, but it is, as far as I know, the only place where you can find anything even resembling a comprehensive list online, or perhaps anywhere. Mograph Wiki, another wiki that uses this combination of MediaWiki and plugins, has a listing of all companies and freelancers that do motion-graphics work in London, England; again, as far as I know, no similar such list exists anywhere else in the world. And, to offer one more example, I don’t think there’s a list anywhere besides the Electronic Literature Organization’s Archive-It wiki of the electronic literature works published in 2004. The interesting thing about all of these lists is that none of them were created by hand; they are all generated automatically when the page is accessed, by querying other data in the wiki. Yes, it would be possible, in a few hours’ worth of time, to generate any of these lists by hand, using the other information found on the wiki, but to create the sheer volume of lists possible on each of these sites, and to keep them all constantly updated, quickly becomes impossible except by automation.
These are three rather specific examples, but then again the number of semantic wikis in the world is still rather small, probably less than 100 at this point. As the number grows, it’s easy to imagine the possibilities for the same semantic treatment being applied to a whole variety of other branches of knowledge.
There’s another important usage for semantic wikis, which is to serve as a data source for other applications. Imagine, for instance, a mapping application that’s able to provide information on nearby establishments or other points of interest for a given location, with the data taken from different wikis (some would call such an application a mashup). Or picture an application that lets the user navigate through historical or economic data, or one that automatically creates music playlists on song properties like tempo and genre, using information from a song wiki. In these and other cases, the open nature of a wiki’s data, along with the ease with which it can be kept up-to-date, make a semantic wiki ideal as a data source. It’s a usage that has so far eluded Wikipedia and other conventional wikis, because it’s just too hard to extract data from a text-only wiki. For the MediaWiki-based solution, this capability too already exists: Semantic MediaWiki provides export into RDF, a format for semantic data, that’s easily queried externally, through standard tools like SPARQL. The conversion from wiki data to RDF (or other, similar formats like OWL) is easy because the data is already stored semantically to begin with.
So this is the kind of information storage that semantic wikis can allow: public, flexible, allowing for easy modification, and easily exportable. In short, it seems like the ideal solution for handling the chaos that is real information.
1f49