Discourse DB has a topic page for the current American health-care bill, AKA the “America’s Affordable Health Choices Act of 2009″. 82 columns and editorials on the subject are already entered, subdivided into “For”, “Against” and “Mixed” on the topic. If you’re curious about the current distribution of thinking among the pundit class, check it out. (And if you know of opinion items in notable sources not already included in the list, please add them!)
Archive for the 'Semantic wikis' Category
- I was at the “NYC wiki-conference 2009“, held on the NYU campus, over the weekend; my thoughts about the conference are here. The one thing I forgot to mention, on a technical note, was a five-minute demo by Tom Maaswinkel, showing a MediaWiki wiki being edited via the soon-to-be-released Google Wave - it wowed the audience, as Google Wave demos tend to always do.
- Jeroen De Dauw released version 0.2 of Maps and Semantic Maps. These new versions have, among other improvements, support for Yahoo! geocoding, and just better-looking code, which is going to be important in the long run, as other developers get their hands on it and start tinkering with the code.
- I added Maps and Semantic Maps to Referata - Semantic Google Maps will be gone shortly. That means mapping on Referata has a lot more options, and it’s already starting to bear fruit - check out the Google Earth option on Food Finds, for instance. Pretty nice!
- Sergey Chernyshev and I released a new version of Semantic Bundle, which now includes Maps and Semantic Maps, replacing Google Geocoder and SGM. It’s really the beginning of the end for SGM, not counting the 30+ wikis it’s already on…
- While working on the new Semantic Bundle version, I had the thought that SMW is starting to feel like a mature technology; in that it seems like the majority of the features that it will eventually have are already in place. The addition of the Semantic Maps extension had a lot to do with it, I think; this was one of the big chunks that I thought was still missing. There are still things left to be done, of course; I have a list of around 30, though they won’t necessarily be features that I implement. And I’m sure there will be various improvements behind the scenes, to speed up queries and the like. But I really feel like the Semantic MediaWiki system of the future won’t look all that different from what it looks like now, with the interplay of categories, templates, forms, properties, External Data calls, tables, maps, calendars, widgets, etc. (whew!) that you can already find in various SMW-based wikis. Though I could be wrong about this.
Announcing Semantic Bundle - a single downloadable file that holds Semantic MediaWiki and 16 other MediaWiki extensions that use it and/or are often used in conjunction with it. The aim is to simplify the confusing landscape of extensions that’s evolved around Semantic MediaWiki, so that users can just get one file instead of having to research and download many files individually to get all the functionality they would want. What we have is a basic super-set of the kinds of extensions people usually end up using on SMW-driven wikis (administrators can choose which of the extensions to include, once they’ve downloaded the bundle.)
Semantic Bundle is similar to the SMW+ package distributed by Ontoprise, although it’s a different set of extensions; both include SMW, of course, but other than that the number of extensions they have in common is surprisingly small - which just goes how to show diverse the set of features has become, and may be another argument for this kind of “curatorial” work.
Semantic Bundle was developed, and is distributed, by Sergey Chernyshev and me.1f54
The title’s a deliberate reference to the essay I wrote about a year and a half ago, “Semantic wikis are the future of information” (a sentiment I still fully agree with, by the way). But in the last few months, since the release of the External Data extension, I’ve had a new realization: that semantic wikis are not just a good tool for storing data, but for handling what’s sometimes known as enterprise application integration: coordinating among a set of systems in an enterprise.
First, the problem: it’s extremely common for mid-to-large organizations - whether they’re companies, non-profits, governments, etc. - to have their data scattered over many different systems. A company can have a database application for accounting, another one for information on employees, another for information on employees, another for customer service information, another for website traffic, etc. Other information, like legal agreements or information about business partners, might not even be located in any single location: it could be spread out over documents or emails throughout a company’s departments. And some of those documents might be in hard-copy only, not on a computer. each data store can be quite useful for what it does (even printed documents in a manila folder somewhere can be a useful storage system); the problem is that the data can’t be combined together in any meaningful way.
Let’s take a straightforward example: a manager wants to know whether employee pay and/or seniority in the customer service department affect the quality of customer service; they also want to know whether service calls about specific products correlate with visits to the website’s help pages about those products, or whether perhaps there’s an inverse correlation, indicating that more information should be added to the website about certain products. The information is all there, somwhere; the problem is that there’s no way to combine it, aggregate it, visualize it, etc.
This is a well-known problem, and a whole body of work exists around it, including journals, books, conferences, etc. The individual data stores are referred to as “information silos“, “islands of information“, “data stovepipes”, etc., while the task of integrating them has been called enterprise information management, and more recently “Enterprise 2.0” and “Business Intelligence 2.0“. And many companies exist to try to solve this problem for organizations, including IBM, SAP etc.
Well, I believe that semantic wikis offer one solution to this problem, a solution that manages to be lightweight, fairly easy to implement, and I think powerful. This idea crystallized for me recently when I was working on a project to get Semantic MediaWiki into a large organization (I won’t say which it is, because discussions are still ongoing). This organization has all the typical problems of data in a large organization: inaccessible data and lack of any central control over all of it. During the course of planning, the group of us discussing how best to integrate SMW hit on what I think is a reasonable general approach. Here’s : most data stays where it is, in the applications in which it was created; the only big change to each individual system is that each one is now responsible for providing an “API” for getting at its data: basically, a web script that, when passed in the ID, within the URL, of some entity in their system, displays the data for that entity, in XML, CSV or JSON (three standard formats for displaying data). In some cases, this would just a short script, maybe less than 10 lines, containing just a single SQL call; in most cases, it likely wouldn’t be a big technical challenge. (And to clarify further, the API, though it would be web-based, could still be behind a firewall; the information would not have to be opened to the public).
At this point the wiki comes in: it would have a page for each entity, with each page containing a template call based on the type of data it represents. This template call would, in turn, extract the data for this page from the relevant data source (or data sources) via their API(s), using the External Data extension. Ths data would then be displayed to users, and also most likely stored via semantic properties, so that it could then be aggregated into lists, graphs, calendars, etc.
The end result is a system in which no pre-existing component needs to know about any other component (only the wiki needs to know about everything), and parts can be brought in and out without bringing down the overall system. Also, it requires no programming, only some wiki-page scripting. And it’s based entirely on free, open-source software.
There’s one additional complication, which is components that don’t have database-backed data storage, that an organization would want to upgrade as part of an enterprise-integration process anyway: data contained in files, or emails, or printed documents. For these, the data could easily be moved onto the wiki, making use of what wikis were originally intended to do, which is storing text information. The flexibility of semantic wikis means that such a transfer could be done gradually, based on the needs of the organization. For a group of PowerPoint presentations, for instance, the wiki could start out as a directory containing the location of each file in the company’s file server, and then eventually come to semantically hold all the data contained in those files.
So there it is: a semantic wiki system (in this case, SMW, although if it takes off I’m sure other wikis will copy this functionality), plus custom APIs per system, provides the ability to do relatively pain-free data integration.
I’m not the first person to think of data integration by means of components publishing their own data; in fact, that’s been one of the suggested uses of so-called Semantic Web technology, where each component publishes data in a format like RDF or OWL, and semantic reasoners and SPARQL queries pull it all together. That, too, is a valid approach; my basic objection to it is I think it’s overkill: you can easily get bogged down in a world of competing ontologies and mismatched data, if the goal is to get all the RDF outputs to be compatible with one another. With a wiki at the center, on the other hand, each component can just publish its data in the simplest format possible, and let the wiki deal with all the data-matching and exception-handling.2134
I’m well overdue, but here, finally, is my full explanation of the External Data MediaWiki extension; there have been quite a few improvements to it since even the overhauled release, so maybe some of the delay was justified… at least, I’d like to think so.
First of all, you can see the PDF slides from my conference-call presentation here.
The basic goal of External Data is to allow structured data from the outside world to be displayed, and otherwise used, in a wiki. There are lots of APIs out there on the web, with more coming all the time, and this extension allows them to be accessed in a very lightweight manner: no need to specify an XML XPath structure, or a SPARQL query (and if you don’t know what those terms mean, all the better for you): you just declare the URL you want to access, and the variables you want to retrieve, and it does the work. And, as I noted earlier, since Semantic MediaWiki provides its own web-based API for accessing data, you can also use External Data to display data from one SMW-based wiki in another.
As I noted in the presentation, though, the vast majority of the world’s data is not accessible via a web-based API and never will be. Instead, it’s contained in database tables or Excel spreadsheets, or in even less-queriable sources: books, paper documents, etc. If there’s a set of data like that that we want to use in our wiki, how do we do that? Yes, we can go the Wikipedia route of just manually entering the data in wherever it’s necessary. However, this leads to a lot of redundant work, and avoiding it is most of the reason we use semantic wikis in the first place. The next-best approach involves using Semantic MediaWiki: you import the data into wiki pages using some sort of automated tool, with the pages containing either direct semantic annotations or template calls that translate into semantic annotations. The data then gets stored in SMW’s data tables, where it can be queried. This approach, as far as I know, has already been done in a few places; however, there’s a major problem with it: if the original data gets changed or expanded, it’s very hard to re-import it, because now you have to merge with whatever changes have been made by users on the wiki.
The ideal solution is to keep the data where it is and create an API for accessing it; however, most of the time that’s not feasible (it’s beyond most organizations’ abilities to create a web service for getting the data from an Excel spreadsheet, for instance). So the “enhanced” External Data allows for what I think is the next-best solution: you put the data into its own wiki page, in CSV format (basically the simplest kind of data format there is; all the values are just separated by commas). The page ‘Special:GetData’, defined by External Data, then serves as a “mini-API” for accessing this data: it takes in the name of a CSV-holding wiki page, and an optional set of criteria, and returns the set of rows that match those criteria. This gives you all the benefits of having an API: the outside world can easily access your data, and you can access it yourself on other wiki pages, using External Data’s standard querying. For some examples of the latter, see this test page on Discourse DB.
But, you may say, importing the data into a wiki page causes all the same problems we were trying to avoid in the first place! After all, it can still be modified by users after the import, making it difficult to re-import. That’s true, but at least the data is now separated from free text, formatting and other things that users may want to be involved with, so the chance of users modifying any of the pages that hold the actual data becomes much smaller; it’s generally a clean solution.
So that’s External Data. The other big wiki-related news is that the project I was working on for the last few months was released two weeks ago (I’m really behind on this stuff): the OpenCongress wiki. It’s meant to work in conjunction with OpenCongress, a site that holds information about the politicians, committees, legislation and campaign contributions of the U.S. Congress. The wiki holds a subset of that information, and it’s of course user-editable. As you can see from the wiki’s version page, it uses Semantic MediaWiki and many of the extensions that have become generally associated with it, including my Semantic Forms and Semantic Drilldown extensions and Sergey Chernyshev’s indispensable Widgets and Header Tabs extensions; all are meant to enable a data-centered approach to the wiki.
I bring it up in this same post because the OpenCongress wiki also uses External Data; actually, in my mind the site and the extension are somewhat interwoven, because External Data was created during my work on the OpenCongress wiki, was inspired by it to a large extent, and certainly got its first usage (and debugging) in the site. External Data is used in various places in the OpenCongress wiki, both to access data from outside APIs (like that of Sunlight Labs), and to handle data that has no API, using the “CSV page” approach. You can see an example of the latter here - a wiki page containing raw data on one organization’s “scorecard” for one year’s senate membership. You can see that data then being displayed here, on the page for Senator Barbara Boxer, using External Data (currently only this one scorecard’s data is displayed for all senators, but I believe it’s due to get expanded soon).
The OpenCongress wiki is a fantastic site for people looking for political information; in addition, I think that, for wikis, it represents the shape of things to come. That’s because it serves as a “mashup” of many different data sources, allowing for much more information brought to the user than relying on just the wiki’s own editors would. Different types of data are brought together in a relatively seamless way: free text written by regular wiki users; semantic data entered through forms; data from OpenCongress’ own database; data from outside APIs; data that’s not otherwise web-accessible (like the scorecard information); outside services like social-bookmarking tools and Google Maps; and “feed” sources like YouTube and Twitter. External Data, and the large and growing number of great data APIs around the web, make this so easy to do that I’d imagine it won’t be long before other wikis start to follow this same strategy.
Finally, on a side note, it might be mischievous of me to note that all this data integration is done without the use of RDF… but, whatever you think of RDF, that is the truth.23ed
Yesterday I released a new version of the External Data extension that allows it to, in addition to previous functionality, get a table’s worth of data (instead of just single values), and extract data from any wiki page holding values in CSV format. The more I think about it, the more I think these additions make External Data among the most important MediaWiki extensions I’ve released (or co-released,since Michael Dale contributed), or maybe even the most important, beating out Semantic Forms; I guess we’ll see.
I hope to write more about the “new” External Data at some point; for now, if you want to hear more about it and what I think its implications are, I’ll be talking about it tomorrow (Thursday) at 1:30 PM EST in session 5 of the semantic wiki conference call “mini-series”. Anyone is free to join in the call. There’ll also be other talks (including three from people I met at the Semantic MediaWiki users meeting in Boston), that should be quite interesting.
It turns out that yesterday was the seventh anniversary of the launch of Wikipedia, otherwise known as Wikipedia Day. So it’s probably as good a time as any to acknowledge the huge impact Wikipedia has had on my own career. It’s rare to say that a website has single-handedly brought into being an entire field of technology, but one could make a case that Wikipedia has done that for my field, semantic wikis - which is, of course, ironic, since Wikipedia itself does not use semantic technology. The site, though, has enabled what I do to come about in several different ways - enough that it’s hard to dispute the direct link. Here are the ways in which Wikipedia has made semantic wikis possible:
- It taught the world about wikis. Most people, when they first heard about Wikipedia, a site where anyone can edit anything, probably had the same reaction: sounds like a recipe for disaster. To be sure, some critics of Wikipedia still say that’s the case; but for most of the hundreds of millions of people who read the site, seeing it work has been an eye-opening experience: the realization that a site where users can edit the content of any page can work. And for some users (including me), the realization that not only is it a workable solution, and not even just the best solution, but in some cases it’s the only solution for aggregating information in one place. And so Wikipedia’s proof-of-concept inspired many people to create their own wikis for their businesses, organizations or personal interests. I dare say that 99% of the people who have been involved with semantic wikis got their first experience with wikis by reading Wikipedia; I’m part of that group.
- It has inspired researchers. Beyond just Wikipedia as a proof-of-concept, the idea of turning Wikipedia into more a database-like information store has captured the imaginations of a lot of people. That’s how Semantic MediaWiki got its start: the first paper published about the project was titled “Semantic Wikipedia”, and the concept remains the holy grail for many of those involed with the project (not for me personally, though I can understand the excitement). And Freebase, the other major semantic wiki technology (in my opinion), which uses its own proprietary application, has billed itself as a “Wikipedia for data”; I wouldn’t be surprised if it was conceived that way too. (It’s an open question what will happen to Freebase if Wikipedia goes semantic, and thus itself becomes the Wikipedia for data.)
- It has enabled the technology. MediaWiki, the wiki engine developed specifically for Wikipedia, is also, in my opinion, the best wiki engine, of the dozens that exist. It’s robust, scalable, and full of useful features. Two of those features have, I think, made it ideally suited for use in semantic wikis: templates and hooks. Templates enable the separation of data from data structure and presentation, which lets a semantic wiki approximate much more closely a regular database-driven website; while hooks, of which MediaWiki has hundreds, allow extensions like Semantic MediaWiki to integrate nicely into the rest of the package with little or no coordination between the extension developers and the main MediaWiki developers: that, in turn, allows for much faster development time. Neither one is a coincidence: the nature of Wikipedia and its massive size make conveniences like these into something more like necessities.
So, a big thank you to Wikipedia, and of course to its two co-founders: Larry Sanger, who had the idea to use a wiki to power the world’s first free online encyclopedia; and Jimmy Wales, who has guided the project successfully through ever since.
The site Pseudomenon, which appears to have just been released yesterday, is the newest entrant to the small club of semantic wiki hosting sites. This is, as far as I know, the third site to offer hosting of Semantic MediaWiki, and the fourth to offer hosting of any sort of semantic wiki, the one non-SMW site being Swirrl. It’s the first, though, to support the Halo extension (also known as “SMW+”), which allows free-form semantic annotation and querying of wiki pages. Pseudomenon doesn’t include any other extensions at the moment, but the inclusion of Halo by itself makes it a helpful addition.
According to the main page, hosting is free, and every wiki gets a subdomain at iskey.info.
Apparently, the word “pseudomenon” is a reference to the Epimenides paradox, in which a Cretan stated “all Cretans are liars”. A snide commentary on truth in wikis? Well, at least it’s a real word, as opposed to the fake-Latin “Referata” I came up with, though I later found out that means, I believe, “reports” in Croatian.23f3
Do you like reading about semantic wikis, but really wish you could hear me talking on the phone about them? Well, you’re in luck, because I’ll be speaking in the 3rd session of the semantic wiki “mini-series” of conference calls, on Thursday. The last two sessions, which happened over the last two months, covered the broader world of semantic wikis; this one focuses specifically on Semantic MediaWiki. Markus Krötzsch, the lead developer of SMW, will talk about the core of the technology, and I’ll talk about “Semantic Forms, Semantic Drilldown, Semantic Result Formats, Semantic Google Maps, Semantic Compound Queries and Data Transfer” (evidently, I get bored easily). There will also be people from the Ontoprise corporation presenting their contributions, and some other presenters. Each presentation will also have a real-time slide show on the web. You can see the presentation time and phone number here (it depends on where you live), plus other details, and a place to RSVP (you don’t need to RSVP to watch/listen, but it’s strongly recommended).
Yesterday I released the extension Semantic Compound Queries; this is my sixth released MediaWiki extension, if you’re keeping score (in addition to three that I’ve modified). This one displays multiple queries in the same place - that can include formats like lists, tables, and (I think more interestingly) maps and calendars. Here are two examples of the extension in action:
- a map that shows both people and organizations, with a different color for each, on the front page of the Semantic MediaWiki Community Wiki.
- a calendar that shows opinion items covering the 2008 presidential election in green, and everything else in the default blue, on my wiki, Discourse DB
That second one is a contrived example, admittedly; hopefully there’ll be more credible examples soon.
Swirrl has been launched - it’s a semantic wiki hosting site, which makes it, in my opinion, the world’s second semantic wiki host, after Referata (you could certainly make the case for Wikia being the first, since they had Semantic MediaWiki on their site about six months before Referata was released, but I think their support for it was (and is) primitive enough that it doesn’t really count). The focus of Swirrl seems to be editable tables and spreadsheets, which is different from the approach of Referata and Semantic MediaWiki, where individual pages (corresponding to cells in spreadsheets) are what’s edited. Swirrl more closely resembles SocialCalc, the application formerly known as WikiCalc, which is a wiki-spreadsheet application, though it’s not hosted, and not semantic; and Dabble DB, which also offers hosting, plus rather sophisticated editing and viewing of spreadsheet data, but without the semantic-web or wiki functionality.
Do I view any of these companies or applications as a threat to Referata? Not at all. The big hurdle for Referata and the SMW-based extensions has always been lack of awareness about structured/semantic wikis and their possibilities, as opposed to any specific competing solution. The number of people who could potentially benefit from semantic wikis but have never heard of them will, for at least the next few years, vastly overshadow the number of people who have heard of more than one structured/semantic wiki and have to choose among them. It would be great if, when describing my site, I could say “it’s a lot like Swirrl”, as opposed to having to explain semantic wikis (and in many cases, wikis themselves) every time. So I’m rooting for them to become a household name.
I put together a “quick reference” guide, also known as a “cheat sheet”, for Semantiic MediaWiki, that covers the syntax for SMW plus three of its extensions: Semantic Forms, Semantic Drilldown and Semantic Calendar. It’s available in both PNG and PDF formats, and ready for printing in each one; you can see it here:
Hopefully this will be a useful reference; it’s the kind of thing that I think would have helped me a lot when I was first starting to use SMW.
The guide was created in Inkscape. The visual format was heavily inspired by the cheat sheets at Added Bytes (the site formerly known as ilovejackdaniels.com); really an exemplar in the concise display of data, as far as I’m concerned.
Just to show that you never know where ideas will come from, the idea for this reference actually came to me in a dream about two months ago; though in my dream the SMW reference guide looked like one of those laminated, back-and-front-of-the-page study guides you sometimes see in bookstores, instead of just a one-page sheet. Also, unlike in my dream, P. Diddy probably won’t show up to thank me for having created it (no, really).5abc
I got back four days ago from Wikimania in Alexandria; I haven’t had that much free time since then, between resting a lot, and trying to take care of a bunch of issues that come up during the week I was gone, and entertaining some people. Anyway, the conference was great, and Egypt was quite interesting too, though an exhausting country to visit. I saw a good amount of Alexandria, and after the conference I spent a day in Cairo and right-nearby Giza with some other people from the conference (that would be “Wikimaniacs”), where we saw some pyramids and the Sphinx. I have a bunch of photos, from then and from the conference itself, that I hope to upload soon.
As for the conference, I met some very interesting people, saw some amazing talks, and had a lot of nice conversations about all manner of wiki-things. I sent an email to the Semantic Forms mailing list summarizing the technical aspects of the conference, which you can read here.
My own presentation went fine, I thought. All presentations were videotaped, though mine, like the vast majority of them, still isn’t online, which makes me fear that it might not be available for a good while, if ever. Anyway, I uploaded the slide show that I used for the presentation;
here it is, in PDF form.
Two notable sites have popped up recently that use my Semantic Forms extension to enable collaborative editing of data, pretty much on opposite ends of the spectrum.
First, the FEMA Region III Levee Accreditation Knowledgebase, which is currently being assessed by various people at FEMA’s “Region III”, which is essentially the mid-Atlantic states of the United States, as a possible system for recording the readiness of levees. Post-Hurricane Katrina, the state of the nation’s levees has obviously become much more portentous, and the fact that my technology could become part of the solution is thrilling.
The other site is the delightfully crassly-named Chickipedia, a wiki for information about attractive female celebrities. Launched by Break.com about two weeks ago, this is by far the highest-profile site so far to use Semantic Forms. The interesting thing is that, of all the many write-ups the site has gotten online, none have indicated awareness of the semantic technology that powers it, or the fact that, unlike Wikipedia (which it deliberately compares itself to) it provides forms for entering content; instead the focus has been, perhaps understandably, on the, er, ample content. On the whole, that’s fine by me, since my goal in the extensions I’ve created has always been to make the underlying semantic technology as invisible as possible. It’s true that it would be nice to get more recognition of the technology, although I suppose that’s what I’m helping to do now.
So, those are the two big new sites; I like the Apollonian/Dionysian, James Bond-esque duality of it.
I’d been thinking about writing this essay for a while, and initially I planned to phrase the title as a question, i.e., “Are semantic wikis the future of information?” But it’s important to have confidence in one’s own line of work, and, in any case, over time I’ve become more and more convinced about the truth of the statement.
First, some definitions: a wiki is a web site whose pages are editable by many people, directly through a browser. And Semantic wikis are wikis that allow for storage of both regular text and queriable data. More on those later.
We all know that the easiest, fastest way to find information currently is on the web. (Whether it’s also the best way is a matter of opinion, but I think no one could argue with its speed.) There’s a staggering amount of information online, not just in written text but also as images, audio and video, and a good percentage of it is indexed by various search engines, so if I’m at a computer with an internet connection anywhere in the world, there are literally millions of interesting questions that I can have answered within a minute or less. That’s an incredible thing, and obviously a new state of affairs in the human existence.
But for all the almost-magical power and scope of the web, there are rather severe limits to the kinds of information that can be gotten there. For an overview and basic facts on just about any subject, you can look it up on the web and generally find what you’re looking for (and sometimes much more), but to have a specific, non-basic question answered is a much trickier proposition. On the web, for instance, I can easily find a list of the highest-grossing films of all time, but the highest-grossing films of, say, 1960; I can find the population of Nairobi, but not a list of the 50 largest cities in Kenya; I can find the current GDP of France, but not the GDP of France in 1850, let alone the percentage of GDP that taxes represented in that year. I can easily find quotes from any of William Shakespeare’s plays, but quotes about any of his plays are quite a bit harder to find. And, to take a more practical example, if I want to buy a pair of sneakers in New York City, I can find local footwear stores, but not a list of stores that carry, say, Gola sneakers and are still open at 8 PM.
In short, there are whole large branches of information that just can’t be found currently on the web. If you want the answer to such a question, you have a few options: you can look it up in some book, or ask someone who’s knowledgeable on the subject (this, it’s true, can be done online). But for those of us who have been spoiled by the web, we prefer to have an answer in 30 seconds over one in several hours or even days. Consider all the questions that come up that we never get the answers to, because the amount of time and/or effort it would take is greater than what we think the importance of the answer is.
Ah, but the amount of information available online is still increasing rapidly; what will happen when there’s 10 or 100 times that amount available, in blog posts, Wikipedia entries, online articles and the like? And what if all those books at the library get scanned in? Actually, I don’t think that will change the basic situation very much. I contend that there are inherent properties of information that make answering non-trivial questions difficult:
- Information is messy. Many questions require more than a one-word answer, because whatever answer is given needs clarification. Different terms mean different things to different people. What is the largest lake in the world, for instance? Three different bodies of water, the Caspian Sea, Lake Baikal and Lake Superior, all lay claim to that title, depending on what exactly you mean by size and what you mean by a lake. To take another example, currency inflation is ongoing, so what does it mean to say that the movie “Gone With the Wind”, made in 1939, earned a certain amount in today’s dollars? If someone bought a DVD of it for $20 five years ago, how exactly does that get added to the total? These are the kinds of issues that, if they can get resolved at all, can usually only be resolved by experts, and even then rarely definitively.
- Information is controversial. Not only can information be messy, but in some cases people will have a vested interest in what the answer will be. That holds especially true when the information deals in some way with hot-button political issues. Infant mortality rates around the world are one example: different countries have very different standards for what should and shouldn’t be counted (whether prematurely-born babies should be counted, for instance), and that’s in part the case because those numbers are used to justify various public policy decisions. Civilian casualty rates and economic indicators are two other types of statistics with similarly highly-charged meaning. It means that any information provided about a controversial topic will be viewed with great suspicion by at least one side and maybe both, which means that you can’t just provide the data and be done with it; you need to show all the work taken to create that data.
- Information is spotty. In many cases, especially for historical information, there are facts we don’t know, or for which the only sources are unreliable. In such cases, it’s good to know, to the extent possible, that such information can’t be found because it’s unknown, not because a search engine can’t find it or because no one’s bothered to enter it yet.
- Information is distributed. Even if all the information exists and is relatively non-controversial, putting it all together in one place could be a challenge. In the very simple case, there’s a lot of information one can find out about any city in Kenya, but to put it together in such a way that one could answer complex questions like, say, what the largest city founded after 1920 is, is something someone would have to do themselves, for the moment. More complex data might involve putting together information in different formats, found in different places, in different languages, etc., and that is a task that would remain a challenge even if the web contained 100 times the information it did.
- Information is always changing. The world is always changing. Thus, information that is meant to describe the world as it currently exists is always in danger of expiring, if it hasn’t expired already. Every aspect of our lives can change, from a restaurant changing its phone number to a country being split in two. Even scientific and historical facts, which people think of as more fixed, can change, as new evidence comes in that causes historians/scientists/etc. to change their minds about something.
- Information is voluminous. This may be the most challenging aspect of all about information - it sheer size. Information is, at its heart, a description of everything that happens in the world, and so there’s no limit to how much information can be expressed. People have written multi-volume memoirs about their own lives, and even then the vast majority of what actually happened from day to day is left out; and that’s just a single person’s life. Sure, trying to include everything would make for an unreadable (in both senses) book, but on the other hand what’s considered interesting and what isn’t varies widely from reader to reader. Even the kind of information that could be described as worthy of being recorded is basically limitless.
So, given all these challenging aspects to information, what is the best way to make more of it accessible? Many people might say that the solution to simply keep adding more information into Wikipedia. Wikipedia is a tremendous resource, it’s true, and in the eight years that it has been around, it has become, in some people’s minds, synonymous with the very concept of online information: when you want to find out about something, you look up its article on Wikipedia, using a web search only as a convenient way to get there. Wikipedia actually deals well with many of the challenges of information I listed earlier, because it’s a wiki: it handles messiness and controversiality of information by giving users a place to discuss any dispute over information, and by providing unlimited space for people to show the exact steps, and references, by which the information was obtained. And it handles the distributed, ever-changing and massive nature of information by letting anyone in the world change any piece of text at any time, while providing versioning tools to prevent that mass collaboration from leading to total chaos. There is simply no way for a small, paid group of professionals, however fast-working, to duplicate the work that Wikipedia’s hundreds of thousands of volunteers do.
Wikipedia, however, could not be a general solution to holding information for a few reasons. First of all, it’s an encyclopedia: it is meant to contain only information of general interest. A lot of text, and entire articles, routinely get deleted because they’re not considered important enough. To take one small example, a restaurant or store will not get included unless it is famous. That means that a lot of information important to a smaller, specialized group of people will not be included in Wikipedia. One solution to this problem has been to create many wikis for a whole variety of different subcategories and niches of information: there are food wikis, travel wikis and the ubiquitous pop culture wikis. These are all for the good, since they certainly increase the sum total of online information. But they have an important weakness, one shared by Wikipedia itself: they lack queriability.
What is queriability? It’s the ability to aggregate information, based on filters. So, for the example of wanting to know the largest cities in Kenya, it’s not enough to have information on each city, as Wikipedia does. There must be some way to combine that information, to generate new lists with the information we want. Wikipedia already does this to some extent, using categories and pages that are simply lists, but it’s not nearly enough. There’s already a list on Wikipedia of famous Swedish-Americans, but Wikipedia will never have a list of, say, politicians of Swedish ancestry from around the world. Similarly, on a more practical level, you can imagine a wiki that has detailed information on local stores, but it won’t let you easily find the list of stores that carry a certain product that are open a certain time - for that, you’d need to hunt through the data to find matching set on your own.
That’s where semantic wikis come in: they allow for automatic aggregation of information and the creation of lists, and thus they provide answers to questions that were previously unanswerable online. The data entered by the users can function like data in a database: it can be sorted, graphed and analyzed in many different ways. There’s a side benefit to this newfound ability, which is that it provides a much greater incentive for users to enter this kind of data. There already exists a great deal of structured data in table and template form in Wikipedia, thanks to the tireless efforts of volunteers, but the GDP of France in 1850, for instance, is not (at the time of this writing) among them. If someone had that information, and they wanted to add it to Wikipedia, they could undoubtedly find a place to put it, but it has not yet been a priority for anyone. But what if this and other economic data were graphable, and able to be turned into a variety of list data, instead of just being a standalone number? All of a sudden, the value of having that number in place would increase, its absence would be felt more noticeably, and thus the chance of it appearing would increase. Thus, even someone looking for that number by itself would benefit from the wiki it was in being semantic, even though they weren’t using the data semantically.
This is not a strictly theoretical discussion: there already exists a way to make semantic wikis, using open-source software: MediaWiki (the application that Wikipedia uses), along with an extension, Semantic MediaWiki, that enables pages to store their data semantically, and a suite of other extensions that use Semantic MediaWiki to enable easy editing, filtering and mapping of data (some of which I created). I wouldn’t be presumptuous enough to say that MediaWiki and these two extensions themselves constitute the future of information, but they do fit the criteria that I think any such solution has to have: they allow for many people to collaborate on data through versioning tools, they provide forms for people to enter and edit data with, so that data structures are pre-defined, and so that users can easily add and modify data, and they make it easy to modify the data structures themselves. This last property is important because of the inherent messiness of information mentioned earlier: for data of any complexity, there will always come cases that don’t quite fit into the structures provided, and when that happens it’s important to let users modify these structures, using the same processes of collaboration and discussion that already exist for modifying data.
Wikis that use MediaWiki and its semantic suite of extensions already exist, so in a sense the future as I’ve defined it is already here. My site, Discourse DB, uses these extensions (and was the testing ground for some of them) has, among other interesting pages, a list of opinions (for, against and mixed) on whether U.S.-led coalition forces should pull out of Iraq. It’s far from a complete list, but it is, as far as I know, the only place where you can find anything even resembling a comprehensive list online, or perhaps anywhere. Mograph Wiki, another wiki that uses this combination of MediaWiki and plugins, has a listing of all companies and freelancers that do motion-graphics work in London, England; again, as far as I know, no similar such list exists anywhere else in the world. And, to offer one more example, I don’t think there’s a list anywhere besides the Electronic Literature Organization’s Archive-It wiki of the electronic literature works published in 2004. The interesting thing about all of these lists is that none of them were created by hand; they are all generated automatically when the page is accessed, by querying other data in the wiki. Yes, it would be possible, in a few hours’ worth of time, to generate any of these lists by hand, using the other information found on the wiki, but to create the sheer volume of lists possible on each of these sites, and to keep them all constantly updated, quickly becomes impossible except by automation.
These are three rather specific examples, but then again the number of semantic wikis in the world is still rather small, probably less than 100 at this point. As the number grows, it’s easy to imagine the possibilities for the same semantic treatment being applied to a whole variety of other branches of knowledge.
There’s another important usage for semantic wikis, which is to serve as a data source for other applications. Imagine, for instance, a mapping application that’s able to provide information on nearby establishments or other points of interest for a given location, with the data taken from different wikis (some would call such an application a mashup). Or picture an application that lets the user navigate through historical or economic data, or one that automatically creates music playlists on song properties like tempo and genre, using information from a song wiki. In these and other cases, the open nature of a wiki’s data, along with the ease with which it can be kept up-to-date, make a semantic wiki ideal as a data source. It’s a usage that has so far eluded Wikipedia and other conventional wikis, because it’s just too hard to extract data from a text-only wiki. For the MediaWiki-based solution, this capability too already exists: Semantic MediaWiki provides export into RDF, a format for semantic data, that’s easily queried externally, through standard tools like SPARQL. The conversion from wiki data to RDF (or other, similar formats like OWL) is easy because the data is already stored semantically to begin with.
So this is the kind of information storage that semantic wikis can allow: public, flexible, allowing for easy modification, and easily exportable. In short, it seems like the ideal solution for handling the chaos that is real information.1f49
A lot of sites have been getting press recently as attempted rivals for Wikipedia, attempting correct one or another of its perceived flaws (that it’s too amateurish, unreliable, biased, etc.), or at least get in on the staggering amount of page views it receives. Below is my roundup of the ones I know about…
Citizendium - the most well-known to Wikipedia at the moment, I think. It’s run by Larry Sanger, Wikipedia’s less-famous co-founder, and it’s meant to be a more civil and better-educated version of Wikipedia, but with the same technology and the same multi-language support. To edit the site, you have to be registered with your real name, and regular users are meant to give deference to those who can prove that they have expertise on a particular subject.
Scholarpedia - it bills itself as a competitor to Wikipedia, though it’s not really. In Scholarpedia every article is written by experts and professionals, which means that only a small number of articles can get written, and presumably many topics can’t be addressed at all (who, after all, is an expert on, say, “Wheel of Fortune”?) Though you could make a credible argument that such topics don’t belong in an encyclopedia in the first place. Currently the articles that do exist are mostly on scientific topics. It does offer an interesting niche, of articles that are credible enough to be cited directly as references (I don’t know if anyone out there is citing Wikipedia - it seems somehow both implausible and inevitable.)
Knol (no link yet, but you can read more about it on Wikipedia, including a mockup page - Google’s planned answer to Wikipedia, to some extent. Like Citizendium, everyone will have to use their real name, and if you want to start an article you have to submit your picture as well. Once you start an article, you’ll basically “own” it, though others can edit it. Thus, the focus will be more on individual pages than on trying to create some sort of unified site. My guess is that this particular site was at least partly inspired by Wikipedia’s uncanny ability to get to the top of Google search results on just about any search term; that’s somewhat confirmed by the Google executive in charge of the project stating that each page on their site “is meant to be the first thing someone who searches for this topic for the first time will want to read.”
Freebase - an encylopedia run by Metaweb, using their proprietary semantic wiki technology; which means that, as far as I know, it’s the first and only semantic encyclopedia, i.e. the only encyclopedia whose data can be queried and aggregated. The interface is really geared toward its database-like style, with lots of field names everywhere for users to fill in. That makes it somewhat hard to read for general reference, and I don’t know if anyone uses it for that purpose. But as a proof-of-concept it’s neat, and for what it does it might be the only game in town, at least until Wikipedia itself bothers to add semantic capabilities…
Conservapedia - an attempt to counter a supposed left-wing bias on Wikipedia, by offering a version with a stated conservative and pro-American bias. Will they succeed? Who knows. They claim to have 20,000 articles, which is quite impressive given that they’ve only been around for about a year (since November 2006). That’s already comparable to the size of, say, the Greek-language Wikipedia. Who knows? Maybe the concept of an opinionated wiki has potential.
By the way, I don’t know if there’s any comparable site on the left - I know about dKosopedia, but they aim to be only a “political encyclopedia”, as opposed to a general-interest one.
Lohipedia - no, it’s not a site about Lindsey Lohan but rather an encyclopedia focused on user ratings and “karma”. The more other users trust you, the more permissions you have - though it appears that anyone who visits the site can still edit anything. This one uses the application Lohimedia, which runs on Ruby on Rails (MediaWiki, by contrast, which powers Wikipedia and some of the rest, runs on PHP). That alone gives me a certain amount of respect for it. Other than that, I have no opinion on it.
The extension page, where it can be downloaded, is here. You can see it in action here. It’s meant to display all the data on one’s wiki, in an interface that allows for drilling down by both subcategories and semantic properties.
The impetus behind creating it is that a semantic wiki really should be able to have all the features of a traditional CMS (content-management system). A complete CMS offers interfaces to easily view individual data, add data, edit it, delete it (these first four actions are sometimes known as CRUD, for create/read/update/delete), search and drill down; and, with lower priority, data import and export as well. MediaWiki itself handles viewing (thanks to the use of templates), deletion and search; Semantic MediaWiki handles importing and exporting, to some extent, using RDF; and Semantic Forms (based on top of Semantic MediaWiki) handles adding and editing data. That left only drill-down as a needed feature, and that’s what Semantic Drilldown is meant to accomplish. We’ll see how much usage it gets, of course, but I’m quite pleased with the way it turned out.
“The Civil War in Four Minutes” - an animation showing the shifting battle lines, over time, of the U.S. Civil War. It was created to show visitors at the Abraham Lincoln Library/Museum. This was on YouTube for a while, then removed at some point, but you can still find at this video site. It’s a great piece of data visualization, and I think semantic wikis will make such animations much easier to accomplish, by providing the large amount of raw data they need. Actually, I look forward to the day when semantic wikis are strongly tied in in the public mind with this kind of visualization.
Good lord, it’s been almost a month since my last post. My apologies to anyone who’s actually been checking back here. It’s certainly not for lack of stuff to write about - maybe that’s the problem, actually. Since the last post, there have been five more versions of Semantic Forms released. It’s now up to 0.6.6, with another one on the way. Semantic Forms is also in use on a few more sites, both public and private, now, including Placeography, the Minnesota Historical Society’s wiki of places around the world, which I have high hopes for. There are a few more SF-based sites that I know are coming out sometime soon, though I’m not at liberty to talk about them at the moment.
On that note, I mentioned before that I was trying to convince a company I do work for to use a semantic wiki as the back-end for one of their projects. Well, happily, I did manage to convince them, and the wiki is up and is now getting populated with data. One of the main concerns had been that the people who would be populating the data, i.e. the true clients, would find the whole setup too confusing to use, since they’re not computer-savvy. We had some informal training sessions for them, though, and they went very well - the users were unfazed by the standard MediaWiki sidebar links, and the extra tabs on the tops of pages - once they filled out a form or two, they understood what the important parts of the system were. What I found really neat was that the whole training went by without us once using the word “wiki” - we just explained that it’s a system for data entry, and it also happens to allow you to see previous versions of a page, so you can revert changes - the word just never came up.
Other things I’ve been up to: I added a CAPTCHA plugin for this blog, so if you want to comment you have to type the text contained in an image; that cut down on a lot of spam. I’m still getting some spam comments, which I’m assuming come from trackbacks.
I also attended my first meeting of the New York Semantic Web Meetup two weeks ago, which had a very interesting talk about communities deciding on the best names for things together. I definitely plan to go back.
Also, I’m typing on a new laptop now - this beauty (yes, I ignored the apparent Lenovo china boycott, for better or for worse). The old one was on its last legs - it had a battery life of about 20 minutes, and it froze up whenever I thought to run more than, say, two browsers and Notepad at the same time.
I’ll be speaking at the “Metatopia” conference in Washington D.C. in November, on the subject of using semantic wikis as a knowlege management system/content management system/whatever. Here’s the page for the presentation. I’m excited about it: I’ve never presented at a conference before, and I haven’t attended one except for Wikimania ‘06 (plus there’s the conferences I used to travel to with my parents when I was young, but as might be expected I wasn’t actually attending those). Many thanks to my co-presenter, Neil Davies, for having the idea for the presentation and handling the logistics.