I’m well overdue, but here, finally, is my full explanation of the External Data MediaWiki extension; there have been quite a few improvements to it since even the overhauled release, so maybe some of the delay was justified… at least, I’d like to think so.
First of all, you can see the PDF slides from my conference-call presentation here.
The basic goal of External Data is to allow structured data from the outside world to be displayed, and otherwise used, in a wiki. There are lots of APIs out there on the web, with more coming all the time, and this extension allows them to be accessed in a very lightweight manner: no need to specify an XML XPath structure, or a SPARQL query (and if you don’t know what those terms mean, all the better for you): you just declare the URL you want to access, and the variables you want to retrieve, and it does the work. And, as I noted earlier, since Semantic MediaWiki provides its own web-based API for accessing data, you can also use External Data to display data from one SMW-based wiki in another.
As I noted in the presentation, though, the vast majority of the world’s data is not accessible via a web-based API and never will be. Instead, it’s contained in database tables or Excel spreadsheets, or in even less-queriable sources: books, paper documents, etc. If there’s a set of data like that that we want to use in our wiki, how do we do that? Yes, we can go the Wikipedia route of just manually entering the data in wherever it’s necessary. However, this leads to a lot of redundant work, and avoiding it is most of the reason we use semantic wikis in the first place. The next-best approach involves using Semantic MediaWiki: you import the data into wiki pages using some sort of automated tool, with the pages containing either direct semantic annotations or template calls that translate into semantic annotations. The data then gets stored in SMW’s data tables, where it can be queried. This approach, as far as I know, has already been done in a few places; however, there’s a major problem with it: if the original data gets changed or expanded, it’s very hard to re-import it, because now you have to merge with whatever changes have been made by users on the wiki.
The ideal solution is to keep the data where it is and create an API for accessing it; however, most of the time that’s not feasible (it’s beyond most organizations’ abilities to create a web service for getting the data from an Excel spreadsheet, for instance). So the “enhanced” External Data allows for what I think is the next-best solution: you put the data into its own wiki page, in CSV format (basically the simplest kind of data format there is; all the values are just separated by commas). The page ‘Special:GetData’, defined by External Data, then serves as a “mini-API” for accessing this data: it takes in the name of a CSV-holding wiki page, and an optional set of criteria, and returns the set of rows that match those criteria. This gives you all the benefits of having an API: the outside world can easily access your data, and you can access it yourself on other wiki pages, using External Data’s standard querying. For some examples of the latter, see this test page on Discourse DB.
But, you may say, importing the data into a wiki page causes all the same problems we were trying to avoid in the first place! After all, it can still be modified by users after the import, making it difficult to re-import. That’s true, but at least the data is now separated from free text, formatting and other things that users may want to be involved with, so the chance of users modifying any of the pages that hold the actual data becomes much smaller; it’s generally a clean solution.
So that’s External Data. The other big wiki-related news is that the project I was working on for the last few months was released two weeks ago (I’m really behind on this stuff): the OpenCongress wiki. It’s meant to work in conjunction with OpenCongress, a site that holds information about the politicians, committees, legislation and campaign contributions of the U.S. Congress. The wiki holds a subset of that information, and it’s of course user-editable. As you can see from the wiki’s version page, it uses Semantic MediaWiki and many of the extensions that have become generally associated with it, including my Semantic Forms and Semantic Drilldown extensions and Sergey Chernyshev’s indispensable Widgets and Header Tabs extensions; all are meant to enable a data-centered approach to the wiki.
I bring it up in this same post because the OpenCongress wiki also uses External Data; actually, in my mind the site and the extension are somewhat interwoven, because External Data was created during my work on the OpenCongress wiki, was inspired by it to a large extent, and certainly got its first usage (and debugging) in the site. External Data is used in various places in the OpenCongress wiki, both to access data from outside APIs (like that of Sunlight Labs), and to handle data that has no API, using the “CSV page” approach. You can see an example of the latter here - a wiki page containing raw data on one organization’s “scorecard” for one year’s senate membership. You can see that data then being displayed here, on the page for Senator Barbara Boxer, using External Data (currently only this one scorecard’s data is displayed for all senators, but I believe it’s due to get expanded soon).
The OpenCongress wiki is a fantastic site for people looking for political information; in addition, I think that, for wikis, it represents the shape of things to come. That’s because it serves as a “mashup” of many different data sources, allowing for much more information brought to the user than relying on just the wiki’s own editors would. Different types of data are brought together in a relatively seamless way: free text written by regular wiki users; semantic data entered through forms; data from OpenCongress’ own database; data from outside APIs; data that’s not otherwise web-accessible (like the scorecard information); outside services like social-bookmarking tools and Google Maps; and “feed” sources like YouTube and Twitter. External Data, and the large and growing number of great data APIs around the web, make this so easy to do that I’d imagine it won’t be long before other wikis start to follow this same strategy.
Finally, on a side note, it might be mischievous of me to note that all this data integration is done without the use of RDF… but, whatever you think of RDF, that is the truth.