External Data grows again

June 23rd, 2009

The latest version of the External Data extension now lets you get data from two other sources (in addition to APIs and text files): LDAP servers, and database tables. This is a nice step forward, in that it’s no longer completely necessary to create an API for every data source you want to access from the wiki; which makes the concept of using MediaWiki for data integration potentially simpler and less breakable. Thanks to David Macdonald for this new functionality.

Semantic Bundle launched

June 22nd, 2009

Announcing Semantic Bundle - a single downloadable file that holds Semantic MediaWiki and 16 other MediaWiki extensions that use it and/or are often used in conjunction with it. The aim is to simplify the confusing landscape of extensions that’s evolved around Semantic MediaWiki, so that users can just get one file instead of having to research and download many files individually to get all the functionality they would want. What we have is a basic super-set of the kinds of extensions people usually end up using on SMW-driven wikis (administrators can choose which of the extensions to include, once they’ve downloaded the bundle.)

Semantic Bundle is similar to the SMW+ package distributed by Ontoprise, although it’s a different set of extensions; both include SMW, of course, but other than that the number of extensions they have in common is surprisingly small - which just goes how to show diverse the set of features has become, and may be another argument for this kind of “curatorial” work.

Semantic Bundle was developed, and is distributed, by Sergey Chernyshev and me.

Meeting Metaweb

June 17th, 2009

I had a very interesting meeting about a week and a half ago with Robert Cook, the co-founder of Metaweb, i.e. the people behind Freebase. By sheer coincidence, we know someone (non-technical) in common, and he was visiting New York, so it all worked out. I certainly learned a good amount. For one thing, it was a pleasant surprise to find out that he’s a very friendly and personable guy. The meeting also cleared up some misconceptions I had had about Freebase, and their future plans. I had always thought of Freebase and Semantic MediaWiki as rivals - friendly rivals, perhaps, but still creators of similar products, possibly competing for some of the same customers. And if Wikipedia ever started using SMW, I imagined we’d become pretty much direct competitors, since the other co-founder of Metaweb, Danny Hillis, has referred to Freebase as “Wikipedia for data”. But it turned out that, far from fearing or being skeptical Wikipedia adopting Semantic MediaWiki, Robert was very excited about the idea, and wanted to know what he could do to help. As I found out, Metaweb sees Freebase more as an aggregator of data than an original source of it (that’s my understanding, anyway). In other words, though users can directly add information to Freebase through the form interface, the much more important source is sites like Wikipedia, MusicBrainz, EDGAR, etc. Freebase’s strengths lie in matching up entities (i.e., knowing that data about a book from two different databases are about the same book), as well as querying and browsing - they have an extremely fast storage and querying system for their millions of items of data, and some slick interfaces for browsing through it all (see Parallax). So a two-part solution suggests itself: Wikipedia, with some sort of semantic capability, handles the entry and display of data, along with basic aggregation, like lists and tables (and possibly maps and timelines, etc.); while Freebase takes in the data, then handles the complex browsing and querying that Wikipedia probably couldn’t allow, for performance reasons. Other sites could allow for querying and browsing of Wikipedia’s data as well, of course, but Freebase looks like they’re in a unique position to handle it all.

There’s also Freebase’s entity match-up, which is at the heart of Freebase’s new Common Tag effort. The idea is to, instead of using plain text tags for blog posts, news articles, etc., use Freebase entity IDs instead - so that there won’t be ambiguity about what a tag means. It’ll be interesting if this initiative takes off - as Robert noted, it’s not a substitute for true semantic triples, but it beats having “an ambiguous relationship to an ambiguous entity” (my recollection of how he described current tags).

SMW helps win contests, UPDATE: I can’t read very well

June 9th, 2009

Okay, all of the stuff I wrote before happened, but it was this time last year, not this year. I was off by an entire year. It’s still cool, though - maybe more impressive, actually, given how much functionality has been added to Semantic MediaWiki, etc. since last year. Anyway, what’s written below is not timely in the least.

—————————————————–

This is cool. The company 23andMe creates reports for people on their genetic profiles - it doesn’t send anyone their entire DNA chain, but just notifies about the presence of SNPs (”snips”), which, as I understand it, are DNA sequences considered specifically informative. (The company’s also known for being founded by Google co-founder Sergey Brin’s wife, but I digress.) Anyway, in April they ran a contest in which they published the 23andMe data for an anonymous woman, and those who took part had to guess at as many of her attributes as possible. The winner was announced three weeks ago, and it was Mike Cariaso, whom I always enjoy talking to, and who runs the site SNPedia.com (”snipedia”). In his winning entry, he gave details for her race, hair and eye color, proclivity for diseases, and more intangible things like personality and intelligence. In their announcement of the winner, the company didn’t say which of the details were accurate, but if even half of them are, it’s a surprising (to me) level of detail.

In any case, the really neat thing is that Mike used SNPedia as the database to get all this information; and SNPedia is a wiki that runs on Semantic MediaWiki, and Semantic Forms. So I think it’s great proof that SMW can compete with any technology out there at the moment as far as enabling open, collaborative databases.(Oh, and the prize is a free genetic screening, which sounds good if you’re into that sort of thing.)

Updates

June 1st, 2009

Lots of Semantic MediaWiki-related developments recently…

My name is ___, and I use SMW

May 20th, 2009

After some amount of planning, emailing and persuading, the Semantic MediaWiki testimonials page is now up. The page, as could be expected, holds a list of statements by various people about how SMW has helped them. There are eight testimonials already, featuring a good mix of contributions from corporations, research groups and individual websites. So now we have something that I think is rare: a testimonials page for an open-source application that has no organization running it. In other words, a marketing effort without the marketers, or even a CEO. Is this a harbinger of the future of work? I guess we’ll see. For now, I think this will be an important tool in getting companies and other organizations more comfortable with the idea of using SMW, especially in place of more slickly-marketed (but also more expensive) packages. And if you use Semantic MediaWiki and haven’t already submitted a testimonials, please feel free to do so - the email address is at the bottom of that page.

New extension: Admin Links

May 13th, 2009

I’m pleased to announce my latest extension, Admin Links, released earlier today; which, depending on how you count it, is around my ninth extension (a number I never would have guessed I would reach). I believe this is my conceptually simplest extension yet: just a page of links that are meant to be helpful for administrators. I think that this helps fix a hole in MediaWiki, though: I wrote before that I thought one of the top weaknesses of MediaWiki compared to competing systems was “lack of guidance from the interface about how administrators should accomplish their tasks”. Other applications have wizards, control panels and the like for helping administrators do their daily tasks, but when you first set up MediaWiki, there’s nothing looking back at you but a blank main page, and lots of pages of documentation elsewhere. Admin Links provides the bare minimum, which is a page (at “Special:AdminLinks”) of links to common administrative tasks (like editing the CSS file, managing users, viewing a list of all the wiki’s pages). In addition, for administrators, it puts a link to this page within their “user links”, which are the links usually at the top of the page of “my talk”, “my preferences”, etc.; that way, an administrator can easily get to it from whatever page they happen to be on. Finally, Admin Links provides an API for letting other extensions add on sections and links to the page, so that Special:AdminLinks can always serve as a control panel for whatever set of extensions are installed. You can see an example of Admin Links at work here, on Discourse DB; though, since you’re not an administrator, you won’t see a link to it at the top. I’ve modified my local versions of the Semantic MediaWiki and Semantic Forms extensions to call the Admin Links API already, so you can see a lot of links geared for those two. I plan to check in the new Admin Links code of SMW and SF at some point soon, as well as to add similar calls to some of my other extensions.

The idea for this extension actually came from my wiki hosting site, Referata, which already has such a page for administrators (though there it’s called “Helpful links” - which will probably be replaced by Admin Links soon). And the idea for that, in turn, came because I realized the sheer volume of pages that people creating a Semantic MediaWiki site need to know about was making it hard for people to get started. So, in a very real sense, Admin Links is a Semantic MediaWiki-inspired extension; though of course it will most likely have usage beyond that. I should also note that it was the head of SMW, Markus Krötzsch, who came up with the insightful idea of implementing it as a general extension with an API, back when I discussed it with him a long while ago.

Semantic wikis are the future of systems integration

May 4th, 2009

The title’s a deliberate reference to the essay I wrote about a year and a half ago, “Semantic wikis are the future of information” (a sentiment I still fully agree with, by the way). But in the last few months, since the release of the External Data extension, I’ve had a new realization: that semantic wikis are not just a good tool for storing data, but for handling what’s sometimes known as enterprise application integration: coordinating among a set of systems in an enterprise.

First, the problem: it’s extremely common for mid-to-large organizations - whether they’re companies, non-profits, governments, etc. - to have their data scattered over many different systems. A company can have a database application for accounting, another one for information on employees, another for information on employees, another for customer service information, another for website traffic, etc. Other information, like legal agreements or information about business partners, might not even be located in any single location: it could be spread out over documents or emails throughout a company’s departments. And some of those documents might be in hard-copy only, not on a computer. each data store can be quite useful for what it does (even printed documents in a manila folder somewhere can be a useful storage system); the problem is that the data can’t be combined together in any meaningful way.

Let’s take a straightforward example: a manager wants to know whether employee pay and/or seniority in the customer service department affect the quality of customer service; they also want to know whether service calls about specific products correlate with visits to the website’s help pages about those products, or whether perhaps there’s an inverse correlation, indicating that more information should be added to the website about certain products. The information is all there, somwhere; the problem is that there’s no way to combine it, aggregate it, visualize it, etc.

This is a well-known problem, and a whole body of work exists around it, including journals, books, conferences, etc. The individual data stores are referred to as “information silos“, “islands of information“, “data stovepipes”, etc., while the task of integrating them has been called enterprise information management, and more recently “Enterprise 2.0” and “Business Intelligence 2.0“. And many companies exist to try to solve this problem for organizations, including IBM, SAP etc.

Well, I believe that semantic wikis offer one solution to this problem, a solution that manages to be lightweight, fairly easy to implement, and I think powerful. This idea crystallized for me recently when I was working on a project to get Semantic MediaWiki into a large organization (I won’t say which it is, because discussions are still ongoing). This organization has all the typical problems of data in a large organization: inaccessible data and lack of any central control over all of it. During the course of planning, the group of us discussing how best to integrate SMW hit on what I think is a reasonable general approach. Here’s : most data stays where it is, in the applications in which it was created; the only big change to each individual system is that each one is now responsible for providing an “API” for getting at its data: basically, a web script that, when passed in the ID, within the URL, of some entity in their system, displays the data for that entity, in XML, CSV or JSON (three standard formats for displaying data). In some cases, this would just a short script, maybe less than 10 lines, containing just a single SQL call; in most cases, it likely wouldn’t be a big technical challenge. (And to clarify further, the API, though it would be web-based, could still be behind a firewall; the information would not have to be opened to the public).

At this point the wiki comes in: it would have a page for each entity, with each page containing a template call based on the type of data it represents. This template call would, in turn, extract the data for this page from the relevant data source (or data sources) via their API(s), using the External Data extension. Ths data would then be displayed to users, and also most likely stored via semantic properties, so that it could then be aggregated into lists, graphs, calendars, etc.

The end result is a system in which no pre-existing component needs to know about any other component (only the wiki needs to know about everything), and parts can be brought in and out without bringing down the overall system. Also, it requires no programming, only some wiki-page scripting. And it’s based entirely on free, open-source software.

There’s one additional complication, which is components that don’t have database-backed data storage, that an organization would want to upgrade as part of an enterprise-integration process anyway: data contained in files, or emails, or printed documents. For these, the data could easily be moved onto the wiki, making use of what wikis were originally intended to do, which is storing text information. The flexibility of semantic wikis means that such a transfer could be done gradually, based on the needs of the organization. For a group of PowerPoint presentations, for instance, the wiki could start out as a directory containing the location of each file in the company’s file server, and then eventually come to semantically hold all the data contained in those files.

So there it is: a semantic wiki system (in this case, SMW, although if it takes off I’m sure other wikis will copy this functionality), plus custom APIs per system, provides the ability to do relatively pain-free data integration.

I’m not the first person to think of data integration by means of components publishing their own data; in fact, that’s been one of the suggested uses of so-called Semantic Web technology, where each component publishes data in a format like RDF or OWL, and semantic reasoners and SPARQL queries pull it all together. That, too, is a valid approach; my basic objection to it is I think it’s overkill: you can easily get bogged down in a world of competing ontologies and mismatched data, if the goal is to get all the RDF outputs to be compatible with one another. With a wiki at the center, on the other hand, each component can just publish its data in the simplest format possible, and let the wiki deal with all the data-matching and exception-handling.

I’m in the Google Summer of Code

April 23rd, 2009

I’m very pleased to say that, as was announced Monday, I’ll be mentoring one of the four projects for the Wikimedia foundation in the 2009 Google Summer of Code. If you don’t know about the Google Summer of Code (or “GSoc”, as it’s affectionately called), it’s a fantastic program, fully funded by Google, that pays students around the world to work on established open-source projects over a summer. The student I’m mentoring is Jeroen De Dauw, a budding hacker in Belgium (and, coincidentally, one with a first name pronounced very similarly to mine, which is why some people when they first hear my name think I’m Dutch). He’s already got the requisite enthusiasm and programming experience that makes me think the project will be a success.

The planned project is different from what’s described on the site, due to some re-thinking. The current plan is for Jeroen to create a new MediaWiki extension, called “Semantic Maps”, that will hold all support for mapping services: initially Google Maps and OpenLayers (replacing the current Semantic Google Maps and (not-really-working) Semantic Layers extensions), and then, as time permits, Google Earth and Yahoo! Maps as well.

This project was easily accepted, which was great; it was mostly luck, due to not that many people signing up to mentor for Wikimedia this year; bringing to mind Woody Allen’s quote that 90% of success is just showing up.

However idiosyncratic the process of getting accepted was, there’s nothing idiosyncratic about the project itself. Geographical mapping is a very important feature in data visualization; judging by this somewhat-reliable list of active SMW-using sites, Semantic Google Maps is the second most-popular additional extension for SMW sites, after Semantic Forms. Of course, that’s Google Maps; and I don’t doubt that Google Maps will remain the most popular mapping service even as others become available, but all the others have their specific strengths and user base: OpenLayers allows for mapping on non-geographic surfaces, like anatomical images and blueprints; Google Earth shows a 3-D view of the world; and Yahoo! Maps has fewer license restrictions than Google Maps does.

So that should be an exciting project; I’m also looking forward to just being a mentor. I’ll hopefully post some updates about Semantic Maps here as it gets developed.

Resolving MediaWiki and SMW weaknesses: discussion forums

April 7th, 2009

As the Semantic MediaWiki system becomes more mature and better-known, it’s encountering a new (and somewhat exciting) problem: it’s getting increasingly faced off against other applications when large organizations evaluate it as a possible content-management/systems-integration/etc. solution. These other applications include, most notably, Microsoft SharePoint, but also “enterprise wikis” like Confluence and SocialText. And when these matchups occur they inevitably bring the weaknesses and gaps in MediaWiki and SMW into focus. The weaknesses that I’ve personally heard have been raised in this way are:

  1. Lack of good WYSIWYG editing (there is a WYSIWYG-editing extension, FCKeditor, that works fine in most circumstances, and I’m in the minority who doesn’t think WYSIWYG editing for wikis is that necessary in the first place, but it’s been brought up as an issue)
  2. Lack of discussion forums
  3. Little to no access control, for being able to set who can read and/or edit which pages
  4. Lack of guidance from the interface about how administrators should accomplish their tasks
  5. A boring appearance - most MediaWiki sites tend to look almost exactly like Wikipedia, which itself doesn’t look that exciting
  6. Especially for Semantic MediaWiki (as opposed to MediaWiki itself), a skepticism about committing to a system that would require either training internal staff or keeping around consultants indefinitely

Those are the big ones, as far as I’m aware. It should be noted that issues of actual storage and display of data, which take up almost all of the focus of SMW discussions and development, don’t seem to have come up in evaluations of SMW at all; which I think indicates that SMW is far ahead of its competitors on data-related matters. Which is great news, though it does suggest that maybe our efforts should be re-prioritized to some extent.

I have some thoughts on how to deal with all of these, except for the first one, and they’re all worth having a discussion about (#3, the access-control issue, is probably worth having quite a few discussions about). But what I want to talk about in this post is issue #2, the lack of discussion forums in MediaWiki. I’ve heard it mentioned as a concern for three different large organizations in the last month, which I assume means that it’s a big issue and will stay that way until it’s solved.

I think the first thing that needs to be addressed, when talking about discussion forums, is that at least three different things fall into the realm of “discussion forums”, which may help explain why it’s been so hard to get a definitive solution. Here are what I see as the three things:

  1. Discussions about wiki pages - questions and conversations about the layout, content, data etc. of the pages in the wiki
  2. Discussions about the wiki’s topics - a place for people to talk, vent and argue about the actual subjects of each wiki page, independent of what the wiki pages happen to contain
  3. General discussions - forum-like discussions that may be unrelated to anything specifically in the wiki

The first kind of discussion is what MediaWiki’s “Talk” pages are geared for, and generally I think they work fine for that purpose. You could make the case that this system could use some improvement - there’s no reason why users should be able to edit others’ comments, for instance - but I haven’t seen any major problems with them, and extensions already exist, like Liquid Threads, that make Talk pages more forum-like.

The second kind of discussion is unique to public wikis - wikis that are meant to attract a general readership, where there will be a set of users who want to read the contents and comment on the topics, without modifying the content itself. On Wikipedia such comments are simply not allowed, which I think is the right thing to do for a mass-audience reference. But for more-specific sites, meant to attract people interested in one particular set of topics, allowing general venting and discussion makes sense. The current best way to do this, in my opinion, is to have such comments be handled by an outside system. The OpenCongress wiki handles them in such a way: the wiki page on the Employee Free Choice Act, for instance, links to OpenCongress’ main page on this bill (at least, the House version), which itself has a tab for the comments page. The flow could be a little nicer, but the system provides a clear location for comments. Of course, in the case of OpenCongress, the non-wiki site, with comments pages, already existed before the wiki was set up, so it was obvious which approach to take. In the case of a wiki without an external site attached, there’s no good, easy solution at the moment. I believe such a solution is important; I also believe that it should be implemented in some way outside the wiki - in other words, comments should be entered in HTML not wiki text, and they shouldn’t be editable once they’re entered. I also don’t know if comments pages should use the wiki’s user-registration system - commenting systems on blogs and such in general seem to work fine without registration, and I believe it might be important to maintain a separate “identity” between making changes to the wiki and expressing one’s personal opinions. For all those reasons, I think it’s a bad idea to use Talk pages for that purpose, although it’s tempting. (And there’s also the fact that Talk pages are already used for discussions about the wiki content.) So that leaves - some sort of way for comment pages to be integrated into a wiki. This definitely could use more thought and discussion.

The third kind of discussion is just discussions in general, potentially on any topic, that people who read and edit the wiki would want to have specifically with one another. For a private wiki in an organization, this would just be a forum for employees/members to talk; for a public wiki on a specific topic, it would be a forum devoted to that topic. Here there’s the least-strong argument for integrating the discussion directly into the wiki, since plenty of good forum software already exists, like phpBB”, and a MediaWiki extension would never be able to match their functionality (some people have tried creating forms using Semantic Forms to enable such a thing, but I don’t think that’ll ever work nearly as well as dedicated software). However, it’s definitely worth creating, at the very least, a “best practices” document explaining how MediaWiki and forum software should be used together and link to one another; and possibly how to integrate their user-registration systems, using OpenID or anything else.

So that’s what I think about disucussions in MediaWiki. I may get around to writing about the other ones; let me know in the comments if there are any that you specifically want to hear my thoughts on, and of course feel free to share your own thoughts.

Strange

March 26th, 2009

I worked (indirectly) for Scott Murphy, for my first full-time job, as a web programmer at iXL-New York, a web consulting company. I was there for about seven months from 1998 to ‘99, which somewhat coincided with the heyday of the dot-com boom; they were out of business within two years. He was one of the heads of the company, and very personable. Anyway, he’s now running for the U.S. House of Representatives, to replace the woman who replaced Hillary Clinton in the senate, and the outcome of the race has national implications.

A longer-than-expected post about External Data and the OpenCongress wiki

March 18th, 2009

I’m well overdue, but here, finally, is my full explanation of the External Data MediaWiki extension; there have been quite a few improvements to it since even the overhauled release, so maybe some of the delay was justified… at least, I’d like to think so.

First of all, you can see the PDF slides from my conference-call presentation here.

The basic goal of External Data is to allow structured data from the outside world to be displayed, and otherwise used, in a wiki. There are lots of APIs out there on the web, with more coming all the time, and this extension allows them to be accessed in a very lightweight manner: no need to specify an XML XPath structure, or a SPARQL query (and if you don’t know what those terms mean, all the better for you): you just declare the URL you want to access, and the variables you want to retrieve, and it does the work. And, as I noted earlier, since Semantic MediaWiki provides its own web-based API for accessing data, you can also use External Data to display data from one SMW-based wiki in another.

As I noted in the presentation, though, the vast majority of the world’s data is not accessible via a web-based API and never will be. Instead, it’s contained in database tables or Excel spreadsheets, or in even less-queriable sources: books, paper documents, etc. If there’s a set of data like that that we want to use in our wiki, how do we do that? Yes, we can go the Wikipedia route of just manually entering the data in wherever it’s necessary. However, this leads to a lot of redundant work, and avoiding it is most of the reason we use semantic wikis in the first place. The next-best approach involves using Semantic MediaWiki: you import the data into wiki pages using some sort of automated tool, with the pages containing either direct semantic annotations or template calls that translate into semantic annotations. The data then gets stored in SMW’s data tables, where it can be queried. This approach, as far as I know, has already been done in a few places; however, there’s a major problem with it: if the original data gets changed or expanded, it’s very hard to re-import it, because now you have to merge with whatever changes have been made by users on the wiki.

The ideal solution is to keep the data where it is and create an API for accessing it; however, most of the time that’s not feasible (it’s beyond most organizations’ abilities to create a web service for getting the data from an Excel spreadsheet, for instance). So the “enhanced” External Data allows for what I think is the next-best solution: you put the data into its own wiki page, in CSV format (basically the simplest kind of data format there is; all the values are just separated by commas). The page ‘Special:GetData’, defined by External Data, then serves as a “mini-API” for accessing this data: it takes in the name of a CSV-holding wiki page, and an optional set of criteria, and returns the set of rows that match those criteria. This gives you all the benefits of having an API: the outside world can easily access your data, and you can access it yourself on other wiki pages, using External Data’s standard querying. For some examples of the latter, see this test page on Discourse DB.

But, you may say, importing the data into a wiki page causes all the same problems we were trying to avoid in the first place! After all, it can still be modified by users after the import, making it difficult to re-import. That’s true, but at least the data is now separated from free text, formatting and other things that users may want to be involved with, so the chance of users modifying any of the pages that hold the actual data becomes much smaller; it’s generally a clean solution.

So that’s External Data. The other big wiki-related news is that the project I was working on for the last few months was released two weeks ago (I’m really behind on this stuff): the OpenCongress wiki. It’s meant to work in conjunction with OpenCongress, a site that holds information about the politicians, committees, legislation and campaign contributions of the U.S. Congress. The wiki holds a subset of that information, and it’s of course user-editable. As you can see from the wiki’s version page, it uses Semantic MediaWiki and many of the extensions that have become generally associated with it, including my Semantic Forms and Semantic Drilldown extensions and Sergey Chernyshev’s indispensable Widgets and Header Tabs extensions; all are meant to enable a data-centered approach to the wiki.

I bring it up in this same post because the OpenCongress wiki also uses External Data; actually, in my mind the site and the extension are somewhat interwoven, because External Data was created during my work on the OpenCongress wiki, was inspired by it to a large extent, and certainly got its first usage (and debugging) in the site. External Data is used in various places in the OpenCongress wiki, both to access data from outside APIs (like that of Sunlight Labs), and to handle data that has no API, using the “CSV page” approach. You can see an example of the latter here - a wiki page containing raw data on one organization’s “scorecard” for one year’s senate membership. You can see that data then being displayed here, on the page for Senator Barbara Boxer, using External Data (currently only this one scorecard’s data is displayed for all senators, but I believe it’s due to get expanded soon).

The OpenCongress wiki is a fantastic site for people looking for political information; in addition, I think that, for wikis, it represents the shape of things to come. That’s because it serves as a “mashup” of many different data sources, allowing for much more information brought to the user than relying on just the wiki’s own editors would. Different types of data are brought together in a relatively seamless way: free text written by regular wiki users; semantic data entered through forms; data from OpenCongress’ own database; data from outside APIs; data that’s not otherwise web-accessible (like the scorecard information); outside services like social-bookmarking tools and Google Maps; and “feed” sources like YouTube and Twitter. External Data, and the large and growing number of great data APIs around the web, make this so easy to do that I’d imagine it won’t be long before other wikis start to follow this same strategy.

Finally, on a side note, it might be mischievous of me to note that all this data integration is done without the use of RDF… but, whatever you think of RDF, that is the truth.

External Data 2.0 (actually 0.4, but same thing)

February 11th, 2009

Yesterday I released a new version of the External Data extension that allows it to, in addition to previous functionality, get a table’s worth of data (instead of just single values), and extract data from any wiki page holding values in CSV format. The more I think about it, the more I think these additions make External Data among the most important MediaWiki extensions I’ve released (or co-released,since Michael Dale contributed), or maybe even the most important, beating out Semantic Forms; I guess we’ll see.

I hope to write more about the “new” External Data at some point; for now, if you want to hear more about it and what I think its implications are, I’ll be talking about it tomorrow (Thursday) at 1:30 PM EST in session 5 of the semantic wiki conference call “mini-series”. Anyone is free to join in the call. There’ll also be other talks (including three from people I met at the Semantic MediaWiki users meeting in Boston), that should be quite interesting.

How Wikipedia enabled semantic wikis

January 16th, 2009

It turns out that yesterday was the seventh anniversary of the launch of Wikipedia, otherwise known as Wikipedia Day. So it’s probably as good a time as any to acknowledge the huge impact Wikipedia has had on my own career. It’s rare to say that a website has single-handedly brought into being an entire field of technology, but one could make a case that Wikipedia has done that for my field, semantic wikis - which is, of course, ironic, since Wikipedia itself does not use semantic technology. The site, though, has enabled what I do to come about in several different ways - enough that it’s hard to dispute the direct link. Here are the ways in which Wikipedia has made semantic wikis possible:

  • It taught the world about wikis. Most people, when they first heard about Wikipedia, a site where anyone can edit anything, probably had the same reaction: sounds like a recipe for disaster. To be sure, some critics of Wikipedia still say that’s the case; but for most of the hundreds of millions of people who read the site, seeing it work has been an eye-opening experience: the realization that a site where users can edit the content of any page can work. And for some users (including me), the realization that not only is it a workable solution, and not even just the best solution, but in some cases it’s the only solution for aggregating information in one place. And so Wikipedia’s proof-of-concept inspired many people to create their own wikis for their businesses, organizations or personal interests. I dare say that 99% of the people who have been involved with semantic wikis got their first experience with wikis by reading Wikipedia; I’m part of that group.
  • It has inspired researchers. Beyond just Wikipedia as a proof-of-concept, the idea of turning Wikipedia into more a database-like information store has captured the imaginations of a lot of people. That’s how Semantic MediaWiki got its start: the first paper published about the project was titled “Semantic Wikipedia”, and the concept remains the holy grail for many of those involed with the project (not for me personally, though I can understand the excitement). And Freebase, the other major semantic wiki technology (in my opinion), which uses its own proprietary application, has billed itself as a “Wikipedia for data”; I wouldn’t be surprised if it was conceived that way too. (It’s an open question what will happen to Freebase if Wikipedia goes semantic, and thus itself becomes the Wikipedia for data.)
  • It has enabled the technology. MediaWiki, the wiki engine developed specifically for Wikipedia, is also, in my opinion, the best wiki engine, of the dozens that exist. It’s robust, scalable, and full of useful features. Two of those features have, I think, made it ideally suited for use in semantic wikis: templates and hooks. Templates enable the separation of data from data structure and presentation, which lets a semantic wiki approximate much more closely a regular database-driven website; while hooks, of which MediaWiki has hundreds, allow extensions like Semantic MediaWiki to integrate nicely into the rest of the package with little or no coordination between the extension developers and the main MediaWiki developers: that, in turn, allows for much faster development time. Neither one is a coincidence: the nature of Wikipedia and its massive size make conveniences like these into something more like necessities.

So, a big thank you to Wikipedia, and of course to its two co-founders: Larry Sanger, who had the idea to use a wiki to power the world’s first free online encyclopedia; and Jimmy Wales, who has guided the project successfully through ever since.

New MediaWiki extension: External Data

January 13th, 2009

I’m pleased to announce External Data, my new MediaWiki extension; this is somewhere between my sixth and ninth released extension, depending on how you count it. External Data allows wiki pages to use and display values that were retrieved from an outside URL that itself holds XML or CSV data. It’s a very simple extension (my smallest one, I think), but I think it has some important implications for SMW. Using it, one Semantic MediaWiki-based site can get the data from another, using a query with the ‘CSV’ format, and then store it semantically. See here for an example of that usage, on Discourse DB - it displays and then semantically stores data that was retrieved from this page on semanticweb.org. Check out the source code of the first page for the specifics of how it’s done. This means that now the information from two or more semantic wikis can be combined together in one place, then queried, mapped, etc., as if it were all just one wiki’s data.

This idea of pooling data from different websites is of course the main concept behind the so-called Semantic Web (not a term I like all that much, but that’s a different story). At the moment, I can’t imagine that this extension will be used much for the classic semantic-web example, of gathering data from completely unrelated wikis (or what could be called a “mashup”); but for wikis and other online data sources that have already coordinated among themselves to split up the handling of data, I think it’s a very reasonable solution for doing that.

New Semantic MediaWiki hosting site

January 5th, 2009

The site Pseudomenon, which appears to have just been released yesterday, is the newest entrant to the small club of semantic wiki hosting sites. This is, as far as I know, the third site to offer hosting of Semantic MediaWiki, and the fourth to offer hosting of any sort of semantic wiki, the one non-SMW site being Swirrl. It’s the first, though, to support the Halo extension (also known as “SMW+”), which allows free-form semantic annotation and querying of wiki pages. Pseudomenon doesn’t include any other extensions at the moment, but the inclusion of Halo by itself makes it a helpful addition.

According to the main page, hosting is free, and every wiki gets a subdomain at iskey.info.

Apparently, the word “pseudomenon” is a reference to the Epimenides paradox, in which a Cretan stated “all Cretans are liars”. A snide commentary on truth in wikis? Well, at least it’s a real word, as opposed to the fake-Latin “Referata” I came up with, though I later found out that means, I believe, “reports” in Croatian.

Yaron has a wish list

December 21st, 2008

I can’t believe I haven’t mentioned yet that my Amazon wish list is up. Feel free to peruse, especially if you’re feeling charitable this holiday season toward, say, people whose software you use. As you can see, I have a definite scarcity of books about web design and open source.

Semantic MediaWiki conference call

December 9th, 2008

Do you like reading about semantic wikis, but really wish you could hear me talking on the phone about them? Well, you’re in luck, because I’ll be speaking in the 3rd session of the semantic wiki “mini-series” of conference calls, on Thursday. The last two sessions, which happened over the last two months, covered the broader world of semantic wikis; this one focuses specifically on Semantic MediaWiki. Markus Krötzsch, the lead developer of SMW, will talk about the core of the technology, and I’ll talk about “Semantic Forms, Semantic Drilldown, Semantic Result Formats, Semantic Google Maps, Semantic Compound Queries and Data Transfer” (evidently, I get bored easily). There will also be people from the Ontoprise corporation presenting their contributions, and some other presenters. Each presentation will also have a real-time slide show on the web. You can see the presentation time and phone number here (it depends on where you live), plus other details, and a place to RSVP (you don’t need to RSVP to watch/listen, but it’s strongly recommended).

More thoughts on SharePoint

December 8th, 2008

Thanks to Sergey Chernyshev, I got a chance to see Microsoft SharePoint for myself last week; he was nice enough to show me a lot of the interface. Or rather, what he demo’d one of the two applications referred to as “SharePoint” - it’s still a little confusing to me, but I believe what he showed me was Windows SharePoint Services, as opposed to “SharePoint Server”. That’s good, because I believe that’s the application usually referred to as a competitor for Semantic MediaWiki. Anyway, seeing the SharePoint I saw, I could certainly understand how comparisons were being made. Windows SharePoint is all about collaborative data - users can create their own page types, with each page containing a set of fields; once you’ve created a page, there’s an automatic form created to let everyone add or edit data. Finally, there are “views” one can create, which are the equivalent of SMW’s queries - you can filter a set of pages by some set of criteria, and decide how to display the pages; just as with SMW, you can make tables, calendars, etc.

SharePoint is really closer to something like a set of public Excel spreadsheets with macros for entering data, or sites like Dabble DB, than to Semantic MediaWiki; with the caveat that SharePoint allows for uploading external files in a manner similar to MediaWiki. There are a few key differences that I can see. First the strengths of SharePoint:

  • SharePoint allows for data permissions. You can set who can read and edit and who can’t for nearly each piece of data. At this point, that’s possible only through hacks with MediaWiki, and not at all with Semantic MediaWiki - basically, if someone can read any page on a wiki, they can pretty much read all of it. Is that a big advantage for SharePoint? I’m sure there are a lot of companies that see it that way.
  • SharePoint’s interface is very easy to understand. If you want to create a new type of page, there’s a nice wizard that guides you through it in a few easy steps. Pages are “pages”, views are “views”, and that’s all there is to it - there’s no need to understand templates, properties, parser functions or anything else, and the word “semantic” is blissfully out of sight. Contrast that, sadly, with Semantic MediaWiki, where even if you know how wikis work, you still have to spend, I would guess, at least an hour or two reading documentation before you can do a thing.

And the weaknesses:

  • SharePoint has no versioning. It is not a wiki. You can’t tell who made which changes to which data and when, and I believe that once a piece of data has been changed its old value is lost forever, except maybe in database archives. As the number of people who can modify a set of information grows, the value of complete versioning grows as well, until you reach the point when you literally can’t function without a record of every single change that was made. That, I think, is a big part of why permissions are so important in applications like SharePoint: you always have to keep the number of people who can change any piece of data to a manageable size; say a few dozen or a few hundred at the very most. Of course, companies can manage this way (they’ve been doing it for hundreds of years), but it’s not ideal.
  • In SharePoint, you can’t link data. Every field in a page is a standalone field. If you have a page representing a project, and there’s a field representing the project manager, and that field reads “Bob Hoover”, it’s just a string of letters. It won’t link to a page representing Bob, and there will be no way to connect that information about Bob to anything else we know about him. Yes, you can create a view to find out all the projects that are managed by Bob Hoover, but you can’t go to a page about him and see which projects he manages, plus which other projects he’s a part of, plus his phone number, plus which days he’s willing to carpool. In Semantic MediaWiki, that’s all easy to do.

Note that I’m just comparing the interfaces here - there are obviously huge differences in price, support, etc. etc., but I wanted to give my sense about the applications themselves.

Article on Referata at SemanticWeb.com

December 7th, 2008

Jennifer Zaino at SemanticWeb.com was kind enough to interview me for an article about Referata and Semantic MediaWiki, and here it is: “Get Your MediaWiki Hosting Here”. I think it’s well-written and gives a nice overview of the site.