New in Semantic Forms: file upload

May 29th, 2008

The latest feature added to Semantic Forms is file uploads, added yesterday. This is a big improvement to the technology, because file uploads are a standard part of real-world forms, and one of the last things that SF couldn’t duplicate yet; before, forms had to tell users to upload their files separately, then come back to the form when they were done. The way it works now is fairly ingenious, and was basically conceived by the developers of Chickipedia for use on their site. They were nice enough to send me their code, which meant that I didn’t have to do much work to add the feature. The real heart of this functionality is the fantastic Javascript library Floatbox, which is similar to the better-known Lightbox but allows for popping up full HTML pages, and doesn’t rely on external Javascript libraries.

Speaking of Chickipedia, it’s mentioned in the current issue of Wired in a tongue-in-cheek sidebar on the “8 Best Non-Wikipedia Pedias”. I was going to say that this is the first time a project I’m indirectly involved with was mentioned in Wired, but then I remembered that they had an insert about the interactive wine bar at the Adour restaurant a few months ago (that piece is not online, apparently), and I was actually even more directly involved with that project. So, it’s the second time. It still feels good.

Technology updates

May 8th, 2008

Some interesting technological improvements and news recently…

  • I released Replace Text, my latest extension, a week and a half ago. It’s a fairly minor extension, just doing a search-and-replace across the pages in a wiki, but it’s important for certain circumstances.
  • Semantic Drilldown had a big update about two weeks ago - it now supports multiple values per filter, finding the set of results that match any one of those values; some the display has been improved as well. For example, from a German site, here’s a list of countries that are either constitutional monarchies or federal republics.
  • Sergey Chernyshev has also released two fantastic extensions recently: Header Tabs, which quickly applies a tabbed interface onto any page, and Widgets, which allows for the easy placement of any widgets (like videos, slideshows, feeds and many others) onto one’s wiki. Header Tabs has already gotten a lot of usage, though I think Widgets will be the more transformative one, enabling a whole new set of functionality without any need for programming.
  • One wiki that uses Semantic MediaWiki and Semantic Forms, that’s gotten a lot of buzz recently, is Cause Caller, which is actually an application around a wiki - the wiki gives information about American politicians, while the application lets you make phone calls to those politicians’ offices, to give them your opinion on various political issues. As far as I can think of, this is the first automated usage of data from a Semantic MediaWiki-based site; i.e. where the data is used for purposes other than just reading it. Its creator also made an entertaining screencast demonstrating the wiki. I think this might be the first online video to mention Semantic Forms by name. Unfortunately it doesn’t actually show a form (maybe there are more videos to come); though Header Tabs does appear quite conspicuously. It does remind me that I should put together my own long-planned screencast…
  • Other interesting SF-based wikis that have shown up recently: The Music Snob, a resource for musicians (which also has a nice usage of Semantic Drilldown), C-Pop Fantasie, a nicely-designed site that covers Chinese pop music, and BioVenturist.com, which covers biotechnology companies, technologies and venture capitalists, all important information.

Book review: “Programmers at Work”

April 30th, 2008

I just finished reading a fascinating book: “Programmers at Work”, published in 1986 by Microsoft Press. It’s a series of interviews with notable programmers of the day; it’s out of print right now, sadly (I have a borrowed copy). The concept was to ask the biggest programmers of the day about what it’s like to be a programmer, and where they thought things were headed. The interview subjects are a real treasure chest, featuring a mixture of those still dominating the software market today (Bill Gates, Ray Ozzie (currently “Chief Software Architect” at Microsoft), Adobe’s John Warnock), those lauded for their former contributions (Apple Macintosh innovators Andy Hertzfeld and the late Jef Raskin, Pac Man inventor Toru Iwatani), and those whose contributions now live in obscurity, often after having been crushed by the Microsoft juggernaut (PFS:FILE’s John Page, Framework’s Bob Carr).

It’s a very interesting artifact because, for all of the contributions of computer programming, the craft of it rarely gets examined by the larger culture. For various reasons, but mostly the fact that programming ooks like just about one of the most mundane tasks imaginable, basically involving sitting at a computer all day, there’s no great desire to hear the individual stories of programmers. (As a digression, it’s interesting to examine the handling of programmers in film: it’s too big a subject to ignore entirely, but given that there’s nothing interesting about watching someone typing on a screen, Hollywood seems to have fallen on a curious standard template, which is programmers-as-victims. Think about what we’ve seen so far: forced into a computer game in Tron, eaten by dinosaurs in Jurassic Park, hunted by a shadowy corporation in The Net, laid off and then nearly imprisoned in Office Space, hunted by a sinister CEO in Antitrust, attacked by Martians in Independence Day, hunted by ruthless virtual agents in The Matrix. For better or worse, apparently programmers are only compelling when someone’s after them. (By the way, if you think this is an incomplete list - I’m not counting the movies where the characters are just system hackers, like “Swordfish”).)

Anyway, as it stands “Programmers at Work” is fairly unique: a view of the software industry from programmers themselves. And it’s definitely a product of its time, when the state of the software industry could somewhat reliably be summed up by interviews with around 20 people. Today the field of software development is so huge, and so distributed, that I doubt you could find a set of interview subjects that could truly summarize the current state of programming. Not to say that it wouldn’t be an interesting attempt…

The book is a nice overview of what programming looked like during the PC revolution, when computers became a household fixture, and began to take over tasks like document creation, financial processing and mathematical calculations. It was a Wild-West era of software development, during which all manners of interface questions were still up in the air - is having a mouse useful, or is it just a fad? What’s the best way to do a copy-and-paste of text? - developers had to make all sorts of decisions we don’t think about any more, from small interface questions to giant ones like which brand of computer they should do their development for. The stakes were higher, too, because one or a handful of programmers could create an application that would take over the industry, as Jonathan Sachs and a few others did with Lotus 1-2-3.

The interviewees discuss the nature of programming, and some of the language might sound familiar to today’s programmers, despite the huge technology gap - the need to keep the entire structure of a system in your head while you’re working on it; the euphoria of finding what seems like the ideal solution to a problem; the advantages of creating “well-balanced code”.

Every interview contains a section asking what the subject thinks will be the future of computing. This makes for some interesting reading, obviously, because we know how the story turns out. Sadly, most of these sections are notable for what isn’t mentioned. None of the programmers, in 1986, anticipate the dominance that object-oriented programming would have over the programming world by the end of that decade; a few mention SmallTalk, which as I understand is object-oriented, but none mention the object-oriented C++, already out at the time, which, by the early 90’s, would become (I think) the dominant language of application development. Gates thinks that rule-based programming will be the wave of the future, which was not the case. There’s also no mention of relational databases, or of Oracle, which was already a fairly large company at the time, and would go on to become huge. There’s also no mention, either positive or negative, of open-source development, which had already had had some great successes at the time with UNIX and the C programming language, would soon go on to more successes with Linux and Perl, and which, 22 years later, is set to surpass paid software as a business model. All of which is an interesting demonstration of William Gibson’s quote that “the future is already here; it’s just unevenly distributed.”

So what do the programmers predict? There are a variety of general comments that computers will continue getting smaller and faster, and, beyond that, testimonials as to how unknowable the future is. In general, the more specific the predictions, the more off-the-mark they are: a 25-year-old Jaron Lanier (now known as the “father of virtual reality”) talks about a secret project he’s working on. Reading between the lines of his description, it sounds like a virtual-reality application that lets people create software; he predicts that it “will really change the way people think about programming”. I don’t know what happened with that application, but suffice it to say that people still program goggle-less. Bill Gates and Gary Kildall (founder of Digital) both predict that CD-ROMs are the future of how people will use their computers - and perhaps they might have been, had the web not eclipsed their usefulness.

There are a few glimpses of the future that did occur, that pop up in some throwaway comments among the interviews. Gates talks about the email system at Microsoft, which helps him “keep up” with informaion. Michael Hawley mentions a graphics application he’s used that’s “spinning off to form a new company called Pixar.” Most tantalizingly, Page predicts that computers will become important as a source of information, and mentions as an example how he logs on to the National Weather Service to “get briefings through my computer.” I don’t know what type of service it was; undoubtedly some early internet protocol. Unfortunately the interviewer changes the subject, so this is the one mention in the book of using the internet for information retrieval, a simple concept that, within 15 or 20 years, would redefine the use of computers and, it’s no great stretch to say, change the world.

Page, in general, comes off somewhat like a hero from the book: besides anticipating the use of networked computers for information retrieval, he talks about elegance in software interface design, by “making a program more functional without increasing its complexity,” echoing the greater focus today on ease-of-use in applications. He laments the current bloated state of applications, “full of controls nobody ever touches or wants.” So how come no one today has heard of him or his software? Perhaps the software industry in the 80’s was no place for an aesthete: ordinary users valued simplicity of purchasing decision over simplicity of interface - a single operating system and suite of applications (Microsoft’s) would suffice, and whether or not it was well-applications was no great concern, as long as everything worked. Maybe that’s still true today, at least for desktop applications - most people still use Internet Explorer, after all, because that’s what comes installed on their PCs, even though the general consensus is that it’s the worst of the major web browsers currently out.

Is “the Semantic Web” a helpful term?

April 17th, 2008

The time has come, I think, to ask whether “the semantic web” is a good term to use; even though a lot of people use it, and I even belong to a semantic-web meetup or two. The problem with it comes, I think, because it creates an incorrect view in people’s minds of a structure that will show up at some point in the future, enabling various magical abilities. The phrase creates some confusion, in that it raises some unanswered questions: what will “the semantic web” actually look like? Who will create it? And how will we know when it’s arrived?

Now, it could be that enabling more semantic export of online data will indeed have some magical effects. My issue, though, is that phrasing it in such a way makes the whole endeavor more intimidating than it needs to be, suggesting that it’s a project that has yet to even really start. In fact, semantic technologies are not only with us already, but some are in widespread use. RSS is the obvious example: it’s a widely-used file format that displays information about blog posts, news articles and the like in a machine-readable way, so that, using a feed reader, one can be instantly notified about new posts, including their title and other basic information, from any of hundreds of thousands of sources. That’s as semantic as it gets.

More generally, there’s obviously plenty of structure already in the non-semantic (”syntactic”) web. Sites that cover everything from weather to shopping to reference to news display their data in a structured way, retrieving it from relational databases. In some cases, like Amazon, APIs are provided so that one’s application can retrieve this data directly. But even if there’s no API, or other semantic export of the data, it can be retrieved anyway, through web scraping. The NewYorkNabes, which I did the programming for, is one of maybe tens of thousands of examples - it gets its real-estate-price information by going to a set of URLs on newyork.backpage.com once a week, finding the relevant prices within the HTML, and taking their median. If Backpage were to additionally publish their data in RDF form, they would be a true semantic web site, and it would be easier for my code to get that same data. But functionally, things would look exactly the same to users as they do now. You could argue that the difference is that the semantic web data would be retrievable even if the look of the site changed: web scraping is a fragile endeavor, and in theory the system can break if any part of the HTML, like just a font color, is changed. But if you think about it, the same holds true for semantic data: if the owners decide to change a property name from “Price” to “Rental price”, the system will break just as easily. Neither approach offers a full guarantee, and they both require maintenance: the difference is only one of degree, not kind.

I’ve sometimes thought that a good analogy for the value of storing data semantically is a well-organized kitchen: if a kitchen has all its tools and supplies logically arranged and in their place, then it’s easy to find any particular item, and, maybe just as importantly, to know if an item is missing, so that if you don’t see it you won’t end up spending an hour looking for it. If you walk into such a kitchen, even if you’ve never been in it before, you’ll probably be able to start cooking right away. By contrast, the regular web can be compared to a disorganized kitchen, where everything is strewn all around, mixed in haphazardly: the blender could be anywhere, and if there’s no baking soda in the kitchen, good luck determining that for sure.

But this analogy also highlights the gray area between “semantic” and “syntactic”. After all, there’s no such thing as a perfectly-organized kitchen, since any two people’s conceptions of how things should be organized will be different. If you’re looking for wine glasses, will you look near the ordinary glasses, or near the fancy plates? However you arrange things, some people still won’t be able to find what they’re looking for right away, because they’re expecting it elsewhere. Similarly, there are always ambiguities in data - to take one small example, retrieved from this fascinating list of “edit wars” that have emerged in Wikipedia over silly data ambiguities, if you’re displaying consumer products on your site, do you refer to a regular iPod as an “iPod” or (the new term) an “iPod classic”? Even the most carefully-laid-out semantic data will still need some human analysis, and “massaging” of the data, to be usable in an application, and to be aggregated with other data sources, because there’s always ambiguity or differences of opinion over how data should be structured.

But if semantic data can resemble syntactic data, the reverse is true as well. To go back to the analogy, even the messiest kitchen is still usable: if you found yourself having to use one, given enough time, you could figure out where everything is and muddle through. After a few months of working in one, you could probably accomplish everything that you could in a well-organized kitchen. It wouldn’t be nearly as enjoyable, of course, but it would be possible. The comparison can be made to a site like NewYorkNabes, which by its nature is a hack, but it works. The difference between syntactic and semantic, again, emerges as one of degree.

My point here is not that all the talk about the benefits that semantic technology like RDF and OWL will bring is overhyped: I won’t try to predict the changes that they will or won’t bring, but I would guess that there will be some substantial benefits to their adoption. I just think “semantic web” is a bad way to describe this technology, because it makes it seem like a goal to be accomplished, so that one day people can say, “the semantic web has been created”, instead of what I think is the more realistic description, which is a gradual process that began a long time ago of making data more accessible. Instead of “the semantic web”, I think I prefer the terms “semantic technology” or “semantic representation”, or even “semantic web technology”. Heck, even “Web 3.0″ is fine with me, since people understand that “Web 2.0″ is about a set of technologies and not a separate structure - an adjective, not a noun.

Presentation, and semantic applications as frameworks

March 28th, 2008

I gave a talk two weeks ago (I really need to update more frequently), with Sergey Chernyshev at the New York Semantic Web meetup, on the subject of Semantic MediaWiki, Semantic Forms, and some of the associated extensions (what I sometimes try to call the “MediaWiki semantic suite”). The audience was about 25-30 people, and I thought it went well - I got the sense that people understood the basic philosophy of structured semantic wikis by the end. In a bit of self-reference, you can also see the page for this presentation on Sergey’s wiki, techpresentations.org.
They see you always learn about the subject matter when you explain it to others, and for me this was no exception, perhaps surprisingly since it’s been over a year of working on the project. One thing I brought up during the talk, that I hadn’t fully thought of before I started preparing for the talk, was how a semantic representation makes creating generic software solutions for data very easy. One big thing in web programming lately has been frameworks - Ruby on Rails, most notably, but also Symfony for PHP, etc. These frameworks all make web development easier by looking at the structure of the tables in one’s database, and making all sorts of assumptions about how that data will be used - if you have a table called “Cars” with various fields, chances are that the application that uses it will need a class called “Car”, holding those same fields, and will need some web page to let someone add a car by filling in those fields, another one to display a single car and all those fields, another one to delete a car, etc. So the framework does this work for the programmer in advance, eliminating the need for a lot of low-level hacking. Well, extensions like Semantic Forms and Semantic Drilldown work in the same way, though their job is significantly easier because, instead of having to deal with numerous tables, with numerous fields in each one, there’s essentially only one database table, with just three fields, holding the full store of both data and meta-data (e.g. the type of each field) in semantic-triple form: no need to deal with all the complexities that a database structure can possibly have. Similarly, one can easily change the “data structure” of a semantic data set without needing to do any re-coding.
Semantic representation is usually described as useful because it lets you easily share data, but I think its flexibility as a data source for generic applications might be equally important.

Goodbye, Bear Stearns

March 19th, 2008

Bear Stearns, the “storied 85-year-old brokerage”, has had a near-instant collapse and apparently no longer exists. I worked there as a programmer for a good while, before quitting two years ago to become a web programmer. It was a nice group of people I worked with; I hope they’re all making out alright. The company, as far as I knew, had a reputation for being conservative; they sat out the dot-com bubble of the late 90’s, and everyone was wearing suits and ties while other companies were still on “business casual”. But the financial world is full of merciless randomness and seems to reward and punish participants with no regard for their financial strategy.

I used to note that, due to my tendency to work at startups, every company I’ve worked at except for Bear Stearns no longer exists; bizarrely, now I have a near-perfect track record again. I never could have guessed that my job security was no more guaranteed at the bank with the 40-story building and cafeteria than it was anywhere else.

The web in one line

March 16th, 2008

Appropriately, since I just mentioned them, my friend Nick’s company, which is being “incubated” by Y Combinator, just launched: Wundrbar. (There’s a German-language pun in the title, which I think is intentional, but I can’t remember now.) It displays a single command line that lets you do so-called “deep searching” (going directly to the relevant page on a website) for various sites for weather, shopping, reference, etc., as well as actions like blogging and emailing. You could compare it to YubNub, which has a similar concept, but it’s less overwhelming in its options (in my opinion) and more geared toward consumer applications. It fits in exactly with my philosophy of making the web easier to use, and I wish them best of luck with it.

Fast, cheap, etc.

March 10th, 2008

Some interesting thoughts about software development and work in general, all via 37 Signals’ Signal vs. Noise:

“Programmer happiness is the most important factor in making quality software”. I completely agree. The author calls this approach “emo programming”, which - well, I like emo the music genre, so I can’t really complain.

Six Principles for Making New Things: “find (a) simple solutions (b) to overlooked problems (c) that actually need to be solved, and (d) deliver them as informally as possible, (e) starting with a very crude version 1, then (f) iterating rapidly.” Written by Paul Graham, whose “incubator”, Y Combinator, a friend of mine is currently working at; so I hope it’s good advice. I mean, I know it’s good advice; that’s been my philosophy for a while now.

In praise of lazy.

I am a hacker

March 6th, 2008

You can see my first-ever MediaWiki change, added earlier today right here.

Yes, Wikipedia runs on a few hundred thousand lines of code, and I wrote exactly one of those lines. You can thank me later.

And one more extension: Data Transfer

February 22nd, 2008

It seems a little odd to release my next MediaWiki extension already, since I released the last one, Semantic Calendar, only two weeks ago, but here it is: Data Transfer. It currently does only one thing, which is to display the data from one or more category and/or namespace from a wiki in XML form. Here’s an example, from the German-language “Verwaltungskooperation” wiki, of the XML for every project on the wiki (I always like to link to my wikis not my own, lest people start to suspect that my applications are really only interesting to me.) This XML was generated automatically, using the template calls within each page, with no need for customization work on the part of the administrator.

This extension was actually developed in August and September of last year, so it’s not like I’ve been on a recent creation tear. I just never got around to packaging it and releasing it until now, with all the various other projects I was working on. I created it originally while talking to someone about the use of wikis as part of a set of applications, and how to transfer data between those applications. Yes, Semantic MediaWiki already has built-in export and import of data using RDF/OWL, and in terms of a generic solution those are very useful; but if one has set up the data structures ahead of time and simply wants to get the information from point A to point B, XML is the far easier way to go.

By the way, the plan is that Data Transfer will eventually handle importing data into a wiki as well, using the same XML format. It may also support other formats, like CSV (”comma-separated values”); that’s why I gave it such a generic name.

I consider this extension part of the Semantic MediaWiki “family”, even though it doesn’t require the presence of SMW (it can make use of SMW for one small additional feature, but that’s a minor detail). I see it as related to the other SMW extensions because of something I’ve been thinking about for a while - I strongly believe the following: structured information should be semantic, and semantic information should be structured. In other words, if you’re going to all the trouble to create structures for your data, you may as well store it in such a way that it can be queried and otherwise reused; there’s no reason not to. Conversely, if you’re going to store information in a way that can be reused, it should be in a structured, consistent form across your set of data; otherwise, you risk ending up with a chaotic mess of data that’s not reusable by anyone.

This idea is implicit when using a relation database to store one’s information, since those provide both structure and “meaning”/reusability at the same time. However, with the advent of self-publishing online on a mass scale, including the use of wikis, I think it becomes a bigger issue, since it enables a lot of data to be created that doesn’t fit this rule. The biggest example of structured data that isn’t semantic is probably Wikipedia, which has tens of thousands of data templates and tables assembled, but with no way to access that information. And the biggest example of semantic data without structure might be… the web itself, if you ask some people; some in the field of AI/natural-language processing would say that the web contains vast amounts of information, that just need to be deciphered correctly by the right language tools and then assembled into a queriable form. I believe that such information, not having been structured to begin with, is just too chaotic to do anything meaningful with.

Anyway, given all that, I think Data Transfer is a semantic-related extension because any wiki that stores data-holding templates across a lot of its pages really should be semantic, even if it’s not already.

New MediaWiki extension: Semantic Calendar

February 8th, 2008

My latest MediaWiki extension, released a few days ago: Semantic Calendar. This one’s my third, if you’re keeping track; it lets people create calendars that rely on semantic date information. This brings the Semantic MediaWiki setup, in my opinion, a step closer to fully mimicking the standard built-in utilities of regular CMS applications: calendars, to-do lists, etc., with all the additional benefits of being a wiki. This extension was developed for use by the aforementioned Region III Levees site (yes, I’m in communication with some of the people I write about); you can see its first usage here.

Two more SF sites

January 30th, 2008

Two notable sites have popped up recently that use my Semantic Forms extension to enable collaborative editing of data, pretty much on opposite ends of the spectrum.

First, the FEMA Region III Levee Accreditation Knowledgebase, which is currently being assessed by various people at FEMA’s “Region III”, which is essentially the mid-Atlantic states of the United States, as a possible system for recording the readiness of levees. Post-Hurricane Katrina, the state of the nation’s levees has obviously become much more portentous, and the fact that my technology could become part of the solution is thrilling.

The other site is the delightfully crassly-named Chickipedia, a wiki for information about attractive female celebrities. Launched by Break.com about two weeks ago, this is by far the highest-profile site so far to use Semantic Forms. The interesting thing is that, of all the many write-ups the site has gotten online, none have indicated awareness of the semantic technology that powers it, or the fact that, unlike Wikipedia (which it deliberately compares itself to) it provides forms for entering content; instead the focus has been, perhaps understandably, on the, er, ample content. On the whole, that’s fine by me, since my goal in the extensions I’ve created has always been to make the underlying semantic technology as invisible as possible. It’s true that it would be nice to get more recognition of the technology, although I suppose that’s what I’m helping to do now.

So, those are the two big new sites; I like the Apollonian/Dionysian, James Bond-esque duality of it.

Semantic wikis are the future of information

December 31st, 2007

I’d been thinking about writing this essay for a while, and initially I planned to phrase the title as a question, i.e., “Are semantic wikis the future of information?” But it’s important to have confidence in one’s own line of work, and, in any case, over time I’ve become more and more convinced about the truth of the statement.
First, some definitions: a wiki is a web site whose pages are editable by many people, directly through a browser. And Semantic wikis are wikis that allow for storage of both regular text and queriable data. More on those later.
We all know that the easiest, fastest way to find information currently is on the web. (Whether it’s also the best way is a matter of opinion, but I think no one could argue with its speed.) There’s a staggering amount of information online, not just in written text but also as images, audio and video, and a good percentage of it is indexed by various search engines, so if I’m at a computer with an internet connection anywhere in the world, there are literally millions of interesting questions that I can have answered within a minute or less. That’s an incredible thing, and obviously a new state of affairs in the human existence.

But for all the almost-magical power and scope of the web, there are rather severe limits to the kinds of information that can be gotten there. For an overview and basic facts on just about any subject, you can look it up on the web and generally find what you’re looking for (and sometimes much more), but to have a specific, non-basic question answered is a much trickier proposition. On the web, for instance, I can easily find a list of the highest-grossing films of all time, but the highest-grossing films of, say, 1960; I can find the population of Nairobi, but not a list of the 50 largest cities in Kenya; I can find the current GDP of France, but not the GDP of France in 1850, let alone the percentage of GDP that taxes represented in that year. I can easily find quotes from any of William Shakespeare’s plays, but quotes about any of his plays are quite a bit harder to find. And, to take a more practical example, if I want to buy a pair of sneakers in New York City, I can find local footwear stores, but not a list of stores that carry, say, Gola sneakers and are still open at 8 PM.

In short, there are whole large branches of information that just can’t be found currently on the web. If you want the answer to such a question, you have a few options: you can look it up in some book, or ask someone who’s knowledgeable on the subject (this, it’s true, can be done online). But for those of us who have been spoiled by the web, we prefer to have an answer in 30 seconds over one in several hours or even days. Consider all the questions that come up that we never get the answers to, because the amount of time and/or effort it would take is greater than what we think the importance of the answer is.

Ah, but the amount of information available online is still increasing rapidly; what will happen when there’s 10 or 100 times that amount available, in blog posts, Wikipedia entries, online articles and the like? And what if all those books at the library get scanned in? Actually, I don’t think that will change the basic situation very much. I contend that there are inherent properties of information that make answering non-trivial questions difficult:

  • Information is messy. Many questions require more than a one-word answer, because whatever answer is given needs clarification. Different terms mean different things to different people. What is the largest lake in the world, for instance? Three different bodies of water, the Caspian Sea, Lake Baikal and Lake Superior, all lay claim to that title, depending on what exactly you mean by size and what you mean by a lake. To take another example, currency inflation is ongoing, so what does it mean to say that the movie “Gone With the Wind”, made in 1939, earned a certain amount in today’s dollars? If someone bought a DVD of it for $20 five years ago, how exactly does that get added to the total? These are the kinds of issues that, if they can get resolved at all, can usually only be resolved by experts, and even then rarely definitively.
  • Information is controversial. Not only can information be messy, but in some cases people will have a vested interest in what the answer will be. That holds especially true when the information deals in some way with hot-button political issues. Infant mortality rates around the world are one example: different countries have very different standards for what should and shouldn’t be counted (whether prematurely-born babies should be counted, for instance), and that’s in part the case because those numbers are used to justify various public policy decisions. Civilian casualty rates and economic indicators are two other types of statistics with similarly highly-charged meaning. It means that any information provided about a controversial topic will be viewed with great suspicion by at least one side and maybe both, which means that you can’t just provide the data and be done with it; you need to show all the work taken to create that data.
  • Information is spotty. In many cases, especially for historical information, there are facts we don’t know, or for which the only sources are unreliable. In such cases, it’s good to know, to the extent possible, that such information can’t be found because it’s unknown, not because a search engine can’t find it or because no one’s bothered to enter it yet.
  • Information is distributed. Even if all the information exists and is relatively non-controversial, putting it all together in one place could be a challenge. In the very simple case, there’s a lot of information one can find out about any city in Kenya, but to put it together in such a way that one could answer complex questions like, say, what the largest city founded after 1920 is, is something someone would have to do themselves, for the moment. More complex data might involve putting together information in different formats, found in different places, in different languages, etc., and that is a task that would remain a challenge even if the web contained 100 times the information it did.
  • Information is always changing. The world is always changing. Thus, information that is meant to describe the world as it currently exists is always in danger of expiring, if it hasn’t expired already. Every aspect of our lives can change, from a restaurant changing its phone number to a country being split in two. Even scientific and historical facts, which people think of as more fixed, can change, as new evidence comes in that causes historians/scientists/etc. to change their minds about something.
  • Information is voluminous. This may be the most challenging aspect of all about information - it sheer size. Information is, at its heart, a description of everything that happens in the world, and so there’s no limit to how much information can be expressed. People have written multi-volume memoirs about their own lives, and even then the vast majority of what actually happened from day to day is left out; and that’s just a single person’s life. Sure, trying to include everything would make for an unreadable (in both senses) book, but on the other hand what’s considered interesting and what isn’t varies widely from reader to reader. Even the kind of information that could be described as worthy of being recorded is basically limitless.

So, given all these challenging aspects to information, what is the best way to make more of it accessible? Many people might say that the solution to simply keep adding more information into Wikipedia. Wikipedia is a tremendous resource, it’s true, and in the eight years that it has been around, it has become, in some people’s minds, synonymous with the very concept of online information: when you want to find out about something, you look up its article on Wikipedia, using a web search only as a convenient way to get there. Wikipedia actually deals well with many of the challenges of information I listed earlier, because it’s a wiki: it handles messiness and controversiality of information by giving users a place to discuss any dispute over information, and by providing unlimited space for people to show the exact steps, and references, by which the information was obtained. And it handles the distributed, ever-changing and massive nature of information by letting anyone in the world change any piece of text at any time, while providing versioning tools to prevent that mass collaboration from leading to total chaos. There is simply no way for a small, paid group of professionals, however fast-working, to duplicate the work that Wikipedia’s hundreds of thousands of volunteers do.

Wikipedia, however, could not be a general solution to holding information for a few reasons. First of all, it’s an encyclopedia: it is meant to contain only information of general interest. A lot of text, and entire articles, routinely get deleted because they’re not considered important enough. To take one small example, a restaurant or store will not get included unless it is famous. That means that a lot of information important to a smaller, specialized group of people will not be included in Wikipedia. One solution to this problem has been to create many wikis for a whole variety of different subcategories and niches of information: there are food wikis, travel wikis and the ubiquitous pop culture wikis. These are all for the good, since they certainly increase the sum total of online information. But they have an important weakness, one shared by Wikipedia itself: they lack queriability.

What is queriability? It’s the ability to aggregate information, based on filters. So, for the example of wanting to know the largest cities in Kenya, it’s not enough to have information on each city, as Wikipedia does. There must be some way to combine that information, to generate new lists with the information we want. Wikipedia already does this to some extent, using categories and pages that are simply lists, but it’s not nearly enough. There’s already a list on Wikipedia of famous Swedish-Americans, but Wikipedia will never have a list of, say, politicians of Swedish ancestry from around the world. Similarly, on a more practical level, you can imagine a wiki that has detailed information on local stores, but it won’t let you easily find the list of stores that carry a certain product that are open a certain time - for that, you’d need to hunt through the data to find matching set on your own.

That’s where semantic wikis come in: they allow for automatic aggregation of information and the creation of lists, and thus they provide answers to questions that were previously unanswerable online. The data entered by the users can function like data in a database: it can be sorted, graphed and analyzed in many different ways. There’s a side benefit to this newfound ability, which is that it provides a much greater incentive for users to enter this kind of data. There already exists a great deal of structured data in table and template form in Wikipedia, thanks to the tireless efforts of volunteers, but the GDP of France in 1850, for instance, is not (at the time of this writing) among them. If someone had that information, and they wanted to add it to Wikipedia, they could undoubtedly find a place to put it, but it has not yet been a priority for anyone. But what if this and other economic data were graphable, and able to be turned into a variety of list data, instead of just being a standalone number? All of a sudden, the value of having that number in place would increase, its absence would be felt more noticeably, and thus the chance of it appearing would increase. Thus, even someone looking for that number by itself would benefit from the wiki it was in being semantic, even though they weren’t using the data semantically.

This is not a strictly theoretical discussion: there already exists a way to make semantic wikis, using open-source software: MediaWiki (the application that Wikipedia uses), along with an extension, Semantic MediaWiki, that enables pages to store their data semantically, and a suite of other extensions that use Semantic MediaWiki to enable easy editing, filtering and mapping of data (some of which I created). I wouldn’t be presumptuous enough to say that MediaWiki and these two extensions themselves constitute the future of information, but they do fit the criteria that I think any such solution has to have: they allow for many people to collaborate on data through versioning tools, they provide forms for people to enter and edit data with, so that data structures are pre-defined, and so that users can easily add and modify data, and they make it easy to modify the data structures themselves. This last property is important because of the inherent messiness of information mentioned earlier: for data of any complexity, there will always come cases that don’t quite fit into the structures provided, and when that happens it’s important to let users modify these structures, using the same processes of collaboration and discussion that already exist for modifying data.

Wikis that use MediaWiki and its semantic suite of extensions already exist, so in a sense the future as I’ve defined it is already here. My site, Discourse DB, uses these extensions (and was the testing ground for some of them) has, among other interesting pages, a list of opinions (for, against and mixed) on whether U.S.-led coalition forces should pull out of Iraq. It’s far from a complete list, but it is, as far as I know, the only place where you can find anything even resembling a comprehensive list online, or perhaps anywhere. Mograph Wiki, another wiki that uses this combination of MediaWiki and plugins, has a listing of all companies and freelancers that do motion-graphics work in London, England; again, as far as I know, no similar such list exists anywhere else in the world. And, to offer one more example, I don’t think there’s a list anywhere besides the Electronic Literature Organization’s Archive-It wiki of the electronic literature works published in 2004. The interesting thing about all of these lists is that none of them were created by hand; they are all generated automatically when the page is accessed, by querying other data in the wiki. Yes, it would be possible, in a few hours’ worth of time, to generate any of these lists by hand, using the other information found on the wiki, but to create the sheer volume of lists possible on each of these sites, and to keep them all constantly updated, quickly becomes impossible except by automation.

These are three rather specific examples, but then again the number of semantic wikis in the world is still rather small, probably less than 100 at this point. As the number grows, it’s easy to imagine the possibilities for the same semantic treatment being applied to a whole variety of other branches of knowledge.

There’s another important usage for semantic wikis, which is to serve as a data source for other applications. Imagine, for instance, a mapping application that’s able to provide information on nearby establishments or other points of interest for a given location, with the data taken from different wikis (some would call such an application a mashup). Or picture an application that lets the user navigate through historical or economic data, or one that automatically creates music playlists on song properties like tempo and genre, using information from a song wiki. In these and other cases, the open nature of a wiki’s data, along with the ease with which it can be kept up-to-date, make a semantic wiki ideal as a data source. It’s a usage that has so far eluded Wikipedia and other conventional wikis, because it’s just too hard to extract data from a text-only wiki. For the MediaWiki-based solution, this capability too already exists: Semantic MediaWiki provides export into RDF, a format for semantic data, that’s easily queried externally, through standard tools like SPARQL. The conversion from wiki data to RDF (or other, similar formats like OWL) is easy because the data is already stored semantically to begin with.

So this is the kind of information storage that semantic wikis can allow: public, flexible, allowing for easy modification, and easily exportable. In short, it seems like the ideal solution for handling the chaos that is real information.

Video roundup

December 24th, 2007

A few interesting online videos I’ve seen recently…

Creating a multi-touch interface using the Wii controller - apparently, thanks to the Nintendo “Wii-mote”, a Minority Report-style interface is already possible today, at a cost of around $40.

“Sensual Seduction”, the video for Snoop Dogg’s current single. This has to be one of the strangest videos I’ve ever seen for a number of reasons, but one is that the vocoder-style pitch correction makes it sound like he’s an amazingly good singer.

Trajan is the movie font.

“Heard ‘Em Say”, Kanye West - a video came out for this song two years ago that was just a black-and-white illustrated animation. Well, apparently there was also a real, live-action video filmed for the song, directed by the famed Michel Gondry, that was never released for some reason, but has since been released. And here it is - I think it’s quite nice.

Couldn’t finish this one without a plug - the Today Show’s holiday shopping roundup included a mention of Eye-Fi, my brother’s company. Matt Lauer was enthused.

The sites who would be Wikipedia

December 17th, 2007

A lot of sites have been getting press recently as attempted rivals for Wikipedia, attempting correct one or another of its perceived flaws (that it’s too amateurish, unreliable, biased, etc.), or at least get in on the staggering amount of page views it receives. Below is my roundup of the ones I know about…

Citizendium - the most well-known to Wikipedia at the moment, I think. It’s run by Larry Sanger, Wikipedia’s less-famous co-founder, and it’s meant to be a more civil and better-educated version of Wikipedia, but with the same technology and the same multi-language support. To edit the site, you have to be registered with your real name, and regular users are meant to give deference to those who can prove that they have expertise on a particular subject.

Scholarpedia - it bills itself as a competitor to Wikipedia, though it’s not really. In Scholarpedia every article is written by experts and professionals, which means that only a small number of articles can get written, and presumably many topics can’t be addressed at all (who, after all, is an expert on, say, “Wheel of Fortune”?) Though you could make a credible argument that such topics don’t belong in an encyclopedia in the first place. Currently the articles that do exist are mostly on scientific topics. It does offer an interesting niche, of articles that are credible enough to be cited directly as references (I don’t know if anyone out there is citing Wikipedia - it seems somehow both implausible and inevitable.)

Knol (no link yet, but you can read more about it on Wikipedia, including a mockup page - Google’s planned answer to Wikipedia, to some extent. Like Citizendium, everyone will have to use their real name, and if you want to start an article you have to submit your picture as well. Once you start an article, you’ll basically “own” it, though others can edit it. Thus, the focus will be more on individual pages than on trying to create some sort of unified site. My guess is that this particular site was at least partly inspired by Wikipedia’s uncanny ability to get to the top of Google search results on just about any search term; that’s somewhat confirmed by the Google executive in charge of the project stating that each page on their site “is meant to be the first thing someone who searches for this topic for the first time will want to read.”

Freebase - an encylopedia run by Metaweb, using their proprietary semantic wiki technology; which means that, as far as I know, it’s the first and only semantic encyclopedia, i.e. the only encyclopedia whose data can be queried and aggregated. The interface is really geared toward its database-like style, with lots of field names everywhere for users to fill in. That makes it somewhat hard to read for general reference, and I don’t know if anyone uses it for that purpose. But as a proof-of-concept it’s neat, and for what it does it might be the only game in town, at least until Wikipedia itself bothers to add semantic capabilities…

Conservapedia - an attempt to counter a supposed left-wing bias on Wikipedia, by offering a version with a stated conservative and pro-American bias. Will they succeed? Who knows. They claim to have 20,000 articles, which is quite impressive given that they’ve only been around for about a year (since November 2006). That’s already comparable to the size of, say, the Greek-language Wikipedia. Who knows? Maybe the concept of an opinionated wiki has potential.

By the way, I don’t know if there’s any comparable site on the left - I know about dKosopedia, but they aim to be only a “political encyclopedia”, as opposed to a general-interest one.

Lohipedia - no, it’s not a site about Lindsey Lohan but rather an encyclopedia focused on user ratings and “karma”. The more other users trust you, the more permissions you have - though it appears that anyone who visits the site can still edit anything. This one uses the application Lohimedia, which runs on Ruby on Rails (MediaWiki, by contrast, which powers Wikipedia and some of the rest, runs on PHP). That alone gives me a certain amount of respect for it. Other than that, I have no opinion on it.

New MediaWiki extension: Semantic Drilldown

December 12th, 2007

The extension page, where it can be downloaded, is here. You can see it in action here. It’s meant to display all the data on one’s wiki, in an interface that allows for drilling down by both subcategories and semantic properties.

The impetus behind creating it is that a semantic wiki really should be able to have all the features of a traditional CMS (content-management system). A complete CMS offers interfaces to easily view individual data, add data, edit it, delete it (these first four actions are sometimes known  as CRUD, for create/read/update/delete), search and drill down; and, with lower priority, data import and export as well. MediaWiki itself handles viewing (thanks to the use of templates), deletion and search; Semantic MediaWiki handles importing and exporting, to some extent, using RDF; and Semantic Forms (based on top of Semantic MediaWiki) handles adding and editing data. That left only drill-down as a needed feature, and that’s what Semantic Drilldown is meant to accomplish. We’ll see how much usage it gets, of course, but I’m quite pleased with the way it turned out.

Still here

December 7th, 2007

Yaron is not writing much these days, because he is busy right now. How busy? I’m working on four - yes, four - separate MediaWiki extensions in one way or another. Only one has been released already. Another one is probably coming early next week, and then the other ones will be… well, hopefully soon afterwards, but who knows.

I’ve had some interesting experiences recently, but no time to write about them right now. Hey, did you know Wikimania 2008 will be in Alexandria, Egypt? I’m probably going to this one, actually. It should be fun - I’ve never been to Africa before.

My friends are political

November 6th, 2007

Karol Sheinin will be speaking at a debate on the Lower East Side on Wednesday, on the subject “Did the Government Know in Advance About 9/11?” She’ll be representing the “No” side, as could be expected.

In other political news, my friend Ivan Lenin is running for President of the United States! And his campaign website is now up. I believe I’m in fact a minister in his party, so if anyone wants to know who I’m planning to vote for in ‘08, now you have your early answer. Check out his platform - it’s a little gulag-heavy, admittedly.

Another update

October 30th, 2007

Some developments recently…

In the biggest news, my brother’s company, Eye-Fi, launched their product, a wi-fi-enabled camera memory card today - “Free your memories… quickly, easily, and wirelessly!” Their product sells for $99.99, and it’s currently #13 among Amazon’s electronic bestsellers. (!)
In less fortunate news, Metatopia, the conference I was supposed to speak at a week from now, has been cancelled - there was a hotel booking mixup that was discovered too late to find an alternate venue. The world will have to wait!
In happier academic-type news, my site, Discourse DB, was mentioned in this recent journal paper by the creators of Semantic MediaWiki. They also mentioned another site, the Protege Wiki, that also uses my extension, Semantic Forms.

That extension, by the way, is going well - the forms are now integrated in with the rest of the tabs on the page. You can click back and forth to your heart’s content between the regular “edit” tab and the “edit with form” tab - what could be better? Some more neat developments are planned.

UPDATE: Eye-Fi is now (the next day) at #3!

Sherman’s March never looked so clean

October 24th, 2007

“The Civil War in Four Minutes” - an animation showing the shifting battle lines, over time, of the U.S. Civil War. It was created to show visitors at the Abraham Lincoln Library/Museum. This was on YouTube for a while, then removed at some point, but you can still find at this video site. It’s a great piece of data visualization, and I think semantic wikis will make such animations much easier to accomplish, by providing the large amount of raw data they need. Actually, I look forward to the day when semantic wikis are strongly tied in in the public mind with this kind of visualization.