Archive for the 'Discourse DB' Category

Health care opinion: the breakdown

Thursday, August 13th, 2009

Discourse DB has a topic page for the current American health-care bill, AKA the “America’s Affordable Health Choices Act of 2009″. 82 columns and editorials on the subject are already entered, subdivided into “For”, “Against” and “Mixed” on the topic. If you’re curious about the current distribution of thinking among the pundit class, check it out. (And if you know of opinion items in notable sources not already included in the list, please add them!)

New MediaWiki extension: External Data

Tuesday, January 13th, 2009

I’m pleased to announce External Data, my new MediaWiki extension; this is somewhere between my sixth and ninth released extension, depending on how you count it. External Data allows wiki pages to use and display values that were retrieved from an outside URL that itself holds XML or CSV data. It’s a very simple extension (my smallest one, I think), but I think it has some important implications for SMW. Using it, one Semantic MediaWiki-based site can get the data from another, using a query with the ‘CSV’ format, and then store it semantically. See here for an example of that usage, on Discourse DB - it displays and then semantically stores data that was retrieved from this page on semanticweb.org. Check out the source code of the first page for the specifics of how it’s done. This means that now the information from two or more semantic wikis can be combined together in one place, then queried, mapped, etc., as if it were all just one wiki’s data.

This idea of pooling data from different websites is of course the main concept behind the so-called Semantic Web (not a term I like all that much, but that’s a different story). At the moment, I can’t imagine that this extension will be used much for the classic semantic-web example, of gathering data from completely unrelated wikis (or what could be called a “mashup”); but for wikis and other online data sources that have already coordinated among themselves to split up the handling of data, I think it’s a very reasonable solution for doing that.

Semantic wikis are the future of information

Monday, December 31st, 2007

I’d been thinking about writing this essay for a while, and initially I planned to phrase the title as a question, i.e., “Are semantic wikis the future of information?” But it’s important to have confidence in one’s own line of work, and, in any case, over time I’ve become more and more convinced about the truth of the statement.
First, some definitions: a wiki is a web site whose pages are editable by many people, directly through a browser. And Semantic wikis are wikis that allow for storage of both regular text and queriable data. More on those later.
We all know that the easiest, fastest way to find information currently is on the web. (Whether it’s also the best way is a matter of opinion, but I think no one could argue with its speed.) There’s a staggering amount of information online, not just in written text but also as images, audio and video, and a good percentage of it is indexed by various search engines, so if I’m at a computer with an internet connection anywhere in the world, there are literally millions of interesting questions that I can have answered within a minute or less. That’s an incredible thing, and obviously a new state of affairs in the human existence.

But for all the almost-magical power and scope of the web, there are rather severe limits to the kinds of information that can be gotten there. For an overview and basic facts on just about any subject, you can look it up on the web and generally find what you’re looking for (and sometimes much more), but to have a specific, non-basic question answered is a much trickier proposition. On the web, for instance, I can easily find a list of the highest-grossing films of all time, but the highest-grossing films of, say, 1960; I can find the population of Nairobi, but not a list of the 50 largest cities in Kenya; I can find the current GDP of France, but not the GDP of France in 1850, let alone the percentage of GDP that taxes represented in that year. I can easily find quotes from any of William Shakespeare’s plays, but quotes about any of his plays are quite a bit harder to find. And, to take a more practical example, if I want to buy a pair of sneakers in New York City, I can find local footwear stores, but not a list of stores that carry, say, Gola sneakers and are still open at 8 PM.

In short, there are whole large branches of information that just can’t be found currently on the web. If you want the answer to such a question, you have a few options: you can look it up in some book, or ask someone who’s knowledgeable on the subject (this, it’s true, can be done online). But for those of us who have been spoiled by the web, we prefer to have an answer in 30 seconds over one in several hours or even days. Consider all the questions that come up that we never get the answers to, because the amount of time and/or effort it would take is greater than what we think the importance of the answer is.

Ah, but the amount of information available online is still increasing rapidly; what will happen when there’s 10 or 100 times that amount available, in blog posts, Wikipedia entries, online articles and the like? And what if all those books at the library get scanned in? Actually, I don’t think that will change the basic situation very much. I contend that there are inherent properties of information that make answering non-trivial questions difficult:

  • Information is messy. Many questions require more than a one-word answer, because whatever answer is given needs clarification. Different terms mean different things to different people. What is the largest lake in the world, for instance? Three different bodies of water, the Caspian Sea, Lake Baikal and Lake Superior, all lay claim to that title, depending on what exactly you mean by size and what you mean by a lake. To take another example, currency inflation is ongoing, so what does it mean to say that the movie “Gone With the Wind”, made in 1939, earned a certain amount in today’s dollars? If someone bought a DVD of it for $20 five years ago, how exactly does that get added to the total? These are the kinds of issues that, if they can get resolved at all, can usually only be resolved by experts, and even then rarely definitively.
  • Information is controversial. Not only can information be messy, but in some cases people will have a vested interest in what the answer will be. That holds especially true when the information deals in some way with hot-button political issues. Infant mortality rates around the world are one example: different countries have very different standards for what should and shouldn’t be counted (whether prematurely-born babies should be counted, for instance), and that’s in part the case because those numbers are used to justify various public policy decisions. Civilian casualty rates and economic indicators are two other types of statistics with similarly highly-charged meaning. It means that any information provided about a controversial topic will be viewed with great suspicion by at least one side and maybe both, which means that you can’t just provide the data and be done with it; you need to show all the work taken to create that data.
  • Information is spotty. In many cases, especially for historical information, there are facts we don’t know, or for which the only sources are unreliable. In such cases, it’s good to know, to the extent possible, that such information can’t be found because it’s unknown, not because a search engine can’t find it or because no one’s bothered to enter it yet.
  • Information is distributed. Even if all the information exists and is relatively non-controversial, putting it all together in one place could be a challenge. In the very simple case, there’s a lot of information one can find out about any city in Kenya, but to put it together in such a way that one could answer complex questions like, say, what the largest city founded after 1920 is, is something someone would have to do themselves, for the moment. More complex data might involve putting together information in different formats, found in different places, in different languages, etc., and that is a task that would remain a challenge even if the web contained 100 times the information it did.
  • Information is always changing. The world is always changing. Thus, information that is meant to describe the world as it currently exists is always in danger of expiring, if it hasn’t expired already. Every aspect of our lives can change, from a restaurant changing its phone number to a country being split in two. Even scientific and historical facts, which people think of as more fixed, can change, as new evidence comes in that causes historians/scientists/etc. to change their minds about something.
  • Information is voluminous. This may be the most challenging aspect of all about information - it sheer size. Information is, at its heart, a description of everything that happens in the world, and so there’s no limit to how much information can be expressed. People have written multi-volume memoirs about their own lives, and even then the vast majority of what actually happened from day to day is left out; and that’s just a single person’s life. Sure, trying to include everything would make for an unreadable (in both senses) book, but on the other hand what’s considered interesting and what isn’t varies widely from reader to reader. Even the kind of information that could be described as worthy of being recorded is basically limitless.

So, given all these challenging aspects to information, what is the best way to make more of it accessible? Many people might say that the solution to simply keep adding more information into Wikipedia. Wikipedia is a tremendous resource, it’s true, and in the eight years that it has been around, it has become, in some people’s minds, synonymous with the very concept of online information: when you want to find out about something, you look up its article on Wikipedia, using a web search only as a convenient way to get there. Wikipedia actually deals well with many of the challenges of information I listed earlier, because it’s a wiki: it handles messiness and controversiality of information by giving users a place to discuss any dispute over information, and by providing unlimited space for people to show the exact steps, and references, by which the information was obtained. And it handles the distributed, ever-changing and massive nature of information by letting anyone in the world change any piece of text at any time, while providing versioning tools to prevent that mass collaboration from leading to total chaos. There is simply no way for a small, paid group of professionals, however fast-working, to duplicate the work that Wikipedia’s hundreds of thousands of volunteers do.

Wikipedia, however, could not be a general solution to holding information for a few reasons. First of all, it’s an encyclopedia: it is meant to contain only information of general interest. A lot of text, and entire articles, routinely get deleted because they’re not considered important enough. To take one small example, a restaurant or store will not get included unless it is famous. That means that a lot of information important to a smaller, specialized group of people will not be included in Wikipedia. One solution to this problem has been to create many wikis for a whole variety of different subcategories and niches of information: there are food wikis, travel wikis and the ubiquitous pop culture wikis. These are all for the good, since they certainly increase the sum total of online information. But they have an important weakness, one shared by Wikipedia itself: they lack queriability.

What is queriability? It’s the ability to aggregate information, based on filters. So, for the example of wanting to know the largest cities in Kenya, it’s not enough to have information on each city, as Wikipedia does. There must be some way to combine that information, to generate new lists with the information we want. Wikipedia already does this to some extent, using categories and pages that are simply lists, but it’s not nearly enough. There’s already a list on Wikipedia of famous Swedish-Americans, but Wikipedia will never have a list of, say, politicians of Swedish ancestry from around the world. Similarly, on a more practical level, you can imagine a wiki that has detailed information on local stores, but it won’t let you easily find the list of stores that carry a certain product that are open a certain time - for that, you’d need to hunt through the data to find matching set on your own.

That’s where semantic wikis come in: they allow for automatic aggregation of information and the creation of lists, and thus they provide answers to questions that were previously unanswerable online. The data entered by the users can function like data in a database: it can be sorted, graphed and analyzed in many different ways. There’s a side benefit to this newfound ability, which is that it provides a much greater incentive for users to enter this kind of data. There already exists a great deal of structured data in table and template form in Wikipedia, thanks to the tireless efforts of volunteers, but the GDP of France in 1850, for instance, is not (at the time of this writing) among them. If someone had that information, and they wanted to add it to Wikipedia, they could undoubtedly find a place to put it, but it has not yet been a priority for anyone. But what if this and other economic data were graphable, and able to be turned into a variety of list data, instead of just being a standalone number? All of a sudden, the value of having that number in place would increase, its absence would be felt more noticeably, and thus the chance of it appearing would increase. Thus, even someone looking for that number by itself would benefit from the wiki it was in being semantic, even though they weren’t using the data semantically.

This is not a strictly theoretical discussion: there already exists a way to make semantic wikis, using open-source software: MediaWiki (the application that Wikipedia uses), along with an extension, Semantic MediaWiki, that enables pages to store their data semantically, and a suite of other extensions that use Semantic MediaWiki to enable easy editing, filtering and mapping of data (some of which I created). I wouldn’t be presumptuous enough to say that MediaWiki and these two extensions themselves constitute the future of information, but they do fit the criteria that I think any such solution has to have: they allow for many people to collaborate on data through versioning tools, they provide forms for people to enter and edit data with, so that data structures are pre-defined, and so that users can easily add and modify data, and they make it easy to modify the data structures themselves. This last property is important because of the inherent messiness of information mentioned earlier: for data of any complexity, there will always come cases that don’t quite fit into the structures provided, and when that happens it’s important to let users modify these structures, using the same processes of collaboration and discussion that already exist for modifying data.

Wikis that use MediaWiki and its semantic suite of extensions already exist, so in a sense the future as I’ve defined it is already here. My site, Discourse DB, uses these extensions (and was the testing ground for some of them) has, among other interesting pages, a list of opinions (for, against and mixed) on whether U.S.-led coalition forces should pull out of Iraq. It’s far from a complete list, but it is, as far as I know, the only place where you can find anything even resembling a comprehensive list online, or perhaps anywhere. Mograph Wiki, another wiki that uses this combination of MediaWiki and plugins, has a listing of all companies and freelancers that do motion-graphics work in London, England; again, as far as I know, no similar such list exists anywhere else in the world. And, to offer one more example, I don’t think there’s a list anywhere besides the Electronic Literature Organization’s Archive-It wiki of the electronic literature works published in 2004. The interesting thing about all of these lists is that none of them were created by hand; they are all generated automatically when the page is accessed, by querying other data in the wiki. Yes, it would be possible, in a few hours’ worth of time, to generate any of these lists by hand, using the other information found on the wiki, but to create the sheer volume of lists possible on each of these sites, and to keep them all constantly updated, quickly becomes impossible except by automation.

These are three rather specific examples, but then again the number of semantic wikis in the world is still rather small, probably less than 100 at this point. As the number grows, it’s easy to imagine the possibilities for the same semantic treatment being applied to a whole variety of other branches of knowledge.

There’s another important usage for semantic wikis, which is to serve as a data source for other applications. Imagine, for instance, a mapping application that’s able to provide information on nearby establishments or other points of interest for a given location, with the data taken from different wikis (some would call such an application a mashup). Or picture an application that lets the user navigate through historical or economic data, or one that automatically creates music playlists on song properties like tempo and genre, using information from a song wiki. In these and other cases, the open nature of a wiki’s data, along with the ease with which it can be kept up-to-date, make a semantic wiki ideal as a data source. It’s a usage that has so far eluded Wikipedia and other conventional wikis, because it’s just too hard to extract data from a text-only wiki. For the MediaWiki-based solution, this capability too already exists: Semantic MediaWiki provides export into RDF, a format for semantic data, that’s easily queried externally, through standard tools like SPARQL. The conversion from wiki data to RDF (or other, similar formats like OWL) is easy because the data is already stored semantically to begin with.

So this is the kind of information storage that semantic wikis can allow: public, flexible, allowing for easy modification, and easily exportable. In short, it seems like the ideal solution for handling the chaos that is real information.

Assignment Zero

Monday, June 25th, 2007

I got interviewed for NewAssignment.net’s Assignment Zero; the organization/website does crowdsourced journalism, meaning that they suggest topics and anyone who wants to can do the research and write the article about it. “Assignment Zero” is their first such “assignment”, a set of interviews with lots of people about, appropriately, crowdsourcing.

I was going to wait to link to it, since most of the interviews will eventually be published on Wired.com, but I think they’ve already been “published” now, on NewAssignment.net (hard to tell, but I think so). Here’s the interview with me, in which I share my thoughts about Discourse DB, the Semantic Forms extension and semantic wikis, plus musings about crowdsourcing and democracy.

I’m quite pleased to be interviewed among such a group of heavy-hitters: there are personal heroes of mine like Wikipedia co-founder and head Jimmy Wales and “The Wisdom of Crowds” author James Surowiecki, and heavy thinkers like Clay Shirky; even one of my college professors is there - Henry Jenkins (I took “Introduction to Media Studies”).

Many thanks to Nate Olson for conducting the interview.

Semantic MediaWiki in five minutes

Tuesday, June 12th, 2007

I didn’t realize this was online: my Discourse DB partner, Nick Grandy, gave a five-minute talk to a bunch of drunk British geeks (well, it was at a bar) at Oxford Geek Night in April, on the subject of Semantic MediaWiki and Discourse DB; you can see the video here. I’m really surprised by how much he managed to pack into that five minutes; it’s as good an introduction to semantic wikis, both the motivation behind them and their mechanics, as anything I’ve seen.

Some of the other linked videos from that night are quite interesting too, by the way, with nice introductions to Amazon web service and the new HTML standard, among others.

New forms extension coming

Thursday, April 19th, 2007

Here’s an introduction of the project I’ve been working on on and off for the last few months: a form-creation extension for MediaWiki.

The basic idea is that, for semantic wikis to really be usable by a mass audience, the’ll have to allow adding and editing data using forms: anything more complex than filling out fields, is, I think, a non-starter for the majority of potential users, despite all the usefulness of the concept. So this extension would allow for form-creation without the need for programming.

I’ve been developing and testing the code on Discourse DB, so that’s where it can be seen now. Here’s a sample opinion item page, and a sample publication page, and a sample author page, and a sample topic page. On each one, you can click on the new “edit with form” tab to get to a form that lets you edit that page. Note that in each case the form is customized for that specific page type; these are four different forms, each of which is defined separately.

And here’s the form for creating such a data-entry form page. A form page could also be written out by hand (it basically looks like HTML with some special tags), but this makes it easier to do.

There a few more features and conveniences, but that’s the basic gist. I hope to release the extension sometime soon.

You say partition, I say partition

Friday, March 9th, 2007

I wrote a post about three months ago stating that an opinion piece by David Apgar in the Boston Globe, about partitioning Iraq into two countries, ignored some previous opinion pieces by other writers also advocating a partition; I wrote that this was a case where Discourse DB could have been helpful as a research tool to find that out. About a month ago (man, I’m behind on things), David Apgar himself wrote in a comment criticizing my statement, saying that the previous split-Iraq-in-two proposals don’t count because his proposal is substantially different. He also says Joe Biden has advocated a partition, though Discourse DB has Biden on the side of those against partition (Biden has written in favor of a decentralized Iraq, so it all depends on whether you view such a thing as one country or several). You can read the original post, and the comment, and the original opinion piece, and judge for yourself the merits of his argument. In any case, my point wasn’t to mock Apgar (though I can understand why he took it as such), since, of course, Discourse DB remains a relatively unknown site, but just to note its usefulness as a tool, both for writers and readers, to get a handle on what’s been previously written. My apologies if it came out harshly.

“I read the news today…”

Tuesday, January 9th, 2007

Check out Daylife, the new news-aggregation site - it’s pretty cool. At first glance it seems slow and pretentious - do they really need 3 tabs and 7 sub-tabs to show you the day’s news? Drudge Report manages fine with just one page, as does Google News with one page and a bunch of “next”s. The page is too weighed down with text and images - I can’t imagine clicking on an image that says “White House” just to read whatever top stories there are about the White House today.

So the site’s a bit overloaded. Also, it may or may not be named after the Beatles’ “A Day in the Life”. That could be just a wild guess.

Anyway, where Daylife really shines is in its search. Check out the Daylife search page for, say, Sarbanes-Oxley, the accounting act that’s still the most-read topic on Discourse DB. They have a whole range of articles and commentary from different newspapers and magazines (in theory they have blog posts too, but not for this issue), all of them relevant. On the right are photos of Chris Cox, the SEC commissioner, who’s the man most closely associated with the issue at the moment. Compare that with Google News’ search page on the same topic - there’s a lot of press releases, some really tangential articles, and obscure publications. The photos that appear are random and unrelated to anything.

For the sake of completeness, here’s Discourse DB’s Sarbanes-Oxley page - the easiest-to-navigate of the three, in my opinion, but then again this one’s not a search page so it’s not really a fair comparison. Just wanted to stick it in, to cleanse the palate a little.

I don’t know how Daylife manages to out-search Google, but they do. It’s a neat tool.

Attention not paid

Wednesday, December 27th, 2006

In a Boston Globe op-ed, “A two-state solution for Iraq?”, economist David Apgar suggests splitting Iraq into two countries, one in the northwest that would include Baghdad and the Sunni and Kurdish areas, the other in the southeast that would be exclusively Shi’ite. “Amazingly, no one has talked much about a two-way partition,” he states.

Of course, if he were a Discourse DB reader, he’d know that, as listed in the “Iraq should be split up” page, at least two journalists have already advocated just such a thing: Jonathan Last in The Philadelphia Inquirer, and James Kurth in The New Republic (more have advocated leaving it as one country, but that’s a different story).

Discourse DB: avoid preventable op-ed mistakes!

Form follows function

Friday, December 22nd, 2006

I created a form for entering new items (that would be editorials and other opinion pieces) into Discourse DB; you can see it here. The page should hopefully make it a lot easier for people to add content to the site. You no longer need to know anything about the MediaWiki templating syntax, which, though it’s not very complicated, I’m sure has scared off a lot of people. You just have to enter the relevant data into the form and hit two “OK” buttons. And thanks to the nifty “autocomplete” feature, you don’t need to worry about getting the names wrong (is that “Los Angeles Times” or “The Los Angeles Times”? “George Will” or “George F. Will” or “GEORGE WILL”?). And it makes it a whole lot easier to find the correct topics and positions for an item’s opinion, instead of having to do a search through the site and then copying and pasting. For any field, you just have to type the first few letters of the name, and, if what you’re looking for is already in the database, the correct version will show up somewhere in a dropdown below the entry box. The autocomplete feature was created using the Scriptaculous Javascript library, which has lived up to its name in every way.

I consider this another step forward in having wiki databases match the functionality of regular database-driven sites or CMSes (content-management systems). Discourse DB has multiple views of data, searches based on fields, data export to the outside world, and now forms for entering data. That’s all in addition to the natural benefits of wikis. I think it’s a strong combination, but we’ll see if others agree.

Discourse DB, now with timelines

Friday, December 15th, 2006

I finally upgraded all the MediaWiki code on Discourse DB to the latest version. I mention this because one of the neat features of the new Semantic MediaWiki code (which is essentially what our site runs on, with some of our own modifications) is timelines. It’s a plugin of the Timeline tool developed by MIT’s Simile project, and it lets you see items with dates along a scrollable graph of time, as opposed to just in a list or table.

As a test of the capability, I set it up for a timeline to appear by default at the bottom of the page for each “topic”. As an example, you can see it at Military Commissions Act of 2006; just scroll to the bottom. If you click on either one of the bars to scroll the timeline sideways, you can see all the editorials/posts/etc. that were written about the act, by publication date. Clicking on any name pops up the basic information about it.

When you show the data this way, various trends become apparent. You can see, for instance, that there’s a large “clump” in which almost all of the commentary was written, basically from early September to early October, 2006; that corresponds to when Congress was discussing and voting on the bill. The other topic pages all show this same pattern of clumping during a period of days or weeks. You can also the days in which nearly everyone is talking about that topic; those tend to correspond with real-life news events.
Another way to visualize the data, for what it’s worth.

ISG elicits controversy

Thursday, December 14th, 2006

The Iraq Study Group’s report has rapidly shot up to near the top of the “most controversial positions” list, on the Discourse DB analysis page (I have it set to refresh every day now). It’s also tied for third place among most-written-about positions. 7 authors are for its recommendations getting implemented, 14 are against and 12 are mixed on the subject (see the position itself here). The report supposedly has 79 recommendations, so it’s no surprise that there are so many in the “mixed” category. It’s why we made it policy when designing Discourse DB to break down topics by their most basic positions. Even the idea of negotiating with Iran and Syria over the war in Iraq, for instance, is really two positions, and some commentators have in fact felt more strongly about the idea of negotiating with one country than with the other. But we made a single position for the ISG report, since it was presented as a single document.

The single biggest point of contention appears to be the Israel issue: in brief, Baker, Hamilton et. al. feel that countries like, well, Iran and Syria are more likely to help turn Iraq into a stable country (and presumably stop funding insurgents there) if the U.S. is able to negotiate an Israeli-Palestinian peace deal. This seems to strike the right-wing side as outrageously treacherous, the left-wing side as sensibly holistic.

Thoughts conveyed

Monday, December 11th, 2006

The appropriately-named Mike Linksvayer thinks that Semantic MediaWiki will be the “killer application for the Semantic Web”, and discusses Discourse DB. I agree that wikis are the best (and maybe only good) way to create online semantic data, although I think we disagree about whether that should come through a single, comprehensive site like Wikipedia or a large set of specialized sites (I go with the latter). On that note, he set up a site on Betocracy and I’m tickled at the market he created.

Alright, now I’m plugging two of my sites at once, which may be crossing the line in terms of self-promotion. Hey, at least no one asked about apartment rentals in New York…

The discourse is foiled again

Monday, December 4th, 2006

Anyone who’s been using Discourse DB to follow current events (I’d like to think maybe such a person exists) must have been surprised at the news today that John Bolton has given up his attempt at a Senate confirmation to be the U.S.’s UN ambassador. After all, according to the analysis page, the opinion that he should have been confirmed is the fifth most popular one on the site! It’s almost as much of a surprise as… the failure of the second- and third-most popular opinions. Those are both against the passage of the Military Commissions Act of 2006 (the two are about two slightly different versions of the act), and that act ultimately passed. Which might suggest that the volume of commentary in favor of a specific action is to some extent a function of the nervousness people have about that action not happening, and not necessarily an indication of how popular that action is.

Then again, maybe there’s only so much that can be extrapolated from this case; the rejection of Bolton was an issue of internal politics and didn’t even reflect the views of a majority of the Senate. On the two issues that have substantial commentary about them on the site that have gone in front a popular vote, the 2006 Connecticut Senate election and the 2006 California gubernatorial election, the direction of the commentary has matched the eventual outcome. So maybe it is an indicator of popularity.

Speaking of the analysis page, I neglected before to thank Greg Williams here, who not only wrote the RDF::Query Perl library that I used to create the page, but also was very personally helpful when I wrote him with some questions about using the library.

Sarbanes-Oxley, who knew?

Tuesday, November 28th, 2006

The most-read topic on Discourse DB in the last few weeks has been, surprisingly, the Sarbanes-Oxley Act. That’s the federal law that was passed after the Enron & co. scandals, that instituted greater accounting requirements for U.S. public companies; now many people are complaining that its requirements are too costly to implement; including, last week, the Secretary of the Treasury.

Is this really the biggest topic of the day? More likely, I’d guess it’s the one in which the level of reader interest most greatly exceeds the level of media reporting.

SPARQL vision

Monday, November 27th, 2006

I referred before to Discourse DB, a site I helped create, as “the first true wiki database site”, defining a wiki database as a set of data that is editable by the entire world but functions like a database. Well, there was certainly room to question that statement, since there are other, pre-existing, sites that combine wiki and database functionality in various ways. ITerating, a product-review site, and WikiTree, a genealogy site, are two examples, not to mention all the other sites that run on Semantic MediaWiki, the technology that Discourse DB itself is built on top of.

Well, now I’m on somewhat firmer ground with my statement, with the creation of the Discourse DB analysis page. This page uses data from the site that was obtained via Discourse DB’s data export, written in a format called RDF, using an RDF-specific query language called SPARQL. What does this mean? It means that anyone in the world can query Discourse DB to get its set of data. Even though the page is on the discoursedb.org domain, it’s going through the publicly-available interface to get the data, and in fact the querying to create this page was done on another server. And SPARQL is an open standard, so there’s nothing proprietary about the process.

If you check out the page you’ll also find some interesting information. Besides the basic type of information, like the political topics that appear most frequently, I programmed the script to get more in-depth information (the entire page was generated by a script). You can find out, for example, that:

  • the single most-popular opinion for a column or editorial in Discourse DB to espouse is that coalition troops should not pull out of Iraq; the site 36 editorials or columns arguing that view. That’s followed closely by the opinion that the “Military Commissions Act of 2006″, the act on the treatment of enemy combatants that was passed by congress, should not have been passed; that’s an interesting matchup that suggests that there’s a divergence between what matters most to the commentariat on the left and on the right.
  • the most controversial positions, meaning those with the closest split between authors arguing for and against them, are whether the United States should negotiate with Syria in order to improve the situation in Iraq, and whether the U.S. should build a fence along the Mexican border.
  • the least controversial position is that China should put pressure on North Korea to end its nuclear ambitions: 21 editorials or columns have been written arguing that, and none against or even mixed on the issue.
  • the two “authors” who have agreed on the most issues are The Wall Street Journal editorial board and The Washington Times editorial board, with 9 opinions in common. The individual authors who have agreed on the most issues are neoconservative writers William Kristol and Robert Kagan (not very interesting, since most of those columns were jointly-written).
  • the two “authors” who have disagreed on the most issues are, maybe not surprisingly, The New York Times editorial board and The Wall Street Journal editorial board. The individual authors who have disagreed on the most issues are right-wing Charles Krauthammer and left-wing Chicago Tribune columnist Steve Chapman.

Now, none of this is entirely scientific; I’m not planning to try to get these results published in a public policy journal. The biggest issue is the spottiness of the information; the site is built to be able to hold opinion columns and such from any time in the past, but in reality there isn’t much from before three months or so ago. So while I can’t really vouch for the amount of truth contained in the data, I think it’s a good proof-of-concept of wiki-database querying and maybe semantic web querying in general.

DHS to monitor opinions

Monday, October 16th, 2006

According to the New York Times, the U.S. Government is looking to create some interesting-sounding software to monitor foreign sentiment: Software Being Developed to Monitor Opinions of U.S.

A consortium of major universities, using Homeland Security Department money, is developing software that would let the government monitor negative opinions of the United States or its leaders in newspapers and other publications overseas.

Such a “sentiment analysis” is intended to identify potential threats to the nation, security officials said.

Researchers at institutions including Cornell, the University of Pittsburgh and the University of Utah intend to test the system on hundreds of articles published in 2001 and 2002 on topics like President Bush’s use of the term “axis of evil,” the handling of detainees at Guantánamo Bay, the debate over global warming and the coup attempt against President Hugo Chávez of Venezuela.

A $2.4 million grant will finance the research over three years.

Wow, software that would allow for the classification and archiving of published opinion pieces… sounds amazing. $2.4 million, eh?

Okay, to be fair, reading between the lines, it sounds like what the software would do (and the reason for that cash outlay) is analyze articles itself, using some kind of language-processing tools; as opposed to having humans do the classifying, which is what our site does. I’m quite skeptical of what’s known as natural language processing - getting computers to understand written language; it’s an intensely difficult process, full of complexities that 40 years of research haven’t been able to get around. The promise of any kind of natural language processing is that, once you’ve figured out how to do it, you can process a lot of data quite quickly; much faster than a person could. But Wikipedia, and other wikis, have demonstrated that there’s another approach that combines speed and accuracy without the need for artificial intelligence hacks - the collaborative approach. Create a good-enough site for the information to go, and, if you have enough contributors, it’ll get populated with the data you need quite quickly. In this case, the underlying newspaper articles are all out in the open, so the government doesn’t get any great benefit by keeping the project hidden - it might as well let the world be involved in creating the analysis and seeing the results.

Anyway, we’ll see if the people involved come around to what, for lack of a better term, I’ll call the “Discourse DB model”. Or maybe I or Nick should place a call over there ourselves… hm.

Just for the humor value, let me note a section from later in the article; no piece on a government intelligence initiative is complete without the requisite hand-wringing quote:

Marc Rotenberg, executive director of the Electronic Privacy Information Center in Washington, said the effort recalled the aborted 2002 push by a Defense Department agency to develop a tracking system called Total Information Awareness that was intended to detect terrorists by analyzing troves of information. “That is really chilling,” Mr. Rotenberg said. “And it seems far afield from the mission of homeland security.”

Yes, it’s come to this: even the contents of newspapers are now more information than some would like the government to have.

Gone commercial

Thursday, October 12th, 2006

We bought BlogAds ads for Discourse DB on four blogs: for ideological balance, we went with two on the right, Alarming News (a friend of mine) and Tim Blair, and on the left, Lawyers, Guns and Money and Mahablog. You can see the ads there now, they’re all up for at least the next week. Since BlogAds lets you see where people are coming from when they click on an ad, it’ll be somewhat interesting to see the response levels from different sites.

The impersistence of memory

Thursday, October 5th, 2006

Google CEO Eric Schmidt, had an interesting interview with the Financial Times (via Captain’s Quarters) in which he predicted that “within five years, ‘truth predictor’ software would ‘hold politicians to account’. Voters would be able to check the probability that apparently factual statements by politicians were actually correct, using programmes that automatically compared claims with historic data.”. It echoes something I’m sure a lot of us have thought of: you’re watching some political debate, and one blowdried politician says, “my blah blah policy has created X number of jobs”, and the other blowdried politician says, “well, that’s just not true; actually, it’s cost us Y jobs.” The immediate impulse is to hit “Pause” (okay, for those of us who have a DVR), head to JustTheFacts.com or whatever and see what the truth is. Schmidt’s an engineer, and he’s clearly betraying an engineer’s view of the world: we have all the facts, statements and numbers at our fingertips; let’s summarize them all neatly, publish the results in tastefully-designed tables and voila, a system by which politicians are beholden to state the truth or suffer the consequences.

Of course, the problem with this view is fairly obvious to anyone who’s ever gotten into a political argument for more than five minutes: the truth is a very malleable and slippery thing. People who differ politically will disagree on the facts themselves, on the importance of difference facts, even on whether some facts should be used at all. Statistics can be used to any end: on the economy, for instance, there are hundreds of indicators you can choose among; that’s how one analyst’s booming economy turns out to be another’s “worst economy since the Great Depression”. Turning to experts is no cure either: on military matters, make any sort of statement about how some current war or battle is going and chances are you can find some retired Army general who will agree with you 100%.

But what about a truly neutral source, a sort of global ombudsman, devoted to finding out the truth without biases? FactCheck.org is a notable attempt at this sort of service, and to many people it’s a useful site. But no matter how carefully they craft their answers, they’re always open to charges that they’re ignoring some set of facts: do a web search on “factcheck.org biased” and you can see various attacks on it from both the left and right (does that mean they’re in the middle? Maybe, but not necessarily). Even if there were such a thing as an objective truth, and the FactCheck team was onto it, there’s no guarantee that, sometime in the future, their staff wouldn’t get populated by others who did have an ideological slant; for that reason, their statements will always have to be regarded with vigilance.

Let me offer a real-life example of the difficulties involved in getting at the truth: when President Bush, before the 2003 invasion of Iraq, was making the case for war, he gave a number of rationales in its favor. One question is, was the importance of bringing democracy to Iraq one of them? it’s important to people because one popular anti-Iraq-war theory holds that the Bush administration turned to that democracy rationale only out of desperation, after all the other ones had been tried and failed. Still, the question isn’t a matter of ideology per se, just an issue of the facts; so one side should have conceded by now. But no: during one small flare-up last year, Instapundit said Bush did, Kevin Drum said Bush didn’t.

Think about it: both of those people have access to the web, just as anyone who’s reading this does. That means they have access to an amount of information, at a rapidity, that would have been unthinkable even 15 years ago. And these aren’t just any two people: they’re two of the most popular political bloggers in America, both renowed for the level of knowledge they bring to their commentary. Beyond what’s on the web, they have access to a variety of other sources, not to mention to the combined expertise of hundreds of thousands of readers. And this is not an obscure issue: it’s the Iraq War, among the most-discussed and -analyzed issues of our time, possibly (if you go by the number of commentators involved) of all time. And the war only began three years ago, so all of the original documents and commentary are still very much with us. In short, it’s an issue which, in terms of immediate data available, stands nearly at the pinnacle of human history; yet even our most informed commentators can’t agree on some of the basics.

Schmidt talks only about probabilities, which indicates to me that he’s thought about just this very thing. Presumably this software or web application, whatever it was, wouldn’t be digesting facts and making up opinions independently, but would instead be aggregating various opinions in some way. That’s already a large step away from an objective truth, but it’s still useful. Even then, having worked on it myself, I know there are two main pitfalls to such an effort:

  • Getting the full range of opinion on the facts. Google News itself has this problem: it’s been attacked by commentators on the right for supposedly removing right-wing news sites from its index due to charges of “hate speech”. No surprise if there have been similar attacks from the left. You shouldn’t block out important sources, but then again you have to block some people because you can’t allow everyone, which brings us to the second pitfall, which is…
  • Vulnerability to “gaming the system”. Any system designed to poll the views of a crowd can be subjected to efforts to sway the outcome to give inaccurate results. In real-life elections, that can include voter fraud; in most online polls, it’s as simple as a popular site pointing its readers to such a poll; given the slant of whatever site that is, they’ll know how to vote. And for basic web searches, there’s Google bombing and the like.

This whole issue is close to my heart because Nick and I created (sorry, shameless plug coming once again) Discourse DB in part to answer the exact same problem that Schmidt describes. Look up a topic on the site and you can see what notable commentators have written about it: with enough content on the site, viewing the breakdown of favorable and unfavoreable commentary on some issue or candidate perhaps gets you to one approximation of that elusive creature, the truth.

The site has its own built-in protections against the two pitfalls: For the case of missing some part of the discourse, Discourse DB is open to adding and modification by anyone at any time: if a reader sees some notable commentary that’s missing from the site, they can just add it in themselves. For the case of system-gaming, the site has rather strict rules about what constitutes an acceptable source and opinion item; a random observer couldn’t do anything to skew what shows up on the site, unless they were a writer at some major newspaper or blog; even then, the effort it would take to game the system (writing up a lot of commentary advocating a particular point) would far outweigh whatever small gains you’d get from it.

Still, from that precise way in which Mr. Schmidt phrased it, I get the sense he has something specific in mind. It’ll be interesting to see what, if anything, that is.

Connecting the dots

Wednesday, September 27th, 2006

One of the neat features I was thinking about as we were creating Discourse DB was that, because of its database structure, you could use it to spot a lot of patterns that would otherwise be difficult to find or prove. Do certain writers tend to mirror each other? Or to take opposite sides? Are they obsessed with certain topics? Are some newspapers not as uniformly left-wing or right-wing as popular perception would have them?

As an example, one interesting discovery that emerges even from the relatively minimal amount of information already on the site is the relentless way in which the Washington Post has been “flooding the zone” on President Bush’s proposed legislation on dealing with war-on-terror detainees, AKA the “Military Commissions Act of 2006″. I haven’t read anything that’s remarked on it, but the newspaper has been publishing 3-4 editorials a week attacking the legislation in its various forms, in addition to a handful of their columnists doing likewise. Look at the page for the Washington Post editorial board - just about half of the editorials included in Discourse DB from the last month or so are on the topic. You can also see the pattern in the page for the 2006 military commissions act itself - as could be expected, a lot of people are talking about the issue, but the Washington Post, and specifically its editorial board, are by far the single biggest contributor of commentary on the issue.

Anyway, I think it’s an interesting observation, and, though I’m sure there are a very many readers of the Washington Post who have privately remarked on the trend, this site allows even people who don’t read the paper to see it, and it allows the discovery of it to occur in a more structured, provable manner.