Tuesday, July 22, 2008

Wikipedia, information and knowledge

I think Wikipedia has failed, or maybe is headed towards failure in a certain sense. The whole idea of wikis is a success, in that a community can collate information in a relatively quick manner. But Wikipedia is so large, the community can no longer maintain that information. Clear vandalism on high-traffic pages gets caught quickly. But on little-visited pages it can last for years.
An even worse problem is that if the vandalism is largely like a real edit (say, changing a date), it can very easily damage the information.
The problem is that Wikipedia has no information model, and a knowledge model that is close to non-existent.
Since the content is just flat text, the information in Wikipedia is entirely opaque to the program. When you enter information, Wikipedia doesn't gain access to that information. So, for example, you can't tell Wikipedia the population of New York and then later ask it for that information. You have to provide it to Wikipedia every time you need it, and if it needs to be updated, it must be changed whereever it is used.
In a similar way, the talk pages are completely wrong; instead of using the same content model that the article pages use, it should use a forum-like thread/message structure, which could provide notification of replies, etc. It would make communicating about the information so much more efficient.
Anyways, I think Wikipedia has failed on that front. How would I do it better?
You have to start with something that has a very strong information model. Freebase and dbpedia are very good starts. Add to that a forum system for discussing the information.
Now that you have a program that is semantically aware of its content, which can fully take advantage of wiki-like community information gathering, you run into the fundamental problem that is plaguing technology now: knowledge.
Technology is missing semantic awareness. That's the whole thing that Powerset is trying to change. Plus there's the whole semantic web thing. But ok, so how do you make a system like freebase able to assert the veracity of its information?
I would do it with two parts. First, you have to have references. But since you have a strong information model, the references can be directly tied to pieces of information (unlike Wikipedia).
Second, a system for people to vouch for a reference. But that's incomplete. One of the biggest flaws in Wikipedia is that it makes the assumptions that edits by all people are equally trustable. So I would use a sort of web of trust.
People would be assigned a trust number between 0 and 1. If a person's trust number is one, the system considers them completely trustworthy—when they vouch completely for a reference, the system believes that reference is flawless and therefore the information is verified. Now, a trust number of one is only admin-assignable. Everyone else gets trust that flows from those people. If a person provides a reference that gets vouched for by someone trustworthy, that person's trust level goes up.
The confidence level of a piece of information is based on how many people have vouched for its references, prorated in some way by each voucher's trust level. If someone believes a piece of information needs to be changed, they submit the changed information, which can then have references provided and vouched for. If the confidence level of the new information meets some criteria based on the confidence level of the old information, it replaces it.
There are a couple of complexities to add to it. First, in vouching for a reference, you have to be able to say whether you are very confident in the reference, only vaguely confident in its veracity, only vouching that the referenced material does indeed support the information, etc. Second, when entering a piece of information, you need to be able to say what kind of information it is—basically, how likely it is to change. If it's something that should never change (e.g., the birth and death dates of a recent president), it should be very hard to change once it has a high confidence level. On the other hand, some information *will* change—for example, the population of New York. It should be relatively easy to change such information regardless of its confidence level. Because of the strong information model, it would be possible to mark data as time-varying and deal with changes accordingly.

No comments: