You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

Archive for the 'I need NLP to help me categorize' Category

Occupy Boston marches with Oakland

ø

Occupy Boston held it’s second of two trial Action Assemblies last night. The idea is to be more focused than the General Assembly and the focus is on action. A lot of future actions were discussed, but we adjourned early. We had a commitment.
Occupy Boston marchers sitting down in the street in front of the Prudential Center

Occupy Boston Marchers sitting down in the street in front of the Prudential Center.

Occupy Boston marchers carrying banner Occupy Oakland Occupy Everything    A single tent for old times sake at Dewey Square Park. The Federal Reserve of Boston is in the background.

Occupiers carrying the banner ‘Occupy Oakland, Occupy Everything’. A single tent for old times sake. The Federal Reserve of Boston is in the background.

StudentsOcupy organizer Bea with a pointed question.    Occupriers under court order to stay out of Dewey Park, reading "Why We Occupy"

StudentsOccupy organizer Bea has a pointed question. Occupiers under court order to stay out of Dewey Square Park reading, “Whe We Occupy.”

Marchers on Atlantic Ave next to Dewey Square Park    An assortment of fashion statements from Boston Occupiers.

Marchers occupy Atlantic Ave. next to Dewey. The closeup shows a variety of fashion statements.

Semantic Web: Wikipedia and Natural Language Processing

ø

Malvina and Zvi after the semantic web panel at WIkimania 2006.

Zvi and Malvina discuss fine points after the panel.
Malvina [right] was one of the panelists.

Suppose that you, like me, are a new wikipedian. You’ve learned the wiki codes which is not a big deal – rather easy compared to HTML, but still takes non-zero time. You’ve learned some of the conventions of the culture. You put “your” page together and put it up on the ‘pedia. What happens then. Well, if you, like me, didn’t read ALL the conventions of the culture, you will come back some time latter and find “your” page emblazoned with banners informing you of the conventions of the culture that you didn’t read. One of these might be that you forgot to assign “your” page a category. So you then need to spend a chunk of time reading the tree of available categories. It’s not hard to find one or two quickly, but how do you know you’ve found the best categories. How do you know you’ve found all the relevant categories. It is a ‘barrier to entry’ for new Wikipedians and a problem even for some experienced Wikipedians.

Natural language Processing (NLP) is equal to automating this process to some extent. It is possible for programs to read bunches of categorized articles and collect a ‘signature’ which could then be used to match up with new articles to make suggestions for categorizing them. This could be done now. The Wikipedians are discussion whether it should be done now.

On the one hand it would make creating new articles easier. Jimmy Wales mentioned in the morning plenary that with over 1,000,000 articles in the English Wikipedia, quality of existing articles is a higher priority than creating new ones. But NLP techniques can help here too. For example, a tool that can identify population numbers could check that a given city has the same population everywhere in the ‘pedia.

On the other hand, NLP systems are complex and consume a lot of computing resources. They are ‘heavy’. Wikipedia currently is ‘light’ i.e. simple and fast. The Wikipedians would like to keep it that way. NLP techniques will be introduced cautiously.

Why have I said “your” page throughout? That’s another aspect of Wikipedia culture. Articles do not belong to the originator, the most profilic contributor, or anyone else. It’s free content baby! It belongs to the world.