Class 14: Knowledge and Metadata

Metadata is information about information

Passing around a copy of the New York Times the class highlights what in the paper is metadata. David Weinberger, who is leading the class, wonders why no one chose to highlight the headline as metadata, which boldly proclaims Spitzer’s indiscretions (John Palfrey suggests it was perhaps too seedy for us!). The problem is that the headline can be data itself, we even have headine news. But the headline is also imparting information about the information in the article which makes is a borderline case, it tells you about the article only if you choose to read it. Is the font size information? Metadata? Yes, it tells you in all caps this guy screwed up big. Placement on the page is also metadata. So the newspaper itself is metadata, even the difference between NYT online and the paper version in terms of space. The fact that something appears in the print version gives us information about the article because there is limited space (claims of all the news that’s fit to print notwithstanding). Does the space between words tell us something? (other than the dominance of oppressive mainstream gramatical structures!) Spaces are metadata because they show you the end of the information you care about, you are told this is the end of the word.

Existential Crisis Alert after the jump

It was suggested that words themselves can be seen as metadata, they are conceptual tools that tell us about something else as does our choice in using them. Could you argue that everything encoded with words is metadata? Yes, but for now we want a class of information that is metadata so we wont take the bait. The problem is framed as: does one think everything is information and relates to everything else, which means it becomes impossible to differentiate information from information about information.

Organising information and metadata pre-web

Melvil Dewey - the Dewey Decimal system inventor, also founded societies to promote simple spelling, the metric system, and shorthand. Came from a tiny school in Massachusetts, had limited experience of the world. Decided as a Senior to organise the world’s knowledge, using a conception of information/knowledge in the tradition of Hagel etc. Went with 10 top categories, then 10 within each, then 10 more. See the classification system here

What guides the order of the categories? Vaguely from something conceptual/ethereal to more practical. Follows philosophers ordering of knowledge, which means philosophy is the top rank. The entire system is an ordered list, the top always indicating the most importance.

The third most important topic is paranormal in Philospophy and psychology! Religon is all Christian! It has been updated, Islam now includes Bahaism and more. Where is Budism? 294.6, didn’t make it to the left of the period, only a billion budists, but Dewey didn’t have many books about budism, he lead a cloistered life. Why hasn’t this been fixed? Mainly because it would be just too embarrassing. Did fix the computer stuff by putting it into the zeros. If they did make an update, think of all the problems: the ordering of Shiite and Sunni, where do Jews for Jesus go, is Palestine as a country, gender stuff (women’s education but not men’s). There doesn’t appear to be a way to fix Dewey. This comes up everytime you have a taxonomy.When you decide how to divide things up your hand is forced.

Organising information on the Web

Amazon gives you an unbelievable set of information about a book. Even SIPs, Statistically Improbable Phrases, that you can search by other books for. User generated metadata. Amazon has multiple categories. An endless amount of metadata. What does it tell us? Sometimes it helps us decide whether we want the book, a SIP might give us a reason to buy from Amazon rather than someone else, maybe it helps us to look for other books. The metadata ties things together that otherwise we never would.

JP poses the question:

If the idea is that by using metadata we put down markers that might be useful, might we go to far and have too much information

Compare Amazon to an old style library card system. Differences with a library card: less space, limited time, rules of how to structure, static rather than dynamic. Amazon can change in time, anybody can change parts of it. Searchability suffers. Ability to sort suffers. Can’t have multiple copies really. No structure to adding metadata. Social life of information says dog eared is interesting. Most effort given over to excluding information.

Recap: systems that rely on taxonomy are fixed systems and tend to be very rigid. This is in a sense mainly because when we work from paper we inevitably lose flexibility.

Taging

Go to Flickr, where you can search on tags provided by other people. There are also groups that a picture can be added to, such as one for pictures of noses. People can even tag other people’s photos depending on settings. What about when there is too much metadata broken out of a taxonomy. Compare Flickr with Corbis which has a 70,000 word taxonomy that has synonyms, a controlled vocabulary. Flickr has an open ended system whereby any tag can be thought up by a user on the spot, no matter how idiosyncratic. Flickr can create clusters of images, automatically generated based only on the tags. This is an analysis of multiple tags. With enough tags you can create order.

Wikia - the search engine launched by the founder of Wikipedia which is a combination of human and machine. The web difference is that you can combine the human inputs with the technology on top. The web means you can update taxonomy much more quickly: the Library of Congress has a flexible taxonomy and will make extra categories when it needs them. When you move online there is no limit to the number of categories that you can create, because you can use faceted structures to organise data it doesn’t matter how many new categories you add to the taxonomy. The LoC has more stuff than it can categorise, 150 million objects, training people in a taxonomy doesn’t scale. LoC put 3000 photos from the 40s into Flickr in order to jump start the metadata. They put the metadata they already had in and within days users had filled up the tag list. Ran out of tags on Flickr at 75, people hacked it. Can put a tag on the picture annotating it. The comments section has lots of comments.

What do we gain and what to we lose from this system?

In Google Books we can search by “call me ishmael”. We get the book back, is it metadata? Someone argues that it is not, the search that leads to it is the metadata. Would Herman Melvil be metadata? Depends on the context - when it gives you the information that he is the author. DW says everything is, is it the information age that makes it so or is it the frequency of use that makes it so? DW suggests it has been useful to have a strict distinction between data and metadata, because we have been stuck organising the real things, when we get too many things we need to separate the data and metadata. We had to reduce the amount of information. When we went into the digital world we started to replicate with databases, but now we realise that anything can become metadata - anything can functionally become metadata. The importance of this is that we just got much better and finding things and will be much better at it - a huge web difference.

Creative Commons License
This work, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-Share Alike 3.0 License.

Leave a Comment

Protected by AkismetBlog with WordPress