Chris Dent

L505 Essay 10

2001-03-26

Exclusion for the Sake of Inclusion

Why are the notions of concepts and categories relevant to Library and Information Science? When we organize and represent information we do concept analysis—distinguish one thing from another for the sake of identification—and then categorize—group according to similarities for the sake of accessibility. Once we have figured out what something is we place it within a framed collection of things that are like it. This makes the information graspable. We spread everything out to see what it is and then lump things back together to be able to move them around easily.

Take, for example, a collection of animals that live in people’s homes: cats and dogs and birds and snakes. When making reference to this mass saying, “The cats and dogs and birds and snakes have left the building” is far more cumbersome than, “The pets have left the building.” The concepts of “cats and dogs and birds and snakes” have been lumped, or framed, to form “pets” a considerably more manageable term.

Information organizers have accurate concept analysis as one of their tasks. This leads to the creation of accurate categories that allow their customers to effectively access the information that has been organized. There is, however, a trap here: at first glance accurate analysis and organization is the primary goal. This is not the case. Accessibility is the primary goal. In many cases accuracy will lead to accessibility but in some cases the preconceived notions of the audience will be widely divergent from the notions of the information organizer.

A fair amount of the literature on electronic information retrieval views retrieval as an exercise in set theory: given set A (the contents of a query) and set B, C, D, E… (the system representation of each of the documents within the collection) the size of the union of set A with B compared with A union C, A union D, A union E etc.[1] indicates relevancy of a given doc to the query. If there is a union it indicates only that some terms are present, not necessarily there is similar meaning between the query and the found documents. This is a practical result of computers being essentially ignorant. They do not and cannot (yet) know what something means. They only know if a document does or does not contain a certain word. This is an unfortunate limitation that is both compounded and improved by human thinking.

The problem: we make loose associations between things. We may assume that if we are looking for a resource about pet cats that searching on “pets” will get us what we want. However if the author of a page is so enamored of cats that it never occurred to them to refer to their pet with the word “pet”, that page will not be returned.

The improvement: we make loose associations between things. Humans making and managing information retrieval resources can “infect” the accuracy of their categorizations with the loose associations that they themselves make between things. The traditional way to do this is provide a thesaurus with the resource.[2] When searching on “pets” and the results are not suitable the system can suggest, “consider searching on cats or dogs or birds or snakes”. That, though, is only halfway there. We are still looking and find only words that exist in the document, not meaning. At this point the option that is left is manual classification by humans. A human may distill the meaning of a document and categorize it in a group that associates it with concepts or terms that are not present in the document. By categorizing the document, we give it a frame that excludes irrelevant noise and provides a handle to make it accessible—to include it.

REFERENCES

Margolis, E; Laurence, S eds. Concepts: Core Readings. Cambridge MA, MIT Press, 1999.

Zerubavel, Eviatar. The Fine Line: Making Distinctions in Everyday Life. New York, Free Press, 1991.

[1] Frequently this score is modified in some fashion to compensate for document size, but for the sake of this discussion this simple idea will suffice.

[2] Oh, it’s downright clever how everything is falling together here at the end of the semester. This ought to be a paper about concepts and categories but it’s really about mental models and classification with a general thrust towards thesauri and controlled vocabularies. Each paragraph is a rubber band pointed in a different direction but never launched. This paper could do with some exclusion.