I. Organizing Knowledge: Hiearchical Structures

Sorted By Creation Time

20011023: Hammond, Toward a General Theory of Hierarchy

Hammond, T.H. (1993). Toward a general theory of hierarchy: books,
     bureaucrats, basketball tournaments and the administrative
     structure of the nation-state. _Journal of Public Administration
     Research and Theory 3_(1), 120-145.

Hammond describes how the hierarchical structure of institutions
affects how information in the hierarchy is transformed and used. This
happens because hierarchies inform how information is categorized and
thus how comparisons are made. Hierarchies control how information is
aggregated and transmitted, thus controlling how problems and solutions
are discovered and defined.  Some examples are given, including: in
different library classification schema adjacency is defined differently
because of different categorical relationships--meaning the results
of serendipitous browsing in the shelves or catalog will be different
from one scheme to another; in an intelligence organization how people
filter information, determining relevancy, controls what information
the final decision maker at the top of the hierarchy will see and act
upon. Hammond's conclusion is that since hierarchies are present, as in
any politicized institution, in the nation-state the organization of
the nation-state impacts the sort of problems that can be identified,
shared and worked upon by the state.  Knowledge of this will help in
the understanding of the behavior of nation-states.


Back to the Index

20011103: Still more on categories

Contact:cdent@burningchrome.com

The flip side to categories being resistant to definition is that they
are also difficult to enumerate. 

If one is able to define and enumerate that's probably a
classification system.


Back to the Index

Studer: Classification v. Categorization

Contact:cdent@burningchrome.com


Studer, P.A. (1977). Classification as a general systems construct. In
     B.M. Fry & C.A. Shepherd (Comp.) Information management in the
     1980's: Proceedings of the [40th] ASIS Annual Meeting, Chicago,
     Illinois, September 26-October 1, 1977 (pp. 67, C6-C14, A1-A9).
     White Plains, NY: Knowledge Industry for American Society for
     Information Science.

-=-=-

While reading the Studer article from session 9 it occurred to me that
there seems to be a lack of consistency in the literature between the
use of the terms classification and categorization. Studer seems to use
the terms almost interchangeably, especially when he is quoting. That is,
while he uses the term classification the quoted text uses category.

He makes it sound like the process of creating classifications is a step
following the creation or identification of categories.

This conflicts with how I've been thinking about the terms. Perhaps
somebody can confirm or reject the following views?

In my view classification is a sort of artificial process by which
we organize things for presentation or later access. It involves the
arbitrary creation of a group of classes, potentially arranged in a
hierarchy, which have explicit definitions. In other words a class is
strictly defined and once inhabited the inhabitants can be enumerated.

Categorization, on the other hand is natural process in the sense
that humans do it out of their cognitive fundament. It is, like Studer
reports, an act of simplification to make apprehension and comprehension
of the environment more efficient. Categories spring up out of necessity
and because they are designed to replace the details of definition are
themselves resistant to definition. When provided with a list of stuff we
are able to categorize the stuff, but when asked to list the full contents
of a category we can't.

So to put it more succinctly:

- a class is a defined grouping of entities in which the members
  fulfill the definition of the class and can be listed.
- a category is a cognitive label applied to a non-enumerable grouping
  of entities wherein membership is determined by typicality amongst
  the members and not some overarching definition.

This is important to me, in part, because I'm playing around with trying
to determine if computers can ever be actually intelligent or must always
fake it. I vote for the latter because computers cannot categorize.

The ability to categorize seems to be the basis for intelligence. On
the fly categorization allows us to place data in an informational
context. Once in that matrix we can do what seems to amount to an endless
recursive dialectic wherein each new synthesis becomes thesis.

Computers can presumably replicate this process but it is imitation. Their
distinctions must be made by definition, by classification, not
categorization. They can be made to appear to do categorization but the
alternate representations they provide are rules (definition) based.
Thus far the most promising research in creating seemingly intelligent
machines has used what can be called a brute force approach: supply
the computer with as much information as possible, related in as many
ways as possible. This is the method that IBM used to get Deep Blue to
become a chess champion is the key to the Semantic Web.

If we want to create truly intelligent machines a then is determining
how categorization really works. I wonder, though, why we want
intelligent machines. Don't we really just want machines that are
tools to augment our own intelligence? If that's the case, then we are
already there: we simply need to improve on what we have.


Back to the Index

20011106: Conversation about identifiers, labels and categories

Contact:cdent@burningchrome.com

This is some email between my step-father and I. Walt has a long
history of thinking about databases and has done a great deal of
reading and writing on strictly unique identifiers. He's the source
for my own feelings about identifiers needing to be meaningless or
else they are not identifiers and are thus broken.

Discussion of identifiers and labels leads to some discussion of
categorization.

-=-=-
From: "Chris Dent" <cdent@burningchrome.com>
To: XXXXX
Sent: Tuesday, November 06, 2001 1:06 AM
Subject: Dewey decimal system

how does your notion of unique, persistent, essentially meaningless
identifiers interact with dewey wherein the call number is both a key to the
location of the book and a mini language which describes the content of the
book. For example 821 is english fiction. If you know the language you can
pick up a book and see from the spine what it is potentially about.

This is used as an example of how dewey is less bad than library of
congress. It exercises putting knowledge into the world, with language, so
you have the opportunity to process less.

This is a key feature of augmentation, which I'm keen on.

So is there some distinction between objects that need to have meaningful
labels and those that need identifiers?

I'm in class right now, writing this on my pilot so this may be  a bit
stilted.

-=-=-
Date: Wed, 31 Oct 2001 20:31:30 -0500
From: Walt Woolfolk <XXXXXX>
To: Chris Dent <cdent@burningchrome.com>
Subject: Re: Dewey decimal system

Dewey? Egregious!  Clear violation of the the doctrine of strict
uniqueness (i.e., a stable identifier must be at least unique and at
most unique).

There are, of course, many examples of meaningful labels, but there
are no justifications I know of for them. That is, a thing possesses
a set of descriptive characteristics, some of which are identifying
characteristics. Since it is highly inconvenient (and unstable) to attempt
to always refer to a thing by its identifiying characteristics (take
yourself, for example- what would it take to describe you sufficiently to
identify you? and how would it likely change over time), an identifier
is assigned to the thing. By making the id strictly unique it serves
the purpose of picking out a single thing and it is immune to change.
Any or all of the thing's characteristics remain available for descriptive
purposes. So a thing has an identifier (label) which is best strictly
unique plus one or more descriptive characteristics.

In a library system item might have a strictly unique id (e.g.,
123456789), some location scheme, and many other descriptive and varying
characteristics. The id becomes the primary key to the item in the
database and any characteristics or combination of characteristics are
potential secondary keys (e.g., author last name).

The argument most often offered for meaningful ids is the convenience
of having certain information immediately available when one looks at
the id, so you don't have an additional look-up step to access that
information. The advantage of this is real, but completely trivial
when weighed against the disadvantages. In the case of Dewey, even this
minor advantage is offset by the fact that what is included in the id
is itself encoded, so you have to look up the meaning of each of the
codes anyway. The argument from convenience doesn't stand up. In fact
there are no good reasons for meaningful ids, and I suspect the real
reason for them is psychological. Finding out what that reason is would
make an interesting project for some grad student with an interest in
human cognition.

-=-=-
From cdent@burningchrome.com Tue Nov  6 01:08:10 2001
Date: Thu, 1 Nov 2001 23:59:37 -0500 (EST)
From: cdent@burningchrome.com
To: Walt Woolfolk <XXXXXX>
Subject: Re: Dewey decimal system

On Wed, 31 Oct 2001, Walt Woolfolk wrote:

> Dewey? Egregious!  Clear violation of the the doctrine of strict
> uniqueness (i.e., a stable identifier must be at least unique and at
> most unique).

Yeah, that's what I thought you would say, which is why I though I
would write.

After thinking about it in the bathtub, though I'm still wondering if
the call number is an identifier and not a label.

If it is a label, the problem is not that it is meaningful, but that
people think it is an identifier (instead of a label).

While for you and me the label still requires a lookup for decoding,
for someone who knows the language, no external lookup is required.
The call number is a signifier with meaning, People like those sorts
of things because they are easy (small) reference chunks to
complicated (large) bits of info.

This goes back to our categories conversation: people make categories
so they don't have to remember all the qualities of a thing in a
category, but can instead refer to it by the category label (e.g.
bird).

Cognitive scaffolding a prof of mine calls them.

>From a database system standpoint it would be an egregious error to
use the call number as the primary key to the book as, just like you
say, if the interpretation (and thus call number and location) changes
you're screwed, that change has to cascade around all over the place.

Presumably some people know this, but when it comes time to physicaly
identify the book (much different act than logically identifying it)
they don't want a unique ID because you'd have to go to some sort of
external (to the brain) device to find out where to put it on the
shelves (either of the library or the brain).

So, while you've just suggested some PhD research to find out why
people want meaningful ids, I find the case already mostly closed, in
that the problem is that people and computers don't think alike, and
shouldn't think alike. Let the tool do it's job, it doesn't think like
you and you don't want it to...

Sort of like: computers are relational databases, humans are
associative databases. Attempts to model people as relational
databases have failed. Attempts to get computers to do associative
linking have mostly fell on their face. By association I mean the
ability to create undefinable categories. Computers have trouble with
that whole lack of definition thing. They want rules.

I might have to quote us into my readings journal for this particular
class, if you don't mind?

I just got back from an outdoor rock climbing trip to a nearby
roadcut that's been developed into a bit of a climbing area. We got
there early enough to get the rope set up before the sun went down,
and then the light of the moon through the clouds led the way. It was
fantastic.

-- 
Chris Dent  <cdent@burningchrome.com>  http://www.burningchrome.com/~cdent/

-=-=-
Date: Fri, 2 Nov 2001 10:28:09 -0500
From: XXXXXX
To: cdent@burningchrome.com
Subject: Dewey or not?

If people want the location (encoded or not) on the book, put it on the
book - no problem - just don't put it in the book's id

What is your distinction between id and label?

-=-=-
From cdent@burningchrome.com Tue Nov  6 01:08:27 2001
Date: Fri, 2 Nov 2001 14:28:20 -0500 (EST)
From: cdent@burningchrome.com
To: XXXXX
Subject: Re: Dewey or not?

On Fri, 2 Nov 2001 XXXXX wrote:

> If people want the location (encoded or not) on the book, put it on the
> book - no problem - just don't put it in the book's id

Right, that's basically the different between a label and id in the
way I was saying it.

Unfortunately people seem to want to use the label as the id. For
example, although the catalogging software for the library here at IU
has a title control number which is a unique ID for a resource, it's
value is so completely obscured by all kinds of crufty things people
try to do to get to stuff in non-referential ways.

> What is your distinction between id and label?

(Note I'm making this up as a I go along)

Several different descriptions:

database:
id is primary key
label is one or more concatenated descriptive fields

categorization:
id is a _reference_ to something which fulfills a strict definition
label is a name of something which approaches some high (but
   undefined) level of typicality of a category (which is itself
   undefinable)

information architecture:
(LIS has this notion of a discipline called information architecture
which has a whole lot to do with wayfinding, navigation, context
generation, signage, etc)
id is a reference to an entity (say the URL of a web document)
label is a name for the entity so someone can identify it (somewhat
   oxymoronic...) (say the words which are the link button, indicating
   (or, ha, identifying) a link)

More generally I'd say what I'm thinking is that an ID is a unique
reference which points to something which fits into a strictly defined
class of entities. In a database you only put something in the books
table if it is a book or you have _declared_ it a book. When it is in
there you need a handle to it, that's the ID.

Labels, on the other hand, are handles to categories of one or more
entities which have been associated for some reason which is
beneficial into a grouping. The label indicates the group. You can
lable a database table, but you can also label a bunch of stuff which
sort of, but maybe not completely, fits together well for the sake of
some exercise.

I'm not sure, does that hang together?

I'm potentially trying to shape the world to my brain and not the
other way round, which could be broken.  Or: I feel like I'm spewing a
bunch of stuff that is potentially interesting, or comletely booboo,
and I'm not sure which it is.

Back to the Index

20011106: Jacob & Albrechtsen, Constructing reality...

Contact:cdent@burningchrome.com


Jacob, E.K., & Albrechtsen, H. (1997). Constructing reality: the role
     of dialogue in the development of classificatory structures. In
     I.C. McIlwaine (Ed), Knowledge organization for information retrieval:
     Proceedings of the 6th International Study Conference on
     Classification Research, 14-16 June 1997, London (pp. 42-50). The
     Hague, Netherlands: Internation Federation of Documentation.

-=-=-

Dovetails nicely with the discussion of ontologies and the semantic
web. Ontologies are epistemes. In the utopian view of the semantic
web, machines will be able to exchange ontologies to combat
heteroglot. Sounds like dialogue. 

Such dialogue, as stated, will need to be in unitary languages or at
least a close approximation. 

I fear there is a danger in the proliferation of unitary languages. If
a language is well defined inference is less fertile. Many a great
idea has come from skimming the connotative effluvia of
misunderstanding. Evolution results from mutation: from error. 

As an aside: this article points out some of the reasons for my
resistance to professionalization: In part a profession is achieved by
the establishment of a well-constructed language. Such a language can
create barriers between those who are considered in the know and those
who aren't. Often this is necessary for safety purposes (doctors) but
in other situations the creation of a well constructed language
appears to be an excuse to write more papers about the domain because
you can't figure out what the domain is (information science). 

(I'm aware of the paradox and irony.)


Back to the Index

20011209: Bowker, The ICD as information infrastructure

Contact:cdent@burningchrome.com

Bowker, G.C. & Star, L.S. (1999). Chapter 3: The ICD as information
     infrastructure. In _Sorting things out: Classification and its
     consequences_ (p. 107-133). Cambridge MA: MIT Press.

-=-=-

A whole slew of information on how systems of classification help to
create infrastructure in systems. In there, two items stood out for
me:

Quoting the League of Nations:

  Rather than omit from the beginning all which are not yet
  satisfactory, the authors have hoped, by including them and utilizing
  them for what they are worth, to create a demand for their
  improvement...

This models a solution to a frequent stumbling block for "Information
Architects" in this day and age. So often people want to come up with
a structure before they really know what the resource will be used
for. The search for structure becomes so intense that using the
resource is delayed and delayed until its eventual value is lost.

I advocate, instead, for situations where the structure is not
apparent, the following process:

- get the data
  - if it is already chunked in some fashion, give those chunks unique
    identifiers
- build an information retrieval system that does free text indexing
  to allow string matching

At this stage we now have a semi-useful resource where there was
nothing before. Next:

- as searches reveal user needs:
  - begin tagging resource with metadata
  - and/or reevaluate the chunking of the documents
  - use the metadata to create faceted retrieval systems

As Wheatley suggested: information is a process that causes
organization. The organizational structures we impose upon in
information can be reveal in how we use the information. They are
structures of convenience and as such we must be prepared to undertake
inconvenient work to create them.

There's a law of conservation of convenience in there somewhere.

The second interesting point:

On page 108 the sentence

   No knowledge system exists in a vacuum, it must be rendered
   compatible with other systems.

has been underlined and the comment "Not so!" is nearby. I can't agree
with the comment. What about the knowledge systems of the users and
the organizations that use the systems and within which the system
exists? The original system must be able to interoperate with those.


Back to the Index

20011209: Bowker, Classification, coding and coordination

Contact:cdent@burningchrome.com

Bowker, G.C. & Star, L.S. (1999). Chapter 4: Classification, coding
     and coordination. In _Sorting things out: Classification and its
     consequences_ (p. 135-161). Cambridge MA: MIT Press.

-=-=-

Laborious explication of the difficulties of communication between
cultures, including constructed cultures such as the ICD. Difficulties
are very noticable in efforts (such as the ICD) to systematize what
would be flexible systems of categorization if there were no need for
the classification.

Underscores the notion of the suitably restricted domain discussed by
Suchman when considering the efficacy of interaction between humans
and technology. Technological solutions (of which a classification
system is a type) are only able to interact gracefully with a human or
group of humans if the domain under consideration is suitably
constrained. Constrained in this context is both bredth and depth.

The ICD is certainly not very constrained.

There's this ongoing discovery of a boundary between two things that
can be modelled in various ways:

   concept        |  theory
   categorization |  classification
   craft          |  science
   flexibility    |  rigidity
   adaptability   |  precision

Those "two things" are both of value and must be respected in the
design of any information system. Ignoring or deemphasizing either
will result in a failure of the system to be completely effective.


Back to the Index