Last 10 changes


122 words
253 defs


[ Prev ] [ Next ]

2002-04-26 01:32:58 ]
2002-04-26 01:31:31 ]


From an on going ba-ohs-talk list. Has some relevance for

Chris Dent  <cdent@burningchrome.com>  http://www.burningchrome.com/~cdent/
"Mediocrities everywhere--now and to come--I absolve you all! Amen!
 -Salieri, in Peter Shaffer's Amadeus

---------- Forwarded message ----------
Date: Thu, 25 Apr 2002 18:43:41 -0500 (EST)
From: cdent@burningchrome.com
Reply-To: ba-ohs-talk@bootstrap.org
To: ba-ohs-talk@bootstrap.org
Subject: Re: Great? idea for improving this list (was Re: [ba-ohs-talk]
    Freezope                 learning environments)

On Thu, 25 Apr 2002, Peter  Jones wrote:

> The other way to do things parallels (I think) some of the
> stuff that Chris Dent has done.

Please note that the project that Kathryn La Barre and I are
working on was started by Kathryn and really comes out of her
brain. I joined in as a technical resource but then found it
so interesting I wanted to be more involved. In case it wandered
into obscurity I'm referring to this:


> 1- Parse the existing archive for terms, recording locations of terms
> 2- Cull out anything useless like stop-words (e.g. 'the', 'and', etc.)
> 3- Parse any new mails against this growing index, recording locations
> of terms
> 4- Check the new terms list every now and again.
> (Repeat 2 as necessary.)
> 5- Make a topic map/semantic net out of the terms if you like,
> for future uses, graphical interfaces, paraphrase searches, whatever...

I think there is value to three different methods and it isn't
clear which is best:

  1 fully automated term and cluster generation
  2 automated cluster generation with human labelling to give
    facets to messages
  3 human tagging

I'm not, at this time, aware of a system that will do 1 and label
the clusters by anything other than highest frequency terms. This
has limited value.

2 is what Kathryn and I are working on.

3 seems to be what Alex and others are suggesting.

Doing 3 would make 2 much more valuable, the human tag could be
one of several facets. Cluster membership could be another.

Identification/classification of new messages compared against an
existing archive is possible with a variety of methods. One might
be to create a vector that represents the incoming message and
compare it against vectors that represent pseudo-documents that
represent the prototypes of the already generated clusters. Then
you can say, "It is highly likely that this message is similar to
this cluster" and tag it as such.

Vector space models like that, though, are very easy to corrupt.
The unrev-ii archive is full of silly little footers from eGroup,
YahooGroups etc that can throw off the math (we have most of them
parsed out now). Length makes a big difference (we don't have
consistent lengths at all). A good stopword list is crucial but
is hard to create for a list as wide ranging as this one and

Kathryn and I will be moving into the next phase of our work in
the middle of May. Comments on directions or tools worth trying
are desired. This message from Peter should help us to focus
somewhat. Others like it would be wonderful.

I very firmly believe that augmentation != automation. If we want
to develop tools that allow us to work better (more effectively,
more efficiently, with more fun) the systems we develop in our
own behaviors for interacting with the tools are as or more
important than than the tools themselves. Alex's idea:


is an excellent way for us to make a slight change in our own
behavior and gain a lot of flexibility in the tools we (the group
at large) are able to develop.

My personal preference would be to do something uncomplicated:
since we do not have aids to help add the keywords, we need to
make the barrier to use as low as possible. Murray's ideas:


are good but I might not do it if I had to type all that.
Something simple like:


would get the ball rolling (bootstrapping, yeah?).

I agree that the keywords should _not_ be in the subject or we
suffer from thread creep in bad mail readers and overly long
subjects. In the body is where we, the people, can put them and
read them. The computers can put them and read them anywhere, so
we may as well put them in the body. For now. In the future there
will be tools that let us do it, do it for us, anywhere in the
message or out of band.

(I imagine a document composer that allows you to compare your
text, prior to delivery, with a large net-wide classification
system, nominating keywords and other identifiers that you could
accept or reject. A system that used vector space style models
would preserve some degree of privacy (depending on how it was
done) because the text itself would not be transmitted to the

It is intersting to note the Usenet news messages have had a
"Keywords:" header for a _long_ time. I don't really them
actually being used for much, though.

> (I must get around to reading the GATE manual.)

What's this?
[ Contact ] [ Old Blog ] [ New Blog ] [ Write ] [ AboutWarp ] [ Resume ] [ Search ] [ List Words ] [ Login ]