Last 10 changes peermore peermore peermore aboutchris augury socialtext pictures socialtext socialtext aboutchris 122 words 253 defs | listtagsRevision: Backlinks: | From an on going ba-ohs-talk list. Has some relevance for unrevdb. -- Chris Dent <cdent@burningchrome.com> http://www.burningchrome.com/~cdent/ "Mediocrities everywhere--now and to come--I absolve you all! Amen! -Salieri, in Peter Shaffer's Amadeus ---------- Forwarded message ---------- Date: Thu, 25 Apr 2002 18:43:41 -0500 (EST) From: cdent@burningchrome.com Reply-To: ba-ohs-talk@bootstrap.org To: ba-ohs-talk@bootstrap.org Subject: Re: Great? idea for improving this list (was Re: [ba-ohs-talk] Freezope learning environments) On Thu, 25 Apr 2002, Peter Jones wrote: > The other way to do things parallels (I think) some of the > stuff that Chris Dent has done. Please note that the project that Kathryn La Barre and I are working on was started by Kathryn and really comes out of her brain. I joined in as a technical resource but then found it so interesting I wanted to be more involved. In case it wandered into obscurity I'm referring to this: http://www.bootstrap.org/lists/ba-ohs-talk/0204/msg00123.html > 1- Parse the existing archive for terms, recording locations of terms > 2- Cull out anything useless like stop-words (e.g. 'the', 'and', etc.) > 3- Parse any new mails against this growing index, recording locations > of terms > 4- Check the new terms list every now and again. > (Repeat 2 as necessary.) > > 5- Make a topic map/semantic net out of the terms if you like, > for future uses, graphical interfaces, paraphrase searches, whatever... I think there is value to three different methods and it isn't clear which is best: 1 fully automated term and cluster generation 2 automated cluster generation with human labelling to give facets to messages 3 human tagging I'm not, at this time, aware of a system that will do 1 and label the clusters by anything other than highest frequency terms. This has limited value. 2 is what Kathryn and I are working on. 3 seems to be what Alex and others are suggesting. Doing 3 would make 2 much more valuable, the human tag could be one of several facets. Cluster membership could be another. Identification/classification of new messages compared against an existing archive is possible with a variety of methods. One might be to create a vector that represents the incoming message and compare it against vectors that represent pseudo-documents that represent the prototypes of the already generated clusters. Then you can say, "It is highly likely that this message is similar to this cluster" and tag it as such. Vector space models like that, though, are very easy to corrupt. The unrev-ii archive is full of silly little footers from eGroup, YahooGroups etc that can throw off the math (we have most of them parsed out now). Length makes a big difference (we don't have consistent lengths at all). A good stopword list is crucial but is hard to create for a list as wide ranging as this one and unrev-ii. Kathryn and I will be moving into the next phase of our work in the middle of May. Comments on directions or tools worth trying are desired. This message from Peter should help us to focus somewhat. Others like it would be wonderful. I very firmly believe that augmentation != automation. If we want to develop tools that allow us to work better (more effectively, more efficiently, with more fun) the systems we develop in our own behaviors for interacting with the tools are as or more important than than the tools themselves. Alex's idea: http://www.bootstrap.org/lists/ba-ohs-talk/0204/msg00120.html is an excellent way for us to make a slight change in our own behavior and gain a lot of flexibility in the tools we (the group at large) are able to develop. My personal preference would be to do something uncomplicated: since we do not have aids to help add the keywords, we need to make the barrier to use as low as possible. Murray's ideas: http://www.bootstrap.org/lists/ba-ohs-talk/0204/msg00180.html#nid04 are good but I might not do it if I had to type all that. Something simple like: [archive-access,vector-space] would get the ball rolling (bootstrapping, yeah?). I agree that the keywords should _not_ be in the subject or we suffer from thread creep in bad mail readers and overly long subjects. In the body is where we, the people, can put them and read them. The computers can put them and read them anywhere, so we may as well put them in the body. For now. In the future there will be tools that let us do it, do it for us, anywhere in the message or out of band. (I imagine a document composer that allows you to compare your text, prior to delivery, with a large net-wide classification system, nominating keywords and other identifiers that you could accept or reject. A system that used vector space style models would preserve some degree of privacy (depending on how it was done) because the text itself would not be transmitted to the service.) It is intersting to note the Usenet news messages have had a "Keywords:" header for a _long_ time. I don't really them actually being used for much, though. > (I must get around to reading the GATE manual.) What's this? | [ Contact ] [ Old Blog ] [ New Blog ] [ Write ] [ AboutWarp ] [ Resume ] [ Search ] [ List Words ] [ Login ] |