Last 10 changes peermore peermore peermore aboutchris augury socialtext pictures socialtext socialtext aboutchris 122 words 253 defs | uvizjournalRevision: Backlinks: | Update on unrevdb activity, related to clustering. ---------- Forwarded message ---------- Date: Mon, 1 Jul 2002 22:43:02 -0500 (EST) From: cdent@burningchrome.com Subject: 594 related: error in clustering database structure and clustering in general (John included here for the discussion of Ward's cluster analysis, near the end.) (This message will be going into warp (my home page) under the uvizjournal word.) I've spent some time this evening staring at the database schema and discovered that there is an error in the way I originally intended to store and import clusters. This is not a major setback, just requires some tweaking of the database and little rethinking. I'll lay out that thinking here so, a: you guys know that I'm up to something b: Kathryn, you can comment if you see any flaws c: John, so you have some context The current scheme thinks of clusters and messages that are members of those clusters. That's only half the story. While an individual cluster is made up of one or more messages, that cluster is one of many clusters that comprise the entire set of clusters (representing the entire data sample) that were processed in a certain style. As currently configured what is being recorded is only similarity in small groups but no way of saying, "these other documents, not in that group, are in these other groups, created by the same process." To record that we have two related options: - we need to record a cluster slice, cluster membership in that slice, and message membership in the cluster - to make things more complete, we could or should record the cluster hiearchy: - cluster slices are members of a cluster hierarchy - an optimal cluster slice is the one that has the greatest similarity inside clusters and the greatest dis-similarity between clusters Recording hiearchy membership adds a bit of complexity but is not outrageous. It may not always be necessary, as some clustering methods may not provide hiearchies, only the best slice. We need to settle on an import format for getting clusters into the database. Since we don't know what the database will look like, nor what the tools will output, we'll have to wait on that. I'm in the process of building R for the machine on which the database interface (but not the database itself) lives (hot). I went looking for some references to ward's clustering methods. I could not find the Powers article, but found a few other things that seem relevant. As is usually the case, the amount of stuff to read is easily huge. The refs are listed below: A simple overview of various clustering algorithms: http://obelia.jde.aca.mmu.ac.uk/multivar/ca_alg.htm A description of the algorithm, for a study of water sources: http://www.kgs.ukans.edu/Dakota/vol1/geo/hodge4.htm A company that sells cluster analysis software: http://www.clustan.com/index.html This (chemistry) paper includes the reference to Ward's original paper: http://www.jchem.com/doc/admin/Ward.html#wardpaper Document clustering for electronic meetings: an experimental comparison of two techniques Abstract: In this article, we report our implementation and comparison of two text clustering techniques. One is based on Ward's clustering and the other on Kohonen's Self-organizing Maps. We have evaluated how closely clusters produced by a computer resemble those created by human experts. We have also measured the time that it takes for an expert to "clean up" the automatically produced clusters. The technique based on Ward's clustering was found to be more precise. Both techniques have worked equally... http://citeseer.nj.nec.com/450890.html The R language reference discusses the hclust method (which can use Ward's) and provides some references at: http://stat.ethz.ch/R-alpha/R-patched/library/mva/html/hclust.html This page has some sample R code that uses hclust: http://statwww.epfl.ch/davison/teaching/Microarrays/lab/classification.html | [ Contact ] [ Old Blog ] [ New Blog ] [ Write ] [ AboutWarp ] [ Resume ] [ Search ] [ List Words ] [ Login ] |