Last 10 changes


122 words
253 defs


[ Prev ] [ Next ]

2002-08-09 02:46:56 ]
2002-08-06 03:21:50 ]
2002-07-31 17:37:12 ]
2002-07-28 03:15:13 ]
2002-07-26 14:03:51 ]
2002-07-21 17:32:20 ]
2002-07-17 14:15:53 ]
2002-07-05 23:52:07 ]
2002-07-01 22:44:44 ]
2002-06-28 17:11:51 ]


Update on unrevdb activity, related to clustering.

---------- Forwarded message ----------
Date: Mon, 1 Jul 2002 22:43:02 -0500 (EST)
From: cdent@burningchrome.com
Subject: 594 related: error in clustering database structure and clustering
    in general

(John included here for the discussion of Ward's cluster
analysis, near the end.)

(This message will be going into warp (my home page) under the
uvizjournal word.)

I've spent some time this evening staring at the database schema
and discovered that there is an error in the way I originally
intended to store and import clusters. This is not a major
setback, just requires some tweaking of the database and little
rethinking. I'll lay out that thinking here so,

 a: you guys know that I'm up to something
 b: Kathryn, you can comment if you see any flaws
 c: John, so you have some context

The current scheme thinks of clusters and messages that are
members of those clusters. That's only half the story. While an
individual cluster is made up of one or more messages, that
cluster is one of many clusters that comprise the entire set of
clusters (representing the entire data sample) that were
processed in a certain style.

As currently configured what is being recorded is only
similarity in small groups but no way of saying, "these other
documents, not in that group, are in these other groups, created
by the same process."

To record that we have two related options:

- we need to record a cluster slice, cluster membership in that
  slice, and message membership in the cluster
- to make things more complete, we could or should record the
  cluster hiearchy:
  - cluster slices are members of a cluster hierarchy
  - an optimal cluster slice is the one that has the greatest
    similarity inside clusters and the greatest dis-similarity
    between clusters

Recording hiearchy membership adds a bit of complexity but is not
outrageous. It may not always be necessary, as some clustering
methods may not provide hiearchies, only the best slice.

We need to settle on an import format for getting clusters into
the database. Since we don't know what the database will look
like, nor what the tools will output, we'll have to wait on that.
I'm in the process of building R for the machine on which the
database interface (but not the database itself) lives (hot).

I went looking for some references to ward's clustering methods. I
could not find the Powers article, but found a few other things
that seem relevant. As is usually the case, the amount of stuff
to read is easily huge. The refs are listed below:

A simple overview of various clustering algorithms:


A description of the algorithm, for a study of water sources:


A company that sells cluster analysis software:


This (chemistry) paper includes the reference to Ward's original paper:


Document clustering for electronic meetings: an experimental
comparison of two techniques

Abstract: In this article, we report our implementation and
comparison of two text clustering techniques. One is based on
Ward's clustering and the other on Kohonen's Self-organizing
Maps. We have evaluated how closely clusters produced by a
computer resemble those created by human experts. We have also
measured the time that it takes for an expert to "clean up" the
automatically produced clusters. The technique based on Ward's
clustering was found to be more precise. Both techniques have
worked equally...


The R language reference discusses the hclust method (which can
use Ward's) and provides some references at:


This page has some sample R code that uses hclust:


[ Contact ] [ Old Blog ] [ New Blog ] [ Write ] [ AboutWarp ] [ Resume ] [ Search ] [ List Words ] [ Login ]