|
Organizers |
Clustering Data with Measurement Errors
by
Mahesh Kumar
Massachusetts Institute of Technology
Coauthors: Nitin R. Patel (Sloan School of Management, MIT), James B. Orlin (Sloan School of Management, MIT).
Clustering is a very well studied problem that attempts to group together similar data points from a large set of data. Most traditional clustering work assumes that the data is provided without measurement error. Real world data, however, usually does contain such errors, which in addition, can be estimated using the standard statistical methods. In the presence of such errors, popular clustering methods, like k-means and hierarchical clustering, may produce un-intuitive results. The fundamental question that this talk addresses is: "What is an appropriate clustering method in the presence of measurement errors associated with data?"
We propose using the maximum likelihood principle to obtain an objective criterion for clustering that incorporates information about the measurement errors associated with the data. The objective criterion provides a basis for several clustering algorithms that are generalizations of the popular k-means and Ward's hierarchical clustering methods. The objective criterion has a scale-invariance property, so that the clustering results are independent of the measurement units of the data. We also provide a heuristic solution to obtain the correct number of clusters, which in itself is a challenging problem. Finally, we show the effectiveness of our technique on simulated data, where it outperforms the k-means and hierarchical clustering methods.
Date received: October 15, 2002
Copyright © 2002 by the author(s). The author(s) of this document and the organizers of the conference have granted their consent to include this abstract in Atlas Conferences Inc. Document # cais-49.