K _MODE ALGORITHM FOR CLUSTERING VERY LARGE DATA SETS IN DATA MINING

D. T. VIMALA

Abstract


Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The k-means algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called k-modes, to extend the k-means paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimize the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of both the number of clusters and the number of records.


Keywords


k-mean, categorical data, soyabean disease data, large data sets, clustering.

Full Text:

PDF

References


Anderberg, M. R. (1973) Cluster Analysis for Applications, Academic Press.

Ball, G. H. and Hall, D. J. (1967) A Clustering Technique for Summarizing Multivariate Data, Behavioral Science, 12, pp. 153-155.

Bezdek, J. C. (1980) A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(8), pp. 1-8.

Bobrowski, L. and Bezdek, J. C. (1991) c-Means Clustering with the l1 and l¥ Norms, IEEE Transactions on Systems, Man and Cybernetics, 21(3), pp. 545-554.

Fisher, D. H. (1987) Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning, 2(2), pp.139-172.

Goldberg, D. E. (1989) Genetic Algorithms in Search, Optimisation, and Machine Learning, Addison-Wesley.

Gowda, K. C. and Diday, E. (1991) Symbolic Clustering Using a New Dissimilarity Measure, Pattern Recognition, 24(6), pp. 567-578.


Refbacks

  • There are currently no refbacks.




Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2018 INTERNATIONAL EDUCATION AND RESEARCH JOURNAL