OPTIMAL NUMBER OF CLUSTERS
Clustering is an important analysis methods in Data Analytics and Pattern Recognition. The process divides the data into groups without any supervision or external labels and it is a subjective analysis as the definition of a cluster is context dependent. Because of this reason many algorithms, like k-nearest neighbors, require the number of clusters to be fixed a priori. Each clustering algorithm depends on a distance metric to identify different groups in the data. Once the number of centers are fixed, each algorithm tries to find the best separation according of its distance measure by using an optimization algorithm. The distance metric determines the shape of the clusters generated. There are algorithms, like Ward, to determine how many clusters we have in a data set and these algorithms also depend on the same distance metrics where many metrics, like Euclidean and its derivatives, generate hyper ellipsoidal clusters and fail in nonlinearly clustered data. Another computationally expensive approach is to run a specific algorithm for different number of cluster centers and try to choose the best number. In this paper, we attempt to analyze the number of clusters using a previously developed Information Theoretical metric called CEF which; in its original use; can separate nonlinear clusters. Data points that are more similar to each other are incrementally joined together using a distance measure to create subclusters like joined data points against the rest of the data. The operation continues until all data elements are consumed. The CEF metric is used to measure the distance between obtained clusters where peaks in the measure indicates strong cluster separation. The method is tested in several artificial and real data sets and its advantages and disadvantages are discussed.
M. Halkidi, Y. Batistakis, M. Vazirgiannis, “On clustering validation techniques”, Journal of Intelligent Information Systems 17 (2001) 107–145.
A.K. Jain, R.C. Dubes, “Algorithms for Clustering Data”, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.
B. Mirkin, “Clustering for Data Mining: A Data Recovery Approach”, Chapman & Hall/CRC, Boca Raton, Florida, 2005.
P.H.A. Sneath, R.R. Sokal, “Numerical Taxonomy, Books in Biology”, W.H. Freeman and Company, San Francisco, 1973.
Gokcay, E.; Principe, J.C., "Information theoretic clustering," Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.24, no.2, pp.158,171, Feb 2002.
 K.J. Holzinger, H.H. Harman,” Factor Analysis”, University of Chicago Press, Chicago, 1941.
 C.-H. Chou, M.-C. Su, E. Lai, “A new cluster validity measure and its application to image compression”, Pattern Analysis and Applications 7 (2004) 205–220.
D. Barbara´ , S. Jajodia (Eds.), “Applications of Data Mining in Computer Security”, Kluwer Academic Publishers, Norwell, Massachusetts, 2002.
M. Halkidi, Y. Batistakls, M. Vazirgiannis, Clustering validity checking methods: part ii, in: ACM SIGMOD Record, vol. 31, 2002, pp. 19–27.
M. Bouguessa, S.Wang, H. Sun, An objective approach to cluster validation, Pattern Recognition Letters 27 (13) (2006) 1419–1430.
X.L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE Trans. On Pattern Analysis and Machine Intelligence 13 (8) (1991) 841–847.
J.C. Dunn,Well separated clusters and optimal fuzzy partitions, Journal of Cybernetics 4 (1974) 95–104.
P. Lingras, M. Chen, D. Miao, Rough cluster quality index based on decision theory, IEEE Transactions on Knowledge and Data Engineering 21 (7) (2009) 1014–1026.
J.Y. Liang, X.W. Zhao, D.Y. Li, F.Y. Cao, C.Y. Dang, Determining the number of clusters using information entropy for mixed data, Pattern Recognition 45 (2012) 2251–2265.
J. Yu, Q.S. Cheng, The range of the optimal number of clusters for the fuzzy clustering algorithms, Science in China (Series E) 32 (2) (2002) 274–280.
S.F. Bahght, S. Aljahdali, E.A. Zanaty, A.S. Ghiduk, A. Afifi, A new validity index for fuzzy c-means for automatic medical image clustering, International Journal of Computer Applications 38 (12) (2012) 1–8.
Y.J. Zhang,W.N.Wang, X.N. Zhang, Y. Li, A cluster validity index for fuzzy clustering, Information Sciences 178 (4) (2008) 1205–1218.
H.J. Sun, S.R.Wang, Q.S. Jiang, Fcm-basedmodel selection algorithms for determining the number of cluster, Pattern Recognition 37 (10) (2004) 2027–2037.
L.F. Chen, Q.S. Jiang, S.R.Wang, A hierarchical method for determining the number of clusters, Journal of Software 19 (2008) 62–72.
A.V. Kapp, R. Tibshirani, Are clusters found in one dataset present in another dataset?, Biostatistics 8 (1) (2007) 9–31.
S. Still,W. Bialek, How many clusters? An information-theoretic perspective, Neural Computation 16 (12) (2004) 2483–2506.
Lichman, M.UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2013