Computing and Library Services - delivering an inspiring information environment

Algorithms for Hierarchical Clustering: An Overview, II

Murtagh, Fionn and Contreras, Pedro (2017) Algorithms for Hierarchical Clustering: An Overview, II. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7 (6). e1219. ISSN 1942-4795

PDF (Submitted version, accepted.) - Accepted Version
Available under License Creative Commons Attribution No Derivatives.

Download (247kB) | Preview


We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm. This review adds to the earlier version, Murtagh and Contreras (2012).

▼ Jump to Download Statistics
Item Type: Article
Additional Information: Article ID: WIDM1219
Subjects: Q Science > Q Science (General)
Q Science > QA Mathematics
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QA Mathematics > QA76 Computer software
Z Bibliography. Library Science. Information Resources > ZA Information resources
Schools: School of Computing and Engineering
Related URLs:
References: 1. Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 94–105. 2. Albatineh AN, Niewiadomska-Bugaj M, Mihalko D. On similarity indices and correction for chance agreement. Journal of Classification, 2006, 23(2), 301–313. 3. Anderberg MR. Cluster Analysis for Applications. Academic Press, New York, 1973. 4. Ankerst M, Breunig M, Kriegel H, Sander J. OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD International Conference on Management of Data. ACM Press, 1999, pp. 49–60. 5. Bader DA, Meyerhenke H, Sanders P, Wagner D. Graph Partitioning and Graph Clustering, Contemporary Mathematics Vol. 588, American Mathematical Society, Providence RI, 2013. 6. Bécue-Bertaut M, Kostov B, Morin A, Naro, G. Rhetorical strategy in forensic speeches: Multidimensional statistics based methodology. Journal of Classification, 2014, 31:85106. 7. Benzécri JP. L’Analyse des Données. I. La Taxinomie, Dunod, Paris, 1979 (3rd ed.). 8. Blashfield RK and Aldenderfer MS. The literature on cluster analysis Mul- tivariate Behavioral Research 1978, 13: 271–295. 9. Bolshakova N, Azuaje, F. Cluster validation techniques for genome expression data. Signal Processing, 2003, 83(4), 825–833. 10. Bruynooghe M. Méthodes nouvelles en classification automatique des données taxinomiques nombreuses. Statistique et Analyse des Données 1977, no. 3, 24–42. 11. Chang J-W, Jin D-S. A new cell-based clustering method for large, highdimensional data in data mining applications. In: SAC ’02: Proceedings of the 2002 ACM Symposium on Applied Computing. New York: ACM, 2002, 503–507. 12. Dash M, Liu H, Xu X. 1 + 1 > 2: Merging distance and density based clustering. In: DASFAA ’01: Proceedings of the 7th International Conference on Database Systems for Advanced Applications. Washington, DC: IEEE Computer Society, 2001, 32–39. 13. Dash M, Liu H, Scheuermann P, Lee Tan K. Fast hierarchical clustering and its validation. Data and Knowledge Engineering, 2003, 44(1), 109–138. 14. Day WHE, Edelsbrunner H. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification 1984, 1: 7–24. 15. Defays D. An efficient algorithm for a complete link method Computer Journal 1977, 20:364–366. 16. de Rham C. La classification hiérarchique ascendante selon la méthode des voisins réciproques. Les Cahiers de l’Analyse des Données 1980, V: 135–144. 17. Deza MM, Deza E. Encyclopedia of Distances. Springer, Berlin, 2009. 18. Dittenbach M, Rauber A, Merkl D. Uncovering the hierarchical structure in data using the growing hierarchical self-organizing map. Neurocomputing, 2002, 48(1–4):199–216. 19. Endo M, Ueno M, Tanabe T. A clustering method using hierarchical selforganizing maps. Journal of VLSI Signal Processing 32:105–118, 2002. 20. Ester M, Kriegel H-P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1996, 226–231. 21. Gan G, Ma C, Wu J. Data Clustering Theory, Algorithms, and Applications Society for Industrial and Applied Mathematics. SIAM, 2007. 22. Gillet VJ, Wild DJ, Willett P, Bradshaw J. Similarity and dissimilarity methods for processing chemical structure databases. Computer Journal 1998, 41: 547–558. 23. Gondran M. Valeurs propres et vecteurs propres en classification hiérarchique. RAIRO Informatique Théorique 1976, 10(3): 39–46. 24. Gordon AD. Classification, Chapman and Hall, London, 1981. 25. Gordon AD. A review of hierarchical classification. Journal of the Royal Statistical Society A 1987, 150: 119–137. 26. Grabusts P, Borisov A. Using grid-clustering methods in data classification. In: PARELEC ’02: Proceedings of the International Conference on Parallel Computing in Electrical Engineering.Washington, DC: IEEE Computer Society, 2002. 27. Graham RH and Hell P. On the history of the minimum spanning tree problem. Annals of the History of Computing 1985 7: 43–57. 28. Griffiths A, Robinson LA, Willett P. Hierarchic agglomerative clustering methods for automatic document classification. Journal of Documentation 1984, 40: 175–205. 29. Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of Intelligent Information Systems, 2001, 17(2–3), 107–145. 30. Halkidi M, Batistakis Y, Vazirgiannis M. Cluster validity methods: part I. ACM SIGMOD Record, 2002, 31(2), 40–45. 31. Hinneburg A, Keim DA. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeding of the 4th International Conference on Knowledge Discovery and Data Mining. New York: AAAI Press, 1998, 58–68. 32. Hinneburg A, Keim D. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In: VLDB ’99: Proceedings of the 25th International Conference on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann Publishers Inc., 1999, 506–517. 33. Jain AK, Dubes RC. Algorithms For Clustering Data Prentice-Hall, Englwood Cliffs, 1988. 34. Jain AK, Murty, MN, Flynn PJ. Data clustering: a review. ACM Computing Surveys 1999, 31: 264–323. 35. Janowitz, MF. Ordinal and Relational Clustering, World Scientific, Singapore, 2010. 36. Juan J. Programme de classification hiérarchique par l’algorithme de la recherche en chaîne des voisins réciproques. Les Cahiers de l’Analyse des Données 1982, VII: 219–225. 37. Kohonen T. Self-Organization and Associative Memory Springer, Berlin, 1984. 38. Kohonen T. Self-Organizing Maps, 3rd edn., Springer, Berlin, 2001. 39. Lampinen J, Oja E. Clustering properties of hierarchical self-organizing maps. Journal of Mathematical Imaging and Vision 2: 261–272, 1992. 40. Legendre P, Legendre, L. Numerical Ecology, 3rd edn., 2012, Elsevier, Amsterdam. 41. Lerman, IC. Classification et Analyse Ordinale des Données, Dunod, Paris, 1981. 42. Le Roux B, Rouanet H. Geometric Data Analysis: From Correspondence Analysis to Structured Data Analysis, Kluwer, Dordrecht, 2004. 43. von Luxburg U A. tutorial on spectral clustering. Statistics and Computing 1997, 17(4): 395–416. 44. March ST. Techniques for structuring database records. ACM Computing Surveys 1983, 15: 45–79. 45. Miikkulainien R. Script recognition with hierarchical feature maps. Connection Science 1990, 2: 83–101. 46. Mirkin B. Mathematical Classification and Clustering Kluwer, Dordrecht, 1996. 47. Murtagh F. A survey of recent advances in hierarchical clustering algorithms. Computer Journal 1983, 26, 354–359. 48. Murtagh F. Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly 1984, 1: 101–113. 49. Murtagh F Multidimensional Clustering Algorithms. Physica-Verlag,W¨urzburg, 1985. 50. Murtagh F. Correspondence Analysis and Data Coding with Java and R, Chapman and Hall, Boca Raton, 2005. 51. Murtagh F. The Haar wavelet transform of a dendrogram. Journal of Classification 2007, 24: 3–32. 52. Murtagh F. Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics. Chapman and Hall/CRC Press, 2017, Boca Raton, FL. 53. Murtagh F, Hernández-Pajares M. The Kohonen self-organizing map method: an assessment, Journal of Classification 1995, 12:165-190. 54. Murtagh F, Raftery AE, Starck JL. Bayesian inference for multiband image segmentation via model-based clustering trees. Image and Vision Computing 2005, 23: 587–596. 55. Murtagh F, Ganz A, McKie S. The structure of narrative: the case of film scripts. Pattern Recognition 2009, 42: 302–312. 56. Murtagh F, Downs G, Contreras P. Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding. SIAM Journal on Scientific Computing 2008, 30(2): 707–730. 57. Murtagh, F. and Contreras, P. Algorithms for hierarchical clustering: an overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2012, 2(1), 86–97. 58. Murtagh F, Legendre P.Ward’s hierarchical agglomerative clustering method: which algorithm’s implement Ward’s criterion? Journal of Classification 2014, 31: 274–295. 59. Park NH, Lee WS. Statistical grid-based clustering over data streams. SIGMOD Record 2004, 33(1): 32–37. 60. Rapoport A, Fillenbaum S. An experimental study of semantic structures, in Eds. A.K. Romney, R.N. Shepard and S.B. Nerlove. Multidimensional Scaling; Theory and Applications in the Behavioral Sciences. Vol. 2, Applications, Seminar Press, New York, 1972, 93–131. 61. Rohlf FJ. Algorithm 76: Hierarchical clustering using the minimum spanning tree. Computer Journal 1973, 16: 93–95. 62. Sander J, Ester M, Kriegel H.-P, Xu X. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining Knowledge Discovery 1998, 2(2): 169–194. 63. Schikuta E. Grid-clustering: An efficient hierarchical clustering method for very large data sets. In: ICPR ’96: Proceedings of the 13th International Conference on Pattern Recognition. Washington, DC: IEEE Computer Society, 1996, 101–105. 64. Sheikholeslami G, Chatterjee S, Zhang A. Wavecluster: a wavelet based clustering approach for spatial data in very large databases. The VLDB Journal, 2000, 8(3–4): 289–304. 65. Sibson R. SLINK: an optimally efficient algorithmfor the single link cluster method. Computer Journal, 1973, 16: 30–34. 66. Sneath PHA, Sokal RR. Numerical Taxonomy, Freeman, San Francisco, 1973. 67. Tino P, Nabney I. Hierarchical GTM: constructing localized non-linear projection manifolds in a principled way. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(5): 639–656. 68. van Rijsbergen CJ. Information Retrieval Butterworths, London, 1979 (2nd ed.). 69. Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, New York, NY: ACM, 2009, 1073–1080. 70. Wang L, Wang Z-O. CUBN: a clustering algorithm based on density and distance. In: Proceeding of the 2003 International Conference on Machine Learning and Cybernetics. IEEE Press, 2003, 108–112. 71. Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. In VLDB ’97: Proceedings of the 23rd International Conference on Very Large Data Bases.San Francisco, CA: Morgan Kaufmann Publishers Inc., 1997, 18–195. 72. Wang Y, Freedman MI, Kung S-Y. Probabilistic principal component subspaces: A hierarchical finite mixture model for data visualization. IEEE Transactions on Neural Networks 2000, 11(3), 625–636. 73. White HD, McCain KW. Visualization of literatures. In: M.E. Williams, Ed., Annual Review of Information Science and Technology (ARIST) 1997, 32:99–168. 74. Vicente D, Vellido A. Review of hierarchical models for data clustering and visualization. In: Giráldez R, Riquelme JC and Aguilar-Ruiz, JS, eds., Tendencias de la Minería de Datos en Espana. Red Espanola de Minería de Datos, 2004. 75. Willett P. Efficiency of hierarchic agglomerative clustering using the ICL distributed array processor. Journal of Documentation 1989, 45:1–45. 76. Wishart D. Mode analysis: a generalization of nearest neighbour which reduces chaining effects. In Cole AJ, ed., Numerical Taxonomy, Academic Press, New York, 282–311, 1969. 77. Xu R, Wunsch D. Survey of clustering algorithms. IEEE Transactions on Neural Networks 2005, 16:645–678. 78. Xu R, Wunsch DC. Clustering IEEE Computer Society Press, 2008. 79. Xu X, Ester M, Kriegel H-P, Sander J. A distribution-based clustering algorithm for mining in large spatial databases. In: ICDE ’98: Proceedings of the Fourteenth International Conference on Data Engineering. Washington, DC: IEEE Computer Society, 1998, 324–331. 80. Xu X, Jäger J, Kriegel H-P. A fast parallel clustering algorithm for large spatial databases. Data Mining Knowledge Discovery 1999, 3(3): 263–290. 81. Zaïane OR, Lee C-H. Clustering spatial data in the presence of obstacles: a density-based approach. In: IDEAS ’02: Proceedings of the 2002 International Symposium on Database Engineering and Applications.Washington, DC: IEEE Computer Society, 2002, 214–223. 82. Zhongheng Zhang, Murtagh F, Van Poucke S, Su Lin, Peng Lan. Hierarchical cluster analysis in clinical research with heterogeneous study population: highlighting its visualization with R. Annals of Translational Medicine, 5(4), Feb. 2017.
Depositing User: Fionn Murtagh
Date Deposited: 09 Aug 2017 14:03
Last Modified: 28 Aug 2021 15:46


Downloads per month over past year

Repository Staff Only: item control page

View Item View Item

University of Huddersfield, Queensgate, Huddersfield, HD1 3DH Copyright and Disclaimer All rights reserved ©