Clustering by Compression

Type: Article
Publication Date: 2005-04-01
Citations: 1187
DOI: https://doi.org/10.1109/tit.2005.844059

Abstract

We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorov complexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.

Locations

  • arXiv (Cornell University)
  • IEEE Transactions on Information Theory

Ask a Question About This Paper

Summary

Login to see paper summary

Similar Works

Action Title Date Authors
Clustering by compression 2003-01-01 Rudi Cilibrasi Paul Vitányi
Normalized Information Distance is Not Semicomputable 2010-01-01 Sebastiaan A. Terwijn Leen Torenvliet Paul Vitányi
Normalized Information Distance is Not Semicomputable 2010-01-01 Sebastiaan A. Terwijn Leen Torenvliet Paul Vitányi
The similarity metric 2001-11-20 Ming Li Xin Chen Xin Li Bin Ma Paul Vitányi
The similarity metric 2001-01-01 Ming Li Xin Chen Xin Li Bin Ma Paul Vitányi
Nonapproximablity of the Normalized Information Distance 2009-01-01 Sebastiaan A. Terwijn Leen Torenvliet Paul Vitányi
Algorithms for Estimating Information Distance with Application to Bioinformatics and Linguistics 2004-01-01 Alexei Kaltchenko
Normalized Information Distance 2008-11-23 Paul Vitányi Frank J. Balbach Rudi Cilibrasi Ming Li
Normalized Information Distance 2008-01-01 Paul Vitányi Frank J. Balbach Rudi Cilibrasi Ming Li
Generalized Compression Dictionary Distance as Universal Similarity Measure 2014-01-01 Andrey Bogomolov Bruno Lepri Fabio Pianesi
Assembly Theory is a weak version of algorithmic complexity based on LZ compression that does not explain or quantify selection or evolution 2024-03-11 Felipe S. Abrahão Santiago Hernández-Orozco Narsis A. Kiani Jesper Tegnér Héctor Zenil
Compression-based Similarity 2011-01-01 Paul Vitányi
Compression-based Similarity 2011-10-20 Paul Vitányi
Information Distance in Multiples 2009-01-01 Paul Vitányi
DNA Sequence Classification with Compressors 2024-01-01 Şükrü Ozan
How Compression and Approximation Affect Efficiency in String Distance Measures 2022-01-01 Arun Ganesh Tomasz Kociumaka Andrea Lincoln Barna Saha
A conditional compression distance that unveils insights of the genomic evolution 2014-01-16 Diogo Pratas Armando J. Pinho
A conditional compression distance that unveils insights of the genomic evolution 2014-01-01 Diogo Pratas Armando J. Pinho
A Conditional Compression Distance that Unveils Insights of the Genomic Evolution 2014-03-01 Diogo Pratas Armando J. Pinho
Normalized Compression Distance of Multisets with Applications 2014-11-26 Andrew R. Cohen Paul Vitányi

Cited by (40)

Action Title Date Authors
On the Value of Multiple Read/Write Streams for Data Compression 2013-01-01 Travis Gagie
Automatic analysis of artistic paintings using information-based measures 2021-02-01 Jorge Miguel Silva Diogo Pratas Rui Antunes Sérgio Matos Armando J. Pinho
Normalized information-based divergences 2007-09-01 J.-F. Coeurjolly Rémy Drouilhet Jean-François Robineau
+
The Normalized Compression Distance Is Resistant to Noise 2007-05-01 Manuel Cebrin Manuel Alfonseca Alfonso Ortega
Computable model discovery and high-level-programming approximations to algorithmic complexity 2022-06-02 V. Lemus Eduardo Acuña-Yeomans V. Zamora Francisco Hernández-Quiroz Héctor Zenil
Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations 2018-01-01 Maciej Besta Torsten Hoefler
Graph Summarization Methods and Applications: A Survey 2016-01-01 Yike Liu Tara Safavi Abhilash Dighe Danai Koutra
+
Causal Markov condition for submodular information measures 2010-02-22 Bastian Steudel Dominik Janzing Bernhard Schoelkopf
+
Detecting translations of the same text and data with common source 2006-10-18 Kostadin Koroutchev Manuel Cebrián
Authorship Verification based on Compression-Models 2017-01-01 Oren Halvani Christian Winter Lukas Graner
A Graph Summarization: A Survey. 2016-12-14 Yike Liu Abhilash Dighe Tara Safavi Danai Koutra
Compression ratios based on the Universal Similarity Metric still yield protein distances far from CATH distances 2006-01-01 Jairo Rocha Francesc Rosselló Joan Segura
SALZA: Soft algorithmic complexity estimates for clustering and causality inference 2016-01-01 Marion Revolle François Cayre Nicolas Le Bihan
Exploring programmable self-assembly in non-DNA based molecular computing 2013-10-01 Germán Terrazas Héctor Zenil Natalio Krasnogor
Restructuring Compressed Texts without Explicit Decompression 2011-01-01 Keisuke Goto Shirou Maruyama Shunsuke Inenaga Hideo Bannai Hiroshi Sakamoto Masayuki Takeda
A Minimal Active Inference Agent 2015-01-01 Simon McGregor Manuel Baltieri Christopher L. Buckley
On the Foundations of Universal Sequence Prediction 2006-01-01 Marcus Hütter
Contextual information retrieval based on algorithmic information theory and statistical outlier detection 2008-05-01 R. Martínez Manuel Cebrián Francisco B. Rodrı́guez David Camacho
+
Algorithmic Information Theory and Novelty Generation 2007-01-01 Simon McGregor
+
The Influence of Time Series Distance Functions on Climate Networks 2019-02-08 Leonardo N. Ferreira Nicole Costa Resende Maria Livia L. M. Gava Liang Zhao Elbert E. N. Macau
Open Problems in Universal Induction & Intelligence 2009-07-04 Marcus Hütter
Rare Speed-up in Automatic Theorem Proving Reveals Tradeoff Between Computational Time and Information Value 2015-01-01 Santiago Hernández-Orozco Francisco Hernández Quiroz Héctor Zenil Wilfried Sieg
Nonparametric density estimation for high‐dimensional data—Algorithms and applications 2019-04-02 Zhipeng Wang David W. Scott
Two-dimensional Kolmogorov complexity and an empirical validation of the Coding theorem method by compressibility 2015-09-30 Héctor Zenil Fernando Soler Toscano Jean‐Paul Delahaye Nicolas Gauvrit
Simultaneous Subspace Clustering and Cluster Number Estimating Based on Triplet Relationship 2019-03-06 Jie Liang Jufeng Yang Ming‐Ming Cheng Paul L. Rosin Liang Wang
Detecting Malware with Information Complexity 2015-02-26 Nadia Alshahwan Earl T. Barr David Clark George Danezis
Algorithmic complexity bounds on future prediction errors 2006-12-20 Alexey Chernov Marcus Hütter Jürgen Schmidhuber
Detecting Malware with Information Complexity 2020-05-20 Nadia Alshahwan Earl T. Barr David Clark George Danezis Héctor D. Menéndez
Black-Box Testing of Deep Neural Networks through Test Case Diversity 2023-02-09 Zohreh Aghababaeyan Manel Abdellatif Lionel Briand S. Ramesh Mojtaba Bagherzadeh
Correspondence and Independence of Numerical Evaluations of Algorithmic Information Measures 2013-01-01 Fernando Soler Toscano Héctor Zenil Jean‐Paul Delahaye Nicolas Gauvrit
Springer Handbook of Science and Technology Indicators 2019-01-01 Michel Zitt Alain Lelu Martine Cadot Guillaume Cabanac
+
Complexity Monotone in Conditions and Future Prediction Errors. 2006-01-01 Alexey Chernov Marcus Hütter Jürgen Schmidhuber
Estimating the algorithmic complexity of stock markets 2015-12-29 Olivier Brandouy Jean‐Paul Delahaye Lin Ma
Scalable and Effective Conductance-Based Graph Clustering 2023-06-26 Longlong Lin Rong-Hua Li Tao Jia
On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures 2022-01-01 Abicumaran Uthamacumaran Felipe S. Abrahão Narsis A. Kiani Héctor Zenil
RePair in Compressed Space and Time 2018-01-01 Kensuke Sakai Tatsuya Ohno Keisuke Goto Yoshimasa Takabatake I Tomohiro Hiroshi Sakamoto
On universal prediction and Bayesian confirmation 2007-05-25 Marcus Hütter
Kernelized Supervised Dictionary Learning 2013-07-24 Mehrdad J. Gangeh Ali Ghodsi Mohamed S. Kamel
+
Authorship Verification with Compression Features. 2013-01-01 Cor J. Veenman Zhenshi Li
A Safe Approximation for Kolmogorov Complexity 2014-01-01 Peter Bloem Francisco Mota Steven de Rooij Luís Antunes Pieter Adriaans