A Nearly Tight Analysis of Greedy k-means++

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

A Nearly Tight Analysis of Greedy k-means++. / Grunau, Christoph; Özüdoğru, Ahmet Alper; Rozhoň, Václav; Tětek, Jakub.

Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). ed. / Nikhil Bansal; Viswanath Nagarajan. Society for Industrial and Applied Mathematics, 2023. p. 1012-1070.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Grunau, C, Özüdoğru, AA, Rozhoň, V & Tětek, J 2023, A Nearly Tight Analysis of Greedy k-means++. in N Bansal & V Nagarajan (eds), Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, pp. 1012-1070, 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA23), Florence, Italy, 22/01/2023. https://doi.org/10.1137/1.9781611977554.ch39

APA

Grunau, C., Özüdoğru, A. A., Rozhoň, V., & Tětek, J. (2023). A Nearly Tight Analysis of Greedy k-means++. In N. Bansal, & V. Nagarajan (Eds.), Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (pp. 1012-1070). Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611977554.ch39

Vancouver

Grunau C, Özüdoğru AA, Rozhoň V, Tětek J. A Nearly Tight Analysis of Greedy k-means++. In Bansal N, Nagarajan V, editors, Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics. 2023. p. 1012-1070 https://doi.org/10.1137/1.9781611977554.ch39

Author

Grunau, Christoph ; Özüdoğru, Ahmet Alper ; Rozhoň, Václav ; Tětek, Jakub. / A Nearly Tight Analysis of Greedy k-means++. Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). editor / Nikhil Bansal ; Viswanath Nagarajan. Society for Industrial and Applied Mathematics, 2023. pp. 1012-1070

Bibtex

@inproceedings{19c272e92a9c48c6a6c37b935381f094,

title = "A Nearly Tight Analysis of Greedy k-means++",

abstract = "The famous k-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the k-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following k — 1 centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The k-means++ algorithm is known to return Θ(log k) approximate solution in expectation.In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following greedy variant: in every step, we sample ℓ candidate centers instead of one and then pick the one that minimizes the new cost. This is also how k-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011].We present nearly matching lower and upper bounds for the greedy k-means++: We prove that it is an O(ℓ3 log3 k)-approximation algorithm. On the other hand, we prove a lower bound of Ω(ℓ3 log3 k/ log2 (ℓ log k)). Previously, only an Ω(ℓ log k) lower bound was known [Bhattacharya, Eube, R{\"o}glin, Schmidt; ESA 2020] and there was no known upper bound.",

author = "Christoph Grunau and {\"O}z{\"u}doğru, {Ahmet Alper} and V{\'a}clav Rozho{\v n} and Jakub T{\v e}tek",

year = "2023",

doi = "10.1137/1.9781611977554.ch39",

language = "English",

pages = "1012--1070",

editor = "Nikhil Bansal and Viswanath Nagarajan",

booktitle = "Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)",

publisher = "Society for Industrial and Applied Mathematics",

address = "United States",

note = "null ; Conference date: 22-01-2023 Through 25-01-2023",

}

RIS

TY - GEN

T1 - A Nearly Tight Analysis of Greedy k-means++

AU - Grunau, Christoph

AU - Özüdoğru, Ahmet Alper

AU - Rozhoň, Václav

AU - Tětek, Jakub

PY - 2023

Y1 - 2023

N2 - The famous k-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the k-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following k — 1 centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The k-means++ algorithm is known to return Θ(log k) approximate solution in expectation.In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following greedy variant: in every step, we sample ℓ candidate centers instead of one and then pick the one that minimizes the new cost. This is also how k-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011].We present nearly matching lower and upper bounds for the greedy k-means++: We prove that it is an O(ℓ3 log3 k)-approximation algorithm. On the other hand, we prove a lower bound of Ω(ℓ3 log3 k/ log2 (ℓ log k)). Previously, only an Ω(ℓ log k) lower bound was known [Bhattacharya, Eube, Röglin, Schmidt; ESA 2020] and there was no known upper bound.

AB - The famous k-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the k-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following k — 1 centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The k-means++ algorithm is known to return Θ(log k) approximate solution in expectation.In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following greedy variant: in every step, we sample ℓ candidate centers instead of one and then pick the one that minimizes the new cost. This is also how k-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011].We present nearly matching lower and upper bounds for the greedy k-means++: We prove that it is an O(ℓ3 log3 k)-approximation algorithm. On the other hand, we prove a lower bound of Ω(ℓ3 log3 k/ log2 (ℓ log k)). Previously, only an Ω(ℓ log k) lower bound was known [Bhattacharya, Eube, Röglin, Schmidt; ESA 2020] and there was no known upper bound.

U2 - 10.1137/1.9781611977554.ch39

DO - 10.1137/1.9781611977554.ch39

M3 - Article in proceedings

SP - 1012

EP - 1070

BT - Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)

A2 - Bansal, Nikhil

A2 - Nagarajan, Viswanath

PB - Society for Industrial and Applied Mathematics

Y2 - 22 January 2023 through 25 January 2023

ER -

ID: 382690810

Department of Computer Science