A Nearly Tight Analysis of Greedy k-means++
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
A Nearly Tight Analysis of Greedy k-means++. / Grunau, Christoph; Özüdoğru, Ahmet Alper; Rozhoň, Václav; Tětek, Jakub.
Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). ed. / Nikhil Bansal; Viswanath Nagarajan. Society for Industrial and Applied Mathematics, 2023. p. 1012-1070.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - A Nearly Tight Analysis of Greedy k-means++
AU - Grunau, Christoph
AU - Özüdoğru, Ahmet Alper
AU - Rozhoň, Václav
AU - Tětek, Jakub
PY - 2023
Y1 - 2023
N2 - The famous k-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the k-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following k — 1 centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The k-means++ algorithm is known to return Θ(log k) approximate solution in expectation.In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following greedy variant: in every step, we sample ℓ candidate centers instead of one and then pick the one that minimizes the new cost. This is also how k-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011].We present nearly matching lower and upper bounds for the greedy k-means++: We prove that it is an O(ℓ3 log3 k)-approximation algorithm. On the other hand, we prove a lower bound of Ω(ℓ3 log3 k/ log2 (ℓ log k)). Previously, only an Ω(ℓ log k) lower bound was known [Bhattacharya, Eube, Röglin, Schmidt; ESA 2020] and there was no known upper bound.
AB - The famous k-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the k-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following k — 1 centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The k-means++ algorithm is known to return Θ(log k) approximate solution in expectation.In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following greedy variant: in every step, we sample ℓ candidate centers instead of one and then pick the one that minimizes the new cost. This is also how k-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011].We present nearly matching lower and upper bounds for the greedy k-means++: We prove that it is an O(ℓ3 log3 k)-approximation algorithm. On the other hand, we prove a lower bound of Ω(ℓ3 log3 k/ log2 (ℓ log k)). Previously, only an Ω(ℓ log k) lower bound was known [Bhattacharya, Eube, Röglin, Schmidt; ESA 2020] and there was no known upper bound.
U2 - 10.1137/1.9781611977554.ch39
DO - 10.1137/1.9781611977554.ch39
M3 - Article in proceedings
SP - 1012
EP - 1070
BT - Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)
A2 - Bansal, Nikhil
A2 - Nagarajan, Viswanath
PB - Society for Industrial and Applied Mathematics
Y2 - 22 January 2023 through 25 January 2023
ER -
ID: 382690810