CountSketches, Feature Hashing and the Median of Three

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

Fulltext
Final published version, 450 KB, PDF document

Kasper Green Larsen
Pagh, Rasmus
Tetek, Jakub

In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector v to a vector of dimension (2t - 1)s, where t, s > 0 are integer parameters. It is known that a CountSketch allows estimating coordinates of v with variance bounded by parallel to v parallel to/s. For t > 1, the estimator takes the median of 2t - 1 independent estimates, and the probability that the estimate is off by more than 2 parallel to v parallel to/root s is exponentially small in t. This suggests choosing t to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant t. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to O(min{parallel to v parallel to, parallel to v parallel to/s}) when t > 1. That is, the variance decreases proportionally to s, asymptotically for large enough s.

Original language	English
Title of host publication	Proceedings of the 38 th International Conference on Machine Learning
Editors	M Meila, T Zhang
Publisher	PMLR
Publication date	2021
Pages	6011-6020
Publication status	Published - 2021
Event	38th International Conference on Machine Learning (ICML) - Virtual Duration: 18 Jul 2021 → 24 Jul 2021

Conference

Conference	38th International Conference on Machine Learning (ICML)
By	Virtual
Periode	18/07/2021 → 24/07/2021

Series	Proceedings of Machine Learning Research
Volume	139
ISSN	2640-3498

Department of Computer Science