CountSketches, Feature Hashing and the Median of Three
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
CountSketches, Feature Hashing and the Median of Three. / Larsen, Kasper Green; Pagh, Rasmus; Tetek, Jakub.
Proceedings of the 38 th International Conference on Machine Learning. ed. / M Meila; T Zhang. PMLR, 2021. p. 6011-6020 (Proceedings of Machine Learning Research, Vol. 139).Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - CountSketches, Feature Hashing and the Median of Three
AU - Larsen, Kasper Green
AU - Pagh, Rasmus
AU - Tetek, Jakub
PY - 2021
Y1 - 2021
N2 - In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector v to a vector of dimension (2t - 1)s, where t, s > 0 are integer parameters. It is known that a CountSketch allows estimating coordinates of v with variance bounded by parallel to v parallel to/s. For t > 1, the estimator takes the median of 2t - 1 independent estimates, and the probability that the estimate is off by more than 2 parallel to v parallel to/root s is exponentially small in t. This suggests choosing t to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant t. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to O(min{parallel to v parallel to, parallel to v parallel to/s}) when t > 1. That is, the variance decreases proportionally to s, asymptotically for large enough s.
AB - In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector v to a vector of dimension (2t - 1)s, where t, s > 0 are integer parameters. It is known that a CountSketch allows estimating coordinates of v with variance bounded by parallel to v parallel to/s. For t > 1, the estimator takes the median of 2t - 1 independent estimates, and the probability that the estimate is off by more than 2 parallel to v parallel to/root s is exponentially small in t. This suggests choosing t to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant t. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to O(min{parallel to v parallel to, parallel to v parallel to/s}) when t > 1. That is, the variance decreases proportionally to s, asymptotically for large enough s.
M3 - Article in proceedings
T3 - Proceedings of Machine Learning Research
SP - 6011
EP - 6020
BT - Proceedings of the 38 th International Conference on Machine Learning
A2 - Meila, M
A2 - Zhang, T
PB - PMLR
T2 - 38th International Conference on Machine Learning (ICML)
Y2 - 18 July 2021 through 24 July 2021
ER -
ID: 301135523