CountSketches, Feature Hashing and the Median of Three

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

CountSketches, Feature Hashing and the Median of Three. / Larsen, Kasper Green; Pagh, Rasmus; Tetek, Jakub.

Proceedings of the 38 th International Conference on Machine Learning. ed. / M Meila; T Zhang. PMLR, 2021. p. 6011-6020 (Proceedings of Machine Learning Research, Vol. 139).

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Larsen, KG, Pagh, R & Tetek, J 2021, CountSketches, Feature Hashing and the Median of Three. in M Meila & T Zhang (eds), Proceedings of the 38 th International Conference on Machine Learning. PMLR, Proceedings of Machine Learning Research, vol. 139, pp. 6011-6020, 38th International Conference on Machine Learning (ICML), Virtual, 18/07/2021. <https://proceedings.mlr.press/v139/>

APA

Larsen, K. G., Pagh, R., & Tetek, J. (2021). CountSketches, Feature Hashing and the Median of Three. In M. Meila, & T. Zhang (Eds.), Proceedings of the 38 th International Conference on Machine Learning (pp. 6011-6020). PMLR. Proceedings of Machine Learning Research Vol. 139 https://proceedings.mlr.press/v139/

Vancouver

Larsen KG, Pagh R, Tetek J. CountSketches, Feature Hashing and the Median of Three. In Meila M, Zhang T, editors, Proceedings of the 38 th International Conference on Machine Learning. PMLR. 2021. p. 6011-6020. (Proceedings of Machine Learning Research, Vol. 139).

Author

Larsen, Kasper Green ; Pagh, Rasmus ; Tetek, Jakub. / CountSketches, Feature Hashing and the Median of Three. Proceedings of the 38 th International Conference on Machine Learning. editor / M Meila ; T Zhang. PMLR, 2021. pp. 6011-6020 (Proceedings of Machine Learning Research, Vol. 139).

Bibtex

@inproceedings{c5e9fe21a96b475b9cfded7928809b0e,

title = "CountSketches, Feature Hashing and the Median of Three",

abstract = "In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector v to a vector of dimension (2t - 1)s, where t, s > 0 are integer parameters. It is known that a CountSketch allows estimating coordinates of v with variance bounded by parallel to v parallel to/s. For t > 1, the estimator takes the median of 2t - 1 independent estimates, and the probability that the estimate is off by more than 2 parallel to v parallel to/root s is exponentially small in t. This suggests choosing t to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant t. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to O(min{parallel to v parallel to, parallel to v parallel to/s}) when t > 1. That is, the variance decreases proportionally to s, asymptotically for large enough s.",

author = "Larsen, {Kasper Green} and Rasmus Pagh and Jakub Tetek",

year = "2021",

language = "English",

series = "Proceedings of Machine Learning Research",

pages = "6011--6020",

editor = "M Meila and T Zhang",

booktitle = "Proceedings of the 38 th International Conference on Machine Learning",

publisher = "PMLR",

note = "38th International Conference on Machine Learning (ICML) ; Conference date: 18-07-2021 Through 24-07-2021",

}

RIS

TY - GEN

T1 - CountSketches, Feature Hashing and the Median of Three

AU - Larsen, Kasper Green

AU - Pagh, Rasmus

AU - Tetek, Jakub

PY - 2021

Y1 - 2021

N2 - In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector v to a vector of dimension (2t - 1)s, where t, s > 0 are integer parameters. It is known that a CountSketch allows estimating coordinates of v with variance bounded by parallel to v parallel to/s. For t > 1, the estimator takes the median of 2t - 1 independent estimates, and the probability that the estimate is off by more than 2 parallel to v parallel to/root s is exponentially small in t. This suggests choosing t to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant t. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to O(min{parallel to v parallel to, parallel to v parallel to/s}) when t > 1. That is, the variance decreases proportionally to s, asymptotically for large enough s.

AB - In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector v to a vector of dimension (2t - 1)s, where t, s > 0 are integer parameters. It is known that a CountSketch allows estimating coordinates of v with variance bounded by parallel to v parallel to/s. For t > 1, the estimator takes the median of 2t - 1 independent estimates, and the probability that the estimate is off by more than 2 parallel to v parallel to/root s is exponentially small in t. This suggests choosing t to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant t. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of CountSketch, showing an improvement in variance to O(min{parallel to v parallel to, parallel to v parallel to/s}) when t > 1. That is, the variance decreases proportionally to s, asymptotically for large enough s.

M3 - Article in proceedings

T3 - Proceedings of Machine Learning Research

SP - 6011

EP - 6020

BT - Proceedings of the 38 th International Conference on Machine Learning

A2 - Meila, M

A2 - Zhang, T

PB - PMLR

T2 - 38th International Conference on Machine Learning (ICML)

Y2 - 18 July 2021 through 24 July 2021

ER -

ID: 301135523

Department of Computer Science