I/O-Efficient Similarity Join

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

I/O-Efficient Similarity Join. / Pagh, Rasmus; Pham, Ninh; Silvestri, Francesco; Stöckel, Morten.

In: Algorithmica, Vol. 78, No. 4, 01.08.2017, p. 1263-1283.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Pagh, R, Pham, N, Silvestri, F & Stöckel, M 2017, 'I/O-Efficient Similarity Join', Algorithmica, vol. 78, no. 4, pp. 1263-1283. https://doi.org/10.1007/s00453-017-0285-5

APA

Pagh, R., Pham, N., Silvestri, F., & Stöckel, M. (2017). I/O-Efficient Similarity Join. Algorithmica, 78(4), 1263-1283. https://doi.org/10.1007/s00453-017-0285-5

Vancouver

Pagh R, Pham N, Silvestri F, Stöckel M. I/O-Efficient Similarity Join. Algorithmica. 2017 Aug 1;78(4):1263-1283. https://doi.org/10.1007/s00453-017-0285-5

Author

Pagh, Rasmus ; Pham, Ninh ; Silvestri, Francesco ; Stöckel, Morten. / I/O-Efficient Similarity Join. In: Algorithmica. 2017 ; Vol. 78, No. 4. pp. 1263-1283.

Bibtex

@article{d4414b0dbde74eab9f132b3ef0c3eb6c,
title = "I/O-Efficient Similarity Join",
abstract = "We present an I/O-efficient algorithm for computing similarity joins basedon locality-sensitive hashing (LSH). In contrast to the filtering methods commonlysuggested our method has provable sub-quadratic dependency on the data size. Further,in contrast to straightforward implementations of known LSH-based algorithms onexternal memory, our approach is able to take significant advantage of the availableinternal memory:Whereas the time complexity of classical algorithms includes a factorof Nρ, where ρ is a parameter of the LSH used, the I/O complexity of our algorithmmerely includes a factor (N/M)ρ, where N is the data size and M is the size ofinternal memory. Our algorithm is randomized and outputs the correct result withhigh probability. It is a simple, recursive, cache-oblivious procedure, and we believethat it will be useful also in other computational settings such as parallel computation",
author = "Rasmus Pagh and Ninh Pham and Francesco Silvestri and Morten St{\"o}ckel",
year = "2017",
month = aug,
day = "1",
doi = "10.1007/s00453-017-0285-5",
language = "English",
volume = "78",
pages = "1263--1283",
journal = "Algorithmica",
issn = "0178-4617",
publisher = "Springer",
number = "4",

}

RIS

TY - JOUR

T1 - I/O-Efficient Similarity Join

AU - Pagh, Rasmus

AU - Pham, Ninh

AU - Silvestri, Francesco

AU - Stöckel, Morten

PY - 2017/8/1

Y1 - 2017/8/1

N2 - We present an I/O-efficient algorithm for computing similarity joins basedon locality-sensitive hashing (LSH). In contrast to the filtering methods commonlysuggested our method has provable sub-quadratic dependency on the data size. Further,in contrast to straightforward implementations of known LSH-based algorithms onexternal memory, our approach is able to take significant advantage of the availableinternal memory:Whereas the time complexity of classical algorithms includes a factorof Nρ, where ρ is a parameter of the LSH used, the I/O complexity of our algorithmmerely includes a factor (N/M)ρ, where N is the data size and M is the size ofinternal memory. Our algorithm is randomized and outputs the correct result withhigh probability. It is a simple, recursive, cache-oblivious procedure, and we believethat it will be useful also in other computational settings such as parallel computation

AB - We present an I/O-efficient algorithm for computing similarity joins basedon locality-sensitive hashing (LSH). In contrast to the filtering methods commonlysuggested our method has provable sub-quadratic dependency on the data size. Further,in contrast to straightforward implementations of known LSH-based algorithms onexternal memory, our approach is able to take significant advantage of the availableinternal memory:Whereas the time complexity of classical algorithms includes a factorof Nρ, where ρ is a parameter of the LSH used, the I/O complexity of our algorithmmerely includes a factor (N/M)ρ, where N is the data size and M is the size ofinternal memory. Our algorithm is randomized and outputs the correct result withhigh probability. It is a simple, recursive, cache-oblivious procedure, and we believethat it will be useful also in other computational settings such as parallel computation

U2 - 10.1007/s00453-017-0285-5

DO - 10.1007/s00453-017-0285-5

M3 - Journal article

VL - 78

SP - 1263

EP - 1283

JO - Algorithmica

JF - Algorithmica

SN - 0178-4617

IS - 4

ER -

ID: 194914537