Best-Ever Algorithm Found for Huge Streams of Data
Mikkel Thorup of University of Copenhagen and Kasper Green Larsen of University of Aarhus are co-authors of an article describing the fastest algorithm ever for huge streams of data.
It’s hard to measure water from a fire hose while it’s hitting you in the face. In a sense, that’s the challenge of analyzing streaming data, which comes at us in a torrent and never lets up. If you’re on Twitter watching tweets go by, you might like to declare a brief pause, so you can figure out what’s trending. That’s not feasible, though, so instead you need to find a way to tally hashtags on the fly.
Computer programs that perform these kinds of on-the-go calculations are called streaming algorithms. Because data comes at them continuously, and in such volume, they try to record the essence of what they’ve seen while strategically forgetting the rest. For more than 30 years computer scientists have worked to build a better streaming algorithm. Last fall a team of researchers invented one that is just about perfect.
“We developed a new algorithm that is simultaneously the best” on every performance dimension, said Jelani Nelson, a computer scientist at Harvard University and a co-author of the work with Kasper Green Larsen of Aarhus University in Denmark, Huy Nguyen of Northeastern University and Mikkel Thorup of the University of Copenhagen.
This best-in-class streaming algorithm works by remembering just enough of what it’s seen to tell you what it’s seen most frequently. It suggests that compromises that seemed intrinsic to the analysis of streaming data are not actually necessary. It also points the way forward to a new era of strategic forgetting.
No algorithm works perfectly every time you run it — even the best ones misfire some small percentage of the time. In the example we’ve been using, a misfire could mean that the second two-digit block, 34, gets assigned an incorrect tag, and as a result, when it goes looking for the block it’s supposed to be joined to, it doesn’t have the information it needs to find 56. And once one link in the chain fails, the entire effort falls apart.
To avoid this problem, the researchers use what’s called an “expander graph.” In an expander graph, each two-digit block forms a point. Points get connected by lines (according to the tagging process described above) to form a cluster. The important feature of an expander graph is that instead of merely connecting each point with its adjoining blocks, you connect each two-digit block with multiple other blocks. For example, with 12,345,678, you connect 12 with 34 but also with 56, so that you can still tell that 12 and 56 belong in the same number even if the link between 12 and 34 fails.
An expander graph doesn’t always come out perfectly. Sometimes it’ll fail to link two blocks that should be linked. Or it’ll link two blocks that don’t belong together. To counteract this tendency, the researchers developed the final step of their algorithm: a “cluster-preserving” sub-algorithm that can survey an expander graph and accurately determine which points are meant to be clustered together and which aren’t, even when some lines are missing and false ones have been added.
“This guarantees I can recover something that looks like the original clusters,” Thorup said.
And while Twitter isn’t going to plug in the expander sketch tomorrow, the techniques underlying it are applicable to a far wider range of computer science problems than tallying tweets. The algorithm also proves that certain sacrifices that previously seemed necessary to answer the frequent-items problem don’t need to be made. Previous algorithms always gave up something — they were accurate but memory-intensive, or fast but unable to determine which frequent items were trending. This new work shows that given the right way of encoding a lot of information, you can end up with the best of all possible worlds: You can store your frequent items and recall them, too.