Tolerating correlated failures in Massively Parallel Stream Processing Engines

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

Tolerating correlated failures in Massively Parallel Stream Processing Engines. / Su, Li; Zhou, Yongluan.

2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE Press, 2016. p. 517-528.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Su, L & Zhou, Y 2016, Tolerating correlated failures in Massively Parallel Stream Processing Engines. in 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE Press, pp. 517-528, 32nd IEEE International Conference on Data Engineering, Helsinki, Finland, 16/05/2016. https://doi.org/10.1109/ICDE.2016.7498267

APA

Su, L., & Zhou, Y. (2016). Tolerating correlated failures in Massively Parallel Stream Processing Engines. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE) (pp. 517-528). IEEE Press. https://doi.org/10.1109/ICDE.2016.7498267

Vancouver

Su L, Zhou Y. Tolerating correlated failures in Massively Parallel Stream Processing Engines. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE Press. 2016. p. 517-528 https://doi.org/10.1109/ICDE.2016.7498267

Author

Su, Li ; Zhou, Yongluan. / Tolerating correlated failures in Massively Parallel Stream Processing Engines. 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE Press, 2016. pp. 517-528

Bibtex

@inproceedings{1fe6df31464e409d96b06328415d690c,
title = "Tolerating correlated failures in Massively Parallel Stream Processing Engines",
abstract = "Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.",
author = "Li Su and Yongluan Zhou",
year = "2016",
month = may,
day = "1",
doi = "10.1109/ICDE.2016.7498267",
language = "English",
pages = "517--528",
booktitle = "2016 IEEE 32nd International Conference on Data Engineering (ICDE)",
publisher = "IEEE Press",
note = "32nd IEEE International Conference on Data Engineering, ICDE 2016 ; Conference date: 16-05-2016 Through 20-05-2016",
url = "http://icde2016.fi/",

}

RIS

TY - GEN

T1 - Tolerating correlated failures in Massively Parallel Stream Processing Engines

AU - Su, Li

AU - Zhou, Yongluan

N1 - Conference code: 32

PY - 2016/5/1

Y1 - 2016/5/1

N2 - Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.

AB - Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.

U2 - 10.1109/ICDE.2016.7498267

DO - 10.1109/ICDE.2016.7498267

M3 - Article in proceedings

SP - 517

EP - 528

BT - 2016 IEEE 32nd International Conference on Data Engineering (ICDE)

PB - IEEE Press

T2 - 32nd IEEE International Conference on Data Engineering

Y2 - 16 May 2016 through 20 May 2016

ER -

ID: 179277833