Passive and partially active fault tolerance for massively parallel stream processing engines
Research output: Contribution to journal › Journal article › peer-review
Standard
Passive and partially active fault tolerance for massively parallel stream processing engines. / Su, Li; Zhou, Yongluan.
In: IEEE Transactions on Knowledge and Data Engineering, Vol. 31, No. 1, 2019, p. 32-45.Research output: Contribution to journal › Journal article › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - JOUR
T1 - Passive and partially active fault tolerance for massively parallel stream processing engines
AU - Su, Li
AU - Zhou, Yongluan
PY - 2019
Y1 - 2019
N2 - Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.
AB - Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.
KW - Data models
KW - Distributed Stream Processing
KW - Engines
KW - Fault Tolerance
KW - Fault tolerance
KW - Fault tolerant systems
KW - Semantics
KW - Storms
KW - Topology
U2 - 10.1109/TKDE.2017.2720602
DO - 10.1109/TKDE.2017.2720602
M3 - Journal article
VL - 31
SP - 32
EP - 45
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
SN - 1041-4347
IS - 1
ER -
ID: 182749479