Query-centric failure recovery for distributed stream processing engines
Research output: Contribution to journal › Conference article › Research › peer-review
Standard
Query-centric failure recovery for distributed stream processing engines. / Su, Li; Zhou, Yongluan.
In: Proceedings - International Conference on Data Engineering, Vol. 2018, 24.10.2018, p. 1280-1283.Research output: Contribution to journal › Conference article › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Query-centric failure recovery for distributed stream processing engines
AU - Su, Li
AU - Zhou, Yongluan
PY - 2018/10/24
Y1 - 2018/10/24
N2 - Correlated failures that usually involve a number of nodes failing simultaneously have significant effect on systems' availability, especially for streaming applications that require real-Time analysis. Most state-of-The-Art distributed stream processing engines focus on recovering individual operator failure. By analyzing the existing recovery techniques, we identify the challenges and propose a fault-Tolerance framework that can tolerate both individual and correlated failures with minimum overhead during the system's normal execution. Our progressive and query-centric recovery paradigm carefully schedules the recovery of failed operators based on the current availability of resources, such that the outputs of queries can be recovered as early as possible. We also formulate the new problem of recovery scheduling under correlated failures and design algorithms to optimize the recovery latency with a performance guarantee.
AB - Correlated failures that usually involve a number of nodes failing simultaneously have significant effect on systems' availability, especially for streaming applications that require real-Time analysis. Most state-of-The-Art distributed stream processing engines focus on recovering individual operator failure. By analyzing the existing recovery techniques, we identify the challenges and propose a fault-Tolerance framework that can tolerate both individual and correlated failures with minimum overhead during the system's normal execution. Our progressive and query-centric recovery paradigm carefully schedules the recovery of failed operators based on the current availability of resources, such that the outputs of queries can be recovered as early as possible. We also formulate the new problem of recovery scheduling under correlated failures and design algorithms to optimize the recovery latency with a performance guarantee.
KW - Correlated Failure
KW - Distributed Stream Processing
KW - Fault Tolerance
UR - http://www.scopus.com/inward/record.url?scp=85057124101&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2018.00129
DO - 10.1109/ICDE.2018.00129
M3 - Conference article
AN - SCOPUS:85057124101
VL - 2018
SP - 1280
EP - 1283
JO - Proceedings - International Conference on Data Engineering
JF - Proceedings - International Conference on Data Engineering
SN - 1084-4627
T2 - 34th IEEE International Conference on Data Engineering, ICDE 2018
Y2 - 16 April 2018 through 19 April 2018
ER -
ID: 222697433