Fast Recovery of Correlated Failures in Distributed Stream Processing Engines

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Fast Recovery of Correlated Failures in Distributed Stream Processing Engines. / Su, Li; Zhou, Yongluan.

Proceedings of the 15th ACM International Conference on Distributed and Event-based Systems, DEBS 2021 . Association for Computing Machinery, 2021. p. 66-77.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Su, L & Zhou, Y 2021, Fast Recovery of Correlated Failures in Distributed Stream Processing Engines. in Proceedings of the 15th ACM International Conference on Distributed and Event-based Systems, DEBS 2021 . Association for Computing Machinery, pp. 66-77, 15th ACM International Conference on Distributed and Event-based Systems, Virtual, 28/06/2021. https://doi.org/10.1145/3465480.3466923

APA

Su, L., & Zhou, Y. (2021). Fast Recovery of Correlated Failures in Distributed Stream Processing Engines. In Proceedings of the 15th ACM International Conference on Distributed and Event-based Systems, DEBS 2021 (pp. 66-77). Association for Computing Machinery. https://doi.org/10.1145/3465480.3466923

Vancouver

Su L, Zhou Y. Fast Recovery of Correlated Failures in Distributed Stream Processing Engines. In Proceedings of the 15th ACM International Conference on Distributed and Event-based Systems, DEBS 2021 . Association for Computing Machinery. 2021. p. 66-77 https://doi.org/10.1145/3465480.3466923

Author

Su, Li ; Zhou, Yongluan. / Fast Recovery of Correlated Failures in Distributed Stream Processing Engines. Proceedings of the 15th ACM International Conference on Distributed and Event-based Systems, DEBS 2021 . Association for Computing Machinery, 2021. pp. 66-77

Bibtex

@inproceedings{e3b8b632688b49ac9d1a9597f5b35d63,

title = "Fast Recovery of Correlated Failures in Distributed Stream Processing Engines",

abstract = "In a large-scale cluster, correlated failures usually involve a number of nodes failing simultaneously. Although correlated failures occur infrequently, they have significant effect on systems' availability, especially for streaming applications that require real-time analysis, as repairing the failed nodes or acquiring additional ones would take a significant amount of time. Most state-of-the-art distributed stream processing systems (DSPSs) focus on recovering individual failures and do not consider the optimization for recovering correlated failure. In this work, we propose an incremental and query-centric recovery paradigm where the recovery of failed operator partitions would be carefully scheduled based on the current availability of resources, such that the outputs of queries can be recovered as early as possible. By analyzing the existing recovery techniques, we identify the challenges and propose a fault-tolerance framework that can support incremental recovery with minimum overhead during the system's normal execution. We also formulate the new problem of recovery scheduling under correlated failures and design algorithms to optimize the recovery latency with a performance guarantee. A comprehensive set of experiments are conducted to study the validity of our proposal.",

author = "Li Su and Yongluan Zhou",

year = "2021",

doi = "10.1145/3465480.3466923",

language = "English",

isbn = "978-1-4503-8555-8/21/06",

pages = "66--77",

booktitle = "Proceedings of the 15th ACM International Conference on Distributed and Event-based Systems, DEBS 2021",

publisher = "Association for Computing Machinery",

note = "15th ACM International Conference on Distributed and Event-based Systems ; Conference date: 28-06-2021 Through 02-07-2021",

}

RIS

TY - GEN

T1 - Fast Recovery of Correlated Failures in Distributed Stream Processing Engines

AU - Su, Li

AU - Zhou, Yongluan

PY - 2021

Y1 - 2021

N2 - In a large-scale cluster, correlated failures usually involve a number of nodes failing simultaneously. Although correlated failures occur infrequently, they have significant effect on systems' availability, especially for streaming applications that require real-time analysis, as repairing the failed nodes or acquiring additional ones would take a significant amount of time. Most state-of-the-art distributed stream processing systems (DSPSs) focus on recovering individual failures and do not consider the optimization for recovering correlated failure. In this work, we propose an incremental and query-centric recovery paradigm where the recovery of failed operator partitions would be carefully scheduled based on the current availability of resources, such that the outputs of queries can be recovered as early as possible. By analyzing the existing recovery techniques, we identify the challenges and propose a fault-tolerance framework that can support incremental recovery with minimum overhead during the system's normal execution. We also formulate the new problem of recovery scheduling under correlated failures and design algorithms to optimize the recovery latency with a performance guarantee. A comprehensive set of experiments are conducted to study the validity of our proposal.

AB - In a large-scale cluster, correlated failures usually involve a number of nodes failing simultaneously. Although correlated failures occur infrequently, they have significant effect on systems' availability, especially for streaming applications that require real-time analysis, as repairing the failed nodes or acquiring additional ones would take a significant amount of time. Most state-of-the-art distributed stream processing systems (DSPSs) focus on recovering individual failures and do not consider the optimization for recovering correlated failure. In this work, we propose an incremental and query-centric recovery paradigm where the recovery of failed operator partitions would be carefully scheduled based on the current availability of resources, such that the outputs of queries can be recovered as early as possible. By analyzing the existing recovery techniques, we identify the challenges and propose a fault-tolerance framework that can support incremental recovery with minimum overhead during the system's normal execution. We also formulate the new problem of recovery scheduling under correlated failures and design algorithms to optimize the recovery latency with a performance guarantee. A comprehensive set of experiments are conducted to study the validity of our proposal.

U2 - 10.1145/3465480.3466923

DO - 10.1145/3465480.3466923

M3 - Article in proceedings

SN - 978-1-4503-8555-8/21/06

SP - 66

EP - 77

BT - Proceedings of the 15th ACM International Conference on Distributed and Event-based Systems, DEBS 2021

PB - Association for Computing Machinery

T2 - 15th ACM International Conference on Distributed and Event-based Systems

Y2 - 28 June 2021 through 2 July 2021

ER -

ID: 272137982

Department of Computer Science