An exploratory analysis of methods for real-time data deduplication in streaming processes

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

An exploratory analysis of methods for real-time data deduplication in streaming processes. / Esteves, João; Costa, Rosa; Zhou, Yongluan; Brito De Almeida, Ana Carolina.

DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems. Association for Computing Machinery, 2023. p. 91–102.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Esteves, J, Costa, R, Zhou, Y & Brito De Almeida, AC 2023, An exploratory analysis of methods for real-time data deduplication in streaming processes. in DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems. Association for Computing Machinery, pp. 91–102, 17th ACM International Conference on Distributed and Event-based Systems - DEBS '23, Neuchatel, Switzerland, 27/06/2023. https://doi.org/10.1145/3583678.3596898

APA

Esteves, J., Costa, R., Zhou, Y., & Brito De Almeida, A. C. (2023). An exploratory analysis of methods for real-time data deduplication in streaming processes. In DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems (pp. 91–102). Association for Computing Machinery. https://doi.org/10.1145/3583678.3596898

Vancouver

Esteves J, Costa R, Zhou Y, Brito De Almeida AC. An exploratory analysis of methods for real-time data deduplication in streaming processes. In DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems. Association for Computing Machinery. 2023. p. 91–102 https://doi.org/10.1145/3583678.3596898

Author

Esteves, João ; Costa, Rosa ; Zhou, Yongluan ; Brito De Almeida, Ana Carolina. / An exploratory analysis of methods for real-time data deduplication in streaming processes. DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems. Association for Computing Machinery, 2023. pp. 91–102

Bibtex

@inproceedings{5c6128788958464dbe13e477288ca066,
title = "An exploratory analysis of methods for real-time data deduplication in streaming processes",
abstract = "Modern stream processing systems typically require ingesting and correlating data from multiple data sources. However, these sources are out of control and prone to software errors and unavailability, causing data anomalies that must be necessarily remedied before processing the data. In this context, anomaly, such as data duplication, appears as one of the most prominent challenges of stream processing. Data duplication can hinder real-time analysis of data for decision making. This paper investigates the challenges and performs an experimental analysis of operators and auxiliary tools to help with data deduplication. The results show that there is an increase in data delivery time when using external mechanisms. However, these mechanisms are essential for an ingestion process to guarantee that no data is lost and that no duplicates are persisted.",
author = "Jo{\~a}o Esteves and Rosa Costa and Yongluan Zhou and {Brito De Almeida}, {Ana Carolina}",
year = "2023",
month = jun,
day = "27",
doi = "10.1145/3583678.3596898",
language = "English",
pages = "91–102",
booktitle = "DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems",
publisher = "Association for Computing Machinery",
note = "17th ACM International Conference on Distributed and Event-based Systems - DEBS '23 ; Conference date: 27-06-2023 Through 30-06-2023",

}

RIS

TY - GEN

T1 - An exploratory analysis of methods for real-time data deduplication in streaming processes

AU - Esteves, João

AU - Costa, Rosa

AU - Zhou, Yongluan

AU - Brito De Almeida, Ana Carolina

PY - 2023/6/27

Y1 - 2023/6/27

N2 - Modern stream processing systems typically require ingesting and correlating data from multiple data sources. However, these sources are out of control and prone to software errors and unavailability, causing data anomalies that must be necessarily remedied before processing the data. In this context, anomaly, such as data duplication, appears as one of the most prominent challenges of stream processing. Data duplication can hinder real-time analysis of data for decision making. This paper investigates the challenges and performs an experimental analysis of operators and auxiliary tools to help with data deduplication. The results show that there is an increase in data delivery time when using external mechanisms. However, these mechanisms are essential for an ingestion process to guarantee that no data is lost and that no duplicates are persisted.

AB - Modern stream processing systems typically require ingesting and correlating data from multiple data sources. However, these sources are out of control and prone to software errors and unavailability, causing data anomalies that must be necessarily remedied before processing the data. In this context, anomaly, such as data duplication, appears as one of the most prominent challenges of stream processing. Data duplication can hinder real-time analysis of data for decision making. This paper investigates the challenges and performs an experimental analysis of operators and auxiliary tools to help with data deduplication. The results show that there is an increase in data delivery time when using external mechanisms. However, these mechanisms are essential for an ingestion process to guarantee that no data is lost and that no duplicates are persisted.

U2 - 10.1145/3583678.3596898

DO - 10.1145/3583678.3596898

M3 - Article in proceedings

SP - 91

EP - 102

BT - DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems

PB - Association for Computing Machinery

T2 - 17th ACM International Conference on Distributed and Event-based Systems - DEBS '23

Y2 - 27 June 2023 through 30 June 2023

ER -

ID: 359260915