An exploratory analysis of methods for real-time data deduplication in streaming processes
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
An exploratory analysis of methods for real-time data deduplication in streaming processes. / Esteves, João; Costa, Rosa; Zhou, Yongluan; Brito De Almeida, Ana Carolina.
DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems. Association for Computing Machinery, 2023. p. 91–102.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - An exploratory analysis of methods for real-time data deduplication in streaming processes
AU - Esteves, João
AU - Costa, Rosa
AU - Zhou, Yongluan
AU - Brito De Almeida, Ana Carolina
PY - 2023/6/27
Y1 - 2023/6/27
N2 - Modern stream processing systems typically require ingesting and correlating data from multiple data sources. However, these sources are out of control and prone to software errors and unavailability, causing data anomalies that must be necessarily remedied before processing the data. In this context, anomaly, such as data duplication, appears as one of the most prominent challenges of stream processing. Data duplication can hinder real-time analysis of data for decision making. This paper investigates the challenges and performs an experimental analysis of operators and auxiliary tools to help with data deduplication. The results show that there is an increase in data delivery time when using external mechanisms. However, these mechanisms are essential for an ingestion process to guarantee that no data is lost and that no duplicates are persisted.
AB - Modern stream processing systems typically require ingesting and correlating data from multiple data sources. However, these sources are out of control and prone to software errors and unavailability, causing data anomalies that must be necessarily remedied before processing the data. In this context, anomaly, such as data duplication, appears as one of the most prominent challenges of stream processing. Data duplication can hinder real-time analysis of data for decision making. This paper investigates the challenges and performs an experimental analysis of operators and auxiliary tools to help with data deduplication. The results show that there is an increase in data delivery time when using external mechanisms. However, these mechanisms are essential for an ingestion process to guarantee that no data is lost and that no duplicates are persisted.
U2 - 10.1145/3583678.3596898
DO - 10.1145/3583678.3596898
M3 - Article in proceedings
SP - 91
EP - 102
BT - DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems
PB - Association for Computing Machinery
T2 - 17th ACM International Conference on Distributed and Event-based Systems - DEBS '23
Y2 - 27 June 2023 through 30 June 2023
ER -
ID: 359260915