Maxson: Reduce duplicate parsing overhead on raw data

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

Maxson : Reduce duplicate parsing overhead on raw data. / Shi, Xuanhua; Zhang, Yipeng; Huang, Hong; Hu, Zhenyu; Jin, Hai; Shen, Huan; Zhou, Yongluan; He, Bingsheng; Li, Ruibo; Zhou, Keyong.

Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020. IEEE, 2020. p. 1621-1632 9101499.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Shi, X, Zhang, Y, Huang, H, Hu, Z, Jin, H, Shen, H, Zhou, Y, He, B, Li, R & Zhou, K 2020, Maxson: Reduce duplicate parsing overhead on raw data. in Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020., 9101499, IEEE, pp. 1621-1632, 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, United States, 20/04/2020. https://doi.org/10.1109/ICDE48307.2020.00144

APA

Shi, X., Zhang, Y., Huang, H., Hu, Z., Jin, H., Shen, H., Zhou, Y., He, B., Li, R., & Zhou, K. (2020). Maxson: Reduce duplicate parsing overhead on raw data. In Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020 (pp. 1621-1632). [9101499] IEEE. https://doi.org/10.1109/ICDE48307.2020.00144

Vancouver

Shi X, Zhang Y, Huang H, Hu Z, Jin H, Shen H et al. Maxson: Reduce duplicate parsing overhead on raw data. In Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020. IEEE. 2020. p. 1621-1632. 9101499 https://doi.org/10.1109/ICDE48307.2020.00144

Author

Shi, Xuanhua ; Zhang, Yipeng ; Huang, Hong ; Hu, Zhenyu ; Jin, Hai ; Shen, Huan ; Zhou, Yongluan ; He, Bingsheng ; Li, Ruibo ; Zhou, Keyong. / Maxson : Reduce duplicate parsing overhead on raw data. Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020. IEEE, 2020. pp. 1621-1632

Bibtex

@inproceedings{242ab9495bbd4d61adf2b463a1b249a1,
title = "Maxson: Reduce duplicate parsing overhead on raw data",
abstract = "JSON is a very popular data format in many applications in Web and enterprise. Recently, many data analytical systems support the loading and querying JSON data. However, JSON parsing can be costly, which dominates the execution time of querying JSON data. Many previous studies focus on building efficient parsers to reduce this parsing cost, and little work has been done on how to reduce the occurrences of parsing. In this paper, we start with a study with a real production workload in Alibaba, which consists of over 3 million queries on JSON. Our study reveals significant temporal and spatial correlations among those queries, which result in massive redundant parsing operations among queries. Instead of repetitively parsing the JSON data, we propose to develop a cache system named Maxson for caching the JSON query results (the values evaluated from JSONPath) for reuse. Specifically, we develop effective machine learning-based predictor with combining LSTM (long shortterm memory) and CRF (conditional random field) to determine the JSONPaths to cache given the space budget. We have implemented Maxson on top of SparkSQL. We experimentally evaluate Maxson and show that 1) Maxson is able to eliminate the most of duplicate JSON parsing overhead, 2) Maxson improves end-to-end workload performance by 1.5-6.5×.",
keywords = "Data analytics system, JSON parsing, Semi-structured format",
author = "Xuanhua Shi and Yipeng Zhang and Hong Huang and Zhenyu Hu and Hai Jin and Huan Shen and Yongluan Zhou and Bingsheng He and Ruibo Li and Keyong Zhou",
year = "2020",
month = apr,
doi = "10.1109/ICDE48307.2020.00144",
language = "English",
pages = "1621--1632",
booktitle = "Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020",
publisher = "IEEE",
note = "36th IEEE International Conference on Data Engineering, ICDE 2020 ; Conference date: 20-04-2020 Through 24-04-2020",

}

RIS

TY - GEN

T1 - Maxson

T2 - 36th IEEE International Conference on Data Engineering, ICDE 2020

AU - Shi, Xuanhua

AU - Zhang, Yipeng

AU - Huang, Hong

AU - Hu, Zhenyu

AU - Jin, Hai

AU - Shen, Huan

AU - Zhou, Yongluan

AU - He, Bingsheng

AU - Li, Ruibo

AU - Zhou, Keyong

PY - 2020/4

Y1 - 2020/4

N2 - JSON is a very popular data format in many applications in Web and enterprise. Recently, many data analytical systems support the loading and querying JSON data. However, JSON parsing can be costly, which dominates the execution time of querying JSON data. Many previous studies focus on building efficient parsers to reduce this parsing cost, and little work has been done on how to reduce the occurrences of parsing. In this paper, we start with a study with a real production workload in Alibaba, which consists of over 3 million queries on JSON. Our study reveals significant temporal and spatial correlations among those queries, which result in massive redundant parsing operations among queries. Instead of repetitively parsing the JSON data, we propose to develop a cache system named Maxson for caching the JSON query results (the values evaluated from JSONPath) for reuse. Specifically, we develop effective machine learning-based predictor with combining LSTM (long shortterm memory) and CRF (conditional random field) to determine the JSONPaths to cache given the space budget. We have implemented Maxson on top of SparkSQL. We experimentally evaluate Maxson and show that 1) Maxson is able to eliminate the most of duplicate JSON parsing overhead, 2) Maxson improves end-to-end workload performance by 1.5-6.5×.

AB - JSON is a very popular data format in many applications in Web and enterprise. Recently, many data analytical systems support the loading and querying JSON data. However, JSON parsing can be costly, which dominates the execution time of querying JSON data. Many previous studies focus on building efficient parsers to reduce this parsing cost, and little work has been done on how to reduce the occurrences of parsing. In this paper, we start with a study with a real production workload in Alibaba, which consists of over 3 million queries on JSON. Our study reveals significant temporal and spatial correlations among those queries, which result in massive redundant parsing operations among queries. Instead of repetitively parsing the JSON data, we propose to develop a cache system named Maxson for caching the JSON query results (the values evaluated from JSONPath) for reuse. Specifically, we develop effective machine learning-based predictor with combining LSTM (long shortterm memory) and CRF (conditional random field) to determine the JSONPaths to cache given the space budget. We have implemented Maxson on top of SparkSQL. We experimentally evaluate Maxson and show that 1) Maxson is able to eliminate the most of duplicate JSON parsing overhead, 2) Maxson improves end-to-end workload performance by 1.5-6.5×.

KW - Data analytics system

KW - JSON parsing

KW - Semi-structured format

UR - http://www.scopus.com/inward/record.url?scp=85085857423&partnerID=8YFLogxK

U2 - 10.1109/ICDE48307.2020.00144

DO - 10.1109/ICDE48307.2020.00144

M3 - Article in proceedings

AN - SCOPUS:85085857423

SP - 1621

EP - 1632

BT - Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020

PB - IEEE

Y2 - 20 April 2020 through 24 April 2020

ER -

ID: 245634371