Maxson: Reduce duplicate parsing overhead on raw data
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
Maxson : Reduce duplicate parsing overhead on raw data. / Shi, Xuanhua; Zhang, Yipeng; Huang, Hong; Hu, Zhenyu; Jin, Hai; Shen, Huan; Zhou, Yongluan; He, Bingsheng; Li, Ruibo; Zhou, Keyong.
Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020. IEEE, 2020. p. 1621-1632 9101499.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Maxson
T2 - 36th IEEE International Conference on Data Engineering, ICDE 2020
AU - Shi, Xuanhua
AU - Zhang, Yipeng
AU - Huang, Hong
AU - Hu, Zhenyu
AU - Jin, Hai
AU - Shen, Huan
AU - Zhou, Yongluan
AU - He, Bingsheng
AU - Li, Ruibo
AU - Zhou, Keyong
PY - 2020/4
Y1 - 2020/4
N2 - JSON is a very popular data format in many applications in Web and enterprise. Recently, many data analytical systems support the loading and querying JSON data. However, JSON parsing can be costly, which dominates the execution time of querying JSON data. Many previous studies focus on building efficient parsers to reduce this parsing cost, and little work has been done on how to reduce the occurrences of parsing. In this paper, we start with a study with a real production workload in Alibaba, which consists of over 3 million queries on JSON. Our study reveals significant temporal and spatial correlations among those queries, which result in massive redundant parsing operations among queries. Instead of repetitively parsing the JSON data, we propose to develop a cache system named Maxson for caching the JSON query results (the values evaluated from JSONPath) for reuse. Specifically, we develop effective machine learning-based predictor with combining LSTM (long shortterm memory) and CRF (conditional random field) to determine the JSONPaths to cache given the space budget. We have implemented Maxson on top of SparkSQL. We experimentally evaluate Maxson and show that 1) Maxson is able to eliminate the most of duplicate JSON parsing overhead, 2) Maxson improves end-to-end workload performance by 1.5-6.5×.
AB - JSON is a very popular data format in many applications in Web and enterprise. Recently, many data analytical systems support the loading and querying JSON data. However, JSON parsing can be costly, which dominates the execution time of querying JSON data. Many previous studies focus on building efficient parsers to reduce this parsing cost, and little work has been done on how to reduce the occurrences of parsing. In this paper, we start with a study with a real production workload in Alibaba, which consists of over 3 million queries on JSON. Our study reveals significant temporal and spatial correlations among those queries, which result in massive redundant parsing operations among queries. Instead of repetitively parsing the JSON data, we propose to develop a cache system named Maxson for caching the JSON query results (the values evaluated from JSONPath) for reuse. Specifically, we develop effective machine learning-based predictor with combining LSTM (long shortterm memory) and CRF (conditional random field) to determine the JSONPaths to cache given the space budget. We have implemented Maxson on top of SparkSQL. We experimentally evaluate Maxson and show that 1) Maxson is able to eliminate the most of duplicate JSON parsing overhead, 2) Maxson improves end-to-end workload performance by 1.5-6.5×.
KW - Data analytics system
KW - JSON parsing
KW - Semi-structured format
UR - http://www.scopus.com/inward/record.url?scp=85085857423&partnerID=8YFLogxK
U2 - 10.1109/ICDE48307.2020.00144
DO - 10.1109/ICDE48307.2020.00144
M3 - Article in proceedings
AN - SCOPUS:85085857423
SP - 1621
EP - 1632
BT - Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020
PB - IEEE
Y2 - 20 April 2020 through 24 April 2020
ER -
ID: 245634371