Maxson: Reduce duplicate parsing overhead on raw data

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

  • Xuanhua Shi
  • Yipeng Zhang
  • Hong Huang
  • Zhenyu Hu
  • Hai Jin
  • Huan Shen
  • Zhou, Yongluan
  • Bingsheng He
  • Ruibo Li
  • Keyong Zhou

JSON is a very popular data format in many applications in Web and enterprise. Recently, many data analytical systems support the loading and querying JSON data. However, JSON parsing can be costly, which dominates the execution time of querying JSON data. Many previous studies focus on building efficient parsers to reduce this parsing cost, and little work has been done on how to reduce the occurrences of parsing. In this paper, we start with a study with a real production workload in Alibaba, which consists of over 3 million queries on JSON. Our study reveals significant temporal and spatial correlations among those queries, which result in massive redundant parsing operations among queries. Instead of repetitively parsing the JSON data, we propose to develop a cache system named Maxson for caching the JSON query results (the values evaluated from JSONPath) for reuse. Specifically, we develop effective machine learning-based predictor with combining LSTM (long shortterm memory) and CRF (conditional random field) to determine the JSONPaths to cache given the space budget. We have implemented Maxson on top of SparkSQL. We experimentally evaluate Maxson and show that 1) Maxson is able to eliminate the most of duplicate JSON parsing overhead, 2) Maxson improves end-to-end workload performance by 1.5-6.5×.

OriginalsprogEngelsk
TitelProceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020
ForlagIEEE
Publikationsdatoapr. 2020
Sider1621-1632
Artikelnummer9101499
ISBN (Elektronisk)9781728129037
DOI
StatusUdgivet - apr. 2020
Begivenhed36th IEEE International Conference on Data Engineering, ICDE 2020 - Dallas, USA
Varighed: 20 apr. 202024 apr. 2020

Konference

Konference36th IEEE International Conference on Data Engineering, ICDE 2020
LandUSA
ByDallas
Periode20/04/202024/04/2020

ID: 245634371