Maxson: Reduce duplicate parsing overhead on raw data

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedings

  • Xuanhua Shi
  • Yipeng Zhang
  • Hong Huang
  • Zhenyu Hu
  • Hai Jin
  • Huan Shen
  • Zhou, Yongluan
  • Bingsheng He
  • Ruibo Li
  • Keyong Zhou

JSON is a very popular data format in many applications in Web and enterprise. Recently, many data analytical systems support the loading and querying JSON data. However, JSON parsing can be costly, which dominates the execution time of querying JSON data. Many previous studies focus on building efficient parsers to reduce this parsing cost, and little work has been done on how to reduce the occurrences of parsing. In this paper, we start with a study with a real production workload in Alibaba, which consists of over 3 million queries on JSON. Our study reveals significant temporal and spatial correlations among those queries, which result in massive redundant parsing operations among queries. Instead of repetitively parsing the JSON data, we propose to develop a cache system named Maxson for caching the JSON query results (the values evaluated from JSONPath) for reuse. Specifically, we develop effective machine learning-based predictor with combining LSTM (long shortterm memory) and CRF (conditional random field) to determine the JSONPaths to cache given the space budget. We have implemented Maxson on top of SparkSQL. We experimentally evaluate Maxson and show that 1) Maxson is able to eliminate the most of duplicate JSON parsing overhead, 2) Maxson improves end-to-end workload performance by 1.5-6.5×.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020
PublisherIEEE
Publication dateApr 2020
Pages1621-1632
Article number9101499
ISBN (Electronic)9781728129037
DOIs
Publication statusPublished - Apr 2020
Event36th IEEE International Conference on Data Engineering, ICDE 2020 - Dallas, United States
Duration: 20 Apr 202024 Apr 2020

Conference

Conference36th IEEE International Conference on Data Engineering, ICDE 2020
LandUnited States
ByDallas
Periode20/04/202024/04/2020

    Research areas

  • Data analytics system, JSON parsing, Semi-structured format

ID: 245634371