Scalable Real-Time Reporting from HBase NoSQL Databases using Optimized Spark SQL Frameworks
Keywords:
NoSQL, HBase, Spark SQL, real-time reporting, query optimizationAbstract
HBase is NoSQL database and the increasing dependency on these type of database for managing large-scale, high-velocity datasets presents number of challenges in real-time analytical reporting. Traditional querying techniques often have significant performance bottlenecks because of its inherent architectural constraints of distributed storage systems. The objective of this research is to introduce a Spark SQL-based framework which is suited for optimising query execution and report generation from HBase which utilises advanced Scala-based optimization to enhance computational efficiency.NoSQL
Downloads
References
A. Pavlo et al., “A comparison of approaches to large-scale data analysis,” Proc. ACM SIGMOD Int. Conf. Manage. Data, pp. 165–178, 2009.
M. Zaharia et al., “Apache Spark: A unified engine for big data processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, 2016.
L. George, HBase: The Definitive Guide, 2nd ed. Sebastopol, CA, USA: O’Reilly Media, 2017.
T. White, Hadoop: The Definitive Guide, 4th ed. Sebastopol, CA, USA: O’Reilly Media, 2015.
P. J. Sadalage and M. Fowler, NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Upper Saddle River, NJ, USA: Addison-Wesley, 2013.
F. Gessert, W. Wingerath, S. Friedrich, and N. Ritter, “NoSQL database systems: A survey and decision guidance,” Comput. Sci. Rev., vol. 27, pp. 1–22, 2018.
D. Jiang, B. Ooi, L. Shi, and S. Wu, “The performance of MapReduce: An in-depth study,” Proc. VLDB Endowment, vol. 3, no. 1, pp. 472–483, 2010.
X. Yu et al., “Optimizing OLAP workloads via NoSQL query transformations,” Proc. IEEE Int. Conf. Big Data, pp. 3076–3085, 2019.
A. Floratou, J. Patel, E. Shekita, and S. Tata, “Column-oriented storage techniques for MapReduce,” Proc. VLDB Endowment, vol. 4, no. 7, pp. 419–429, 2011.
Y. Zhang, Q. Gao, L. Wang, and W. Yu, “High performance iterative big data processing on HPC clusters,” Proc. IEEE Int. Conf. Big Data, pp. 781–789, 2013.
D. J. Abadi, “Query execution in column-oriented database systems,” Proc. ACM SIGMOD Int. Conf. Manage. Data, pp. 904–915, 2008.
Y. Li and J. M. Patel, “Query optimization for mass spectrometry data analysis,” Proc. ACM SIGMOD Int. Conf. Manage. Data, pp. 987–998, 2012.
K. Ren et al., “Design and implementation of a distributed document store,” Proc. USENIX Annual Technical Conference (ATC), pp. 253–264, 2017.
R. Cattell, “Scalable SQL and NoSQL data stores,” ACM SIGMOD Record, vol. 39, no. 4, pp. 12–27, 2011.
J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
P. Mell and T. Grance, “The NIST definition of cloud computing,” Nat. Inst. Standards Technol. (NIST) Special Publication, vol. 800-145, 2011.
A. Thusoo et al., “Hive: A warehousing solution over a map-reduce framework,” Proc. VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009.
J. Lin and C. Dyer, Data-Intensive Text Processing with MapReduce. Morgan & Claypool, 2010.
S. Babu, “Towards automatic optimization of MapReduce programs,” Proc. ACM Symp. Cloud Comput. (SoCC), pp. 137–142, 2010.
F. Chang et al., “Bigtable: A distributed storage system for structured data,” ACM Trans. Comput. Syst., vol. 26, no. 2, pp. 1–26, 2008.