Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32146
Regression Approach for Optimal Purchase of Hosts Cluster in Fixed Fund for Hadoop Big Data Platform

Authors: Haitao Yang, Jianming Lv, Fei Xu, Xintong Wang, Yilin Huang, Lanting Xia, Xuewu Zhu


Given a fixed fund, purchasing fewer hosts of higher capability or inversely more of lower capability is a must-be-made trade-off in practices for building a Hadoop big data platform. An exploratory study is presented for a Housing Big Data Platform project (HBDP), where typical big data computing is with SQL queries of aggregate, join, and space-time condition selections executed upon massive data from more than 10 million housing units. In HBDP, an empirical formula was introduced to predict the performance of host clusters potential for the intended typical big data computing, and it was shaped via a regression approach. With this empirical formula, it is easy to suggest an optimal cluster configuration. The investigation was based on a typical Hadoop computing ecosystem HDFS+Hive+Spark. A proper metric was raised to measure the performance of Hadoop clusters in HBDP, which was tested and compared with its predicted counterpart, on executing three kinds of typical SQL query tasks. Tests were conducted with respect to factors of CPU benchmark, memory size, virtual host division, and the number of element physical host in cluster. The research has been applied to practical cluster procurement for housing big data computing.

Keywords: Hadoop platform planning, optimal cluster scheme at fixed-fund, performance empirical formula, typical SQL query tasks.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 668


[1] on 07/03/2017).
[2] on 07/03/2017).
[3] on 07/03/2017).
[4] on 07/03/2017).
[5] on 07/03/2017).
[6] Capriolo E, Wampler D, and Rutherglen J, Programming hive. O'Reilly Media, Inc., 2012.
[7] Karau H, Konwinski A, Wendell P, et al, Learning spark: lightning-fast big data analysis. O'Reilly Media, Inc., 2015.
[8] on 07/03/ 2017).
[9] M. Armbrust, R. S. Xin, C. Lian, et al, “Spark SQL: Relational data processing in spark,” in Proc. of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, 2015, pp.1383–1394.
[10] M.Zaharia, M. Chowdhury, M. J. Franklin, et al, “Spark: cluster computing with working sets,” in Usenix Conference on Hot Topics in Cloud Computing, USENIX Association, 2010, pp.1765–1773.
[11] on 07/03/2017).
[12] on 07/03/2017).
[13] on 07/03/2017).
[14] on 07/03/2017).
[15] on 07/03/2017).
[16] B. Sotomayor, R. S. Montero, I. M. Llorente, et al, “Virtual infrastructure management in private and hybrid clouds,” IEEE Internet computing, 2009, vol. 13, no. 5, pp. 14–22.
[17] on 07/03/2017).
[18] A. M. Brown, “A step-by-step guide to non-linear regression analysis of experimental data using a Microsoft Excel spreadsheet,” Computer Methods and Programs in Biomedicine, 2001, vol. 65, no. 3, pp. 191–200.
[19] C. L. Lawson, R. J. Hanson, Solving least squares problems. Society for Industrial and Applied Mathematics, 1995.
[20] M.J. Box, D. Davies, and W.H. Swann, Non-Linear optimization Techniques. Oliver & Boyd, 1969.
[21] N.J.Gunther, P. Puglia, K. Tomasette, “Hadoop Superlinear Scalability,” Communications of the ACM, 2009, vol. 58, no. 4, pp. 46–55.
[22] A. Mukherjee, J. Datta, R. Jorapur, et al, “Shared disk big data analytics with apache Hadoop,” 19th international conference on High Performance computing (HiPC2012), IEEE, 2012, pp. 1–6.
[23] T. White, Hadoop: The definitive guide. O'Reilly Media, Inc., 2012.