白皮書

Evaluating the Impact of SSDs and InfiniBand in Hadoop Cluster Performance and Costs

A report from the Barcelona Supercomputing Center evaluating the impact of SSDs and InfiniBand on performance and cost in Hadoop clusters. (9 pages)

Introduction

During the past years the exponential growth of data, its generation speed, and its expected consumption rate presents one of the most important challenges in IT. Under this scenario, storage becomes a critical component to economical store, process, and scale rapidly as data grows. While Big Data has traditionally been a field for economical and large storage solutions such as rotational drives e.g., SATA drives, Solid State Drive technology offers reduced latency and increased throughput over their rotational counterparts. NAND flash technology also has other benefts for improving data centers, such as increased density, higher reliability and power savings. As during the last years, the cost of SSDs have been decreasing while their capacity increases [9, 11, 12]; not only projections show that they are viable storage solution for big data processing, but they already can provide a reduced Total-Cost-of-Ownership (TCO) and speedup applications.

Traditional Hadoop architecture was designed with spinning disk in mind. In this distributed architecture, the thinking is that you can fetch data from another node faster than getting it from local slow spinning hard disk. However, as SSDs get adopted in Hadoop clusters, the faster throughput and lower latencies shift the bottleneck from storage to network. Adding more nodes to increase Hadoop scalability further increases the network traffic between the nodes. So high speed networking technologies, like 10Gig Ethernet and Infiniband become almost compulsory to keep up with ash technology. The results prove this as well.

The intent of this technical report, is to find the exact numbers of both technologies, and to use them as a reference base in terms of job execution time -performance- and costs. Also, to evaluate SSDs and how they can function independently or together with traditional disks over commodity, entry level clusters for Hadoop. Along with the SSDs evaluation, we test impact of high-speed, low-latency networking such as InfiniBand, and compare it to its commodity counterpart, GigaBit Ethernet. Where we first evaluate the performance in terms of speedups; and second, the cost-effectiveness of each HW and SW combination. The final goal is to produce baseline numbers of expected performance of market-available consumer grade technologies. To later contrast them in our current research for future datacenter technologies. As well as to make the results publicly available for other researchers and Big Data practitioners. This study is part of the ALOJA project, an open research initiative from the Barcelona Supercomputing Center (BSC) to increase cost-efficiency and the general understanding of Big Data systems via automation and learning.

Organization

The report is organized as follows: Section 2 presents the ALOJA project in which this study is framed. Section 3, the technical specifcations of the cluster are provided. Followed by Section 4, which presents performance metrics, speedups and cost-effectiveness data covering different subsets of executions. In Section 5 we highlight the main results and observations from the benchmarks. Finally, in section 6 we make some final remarks and present the future lines for the project.

 

Figure 1: Main components and workflow in the ALOJA platform

 

The ALOJA Big Data benchmarking project

The ALOJA project [10] is an open initiative from the Barcelona Supercomputing Center (BSC) to explore and automate the characterization of costeffectiveness for Big Data deployments. BSC is a center with over 8 years of research expertise in Hadoop environments, which counts support from industrial partners in the area of Big Data technologies. ALOJA attempts to provide solutions to an every time more important problem for the Big Data community, which is the lack of understanding of what parameters, either SW or HW, determine the performance of Big Data workloads. Therefore the selected configuration determines the speed in which data is processed and returned, and most importantly, the hosting budget.

Additionally, as open-source frameworks i.e., Hadoop and Spark become more common, they can be found in a diversity of operational environments. Comprising from low-end commodity clusters, low-powered micro-servers, to high-end data appliances, including all types of Cloud-based solutions at scale i.e., IaaS and PaaS. Where due to the large number of software configuration options |more than 100 interrelated parameters [4, 5, 6], and the increasing number of deployment types optimizing the performance of Big Data systems requires extensive manual benchmarking [5, 8]. For these reasons, as well as the lack of automation tools for Big Data had led us to build progressively, an automated benchmarking platform to deal with defining cluster setups, server deployment, defining benchmarking execution plans, orchestration of configuration, and data management of results.

The ALOJA platform is further composed of open-source tools to achieve an automated benchmarking of Big Data deployments. These tools are used by researchers testing new features and algorithms, or by practitioners needing either to privately test their own system and application, or to improve benchmark results. The 3 main components are: Big Data benchmarking scripts, that deploy servers and execute benchmarks; the online repository; and Web Analytics tools, that feed each other in a continuous loop as benchmark are executed as shown in Figure 1.

Benchmarking methodology

Due to the large number of configuration options that have an effect on Hadoop's performance, this study makes use of the ALOJA platform to manage and update cluster configuration, orchestrate benchmark execution and data collection. In order to automate the process of testing over 1,000 benchmark runs with different settings. As well as to automate the evaluation of the results through the online analytic tools [1], following the work ow previously presented in Figure 1.

Hadoop's distribution includes jobs that can be used to benchmark its performance, usually referred as micro benchmarks, however these type of benchmarks usually have limitation on their representativeness and variety [7]. ALOJA currently features the HiBench open-source benchmark from Intel [7], which can be more realistic and comprehensive than the supplied example jobs in Hadoop. HiBench features several ready to use benchmarks from 4 categories: micro benchmarks, Web search, Machine Learning, HDFS benchmarks. While the figures in the next sections show the results from functional benchmarks, more specifically terasort (using 100GB), DFSIOE R/W, sort and wordcount; the whole HiBench was run and results can be found in the online repository [1] detailed next.

Online repository and Platform Tools

Part of the goals for ALOJA include the creation of vendor-neutral, open public Big Data benchmark repository. This effort currently features more than 50,000 Big Data job benchmark runs, along with their performance traces and logs. As few organizations have the time or performance profiling expertise, we expect the repository will benefit the Big Data community to complement their cost-performance comparison needs, without having to repeat benchmarks. As well as to bring more transparency to product claims and reports, as results are fully disclosed and experiments can be easily repeated, especially for public Cloud results.

In total for this study, more than 1,000 executions were run on the cluster, with different combinations of storage, networking and Hadoop configuration options i.e., the concurrent number of containers (mappers or reducers). The results repository keeps growing as we continue the project’s benchmarking efforts. The results from this report can also be compared with online resources from different clusters including Cloud deployments.

 

Experimental setup

The system under test (SUT) is composed of 9 machines, of which 1 is used as the master node that inludes the Hadoop NameNode and JobTracker services. The rest 8 nodes, have a worker role (slaves) and each include the Hadoop DataNode and TaskTracker services. Each machine has 64GB of RAM and 12 CPU cores (2x6) with hyperthreading, for a total of 24 hardware threads. Storage-wise, each machine is equipped with 6x1 TB SATA drives, 1 for the OS, 5 for application use, and 2x240GB SanDisk CloudSpeed ECO SATA SSDs for application use. Finally, each node has a 4x1Gb Ethernet adapter (only one port used for the tests), and a 2xFDR 56Gb InfiniBand adapter (only one port used for the tests using IPoIB.

Cluster
Processor Intel(R) Xeon(R) E5-2630L@2.00GHz
# of nodes 8 + 1 master
Memory Capacity 64GB
Storage (1) 1x1TB SATA 3.5" 7200RPM (OS)
Storage (2) 5x1TB SATA 2.5" 5400RPM (apps)
Storage (3) 2x240GB SATA SSD (apps)
Interconnect (1) 4 x 1 Gb Ethernet
Interconnect (2) 2 x FDR 56 Gb IB
CPU
Architecture Sandy Bridge
# of sockets 2
Cores per socket 6
Threads per core 2
# of cores 12 (6x2)
# of hw threads 24 (2x6x2 = 12x2)
Max frequency 2.0 GHz
Memory BW 42.6 GB/s
Hadoop
Version Apache Hadoop v.1.0.3
OS
Type GNU/Linux
Version Ubuntu Server 14.04 (Trusty Tahr)
Kernel 3.13.0-65-generic x86 64

Table 1: Speci cations for the System Under Test

 

Table 1 summarizes the specific HW components and versions used in this study, as well as the Operating System (OS) version. The Hadoop version used for the tests is 1.0.3. This particular version was chosen as it is compatible with the Java instrumentation Suite (JIS) [10] developed at BSC, which allows low-level performance analysis by using BSC's HPC performance tools [2]. Results with other versions including Hadoop v2 can be found in the online repository. The operating system was Ubuntu Linux 14.04 (Trusty Tahr) with kernel version 3.13.

Tested HW and SW configurations

With respect to storage, a JBOD (Just-a-Bunch-Ofdisks) setting was selected as recommended by the Hadoop Guide [13], to increase Hadoop's disk throughput. But at the expense of application-level reliability, as opposed to commonly used hardware-level RAID technologies for database systems. The following different disk combinations were tested:

  • From 1 to 5 SATA drives as JBOD, as opposed to more reliable RAID technlogies as recommended by the Hadoop Guide [13], identified in test names by HDD, HD2, HD3, HD4, HD5 respectively.
  • 1 and 2 SSDs as JBOD (SSD, SS2).
  • 5 SATA drives as JBOD and Hadoop /tmp directory on 1 SSD (HS5).

With respect to networking, 1Gb Ethernet and InfiniBand (56Gb FDR using IPoIB) were compared. With respect to Hadoop, the number of parallel container per node was tuned i.e., number of mappers/reducers, different numbers were tested i.e., 12, 16 and 24. Also different block sizes, I/O buffers sizes, and compression. More details on the methodology for selecting SW configurations in [10].

 

Evaluations

This section presents the impact of different Hadoop configuration parameters, as well as hardware configurations to both performance and costs. The following experiments can also be obtained in ALOJA's online application [1], which provides further details and access to the logs.

 

Figure 2: Comparison of Hadoop processes and execution time for the fastest runs in di erent disk con guration categories

 

Performance metrics

The most relevant metric to determine performance of a Hadoop application is the execution time. The time it takes a Hadoop job to complete, in this case we measure it in seconds. As an example of the different Hadoop internal processes per job, Figure 2 presents the different Hadoop phases for terasort of the fastest runs for 3 different disk configurations: 2 SSDs, 5 SATA disks and /tmp in 1 SSD, and 5 SATA disks respectively. In this case all using IPoIB networking, as it gave the best results over GbE for each of the different disk configurations.

The fastest configuration as expected is using 2 SSDs, 207s. However, by using a combination of 5 SATA drives and 1 SSD for temporary files, the second combination achieves a close 228s, while increasing the capacity at the cost of the extra SATA drives. The duration of the reduce phase is roughly the same for the 3 jobs in Figure 2, but when Hadoop's temporary directory resides on ash storage -first and second jobthe duration of the shuffle phase is significantly shorter. The third configuration with 348s is the fastest among all those using only SATA disks (5 of them), but 68% slower than the 2 SSD option. The rest of the results for other benchmarks can be seen online. Brie y to summarize, wordcount being CPU-bound achieves similar numbers on all 3 of the configurations. DFSIOE does not make use of temporary storage, so the SSD only solutions is significantly faster than the combination or the SATA only solutions. DFSIOE results can be seen in the next sub-section that details the speedup of selected benchmarks.

Figure 3: Speedup by storage con guration

 

Performance Speedups

The speedup is defined as the relative performance improvement, in this case, to the average execution time for the selected benchmark. Where the average execution time is at 1 on the X axis. Any value below 1 represents a speed-down and a value above 1 represents a speed-up over the average. Figure 3 shows speedups by comparing four storage configurations of 523 different runs. The aggregation groups being: 2 SATA, 1 SATA, 2 SSDs, 1 SSD respectively. Here we can see that the configuration with 2 SSDs is the best on all benchmarks, followed by the single-SSD one. The speedup is especially notable for terasort. We can also notice here that while scaling in SATA drives from 1 to 2, the speedup almost doubles, for SSDs the increase is limited between 15 to 30% according to the benchmark. Since SSDs scale as well as SATAs or better, this is probably due to either the benchmark or Hadoop not pushing enough data to fully utilize the SSD performance, or to some hardware or OS bottleneck.

 

Figure 4: Speedup by network con guration

 

Figure 4 shows speedups by network configuration (Ethernet vs. InfiniBand) by aggregating 882 different runs of the selected benchmarks. Where InfiniBand is faster than Ethernet as expected, but different according to the benchmark. While DFSIOE Read gets up to 60% increase, terasort 20%, both in average.

 

Figure 5: Speedup by storage + network + Hadoop processes (ethernet)

 

The next Figures 5 for ethernet and 6 for InfiniBand, shows the effect of varying the number of concurrent containers i.e., mappers and reducers in parallel, to each disk technology. In Figure 5 (ethernet-based), we can see that the most optimal configuration for 1 and 2 SSDs is by using 16 concurrent processes per node. For 5 SATA + /tmp in SSD, 24 concurrent processes is the most efficient. While for 5 SATA, 12 concurrent processes is the maximum concurrency allowed by the HW. On Figure 6 (IB-based), we can see that for 1 SSD, also the most efficient is 16 containers, while having 2 SSDs allows stretching it to 24 concurrent processes, giving 10% more speedup. For the SATA tests, the same values as with ethernet are recommended.

 

Figure 6: Speedup by storage + network + Hadoop processes (In niBand)

 

Varying the number of concurrent processes per node according to the HW has the following implications: having a higher number of disks in the system, allows for running more concurrent processes, speeding up runs. This is the same for 5 SATA drives as for SSD disks. However while SATA doesn't get significant boost with InfiniBand; for SSD disks, to have higher concurrency by adding more disks, fast networking is required as shown in Figure 6 in the InfiniBand use case.

Cost-effectiveness

While the previous sub-section showed best results in absolute performance terms, here we rank executions by the execution time over the total cost of executing the job, as if it was a cloud-based resource being fully utilized at all times. The cluster costs are calculated by adding the HW costs of the particular setup. Amortizing HW over 3 years, adding service and maintenance cost over this period to estimate the cost per hour of the setup. The costs used for the SUT can be seen at the http://aloja.bsc.es/clustercosteffectiveness page, along with the rest of the clusters benchmarked for ALOJA. The page also allows the user to change costs, to experiment the impact of expected costs.

 

Figure 7: Comparison of cost e ectiveness

 

The cost-effectiveness diagram in Figure 7, places different evaluated HW configurations as if they were different clusters, where each bubble represent a different cluster. It places each cluster at an (X,Y) location of a bidimensional space, where the X and Y axis represent the normalized execution time and cost respectively. The figure is further divided into 4 quadrants, to represent the execution cost vs. performance result. The quadrants being: Fast-Economical, FastExpensive, Slow-Expensive, and Slow-Economical. With point (0,0) represents the best cost-effective execution. The bubble sizes in this case represent the number of executions that fit each HW configuration. This type of analysis, allow the user to quickly filter out slow-expensive HW configurations, to concentrate (run more benchmarks with different SW configurations) in the most beneficial ones, closer to point (0,0) in the chart.

To summarize numerically Figure 7, Table 2 ranks the best configurations with respect to costeffectiveness with their absolute costs and performance obtained. The best configuration takes 207 seconds, with a cost of $ 0.33 USD per run. As best ranked configuration we have IB networking and 2 SSD disks. This higher-end setup is also the most cost-performing in this case. Of course the storage capacity for this configuration is not very large, so if higher capacity is needed, one of the configurations with 5 SATA disks would be more appropriate, although possibly a bit slower and more expensive.

 

Rank Cluster Exec cost Exec time Network Disk
1 minerva-100-10-18-21 0.33 US$ 207 s InfiniBand 2 SSD drives
4 minerva-100-10-18-21 0.35 US$ 258 s Ethernet 2 SSD drives
5 minerva-100-10-18-21 0.37 US$ 265 s InfiniBand 1 SSD drive
3 minerva-100-10-18-21 0.38 US$ 228 s InfiniBand 5 SATA /tmp to SSD
6 minerva-100-10-18-21 0.40 US$ 275 s Ethernet 5 SATA /tmp to SSD
8 minerva-100-10-18-21 0.41 US$ 348 s Ethernet 1 SSD drive
9 minerva-100-10-18-21 0.44 US$ 348 s Ethernet 5 SATA drives
10 minerva-100-10-18-21 0.48 US$ 325 s InfiniBand 5 SATA drives

 

Table 2: Summary of cost e ectiveness by setting up HDFS to di erent disk con gurations

 

Discussion

From the speedup diagrams it can be seen that, while SSD and InfiniBand always provide better performance, the actual gain depends on the specific benchmark under consideration. For terasort and DFSIOE the speedup is considerable, but for wordcount it is negligible as it is CPU-bound (not shown in the paper, results can be obtained online). terasort benefits the most from having Hadoop's temporary directory on SSD as it uses it heavily, while DFSIOE R/W do not use it. Notable speedup results for terasort with respect to storage (using IB networking):

  • 5 SATA disks in JBOD configuration are 80% faster than a single SATA disk.
  • A single SSD disk is 300% faster than a single SATA disk and at least 100% faster than 5 SATA disks (depending on number of processes)
  • Doubling the number of SSDs (2 vs 1) only gives only a 20% performance improvement, due to Hadoop or the benchmark not pushing enough data (since SSDs normally scale well). Less benefit is expected by adding a 3rd volume, especially in the ethernet case.
  • 5 SATA disks in JBOD configuration + 1 SSD for Hadoop temporary directory are 300% faster than 1 SATA disk and more than 100% faster than 5 plain SATA disks (without SSD). This is the same performance as 2 SSDs, but with a much larger storage capacity.

Notable speedup results for terasort with respect to IB vs. Ethernet networking:

  • On average, IB is 60% faster than Ethernet.
  • The fastest SSD configuration gets a speedup of 13% with IB.
  • Due to the higher network speed, IB stresses more the disks in terms of IOPS and throughput, so when using IB the disk is more of a bottleneck than with Ethernet.
  • IB enables to increase the performance by adding more SSDs as a JBOD into the system.

Most cost-effective disk/net configurations for terasort (see also table 2):

  • First: 2 SSDs for HDFS + IB.
  • Second: 2 SSDs for HDFS + ETH.
  • Third: 1 SSD for HDFS + IB.
  • Fourth: 5 SATA for HDFS, 1 SSD for /tmp + IB
  • Fifth: 5 SATA for HDFS, 1 SSD for /tmp + ETH

 

Conclusions

This report presented the impact of introducing both SSDs and InfiniBand to a commodity-style Hadoop cluster. A setup that is typically found in many enterprises, but with the addition of the SSDs and IB for evaluation. We have run over 1,000 benchmarks of different application domains by varying software configurations to test the distinct hardware scenarios. As an example, we have shown that for InfiniBand networks to be cost-effective, they need to be combined with SSDs or other fast disk options. If not the improvement they provide is negligible to justify the costs. Also, best Hadoop configuration options such as the number of containers (mappers and reducers) to run in parallel according to the available CPU cores and job types, or the best cost effective compression factor to use according to the different workloads was presented. Showing the best performing concurrency varies according to the chosen disk and network. Results show that as expected, both technologies can speedup Big Data processing. However, unlike commonly perceived, results show that SSDs and InfiniBand can actually improve the costeffectiveness, even in small clusters when they are busy most of the time. Ultimately, they speedup up Big Data usage while reducing the TCO at the same time, and possibly space and energy costs.

These results are also utilized as baseline numbers for future work in the ALOJA project, to compare them with larger cluster sizes, Cloud-based deployment and newer technologies i.e., PCIe NVRAM disks (FusionIO now part of SanDisk). We believe that a tiered storage layer consisting of different hierarchy levels from: RAM, PCIe Disks, SSDs, and rotational; can greatly reduce operation costs, while improving significantly the availability of Big Data applications. Hadoop is already working towards this and supports tiered storage starting from version 2.3, although this hasn't been as widely adopted and publicized by the various Hadoop distros, see [3] for an example.

ALOJA provides to the Hadoop community by producing more knowledge and understanding of the underlying Hadoop runtime while it is executing. Our intent is that researchers and organizations evaluating or deploying the Hadoop solution stack will benefit from this growing database of performance results and configuration guidance.

 

References

[1] BSC. Aloja home page: http://aloja.bsc.es/, 2015.

[2] BSC. Performance tools research group page: http://www.bsc.es/computer- sciences/performance-tools, 2015.

[3] eBay tech blog. Hdfs storage efficiency using tiered storage: http://www.ebaytechblog.com/2015/01/12/hdfs- storage-efficiency-using-tiered-storage, 2015.

[4] D. Heger. Hadoop Performance Tuning https://hadoop-toolkit.googlecode.com/files/White paper-HadoopPerformanceTuning.pdf. Impetus, 2009.

[5] D. Heger. Hadoop Performance Tuning - A Prag- matic & Iterative Approach. DH Technologies, 2013.

[6] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. In In CIDR, pages 261{272, 2011.

[7] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. Data Engineering Workshops, 22nd International Con- ference on, 0:41{51, 2010.

[8] K. Kambatla, A. Pathak, and H. Pucha. Towards optimizing hadoop provisioning in the cloud. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing, HotCloud'09, Berkeley, CA, USA, 2009. USENIX Association.

[9] J. O'Reilly. Ssd prices in a free fall: http://www.networkcomputing.com/storage/ssd- prices-in-a-free-fall/a/d-id/1320958, 2015.

READY TO FLASH FORWARD?

Whether you’re a Fortune 500 or five person startup, SanDisk has solutions that will help you get the most out of your infrastructure.

VIA
EMAIL

Go ahead, ask us some questions and we'll get back to you with answers.

Let's Talk
800.578.6007

Don't wait, let's just talk now and start building the perfect flash solution.

Global Contact

Find contact information for offices all over the world.

SALES INQUIRIES

Whether you'd like to ask a few initial questions or are ready to discuss a SanDisk solution tailored to your organizations's needs, the SanDisk sales team is standing by to help.

We're happy to answer your questions, so please fill out the form below so we can get started. If you need to talk to the sales team immediately, please phone: 800.578.6007

欄位不能為空。
欄位不能為空。
請輸入有效的電子郵件地址。
欄位只能包含數字。
欄位不能為空。
欄位不能為空。
欄位不能為空。
欄位不能為空。
欄位不能為空。
欄位不能為空。

Please indicate your areas of interest:

你必須選擇一個選項。

Questions or comments:

你必須選擇一個選項。

Thank you. We have received your request.