Comparison of Hadoop and Spark performance using a stock portfolio analysis application
The world today is inundated with data amounting to terabytes and petabytes in size. In a real world scenario, one terabyte can contain nearly 84 days of CD-quality music, and ten terabytes can hold the entire United States Library of Congress print collections. In the near future, the total amount of data to be stored will be in Exabyte, yottabyte, and zettabyte. Transactional and operational systems, scanning systems, facility management systems, inbound and outbound customer management systems, the web, cloud computing, mobile devices, and web-based social media contribute to the explosion of data generated and processed. The explosion of data is not new and is a continuation of a trend that started in the 1970s. However, the velocity of data generation, volume of data generated, and the variety of data generated provide challenges in processing the data. For competitive edge and planning, business need to harvest every byte of data and use it in making better decisions. The need to efficiently process and analyze big data led to introduction of many tools that take advantage of newer technology trends such as distributed architectures and memory level processing. Apache Hadoop is a project that develops an open-source tool for scalable, reliable and distributed computing. Its technology library is based on MapReduce framework and allows distributed processing of huge volumes of data sets over a cluster of computers using vi simple and conventional programming models. Spark is another big data tool. It is a powerful open-source processing technique that focuses on speed, sophisticated analytics and ease of use. It is an in-memory data processing framework that is more reliable and faster than MapReduce. Previous studies have shown that Spark outperforms Hadoop when the data need to be iteratively reused by an application includes a machine learning component. Unlike Hadoop which is designed to write data to hard disk between iterations, Spark can cache the data in memory, and reuse them across multiple stages of execution. In contrast to previous studies, this thesis aims to conduct a comparative performance analysis of Hadoop and Spark for non iterative applications. In this thesis, I evaluate the efficiency of the Hadoop and Spark frameworks, which are typically used for big data processing in the cloud, using a stock portfolio's value at risk estimation by Monte Carlo simulations method, a benchmark not commonly used in big data tools evaluations. I also evaluate the performances of Hadoop and Spark using WordCount and Pi, which are used in prior study. The reason of running multiple benchmarks on all the tools is to be able to fairly analyze and compare the performance of Hadoop and Spark. For VaR, WordCount, and Pi, Hadoop and Spark perform differently in each one for reasons that are specific to that application. The jobs I run in this experiment is considered small jobs as the data used is in kilo-megabytes. The experiment shows that Spark outperforms Hadoop up to 4x times. The benefits of Spark over Hadoop is contributed by Spark's efficient way of reading small files as a whole directory, Spark's low overhead scheduler and its lightweight design.