Hands-on Hive-on-Spark in the AWS Cloud

Interested in Hive-on-Spark progress? This new AMI gives you a hands-on experience.

Nearly one year ago, the Apache Hadoop community began to embrace Apache Spark as a powerful batch-processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this shift, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.

Since that demo, we have made tremendous progress, having finished up Map Join (HIVE-7613), Bucket Map Join (HIVE-8638), integrated with Hive Server 2 (HIVE-8993) and importantly integrated our Spark Client (HIVE-8548, aka Remote Spark Context). Remote Spark Context is important as it’s not possible to have multiple SparkContexts within a single process. The RSC API allows us to run the SparkContext on the server in a container while utilizing the Spark API on the client—in this case HiveServer 2, which reduces resource utilization on an already burdened component.

Many users have proactively starting using the Spark branch and providing feedback. Today, we’d like to offer you the first chance to try Hive-on-Spark yourself. As this work is under active development, for most users, we do not recommend that you attempt to run this code outside of the packaged Amazon Machine Image (AMI) provided. The AMI ami-35ffed70 (named hos-demo-4) is available in us-west-1 while we recommend an instance of m3.large or larger.

Once logging in as ubuntu, change to the hive user (sudo su - hive) and you will be greeted with instructions on how to start Hive on Spark. Pre-loaded on the AMI is a small TPC-DS dataset and some sample queries. Users are strongly encouraged to load their own sample datasets and try their own queries. We are hoping not only to showcase our progress delivering Hive-on-Spark but also to help find areas of improvement, early. As such, if you find any issues, please email hos-ami@cloudera.org and the cross-vendor team will do its best to investigate the issue.

Despite spanning the globe, the cross-company engineering teams have become close. The team members would like to thank our employers for sponsoring this project: MapR, Intel, IBM, and Cloudera.

Rui Li is a software engineer at Intel and a contributor to Hive.

Na Yang is a staff software engineer at MapR and a contributor to Hive.

Brock Noland is an engineering manager at Cloudera and a Hive PMC member.




Leave a Reply

Your email address will not be published. Required fields are marked *