Using Clojure to query AWS EMR via Hive JDBC
Background
In January, I launched an Amazon Elastic Map/Reduce (EMR) cluster, for a migration project. Since then, I've been slowing adjusting and configuring it (with help) to:
- Use an external Hive metastore from a data lake
- Enable querying the cluster via Hive JDBC
- Enable querying the cluster via Simba JDBC (Spark Thrift)
It took a while to achieve those bullets in a security-restricted environment. In the few weeks since it's been copasetic, I've begun testing the Hive JDBC access.
The first successful test involved JRuby 9.1.15.0 on Rails 4.2 using a non-ActiveRecord model (PORO). This article illustrates my second test, using Clojure.
Prerequisites
This assumes you already have your AWS EMR cluster running with Hive et. al. enabled.
Download the Hive JDBC Drivers
The first step is to download the Hive JDBC drivers Amazon provides. This was tricky to find since links have changed over time. This 2014 blog post, from Simba Technologies, on the AWS Big Data Blog, contain a link to the drivers. Despite my Google-fu, I couldn't find a more recent version of the drivers.
I'm using the JDBC 4.1 drivers found in the AmazonHiveJDBC/AmazonHiveJDBC41-1.0.9.1060 directory.
Create a New Project
Use Leiningen to create a new Clojure project:
lein new app emr-hive-jdbc-example
Add the Dependencies
Edit the project.clj file and add dependency entries for Logging and Java JDBC.
:dependencies [[org.clojure/clojure "1.8.0"]
[org.clojure/tools.logging "0.4.1"]
[org.clojure/java.jdbc "0.7.8"]]
Add the JAR files
I didn't care to mess with Maven for this example code. I took the approach of placing the JAR files in the resources directory. And added the corresponding entries to the project.clj file.
:resource-paths ["resources/HiveJDBC41.jar"
"resources/commons-codec-1.3.jar"
"resources/commons-logging-1.1.1.jar"
"resources/hive_metastore.jar"
"resources/hive_service.jar"
"resources/httpclient-4.1.3.jar"
"resources/httpcore-4.1.3.jar"
"resources/libfb303-0.9.0.jar"
"resources/libthrift-0.9.0.jar"
"resources/log4j-1.2.14.jar"
"resources/ql.jar"
"resources/slf4j-api-1.5.11.jar"
"resources/slf4j-log4j12-1.5.11.jar"
"resources/TCLIServiceClient.jar"
"resources/zookeeper-3.4.6.jar"]
Here's what the full project.clj file should look like:
(defproject emr-hive-jdbc-example "0.1.0-SNAPSHOT"
:description "Example of how to connect to AWS EMR via Hive JDBC"
:url "https://gitlab.com/ejstembler/emr-hive-jdbc-example"
:license {:name "MIT"
:url "https://mit-license.org"}
:dependencies [[org.clojure/clojure "1.8.0"]
[org.clojure/tools.logging "0.4.1"]
[org.clojure/java.jdbc "0.7.8"]]
:resource-paths ["resources/HiveJDBC41.jar"
"resources/commons-codec-1.3.jar"
"resources/commons-logging-1.1.1.jar"
"resources/hive_metastore.jar"
"resources/hive_service.jar"
"resources/httpclient-4.1.3.jar"
"resources/httpcore-4.1.3.jar"
"resources/libfb303-0.9.0.jar"
"resources/libthrift-0.9.0.jar"
"resources/log4j-1.2.14.jar"
"resources/ql.jar"
"resources/slf4j-api-1.5.11.jar"
"resources/slf4j-log4j12-1.5.11.jar"
"resources/TCLIServiceClient.jar"
"resources/zookeeper-3.4.6.jar"]
:main ^:skip-aot emr-hive-jdbc-example.core
:target-path "target/%s"
:profiles {:uberjar {:aot :all}})
Add a log4j.properties file
Create a new log4j.properties file in the src directory. Add some basic logging configuration for log4j to pick up.
log4j.rootLogger=INFO, console
log4j.logger.example=DEBUG
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%-5p %c: %m%n
Edit the core source file
The core.clj file is the main and only source file for this example.
Add requires to namespace
At the top of the file, we need to add requires statements for logging and JDBC. I like to use succinct but logical aliases.
(ns emr-hive-jdbc-example.core
(:require [clojure.tools.logging :as log])
(:require [clojure.java.jdbc :as jdbc])
(:gen-class))
Some examples out there use j as an alias for JDBC. That's too short and not descriptive!
Read sensitive values from environment variables
Next, I'm reading some sensitive values from environment variables. You never want to hard-code sensitive values in your source code, or even in your config files! See also: The Twelve-Factor App: III. Config.
;; Define the user/password/subname from ENV variables
(def user (System/getenv "EMR_HIVE_JDBC_USER"))
(def password (System/getenv "EMR_HIVE_JDBC_PASSWORD"))
(def subname (System/getenv "EMR_HIVE_JDBC_SUBNAME"))
When dealing with JDBC in other JVM languages I use a url. Yet, Clojure JDBC seems to prefer using a subname instead. It likely builds the url from all the parts you provide, including the subname. The subname is a part of the url anyway.
Preconditions: Confirm the required values
I'm a fan of Betrand Meyer's concept of Preconditions. I try to use them in most all languages I write code in. Some languages have built-in support or 3rd-party libraries. Though sometimes it's better to hand-code them for simplicity's sake. See also: Design by contract.
;; Preconditions
(when (nil? user)
(throw (RuntimeException. "user is nil")))
(when (nil? password)
(throw (RuntimeException. "password is nil")))
(when (nil? subname)
(throw (RuntimeException. "subname is nil")))
Since the code cannot continue without valid values for the user / password / subname it makes sense to raise errors if they are missing. This way you can control the message that's displayed to the end-user.
Define the connection specification
Defining the connection specification is pretty straightforward. It's only a Map.
;; Define the connection specification
(def hive-spec {:classname "com.amazon.hive.jdbc41.HS2Driver"
:subprotocol "hive2"
:subname subname
:user user
:password password})
I'm using the JDBC 4.1 HS2 driver with the hive2 subprotocol.
Define the SQL statement
This is pretty straightforward and simple too. Though, I was perplexed about why the Clojure JDBC query function requires the SQL statement in a Vector. Why not use a String?
;; Define the SQL statement to execute
(def sql ["SELECT
COUNT(*) AS row_count
FROM
my_table"]) ; TODO: Change to your desired SQL statement
The main entry-point
Here's the main entry-point of this example, which does the following:
- Logs the SQL to execute
- Executes the SQL
- Returns the row_count value of the first row
;; The main entry-point
;; 1. Logs the SQL to execute
;; 2. Executes the SQL
;; 3. Returns the row_count value of the first row
(defn -main
[& args]
(log/info "Executing SQL:\n" (first sql))
(println (jdbc/query hive-spec sql {:row-fn :row_count :result-set-fn first})))
I'm ignoring any command-line arguments here. I log the SQL statement so I can see it. And I call the query function, passing the spec, the sql, and an options Map which specifies:
- a result-set function which returns only the 1st row
- a row function which only returns the row_count column I aliased in my SQL statement
I noticed that if I used COUNT(*) without the column alias, Hive defines it as something obscure like _c0. Not user friendly at all.
Here's the complete core.clj file:
(ns emr-hive-jdbc-example.core
(:require [clojure.tools.logging :as log])
(:require [clojure.java.jdbc :as jdbc])
(:gen-class))
;; Define the user/password/subname from ENV variables
(def user (System/getenv "EMR_HIVE_JDBC_USER"))
(def password (System/getenv "EMR_HIVE_JDBC_PASSWORD"))
(def subname (System/getenv "EMR_HIVE_JDBC_SUBNAME"))
;; Preconditions
(when (nil? user)
(throw (RuntimeException. "user is nil")))
(when (nil? password)
(throw (RuntimeException. "password is nil")))
(when (nil? subname)
(throw (RuntimeException. "subname is nil")))
;; Define the connection specification
(def hive-spec {:classname "com.amazon.hive.jdbc41.HS2Driver"
:subprotocol "hive2"
:subname subname
:user user
:password password})
;; Define the SQL statement to execute
(def sql ["SELECT
COUNT(*) AS row_count
FROM
my_table"]) ; TODO: Change to your desired SQL statement
;; The main entry-point
;; 1. Logs the SQL to execute
;; 2. Executes the SQL
;; 3. Returns the row_count value of the first row
(defn -main
[& args]
(log/info "Executing SQL:\n" (first sql))
(println (jdbc/query hive-spec sql {:row-fn :row_count :result-set-fn first})))
Running the project
When I was testing this project, I used lein to run it.
lein run
Later, after you've compiled the JAR, you can run it via Java:
java -jar emr-hive-jdbc-example-0.1.0-standalone.jar
Conclusion
In any case, I'm glad I got this working! I like the functional way Clojure + JDBC code turns out.
Hopefully someone may find this useful too.