Analyze Data with Hunk on Amazon EMR

In this post you will learn how to use Hunk to process data with an Amazon EMR cluster. We will go through the steps of:

Creating a Hunk EC2 instance,
Creating an Amazon EMR cluster
Configure Hunk with EMR for the purposes of analyzing data in an S3 bucket.

Create a Hunk instance on AWS EC2.

The most convenient way to create an EC2 instance with Hunk is to use the Hunk AMI directly from AWS Marketplace (https://aws.amazon.com/marketplace/pp/B00GIZK2QI). The AMI is public and free to use, although typical EC2 hourly fees apply. It includes Hunk installed, the Hunk installer package (which will be needed later to distribute to DataNodes), Hadoop libraries, as well as Java – all in a Linux x64 base.

Build an instance with enough resources that fits yours needs and satisfies your requirements. I would recommend a minimum of m1.xlarge.

Proceed with the rest of the setup screens until you have the EC2 instance up and running. You are not necessarily required to have extra storage added to the instance, but if you would like, feel free to add according to your needs. Also, make sure you select and note the key pair name to connect to the instance.

Connect to the Hunk instance

To connect to the Hunk instance you just provisioned, open a terminal and ssh via to the Public DNS address.

$ ssh –i my_key.pem ec2-user@<public-dns-address>

Navigate to /opt and note the directory layout:

Brief description:

/opt/hadoop contains the Hadoop libraries. For now, only vanilla Apache hadoop 1.0.3 and 2.2.0 are located here.

/opt/java is where a modern version of Java resides. The latest installed in the AMI is Java 1.7 U45.

/opt/splunk contains the actual Hunk installation.

/opt/splunk_packages is where tar.gz Hunk install bits reside. The current package is: splunk-6.0-184175-Linux-x86_64.tgz

To make it easy to interact with EMR (i.e. read from HDFS/S3n and run MR jobs) all the above directories are recursively owned by user and group hadoop.

Start Hunk

To start Hunk run the following commands from your SSH window:

Go through the agreement steps and note the port where Hunk is running; default is 8000.

Point your browser of choice to the Public DNS URL and login:

http://<public-dns-address>:8000

Default credentials are: admin/changeme You will be asked to change the password and afterwards you will be presented with the classic Splunk 6 interface:

Create an EMR cluster

There are at least two ways to create an AWS EMR cluster; via AWS Console or using EMR Tools through the command line. Since I have tools installed I will launch the cluster using this latter method. In order to provide analytics and insights on data on Hadoop, Hunk does need or utilize any other applications such as Hive or Pig. Therefore, if you’re creating a cluster from the AWS Console, you can simply de-select them, as we will only need an interactive EMR cluster. Enter a cluster name and select your desired logging and debugging options.

Software: Choose the Amazon Hadoop distribution, with the latest AMI version: 2.4.2 (Hadoop 1.0.3) – latest

Hardware: Select an m1.medium for Master, count=1, and m1.xlarge for Core/Slaves, count=3.

Security and Access: Make your appropriate selections here. I chose to select the same key-pair as my EC2 instance above.

Bootstrap Actions and Steps: Make your own selections here. I chose not to have any.

The equivalent of this from command line is:

$ ./elastic-mapreduce --create --alive --name "my_hunk_emr_cluster" --ami-version latest --master-instance-type m1.medium --slave-instance-type m1.xlarge --num-instances 4 --key-pair my_key

This command creates an EMR cluster named “my_hunk_emr_cluster” off of the “latest” AMI with three slave m1.xlarge nodes and one m1.medium master node.

Configure Hunk with EMR cluster and S3n bucket

Hunk is able to work with data in both HDFS and S3. In this case we’re working with the assumption that data resides in S3n (native), although, “local” HDFS is much better performance-wise.

In terms of security you need to make sure the Hunk instance can freely communicate with the EMR cluster nodes; both master and slaves. Modify Security Groups in EC2 Management page accordingly.

Let’s now configure Hunk with our freshly created EMR cluster: while logged in in Hunk, go to Settings, Virtual Indexes and click on New Provider. Enter a Name of your liking. For Java home and Hadoop home you can use the ones below. Modify the Job Tracker, File System, and HDFS Working Directory to correspond to your Master address and your S3 bucket respectively:

Name: my-emr-provider

Java Home: /opt/java/latest

Hadoop Home: /opt/hadoop/apache/hadoop-1.0.3

Hadoop Version: Hadoop 1.x (MRv1)

Job Tracker: <internal master ip>:9001

File System: s3n://<AWS Access Key>:<AWS Secret>@<bucket name>

HDFS working dir: /working-dir (in my case this a folder at the root of the bucket above)

Add a new setting, at the bottom, to tell Hunk what package to distribute to DataNodes:

vix.splunk.setup.package: /opt/splunk_packages/splunk-6.0-184175-Linux-x86_64.tgz

Save the configuration, and proceed to create a new virtual index. In our case, we’re naming the index emr-index and configuring it to read the Apache web server logs.

Logs reside in a folder called logs in the base of our bucket and they are compressed in .tar.gz format.

Click Save and return to the Search app.

In the search bar enter “index=emr-index | head 10” and observe events streaming from our bucket .

There are not many interesting fields extracted by default from our events. So, let’s add access-extractions to our source. This configuration will apply field extractions for logs in Apache access combined format. Go to Settings, Fields and create a New extraction:

Note that source named field reads exactly /logs/access… with the ellipsis (…) indicating recursiveness as per here: http://docs.splunk.com/Documentation/Splunk/latest/Admin/propsconf. Change accordingly to fit your path. Save.

Return to the search bar and run the same search again. Note the additional fields on the left hand side of the screen.

Create a Dashboard with two panels

Let’s assume we need to see top clients and an overall chart of traffic over time. For this we will need two searches that will power two panel in our dashboard below:

(1) Top clientip search: index=emr-index | top clientip