Commissioning EMR Spark cluster in AWS and accessing it via an Edge Node

In my journey as a data engineer, I came across spark when the big data hype was at its fever pitch (it remains high today, however, some of the myths have been replaced with reality). While I have developed production grade spark applications, it came to my notice that there is a lack of information when it comes to “spinning up a spark cluster yourself”. Of course there are tutorials in various learning platforms which tells you how to download spark and get going within your laptop or setup an AWS environment in a very tailer made way, but none of them (disclaimer : there may be some which I haven’t come across, I am not Gandalf the great of course) charts a step by step course which can set you up with a working spark cluster and then an edge node to submit jobs in it. You can of course set up a spark cluster and run your job within the master node but that’s not the preferred way if you want to schedule your workflows or want to make your setup as close to production ready as you possibly can.

This step by step guide is an attempt to bring in one place, all the actions that you need to perform to setup your EMR Spark cluster in AWS (I intend to do the same with Azure and GCP but more on that in another story) and then to create your edge node and submit jobs from the edge node and track it within the cluster.

The target audience are absolute beginners and intermediate spark users, however experts are welcome to help me fine tune/course correct this article

I will make the following assumptions

  1. The reader is familiar with basic spark and basic python
  2. The reader has fare idea about AWS Console and EC2 instances
  3. The reader knows how to set up an IAM role and have a EC2 key pair. If you want to know how to create a EC2 key pair, please refer to AWS documentation
  4. The reader has basic idea about AWS S3
  5. The reader has an AWS account or has access to one (be aware, that this set up will cost you, but would be small enough that you won’t curse me at the end of the month)

All right, enough preamble, let’s dive straight in

Log into AWS Console and from services, select EMR

Click on “Create Cluster” button

Click on “Go to advanced options” (you can keep the Logging location in S3 to default or use any other S3 location that you prefer). Don’t worry about the cluster name now, you will get to name it later.

In the above page, you can choose the EMR release (I didn’t want to be extra adventurous and stayed one step below the highest available version which is 6.2.0 but feel free to choose that) and the additional softwares (Of course you will need spark), there is no specific reason for choosing so many associated softwares, feel free to choose only spark.

Leave the rest in this page as-is and click “Next”

You will arrive at the hardware configuration page, where you can leave the initial few config as-is and focus on the Cluster Nodes & Instances details.

Choose spot instances to reduce your cost (remember, you are building a POC, not a production application right now)

Click on “Next”

In the next page, add a name to your cluster, remove the check box on “Termination Protection”. This will ensure that the spot instances are terminated even if you forgot to terminate the cluster after you are done with your POC. An useful point to note is that although your cluster might get terminated (even during your work, since its based on spot instances which AWS can reclaim, but the chances of sudden termination are slim), you can still clone from a terminated cluster, which is great since you won’t have to remember the steps you took earlier. In fact, I have terminated, cloned and created my cluster 3 times before attempting this article

On clicking “Next”, AWS takes you to the final step of Security configurations. Add your EC2 key pair here (this will be used to SSH into the master node if you want to). Leave the rest as-is

Finally, hit “Create Cluster” and wait for the Cluster to be provisioned. When I say wait, I really mean it, be patient, we all know Rome wasn’t built in a day

You should be able to see the below screen once your cluster is running

Before we start setting up the Edge node, there is some tidying up to do. First and foremost is that we name our Master and Core (Task) Nodes appropriately so that we can easily identify them in the EC2 Dashboard

Right now, you will be looking at this

To properly identify the master and workers (or core nodes in AWS terms), let’s go back to our EMR Cluster and click on the “Hardware” tab. Then click on the ID of the master node which gives you the eC2 instance ID of the master node

Go back into the EC2 console and edit the name of this instance appropriately

Let’s name the rest as Worker 1 and 2

Now we will have to enable the inbound ssh traffic to the master node

To do that, select the master node (now you understand why renaming them was important) and in the pane below, go to Security tab and scroll down till you see the security groups

Click on the security group, then once you are in the security group’s page, click on “Edit Inbound Rules”

Add SSH as type and My IP as source. AWS will detect your IP address

Click on “Save Rules” and go back to the EC2 dashboard

Now we arrive at a crucial juncture of our exercise where we need to create an image from the Master node. This image will be used to spin up another EC2 instance which will act as our Edge node.

To create an image from the master node, first select the master node from the EC2 dashboard and navigate to “Create Image” from the “Actions” dropdown

Give a meaningful name to the image (AMI in AWS terms) and remember

  1. select the checkbox against “Enable” under “No Reboot”
  2. Increase the size of the 2nd volume to 20 (its 10 by default). This is to ensure that the spark jobs you submit can run without disk space issues. I tried with 10 and after one run, it gave me “no space left on device error”. At that point I had to increase the size and then extend the drive manually. The steps for those are not within scope of this article

Hit “Create Image” and then select “AMI” under “Images” from left hand side of the AWS Console. The newly created image will be seen as pending. Once again, your patience will be tested here. Wait till the image is “available”

Once the image becomes available, select it and hit “Launch”

In the next screen, start by choosing an Instance Type. Do not choose the free tier as it won’t be enough to run spark jobs

Click on “Next: Configure Instance Details” and keep the defaults in the next screen. Then move on to “Add Storage”, no change needed there as well

Continue on and arrive at the “Configure Security Group” step, where select an existing security group which exists as the EMR Master

Finally, hit “Review and Launch”. At this point AWS will warn you that things are not free, but you already know that

So go ahead and “Launch”

I have chosen an existing key pair but you can create a new one at this stage if you do not have one. However, I strongly recommend that you create it before diving into this article and download the .pem file in a location from where you want to connect to the Edge Node

Hit “Launch Instances”

Go back to the EC2 dashboard and you will see that the instance is now running. Once again, give it a proper name

Once again repeat the steps to enable the inbound ssh traffic (as done above for the master node) to the edge node

Now, we are all set, time to test out our setup

With the Edge Node EC2 instance selected, click on “Connect”

copy the example and then from a terminal run the command, remember to replace the “root” user with “ec2-user”. Even if you forget this, AWS will remind you about this replacement while you try to connect the first time. Remember to run the command from a place where the .pem key is accessible or modify the command appropriately

If you are looking at the above screen, then voila, you have your edge node ready to run and connect to the EMR Spark Cluster

Now is the time to actually submit a spark script, but before that we need to take care of a small step so that we don’t run into permission issues while our job is being submitted to connect to the worker nodes

export HADOOP_USER_NAME=hdfs

Run the above at the terminal in your edge node

Next, we need some code to build our first spark script in pyspark language

Feel free to use the following snippet (but remember, I have an existing S3 bucket and a freely available data in there, so replace it with something similar)

import pyspark
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName(‘FirstSparkApp’).setMaster(‘yarn’)
sc = SparkContext(conf=conf)
from pyspark.sql import SQLContext, HiveContext, SparkSession
spark = SparkSession.builder.appName(‘FirstSparkApp’).getOrCreate()
my_first_dframe = spark.read.option(“header”,True).csv(“s3://airflow-2.0-exploration/netflix_titles.csv”)
my_first_dframe_filtered = my_first_dframe.filter(my_first_dframe[‘country’]==’Brazil’)
my_first_dframe_filtered.write.option(“header”,True).csv(“s3://airflow-2.0-exploration/netflix_titles_from_brazil.csv”)

Once done, you are ready to fire your spark submit, so go ahead and run the following or an equivalent command

spark-submit — driver-memory 2g — executor-memory 2g — executor-cores 2 — num-executors 2 — deploy-mode cluster yourScripName.py

Once this is fired and if you have followed this article closely, you should be able to see the following

Congratulations, you now have a working Edge node which can communicate with the EMR Spark Cluster

To track the application progress go to the “Application user interfaces” of the Spark Cluster and then scroll all the way down to the application history where you can see the application you just now launched

A little bit of investigation will reveal how to track down the log. I will leave that bit to you so that you can familiarise yourself more with the cluster.

Have fun running spark jobs and do not forget to terminate the cluster once your are done.

In a mission to reduce waste in supply chain using AI/ML, visit Noodle.ai for more details