In my previous post, I described one of the many ways to set up your own Spark cluster (in AWS) and submitting spark jobs in that cluster from an edge node (in AWS). However, we all know how business requirements soon surpass the ability to run jobs manually and we land up in a quest to develop a data pipeline.

This post will put together a step by step guide to help setup a pipeline which can automate running spark jobs from an edge node to a spark cluster, all within AWS. …

In my journey as a data engineer, I came across spark when the big data hype was at its fever pitch (it remains high today, however, some of the myths have been replaced with reality). While I have developed production grade spark applications, it came to my notice that there is a lack of information when it comes to “spinning up a spark cluster yourself”. Of course there are tutorials in various learning platforms which tells you how to download spark and get going within your laptop or setup an AWS environment in a very tailer made way, but none…

