Member-only story

Exploratory Data Analysis using Spark

Suman Kumar Gangopadhyay
10 min readOct 31, 2021

--

Introduction

This blog aims to present a step by step methodology of performing exploratory data analysis using apache spark. The target audience for this are beginners and intermediate level data engineers who are starting to get their hands dirty in PySpark. However, more experienced or advanced spark users are also welcome to review the material and suggest steps to improve.

Please note that there are multiple ways to perform exploratory data analysis and this blog is just one of them. The aim of this blog is to assist the beginners to kick-start their journey of using spark and to provide a ready reference to the intermediate level data engineers. The information provided here can be used in a variety of ways. It can be used as-is or can be combined together to build various other ways of performing EDA or for building specific features.

Pre-requisites

The data used in this blog is taken from https://www.kaggle.com/new-york-city/nyc-parking-tickets. The data set have approximately 10 million records

The spark distribution is downloaded from https://spark.apache.org/downloads.html

The distribution I used for developing the code presented here is spark-3.0.3-bin-hadoop2.7.tgz

This blog assumes that the reader has access to a Spark Cluster either locally or via AWS EMR or via Databricks or Azure HDInsights

Last but not the least, the reader should bookmark the Spark API reference https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html

All the codes used in this blog are available in https://github.com/sumaniitm/complex-spark-transformations

Without further ado, let us dig right in

Pre-processing

First and foremost is the pre-processing that we want to do on the data

We will be referring to the notebook https://github.com/sumaniitm/complex-spark-transformations/blob/main/preprocessing.ipynb

--

--

Suman Kumar Gangopadhyay
Suman Kumar Gangopadhyay

Written by Suman Kumar Gangopadhyay

A Data engineer who loves to foray into uncharted domains of engineering

No responses yet

Write a response