Exploratory Data Analysis using Spark

Published in

Analytics Vidhya

10 min readOct 31, 2021

Introduction

This blog aims to present a step by step methodology of performing exploratory data analysis using apache spark. The target audience for this are beginners and intermediate level data engineers who are starting to get their hands dirty in PySpark. However, more experienced or advanced spark users are also welcome to review the material and suggest steps to improve.

Please note that there are multiple ways to perform exploratory data analysis and this blog is just one of them. The aim of this blog is to assist the beginners to kick-start their journey of using spark and to provide a ready reference to the intermediate level data engineers. The information provided here can be used in a variety of ways. It can be used as-is or can be combined together to build various other ways of performing EDA or for building specific features.

Pre-requisites

The data used in this blog is taken from https://www.kaggle.com/new-york-city/nyc-parking-tickets. The data set have approximately 10 million records

The spark distribution is downloaded from https://spark.apache.org/downloads.html

The distribution I used for developing the code presented here is spark-3.0.3-bin-hadoop2.7.tgz

This blog assumes that the reader has access to a Spark Cluster either locally or via AWS EMR or via Databricks or Azure HDInsights

Last but not the least, the reader should bookmark the Spark API reference https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html

All the codes used in this blog are available in https://github.com/sumaniitm/complex-spark-transformations

Without further ado, let us dig right in

Pre-processing

First and foremost is the pre-processing that we want to do on the data

We will be referring to the notebook https://github.com/sumaniitm/complex-spark-transformations/blob/main/preprocessing.ipynb

As a first step we will make a few pre-requisites available at a common place, so that we can always come back and change those if necessary. e.g. the path where the data files are kept (both input data and output data), names of the various explanatory and response variables (to know what these variables mean, check out https://www.statisticshowto.com/probability-and-statistics/types-of-variables/explanatory-variable/). These are declared in a simple python file https://github.com/sumaniitm/complex-spark-transformations/blob/main/config.py

Configuration declared in the config.py file

As you can see, I have made a list of data attributes as response and explanatory variables. The aim here is to study what type of response variables are found to be more common with respect to the explanatory variable. To make it more clear, let’s ask questions such as; which type of Law_Section is the most violated in a month and which Plate_Type of vehicles are the violates more in a given week. The week and month in these questions will obviously be coming from the Issue_Date and Violation_Time.

Coming back to the world of engineering from the world of statistics, the next step is to start off a spark session and make the config file available within the session, then use the configurations mentioned in the config file to read in the data from file. We do this via the following

Initial declarations and reading in file data

Now that we have the data in a PySpark dataframe, we will notice that there are spaces in the column names. This is going to be inconvenient later on, so to streamline our EDA, we replace the spaces with underscore, like below

Removal of spaces in column names

To start off the pre-processing, we first try to see how many unique values of the response variables exist in the dataframe, in other words, we want a sense of cardinality. So we proceed with the following

As you can see, some of the response variables have a significantly large number of distinct values whereas some others have much less, e.g. Law_Section and Violation_County are two response variables that have distinct values (8 and 12 respectively) which are easier to visualise without a chart/plot. Hence for the sake of simplicity we will pick these two for our further EDA.

Moving on, we will focus on the explanatory variables and as a first check on the quality of the chosen variables we will try to find out how many Nulls or NaNs of the explanatory variables exist in the data

Nulls/NaNs check in explanatory variables

This is good, our chosen explanatory variables do not suffer from very high occurrences of Nulls or NaNs

Looking at the Violation_Time explanatory variable, we can see an opportunity of creating another explanatory variable which can add another dimension to our EDA, so we create it right now instead of creating it during the feature or transformation building phase

New categorical explanatory variable AM/PM

This new explanatory variable will give a sense of the time of the day when the violations most occur, in the early hours or late hours or in the middle of the day.

The next concern we have is with the format of the dates in the column Issue_Date, it is currently in MM/dd/yyyy format and needs to be standardised in the YYYY-MM-DD format. So we perform the following

Note that, the values in Issue_Date column will be have a large number of distinct values and hence will be cumbersome to deal with in its current form (without having the help of plotting). So, we derive a few categorical explanatory variable from it, which will have much lesser cardinality than Issue_Date in its current form. We will extract the year, month, day of week and day of month as shown below

We also explore few more columns of the dataframe to see if they can qualify as response variables or not

significantly high Nulls/NaNs, hence rejected

search for other response variables, contd.

Apart from Feet_From_Curb the other two can be rejected

final search for more response variables

All three of these can be rejected

In order to reduce the size of our dataframe, let’s drop these columns which are of no apparent use to us.

The next check we will perform will tell us whether our data is within the expected time horizon. Remember that we have chosen the 2017 data from the NYC taxi datasets in kaggle, so the range of Issue Dates is expected to be within 2017. Let’s see if that is indeed the case and if not get it corrected.

As you can see, there are records with future issue dates, which doesn’t really make any sense, so we pare down the data to within the year 2017 only. However, this will make the categorical explanatory variable Issue_Year (created earlier) redundant but that is a trade-off we are willing to make. We will make up for this lost variable by deriving another one from the Violation_Time variable

new explanatory variable replacing Issue_Year

The final record count stands at approximately 5 million

record count after restriction to within 2017

Finally we finish pre-processing by persisting this dataframe by writing it out in a csv, this will be our dataset for further EDA

Transformations

In the below discussion we will refer to the notebook https://github.com/sumaniitm/complex-spark-transformations/blob/main/transformations.ipynb

We cast off by reading the pre-processed dataset that we wrote in disk above and start looking for seasonality, i.e. whether the violations are more in any particular months (remember we are dealing with year 2017 only).

As we can see from above that the violations are more common in the 1st half of the year. This gives us another opportunity to trim our dataframe further down to the 1st six months of 2017. Remember this can be an error in the source data itself but we have no way to verify that in our current scope of discussion.

Next we try to standardise/normalise the violations in a month. To understand the difference between standardisation and normalisation check out https://medium.com/swlh/difference-between-standardization-normalization-99be0320c1b1

steps to arrive at standardised/normalised numbers

Observe that from the standardised numbers (the violation counts are either 1,2 or 3 units above or below standard) it seems that the violations are more common in the early and later days of the month with a slight dip in the middle. Now that we have seen some trend in the month, let’s narrow down to within a week.

As you can expect, the violations are more common (the violation counts are either 1 or 2 units above or below standard) during the weekdays

Till now we were only looking at one response variable at a time, let’s switch gears and try to observe a combination of response variables.

Pay close attention to the usage of pivot function in spark, this is a powerful tool in the spark arsenal and can be used in a variety of useful ways. However, this view is not very useful in determining the trends of violations for this combination of response variables, so let us try something different.

There is a clear indication that the vehicles registered in NY are the most common violators and amongst them the violations are more common in NY county and the 408 section is the most commonly violated law. However, this is quite obvious since the whole dataset is from NYC. Let us remove NY county and NY as registration state and see which combination comes in the top 10.

Now, NJ registered vehicles comes out on top with K county being at the receiving end of the most number of violations of Law 408.

So far so good, but the combination of response variables pose a challenge to visual inspection (as we are not using any plots to keep ourselves purely within spark), hence we go back to studying single response variables.

One thing which kind of sticks out is the Issue_DayofWeek, its currently stored as numerals and can pose a challenge later on, so we append a string Day_ in front of the data in this column.

Now we focus our attention one response variable at a time and see how they are distributed throughout the week

Notice the how=‘any’ in the call to the dropna function, this tells spark to drop the rows which has nulls in any one of the columns. This restricts our observations to within those Law Sections which are violated throughout the week. As you can see 408 is the most violated law section and it is violated all through the week.

Distribution of violation county in a week

Using similar transformation as used for Law Section, we observe that the K county registers the most number of violations all over the week. Remember that we have filtered out NY from this dataset, otherwise NY county would have come on top like it did before.

Finally, we look at the registration state, but remember the high cardinality of this variable, so we will have to order all the weekdays based on the violation count and then look at the top 10 data points. Let us go ahead and do it

As seen before (while working with the combination of multiple response variable) vehicles registered to NJ are the most common violators throughout the week. Pay close attention to the variable colmToOrderBy. This variable helps us to avoid writing all the days as the columns to order the dataframe by. It is essential to learn using these type of shorthand techniques to make your code more modular and readable and to avoid hard-coding as much as possible.

Final outcome of EDA

Overall conclusion based on 2017 data is as below : The violations are most common in the 1st half of the year and violations occur more frequently in the beginning or ending of the months. The weekdays are more prone to violations. Vehicles registered in NY and NJ are the most violators and these violations are observed most in NY and K counties

Conclusion

If you have made it this far, I thank you for spending your time and hope this has been valuable. Please share your views in the comments or by clapping (if you think it was worth spending time).