Member-only story
How to start using Polars & DuckDB together for data analysis
Python Pandas have been around for a very long time and it will continue to do so for the foreseeable future, however, that shouldn’t stop us engineers from trying out a new library or a combination of new libraries for our day-to-day development work.
This blog aims to demonstrate the usage of Polars with DuckDB to perform similar data transformations as is very often done using Pandas. There are a plethora of comparative studies done and documented between Pandas and Polars so this blog will shy away from all comparisons.
As always, this blog will also be mostly beginner-friendly, however, to keep it short, I have not started all the way from installing both Pandas and Polars. Both of these can be pip-installed.
I will use a dataset from Kaggle to demonstrate the various data transformation/exploration activities. You can explore the details of the dataset by visiting the URL tagged above
From the rich API library that is offered by Polars, we will focus on the following since they are responsible for starting the data analysis.
read_csv
scan_csv
read_csv_batched
Let’s first focus on read_csv
since it is almost similar to Pandas
import polars as pl
train_data_dms_map =
pl.read_csv("/kaggle/input/stanford-ribonanza-rna-folding \
/train_data_QUICK_START.csv", low_memory=True) \
.filter(pl.col('experiment_type')=='DMS_MaP')
A few things are going on here
- Read the CSV file from a specific location (no surprises there)
low_memory=True
, as per the Polars documentation this option reduces memory pressure at the expense of performance. I have not verified this so please experiment with this and let me know.filter(pl.col('experiment_type')=='DMS_MaP')
as you might have guessed, this option filters the dataset on a specific value of theexperiment_type
column
Once the required data is available in memory, let’s try to perform another very common activity. Applying a specific transformation logic to the values of a particular column of the dataframe
The sequence
column of this dataset has RNA sequences of the form
GGGAACGACUCGAGUAGAGUCGAAAAAGAUCGCCACGCACUUACGAGUGCGUGGCGAUCACGCGUGUUGCAGCGCGUCUAACAUAGGCAGGGCAACCUGC…