How to start using Polars & DuckDB together for data analysis

Suman Kumar Gangopadhyay
7 min readMar 18, 2024

Python Pandas have been around for a very long time and it will continue to do so for the foreseeable future, however, that shouldn’t stop us engineers from trying out a new library or a combination of new libraries for our day-to-day development work.

This blog aims to demonstrate the usage of Polars with DuckDB to perform similar data transformations as is very often done using Pandas. There are a plethora of comparative studies done and documented between Pandas and Polars so this blog will shy away from all comparisons.

As always, this blog will also be mostly beginner-friendly, however, to keep it short, I have not started all the way from installing both Pandas and Polars. Both of these can be pip-installed.

I will use a dataset from Kaggle to demonstrate the various data transformation/exploration activities. You can explore the details of the dataset by visiting the URL tagged above

From the rich API library that is offered by Polars, we will focus on the following since they are responsible for starting the data analysis.

read_csv
scan_csv
read_csv_batched

Let’s first focus on read_csv since it is almost similar to Pandas

import polars as pl

train_data_dms_map =
pl.read_csv("/kaggle/input/stanford-ribonanza-rna-folding \
/train_data_QUICK_START.csv", low_memory=True) \
.filter(pl.col('experiment_type')=='DMS_MaP')

--

--

Suman Kumar Gangopadhyay
Suman Kumar Gangopadhyay

Written by Suman Kumar Gangopadhyay

A Data engineer who loves to foray into uncharted domains of engineering

Responses (1)