Member-only story

How to start using Polars & DuckDB together for data analysis

Suman Kumar Gangopadhyay
7 min readMar 18, 2024

Python Pandas have been around for a very long time and it will continue to do so for the foreseeable future, however, that shouldn’t stop us engineers from trying out a new library or a combination of new libraries for our day-to-day development work.

This blog aims to demonstrate the usage of Polars with DuckDB to perform similar data transformations as is very often done using Pandas. There are a plethora of comparative studies done and documented between Pandas and Polars so this blog will shy away from all comparisons.

As always, this blog will also be mostly beginner-friendly, however, to keep it short, I have not started all the way from installing both Pandas and Polars. Both of these can be pip-installed.

I will use a dataset from Kaggle to demonstrate the various data transformation/exploration activities. You can explore the details of the dataset by visiting the URL tagged above

From the rich API library that is offered by Polars, we will focus on the following since they are responsible for starting the data analysis.

read_csv
scan_csv
read_csv_batched

Let’s first focus on read_csv since it is almost similar to Pandas

import polars as pl

train_data_dms_map =
pl.read_csv("/kaggle/input/stanford-ribonanza-rna-folding \
/train_data_QUICK_START.csv"
, low_memory=True) \
.filter(pl.col('experiment_type')=='DMS_MaP')

A few things are going on here

  1. Read the CSV file from a specific location (no surprises there)
  2. low_memory=True, as per the Polars documentation this option reduces memory pressure at the expense of performance. I have not verified this so please experiment with this and let me know
  3. .filter(pl.col('experiment_type')=='DMS_MaP') as you might have guessed, this option filters the dataset on a specific value of the experiment_type column

Once the required data is available in memory, let’s try to perform another very common activity. Applying a specific transformation logic to the values of a particular column of the dataframe

The sequence column of this dataset has RNA sequences of the form

GGGAACGACUCGAGUAGAGUCGAAAAAGAUCGCCACGCACUUACGAGUGCGUGGCGAUCACGCGUGUUGCAGCGCGUCUAACAUAGGCAGGGCAACCUGC…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Suman Kumar Gangopadhyay
Suman Kumar Gangopadhyay

Written by Suman Kumar Gangopadhyay

A Data engineer who loves to foray into uncharted domains of engineering

Responses (1)

Write a response

Great Explanation Suman! By the way, as you correctly said pandas have long been the linchpin in this data wrangling process, how about also trying libraries that can optimize a pandas program as it is. NEC has developed a high-performance…