Member-only story

How to start using Polars & DuckDB together for data analysis

7 min readMar 18, 2024

Python Pandas have been around for a very long time and it will continue to do so for the foreseeable future, however, that shouldn’t stop us engineers from trying out a new library or a combination of new libraries for our day-to-day development work.

This blog aims to demonstrate the usage of Polars with DuckDB to perform similar data transformations as is very often done using Pandas. There are a plethora of comparative studies done and documented between Pandas and Polars so this blog will shy away from all comparisons.

As always, this blog will also be mostly beginner-friendly, however, to keep it short, I have not started all the way from installing both Pandas and Polars. Both of these can be pip-installed.

I will use a dataset from Kaggle to demonstrate the various data transformation/exploration activities. You can explore the details of the dataset by visiting the URL tagged above

From the rich API library that is offered by Polars, we will focus on the following since they are responsible for starting the data analysis.

read_csv
scan_csv
read_csv_batched

Let’s first focus on read_csv since it is almost similar to Pandas

import polars as pl

train_data_dms_map = 
pl.read_csv("/kaggle/input/stanford-ribonanza-rna-folding \
/train_data_QUICK_START.csv", low_memory=True) \
.filter(pl.col('experiment_type')=='DMS_MaP')

A few things are going on here

Read the CSV file from a specific location (no surprises there)
low_memory=True, as per the Polars documentation this option reduces memory pressure at the expense of performance. I have not verified this so please experiment with this and let me know
.filter(pl.col('experiment_type')=='DMS_MaP') as you might have guessed, this option filters the dataset on a specific value of the experiment_type column

Once the required data is available in memory, let’s try to perform another very common activity. Applying a specific transformation logic to the values of a particular column of the dataframe

The sequence column of this dataset has RNA sequences of the form

GGGAACGACUCGAGUAGAGUCGAAAAAGAUCGCCACGCACUUACGAGUGCGUGGCGAUCACGCGUGUUGCAGCGCGUCUAACAUAGGCAGGGCAACCUGC…

Great Explanation Suman! By the way, as you correctly said pandas have long been the linchpin in this data wrangling process, how about also trying libraries that can optimize a pandas program as it is. NEC has developed a high-performance…

How to start using Polars & DuckDB together for data analysis

Create an account to read the full story.

Written by Suman Kumar Gangopadhyay

Responses (1)

More from Suman Kumar Gangopadhyay

Using XCOMs in Airflow — Scenario based examples with code

There are many tutorials which already describes how to use XCOMs in airflow, however I found that most of these are two simplistic and do…

Exploratory Data Analysis using Spark

Introduction

Commissioning EMR Spark cluster in AWS and accessing it via an Edge Node

In my journey as a data engineer, I came across spark when the big data hype was at its fever pitch (it remains high today, however, some…

A tool/framework to detect the extent of changes in data entities between time periods

Today, organisations in the world leverage multiple tools/frameworks to enable traceability of data running throughout various data…

Recommended from Medium

10 Python Mistakes You Might Still Be Making in 2025

Avoid these common pitfalls before they turn debugging sessions into a nightmare — and learn what to do instead!

Using DuckDB in Python: A Comprehensive Guide

Introduction to DuckDB

Why PostgreSQL Survived When Commercial Databases Failed

PostgreSQL has always been the underdog in the database world. While Oracle, SQL Server, and DB2 gathered corporate budgets and marketing…

My First Billion (of Rows) in DuckDB

First Impressions of DuckDB handling 450Gb in a real project

Mastering init.py in Python: A Complete Guide to Imports, Packages, and Best Practices

Structure Python packages, control imports, and write a clean, efficient init.py. Machine learning example included!

Simple Ways to Tell if Python Code Was Written by an LLM

Yes, We Can Tell

How to start using Polars & DuckDB together for data analysis

Create an account to read the full story.

Written by Suman Kumar Gangopadhyay

Responses (1)

More from Suman Kumar Gangopadhyay

Using XCOMs in Airflow — Scenario based examples with code

There are many tutorials which already describes how to use XCOMs in airflow, however I found that most of these are two simplistic and do…

Exploratory Data Analysis using Spark

Introduction

Commissioning EMR Spark cluster in AWS and accessing it via an Edge Node

In my journey as a data engineer, I came across spark when the big data hype was at its fever pitch (it remains high today, however, some…

A tool/framework to detect the extent of changes in data entities between time periods

Today, organisations in the world leverage multiple tools/frameworks to enable traceability of data running throughout various data…

Recommended from Medium

10 Python Mistakes You Might Still Be Making in 2025

Avoid these common pitfalls before they turn debugging sessions into a nightmare — and learn what to do instead!

Using DuckDB in Python: A Comprehensive Guide

Introduction to DuckDB

Why PostgreSQL Survived When Commercial Databases Failed

PostgreSQL has always been the underdog in the database world. While Oracle, SQL Server, and DB2 gathered corporate budgets and marketing…

My First Billion (of Rows) in DuckDB

First Impressions of DuckDB handling 450Gb in a real project

Mastering __init__.py in Python: A Complete Guide to Imports, Packages, and Best Practices

Structure Python packages, control imports, and write a clean, efficient __init__.py. Machine learning example included!

Simple Ways to Tell if Python Code Was Written by an LLM

Yes, We Can Tell

Mastering init.py in Python: A Complete Guide to Imports, Packages, and Best Practices

Structure Python packages, control imports, and write a clean, efficient init.py. Machine learning example included!