How to start using Polars & DuckDB together for data analysis
Python Pandas have been around for a very long time and it will continue to do so for the foreseeable future, however, that shouldn’t stop us engineers from trying out a new library or a combination of new libraries for our day-to-day development work.
This blog aims to demonstrate the usage of Polars with DuckDB to perform similar data transformations as is very often done using Pandas. There are a plethora of comparative studies done and documented between Pandas and Polars so this blog will shy away from all comparisons.
As always, this blog will also be mostly beginner-friendly, however, to keep it short, I have not started all the way from installing both Pandas and Polars. Both of these can be pip-installed.
I will use a dataset from Kaggle to demonstrate the various data transformation/exploration activities. You can explore the details of the dataset by visiting the URL tagged above
From the rich API library that is offered by Polars, we will focus on the following since they are responsible for starting the data analysis.
read_csv
scan_csv
read_csv_batched
Let’s first focus on read_csv
since it is almost similar to Pandas
import polars as pl
train_data_dms_map =
pl.read_csv("/kaggle/input/stanford-ribonanza-rna-folding \
/train_data_QUICK_START.csv", low_memory=True) \
.filter(pl.col('experiment_type')=='DMS_MaP')