Member-only story

SQL for Pandas, Spark, and Dask DataFrames with Fugue

2 min readApr 8, 2023

Fugue is a framework that simplifies distributed computing by abstracting away the complexities of distributed systems, allowing users to write high-level code in familiar languages like Python, Pandas, and SQL, and then automatically optimizes and executes that code on distributed computing frameworks like Spark, Dask, and Ray.

This approach not only reduces the amount of code required but also ensures that the code is optimized for the specific distributed computing framework being used, leading to more efficient execution and faster completion times for big data projects. Additionally, Fugue provides a range of tools and features for monitoring and debugging distributed applications, making it easier for developers to build, test, and maintain their code.

Installation

pip install fugue

Backend engines are installed separately through pip extras. For example, to install Spark:

pip install fugue[spark]

If Spark, Dask, or Ray are already installed on your machine, Fugue will be able to detect it. Spark requires Java to be installed separately.

Examples

from fugue_notebook import setup
setup()

import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3, 4], "col2": ["a", "b", "c", "c"]})
df

 col1 col2
0 1 a
1 2 b
2 3 c
3 4 c

%%fsql 

SELECT *
FROM df 
WHERE col2="c"
PRINT

col1:long col2:str
2 3 c
3 4 c

%%fsql 

SELECT col2, AVG(col1) AS avg_col1
FROM df 
GROUP BY col2
PRINT

 col2:str avg_col1:double
0 a 1.0
1 b 2.0
2 c 3.5
PandasDataFrame: col2:str,avg_col1:double

Full code on Github. Thanks for reading.

SQL for Pandas, Spark, and Dask DataFrames with Fugue

Installation

Examples

Create an account to read the full story.

Written by GeoSense ✅

No responses yet

More from GeoSense ✅

Monte Carlo Simulation with Python to predict the profit from launching a new product

Create Monte Carlo Simulations with Python

Segment Anything Model (SAM) on Apple Silicon M1 and M2

SAM on Mac M1 and M2

Spatial Interpolation

Implement spatial interpolation using Python exclusively, without relying on ArcGIS software.

Retrieving Leaf Area Index (LAI) Google Earth Engine (GEE)

Retrieving Leaf Area Index (LAI)

Recommended from Medium

Pandas vs. Polars: Which Python Library is Best for Data Processing in 2025?

The Data Dilemma: Pandas or Polars, Performance Breakdown with code examples.

This Python Trick Changed The Way I Code Forever!

I just love Python a lot. I always try to make my code neat and clear. But when I learnt this…this changed the way I code.

Python Meets SQLGlot: Seamlessly Parse and Convert SQL Queries

Have you ever found yourself jumping between SQL dialects — MySQL, SparkSQL, PostgreSQL, Hive — and getting tripped up by subtle syntax…

How to identify outliers of your data? (With Python codes)

Point out the outliers

FireDucks, Pandas, DuckDB, and Polars: A Comprehensive Comparison

In the rapidly evolving world of data analysis, selecting the right tool can significantly impact performance and usability. This…

Creating Stunning Histograms with Plotly: A Guide to Beautiful Data Visualization

Not part of the Medium’s partner program? No worries! 👉 Read this article here.