Cython vs. Numba vs. Mojo

Python Speedups

Benjamin Hackl • June 12, 2024

A Comparison of Different Approaches to speedup Python Language Execution

ASHPC24 @ Grundlsee

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Toy Problem

Denoise an image $g \in \mathbb{R}^{M\times N} =: X$ by minimizing \[ \min_{u \in X} \frac{1}{2\lambda} \|u - g\| + J(u), \] where $\lambda > 0$ is a fidelity weight and \[ J(u) := \sum_{\substack{1 \leq i \leq M\\ 1\leq j\leq N} } |(\nabla u)_{i, j}|, \] a discretized total variation (Rudin, Osher, Fatemi 1992).

Toy Problem: Algorithm

Simple iterative algorithm (Chambolle 2001):

working array $p \in \mathbb{R}^{2\times M\times N}$
out-image: $u = g - \operatorname{div}(p)$
update $p$: \[ p^{n+1} = \frac{p^n - \tau \nabla u}{1 + \tau/\lambda |\nabla u|} \]
stop: check variation $|p^n - p^{n+1}| < \varepsilon$ or $n > n_{\max}$

Toy Problem: (Core) Implementation

while num_iter < max_num_iter:
		if num_iter > 0:
				p_div = p.sum(axis=0)
				p_div[1:, :] -= p[0, 0:-1, :]
				p_div[:, 1:] -= p[1, :, 0:-1]

				result_channel = image_data[:, :, channel] - p_div

		result_gradient[0, 0:-1, :] = np.diff(result_channel, axis=0)
		result_gradient[1, :, 0:-1] = np.diff(result_channel, axis=1)

		factors = 1.0 + tau / weight * np.sqrt(
			(result_gradient**2).sum(axis=0)
		)
		factors = factors[np.newaxis, ...]
		p_new = p - tau * result_gradient
		p_new /= factors

In A Nutshell: Cython

"Cython is Python with C data types" — write C extensions using Python(y language)
Call C / C++ (stdlib) functions directly from Cython code
Given Python snippet: "cythonize" by adding static types:
```
cdef int[1000] arr
# or
arr: cython.int[1000]
				
```
Very mature, around since ~2007 (as a fork of Pyrex)

In A Nutshell: Numba

JIT compiler for Python-subset + NumPy code
Common usage: just add @numba.jit decorator to functions
Built using LLVM:
- Start with Python code
- Translate to intermediate representation (e.g., type inference)
- LLVM code generation plus optimization
Around since 2012, also pretty mature and stable

In A Nutshell: Mojo 🔥

"Programming Language for all AI Developers"
Want: "full compatibility with Python ecosystem"; become Python superset in the long-term
First appearance 2023; incubated by Modular Inc., core modules of standard library open source (Apache 2) since March 2024
Built using MLIR from the LLVM infrastructure stack
Can target CPUs, GPUs, TPUs, ASICs, ... — broad hardware spectrum

Profiling: Test Environment

Test data: Drone shot of Lake Bled, 1500px × 1125px (colored) with added Gaussian noise (independently to each pixel and channel)
Each color channel is denoised spearately
Executed on MacBook Air (M1, 2020)
Time for compilation: disregarded

Profiling: Results

All reference implementations available on GitHub

Notes on Cython

Cythonization without changes: same runtime as plain version
With some effort: speedup! (≈ 3.6× faster than plain)
- "unrolling" vector operations via loops
- array → element ops via memory views
- C math functions
Fast implementation guided by Cython annotations!

Notes on Numba

PIL interaction does not want to be numba'd
Parts of numpy API (np.diff(arr, axis=0)) not supported
After fixes: code runs ≈1.32× slower
Array operations cause lots of large temporary arrays that slow us down — get rid of them!
Refactored (unrolled) version: ≈3.13× faster!

Notes on Mojo

Mojo is not a proper Python-superset yet
Only rudimentary interaction with numpy arrays — no slicing!
Efficient translation numpy-array ↔ Mojo-Tensor is (pretty much) black pointer magic
Mojo-implementation of "unrolled Cython"-version is somewhat slow (≈ 3.5× plain version)
... but (!) "unfair" setup (no parallelization, no vectorization, odd (?) data structure) — see their blog for $k$-means from scratch