Python Speedups
Benjamin Hackl • June 12, 2024

Cython vs. Numba vs. Mojo

A Comparison of Different Approaches to speedup Python Language Execution

ASHPC24 @ Grundlsee
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Toy Problem

Denoise an image $g \in \mathbb{R}^{M\times N} =: X$ by minimizing \[ \min_{u \in X} \frac{1}{2\lambda} \|u - g\| + J(u), \] where $\lambda > 0$ is a fidelity weight and \[ J(u) := \sum_{\substack{1 \leq i \leq M\\ 1\leq j\leq N} } |(\nabla u)_{i, j}|, \] a discretized total variation (Rudin, Osher, Fatemi 1992).

Toy Problem: Algorithm

Simple iterative algorithm (Chambolle 2001):
  • working array $p \in \mathbb{R}^{2\times M\times N}$
  • out-image: $u = g - \operatorname{div}(p)$
  • update $p$: \[ p^{n+1} = \frac{p^n - \tau \nabla u}{1 + \tau/\lambda |\nabla u|} \]
  • stop: check variation $|p^n - p^{n+1}| < \varepsilon$ or $n > n_{\max}$

Toy Problem: (Core) Implementation

while num_iter < max_num_iter:
		if num_iter > 0:
				p_div = p.sum(axis=0)
				p_div[1:, :] -= p[0, 0:-1, :]
				p_div[:, 1:] -= p[1, :, 0:-1]

				result_channel = image_data[:, :, channel] - p_div

		result_gradient[0, 0:-1, :] = np.diff(result_channel, axis=0)
		result_gradient[1, :, 0:-1] = np.diff(result_channel, axis=1)

		factors = 1.0 + tau / weight * np.sqrt(
			(result_gradient**2).sum(axis=0)
		)
		factors = factors[np.newaxis, ...]
		p_new = p - tau * result_gradient
		p_new /= factors
		

In A Nutshell: Cython

  • "Cython is Python with C data types" — write C extensions using Python(y language)
  • Call C / C++ (stdlib) functions directly from Cython code
  • Given Python snippet: "cythonize" by adding static types:
    
    cdef int[1000] arr
    # or
    arr: cython.int[1000]
    				
  • Very mature, around since ~2007 (as a fork of Pyrex)

In A Nutshell: Numba

  • JIT compiler for Python-subset + NumPy code
  • Common usage: just add @numba.jit decorator to functions
  • Built using LLVM:
    • Start with Python code
    • Translate to intermediate representation (e.g., type inference)
    • LLVM code generation plus optimization
  • Around since 2012, also pretty mature and stable

In A Nutshell: Mojo 🔥

  • "Programming Language for all AI Developers"
  • Want: "full compatibility with Python ecosystem"; become Python superset in the long-term
  • First appearance 2023; incubated by Modular Inc., core modules of standard library open source (Apache 2) since March 2024
  • Built using MLIR from the LLVM infrastructure stack
  • Can target CPUs, GPUs, TPUs, ASICs, ... — broad hardware spectrum

Profiling: Test Environment

  • Test data: Drone shot of Lake Bled, 1500px × 1125px (colored) with added Gaussian noise (independently to each pixel and channel)
  • Each color channel is denoised spearately
  • Executed on MacBook Air (M1, 2020)
  • Time for compilation: disregarded

Profiling: Results

  • All reference implementations available on GitHub

Notes on Cython

  • Cythonization without changes: same runtime as plain version
  • With some effort: speedup! (≈ 3.6× faster than plain)
    • "unrolling" vector operations via loops
    • array → element ops via memory views
    • C math functions
  • Fast implementation guided by Cython annotations!

Notes on Numba

  • PIL interaction does not want to be numba'd
  • Parts of numpy API (np.diff(arr, axis=0)) not supported
  • After fixes: code runs ≈1.32× slower
  • Array operations cause lots of large temporary arrays that slow us down — get rid of them!
  • Refactored (unrolled) version: ≈3.13× faster!

Notes on Mojo

  • Mojo is not a proper Python-superset yet
  • Only rudimentary interaction with numpy arrays — no slicing!
  • Efficient translation numpy-array ↔ Mojo-Tensor is (pretty much) black pointer magic
  • Mojo-implementation of "unrolled Cython"-version is somewhat slow (≈ 3.5× plain version)
  • ... but (!) "unfair" setup (no parallelization, no vectorization, odd (?) data structure) — see their blog for $k$-means from scratch

Thank you!