Cache-Friendly, Low-Memory Lanczos Algorithm in Rust

✨ Discover this awesome post from Hacker News 📖

📂 Category:

📌 Main takeaway:

The standard Lanczos method for computing matrix functions has a brutal memory requirement: storing an $n \times k$

In this post, we will explore one of the most straightforward solutions to this problem: a two-pass variant of the Lanczos algorithm that only requires $O(n)$

Open Table of Contents

Let’s consider the problem of computing the action of matrix functions on a vector:

$\mathbf⚡ = f(\mathbfShare your opinion below!)\mathbf🔥$

where $\mathbfTell us your thoughts in comments!$

Indeed, there are a lot problems with computing $f(\mathbf{A})$

However we know that given any matrix function $f$

This gives us a good and a bad news: the good news is that, well, we can express $f(\mathbf{A})$

$f(\mathbf{A})\mathbf{b} \approx p_k(\mathbf{A})\mathbf{b} = \sum_{i=0}^k c_i \mathbf{A}^i \mathbf{b}$

This polynomial only involves vectors within a specific subspace.

Krylov Projection

We can notice that $p_k(\mathbf{A})\mathbf{b}$

$\mathcal{K}_k(\mathbf{A}, \mathbf{b}) = \text{span}\{\mathbf{b}, \mathbf{Ab}, \mathbf{A}^2\mathbf{b}, \ldots, \mathbf{A}^{k-1}\mathbf{b}\}$

This is fortunate: we can compute an approximate solution by staying within this space, which only requires repeated matrix-vector products with $\mathbf{A}$

We don’t need to construct $\mathbf{A}^j$

But there’s a problem: the raw vectors $\{\mathbf{A}^j\mathbf{b}\}$

Building an Orthonormal Basis

The standard method is the Arnoldi process, which is Gram-Schmidt applied to Krylov subspaces. We start by normalizing $\mathbf{v}_1 = \mathbf{b} / \|\mathbf{b}\|_2$

Compute a new candidate: $\mathbf{w}_j = \mathbf{A}\mathbf{v}_j$
Orthogonalize against all existing basis vectors:

$\tilde{\mathbf{v}}_j = \mathbf{w}_j – \sum_{i=1}^j (\mathbf{v}_i^H \mathbf{w}_j) \mathbf{v}_i$

Normalize: $\mathbf{v}_{j+1} = \tilde{\mathbf{v}}_j / \|\tilde{\mathbf{v}}_j\|_2$

The coefficients $h_{ij} = \mathbf{v}_i^H \mathbf{w}_j$

$\mathbf{V}_k = [\mathbf{v}_1, \ldots, \mathbf{v}_k]$
$\mathbf{H}_k$

We can express this relationship with the Arnoldi decomposition:

$\mathbf{A}\mathbf{V}_k = \mathbf{V}_k \mathbf{H}_k + h_{k+1,k} \mathbf{v}_{k+1} \mathbf{e}_k^T$

Solving in the Reduced Space

Now we approximate our original problem by solving it in the small $k$

$\mathbf{x}_k = \mathbf{V}_k \mathbf{y}_k$

where $\mathbf{y}_k$

$\mathbf{y}_k = f(\mathbf{H}_k) \mathbf{e}_1 \|\mathbf{b}\|_2$

The heavy lifting is now on computing $f(\mathbf{H}_k)$

For $f(z) = z^{-1}$

When $\mathbf{A}$

In the literature, this projected matrix is denoted $\mathbf{T}_k$

$\mathbf{T}_k = \begin{pmatrix} \alpha_1 & \beta_1 & & \\ \beta_1 & \alpha_2 & \beta_2 & \\ & \beta_2 & \ddots & \ddots \\ & & \ddots & \alpha_k \end{pmatrix}$

where $\alpha_j \in \mathbb{R}$

Three-Term Recurrence

This tridiagonal structure leads to a beautiful simplification. To build the next basis
vector $\mathbf{v}_{j+1}$

$\mathbf{A}\mathbf{v}_j = \beta_{j-1}\mathbf{v}_{j-1} + \alpha_j \mathbf{v}_j + \beta_j \mathbf{v}_{j+1}$

Rearranging gives us an algorithm to compute $\mathbf{v}_{j+1}$

Compute the candidate: $w_{j+1} = \mathbf{A}\mathbf{v}_j$
Extract the diagonal coefficient: $\alpha_j = \mathbf{v}_j^H w_{j+1}$
Orthogonalize against the two previous vectors:

$\tilde{\mathbf{v}}_{j+1} = w_{j+1} – \alpha_j \mathbf{v}_j – \beta_{j-1}\mathbf{v}_{j-1}$

Normalize: $\beta_j = \|\tilde{\mathbf{v}}_{j+1}\|_2$

This is known as the Lanczos algorithm. It’s more efficient than Arnoldi because each iteration only orthogonalizes against two previous vectors instead of all prior ones.

Reconstructing the Solution

After $k$

$\mathbf{x}_k = \mathbf{V}_k \mathbf{y}_k$

where $\mathbf{y}_k = f(\mathbf{T}_k) \mathbf{e}_1 \|\mathbf{b}\|_2$

There is a timing problem however: we cannot compute the coefficients $\mathbf{y}_k$

So we’re left with a choice: whether we store all the basis vectors and solve the problem in $k$

There are also techniques to compress the basis vectors, have a look here

Here’s where we break the timing deadlock. The insight that we don’t actually need to store the basis vectors if we can afford to compute them twice

Think about what we have after the first pass. We’ve computed all the $\alpha_j$

This comes at a cost. We run Lanczos twice, so we pay for $2k$

It sounds like a bad trade at first. But as we’ll see later, the cache behavior of this
two-pass approach can actually make it as fast (or even faster) on real hardware if well optimized.

First Pass: Compute the Projected Problem

We initialize $\mathbf{v}_1 = \mathbf{b} / \|\mathbf{b}\|_2$

$w_j = \mathbf{A}\mathbf{v}_j$

At each step, we record $\alpha_j$

After $k$

$\mathbf{y}_k = f(\mathbf{T}_k) \mathbf{e}_1 \|\mathbf{b}\|_2$

This is the solution in the reduced space. Now that we have the coefficients we need to build $\mathbf{x}_k$

Second Pass: Reconstruct and Accumulate

With $\mathbf{y}_k$

let mut x_k = vec![0.0; n];
let mut v_prev = vec![0.0; n];
let mut v_curr = b.clone() / b_norm;

for j in 1..=k {
    let w = A @ v_curr;  // Matrix-vector product

    // We don't recompute alpha/beta; we already have them from pass 1
    let alpha_j = alphas[j - 1];
    let beta_prev = j > 1 ? betas[j - 2] : 0.0;

    // Accumulate the solution
    x_k += y_k[j - 1] * v_curr;

    // Regenerate the next basis vector for the *next* iteration
    let v_next = (w - alpha_j * v_curr - beta_prev * v_prev) / betas[j - 1];

    // Slide the window forward
    v_prev = v_curr;
    v_curr = v_next;
}

This loop regenerates each $\mathbf{v}_j$

A Subtle Numerical Point

There is one detail worth noting: floating-point arithmetic is deterministic. When we replay the Lanczos recurrence in the second pass with the exact same inputs and the exact same order of operations, we get bitwise-identical vectors. The $\mathbf{v}_j$

However, the order in which we accumulate the solution differs. In a standard Lanczos,
$\mathbf{x}_k$

Building this in Rust forces us to think concretely about where data lives and how it flows through the cache hierarchy. We need to control memory layout, decide when allocations happen, and choose abstractions that cost us nothing at runtime.

For linear algebra, we reach for faer. Three design choices in this library matter for what we’re building:

Stack allocation via MemStack: Pre-allocated scratch space that lives for the entire computation. The hot path becomes allocation-free.
Matrix-free operators: The LinOp trait defines an operator by its action (apply) without materializing a matrix. For large sparse problems, this is the only viable approach.
SIMD-friendly loops: The zip! macro generates code that compiles to packed instructions.

Recurrence Step

Our starting point is the Lanczos three-term recurrence that we derived earlier:

$\beta_j \mathbf{v}_{j+1} = \mathbf{A}\mathbf{v}_j – \alpha_j \mathbf{v}_j – \beta_{j-1}\mathbf{v}_{j-1}$

We can translate this into a recurrence step function. The signature looks like this:

fn lanczos_recurrence_step<T: ComplexField, O: LinOp<T>>(
    operator: &O,
    mut w: MatMut<'_, T>,
    v_curr: MatRef<'_, T>,
    v_prev: MatRef<'_, T>,
    beta_prev: T::Real,
    stack: &mut MemStack,
) -> (T::Real, Option<T::Real>)

The function is generic over the field type T (f64, c64, etc.) and the operator type O. It operates on matrix views (MatMut and MatRef) to avoid unnecessary data copies. The return type gives us the diagonal element $\alpha_j$

Now we can implement the body by following the math. The first step is the most expensive:

// 1. Apply operator: w = A * v_curr
operator.apply(w.rb_mut(), v_curr, Par::Seq, stack);

The matrix-vector product dominates the computational cost. Everything else is secondary.

Next, we orthogonalize against $\mathbf{v}_{j-1}$

// 2. Orthogonalize against v_{j-1}: w -= β_{j-1} * v_{j-1}
let beta_prev_scaled = T::from_real_impl(&beta_prev);
zip!(w.rb_mut(), v_prev).for_each(|unzip!(w_i, v_prev_i)| {
    *w_i = sub(w_i, &mul(&beta_prev_scaled, v_prev_i));
});

With w partially orthogonalized, we can compute the diagonal coefficient via an inner product. Since $\mathbf{A}$

// 3. Compute α_j = v_j^H * w
let alpha = T::real_part_impl(&(v_curr.adjoint() * w.rb())[(0, 0)]);

We complete the orthogonalization against $\mathbf{v}_j$

// 4. Orthogonalize against v_j: w -= α_j * v_j
let alpha_scaled = T::from_real_impl(&alpha);
zip!(w.rb_mut(), v_curr).for_each(|unzip!(w_i, v_curr_i)| {
    *w_i = sub(w_i, &mul(&alpha_scaled, v_curr_i));
});

Now w holds the unnormalized next basis vector. We compute its norm to get $\beta_j$

// 5. Compute β_j = ||w||_2 and check for breakdown
let beta = w.rb().norm_l2();
let tolerance = breakdown_tolerance::::Real>();

if beta <= tolerance {
    (alpha, None)
} else {
    (alpha, Some(beta))
}

The function returns None for $\beta_j$

An Iterator for State Management

The recurrence step is a pure function, but calling it in a loop is both inefficient and awkward. We’d need to manually pass vectors in and out of each iteration. More critically, we’d create copies when we should be reusing memory.

The iterator pattern solves this. We create a struct that encapsulates the state:

struct LanczosIteration<'a, T: ComplexField, O: LinOp<T>> {
    operator: &'a O,
    v_prev: Mat<T>,       // v_{j-1}
    v_curr: Mat<T>,       // v_j
    work: Mat<T>,         // Workspace for the next vector
    beta_prev: T::Real,   // β_{j-1}
    // ... iteration counters
}

The main design choice here is that vectors are owned (Mat), not borrowed. This enables an optimization in the next_step method. After computing the next vector and normalizing it into work, we cycle the state without allocating or copying:

// Inside next_step, after normalization...
core::mem::swap(&mut self.v_prev, &mut self.v_curr);
core::mem::swap(&mut self.v_curr, &mut self.work);

On x86-64, swapping two Mat structures (fat pointers) compiles to three mov instructions. The pointers change, but no vector data moves. After the swap, v_prev points to what v_curr held, v_curr points to work’s allocation, and work points to the old v_prev data. In the next iteration, work gets reused.

We keep exactly three n-dimensional vectors live in memory. The same allocations cycle through the computation, staying hot in L1 cache. This is the core reason the two-pass method can be faster than expected, the working set never leaves cache.

First Pass: Computing the Decomposition

The first pass runs the Lanczos iteration and collects the coefficients $\{\alpha_j, \beta_j\}$

pub fn lanczos_pass_one<T: ComplexField>(
    operator: &impl LinOp<T>,
    b: MatRef<'_, T>,
    k: usize,
    stack: &mut MemStack,
) -> Result<LanczosDecomposition<T::Real>, LanczosError> {
    // ...
}

We allocate vectors for the coefficients with a capacity hint to avoid reallocations:

let mut alphas = Vec::with_capacity(k);
let mut betas = Vec::with_capacity(k - 1);

Then we construct the iterator. This allocates the three work vectors once. After this point, the hot path is allocation-free:

let mut lanczos_iter = LanczosIteration::new(operator, b, k, b_norm)?;

for i in 0..k {
    if let Some(step) = lanczos_iter.next_step(stack) {
        alphas.push(step.alpha);
        steps_taken += 1;

        let tolerance = breakdown_tolerance::::Real>();
        if step.beta <= tolerance {
            break;
        }

        if i < k - 1 {
            betas.push(step.beta);
        }
    } else {
        break;
    }
}

The check for breakdown stops the iteration when the residual becomes numerically zero. This means we’ve found an invariant subspace and there’s no value in continuing.

At the end, we collect the scalars into a LanczosDecomposition struct. The memory footprint throughout this pass is constant: three n-dimensional vectors plus two small arrays that grow to at most $k$

Second Pass: Reconstructing the Solution

Now we face a different problem. We have the $\{\alpha_j, \beta_j\}$

$\mathbf{x}_k = \sum_{j=1}^k (\mathbf{y}_k)_j \mathbf{v}_j$

without storing the full basis matrix $\mathbf{V}_k$

The recurrence step in this pass is structurally similar to the first pass, but with a key difference: we no longer compute inner products or norms. We already know the coefficients, so the step becomes pure reconstruction.

fn lanczos_reconstruction_step<T: ComplexField, O: LinOp<T>>(
    operator: &O,
    mut w: MatMut<'_, T>,
    v_curr: MatRef<'_, T>,
    v_prev: MatRef<'_, T>,
    alpha_j: T::Real,
    beta_prev: T::Real,
    stack: &mut MemStack,
) {
    // Apply operator
    operator.apply(w.rb_mut(), v_curr, Par::Seq, stack);

    // Orthogonalize using stored α_j and β_{j-1}
    let beta_prev_scaled = T::from_real_impl(&beta_prev);
    zip!(w.rb_mut(), v_prev).for_each(|unzip!(w_i, v_prev_i)| {
        *w_i = sub(w_i, &mul(&beta_prev_scaled, v_prev_i));
    });

    let alpha_scaled = T::from_real_impl(&alpha_j);
    zip!(w.rb_mut(), v_curr).for_each(|unzip!(w_i, v_curr_i)| {
        *w_i = sub(w_i, &mul(&alpha_scaled, v_curr_i));
    });
}

This is cheaper than the first-pass recurrence. We’ve eliminated the inner products that computed $\alpha_j$

lanczos_pass_two implements this reconstruction. We initialize the three work vectors and the solution accumulator:

pub fn lanczos_pass_two<T: ComplexField>(
    operator: &impl LinOp<T>,
    b: MatRef<'_, T>,
    decomposition: &LanczosDecomposition<T::Real>,
    y_k: MatRef<'_, T>,
    stack: &mut MemStack,
) -> Result<Mat<T>, LanczosError> {
    let mut v_prev = Mat::<T>::zeros(b.nrows(), 1);
    let inv_norm = T::from_real_impl(&T::Real::recip_impl(&decomposition.b_norm));
    let mut v_curr = b * Scale(inv_norm);  // v_1

    let mut work = Mat::<T>::zeros(b.nrows(), 1);

    // Initialize solution with first component
    let mut x_k = &v_curr * Scale(T::copy_impl(&y_k[(0, 0)]));

We build the solution incrementally by starting with the first basis vector scaled by its coefficient. The main loop then regenerates each subsequent vector: we regenerate each subsequent basis vector, normalize it using the stored $\beta_j$

for j in 0..decomposition.steps_taken - 1 {
    let alpha_j = T::Real::copy_impl(&decomposition.alphas[j]);
    let beta_j = T::Real::copy_impl(&decomposition.betas[j]);
    let beta_prev = if j == 0 {
        T::Real::zero_impl()
    } else {
        T::Real::copy_impl(&decomposition.betas[j - 1])
    };

    // 1. Regenerate the unnormalized next vector
    lanczos_reconstruction_step(
        operator,
        work.as_mut(),
        v_curr.as_ref(),
        v_prev.as_ref(),
        alpha_j,
        beta_prev,
        stack,
    );

    // 2. Normalize using stored β_j
    let inv_beta = T::from_real_impl(&T::Real::recip_impl(&beta_j));
    zip!(work.as_mut()).for_each(|unzip!(w_i)| {
        *w_i = mul(w_i, &inv_beta);
    });

    // 3. Accumulate: x_k += y_{j+1} * v_{j+1}
    let coeff = T::copy_impl(&y_k[(j + 1, 0)]);
    zip!(x_k.as_mut(), work.as_ref()).for_each(|unzip!(x_i, v_i)| {
        *x_i = add(x_i, &mul(&coeff, v_i));
    });

    // 4. Cycle vectors for the next iteration
    core::mem::swap(&mut v_prev, &mut v_curr);
    core::mem::swap(&mut v_curr, &mut work);
}

The accumulation x_k += y_{j+1} * v_{j+1} is implemented as a fused multiply-add in the zip! loop. On hardware with FMA support, this becomes a single instruction per element, not three separate operations.

Note that we accumulate the solution incrementally. After each iteration, x_k contains a partial result. We cycle through the same three vectors (v_prev, v_curr, work), keeping the working set small and resident in L1 cache.

Compare this to the standard method’s final reconstruction step: $\mathbf{x}_k = \mathbf{V}_k \mathbf{y}_k$

In our two-pass reconstruction, the operator $\mathbf{A}$ is applied $k$

This is the reason the two-pass method can be faster on real hardware despite performing twice as many matrix-vector products. The cache behavior of the reconstruction phase overwhelms the savings of storing the basis.

The Public API

We can wrap the two passes into a single entry point:

pub fn lanczos_two_pass<T, O, F>(
    operator: &O,
    b: MatRef<'_, T>,
    k: usize,
    stack: &mut MemStack,
    mut f_tk_solver: F,
) -> Result<Mat<T>, LanczosError>
where
    T: ComplexField,
    O: LinOp<T>,
    F: FnMut(&[T::Real], &[T::Real]) -> Result<Mat<T>, anyhow::Error>,
{
    // First pass: compute T_k coefficients
    let decomposition = lanczos_pass_one(operator, b, k, stack)?;

    if decomposition.steps_taken == 0 {
        return Ok(Mat::zeros(b.nrows(), 1));
    }

    // Solve projected problem: y_k' = f(T_k) * e_1
    let y_k_prime = f_tk_solver(&decomposition.alphas, &decomposition.betas)?;

    // Scale by ||b||
    let y_k = &y_k_prime * Scale(T::from_real_impl(&decomposition.b_norm));

    // Second pass: reconstruct solution
    lanczos_pass_two(operator, b, &decomposition, y_k.as_ref(), stack)
}

The design separates concerns. The f_tk_solver closure is where we inject the specific matrix function. We compute the Lanczos decomposition, then pass the coefficients to the user-provided solver, which computes $\mathbf{y}_k’ = f(\mathbf{T}_k) \mathbf{e}_1$

The caller provides f_tk_solver as a closure. It receives the raw $\{\alpha_j, \beta_j\}$

Example: Solving a Linear System

To see this in practice, consider solving $\mathbf{Ax} = \mathbf{b}$

Since $\mathbf{T}_k$

let f_tk_solver = |alphas: &[f64], betas: &[f64]| -> Result<Mat<f64>, anyhow::Error> {
    let steps = alphas.len();
    if steps == 0 {
        return Ok(Mat::zeros(0, 1));
    }

    // 1. Assemble T_k from coefficients using triplet format
    let mut triplets = Vec::with_capacity(3 * steps - 2);
    for (i, &alpha) in alphas.iter().enumerate() {
        triplets.push(Triplet { row: i, col: i, val: alpha });
    }
    for (i, &beta) in betas.iter().enumerate() {
        triplets.push(Triplet { row: i, col: i + 1, val: beta });
        triplets.push(Triplet { row: i + 1, col: i, val: beta });
    }
    let t_k_sparse = SparseColMat::try_new_from_triplets(steps, steps, &triplets)?;

    // 2. Construct e_1
    let mut e1 = Mat::zeros(steps, 1);
    e1.as_mut()[(0, 0)] = 1.0;

    // 3. Solve T_k * y' = e_1 via sparse LU
    Ok(t_k_sparse.as_ref().sp_lu()?.solve(e1.as_ref()))
};

The closure takes the coefficient arrays, constructs the sparse tridiagonal matrix, and solves the system. The triplet format lets us build the matrix efficiently without knowing its structure in advance. The sparse LU solver leverages the tridiagonal structure to avoid dense factorization.

Now that we have a working implementation we can run some tests. The core idea of what we have done is simple: trade flops for better memory access. But does this trade actually pay off on real hardware? To find out, we need a reliable way to benchmark it.

For the data, we know that the performance of any Krylov method is tied to the operator’s spectral properties. We need a way to generate a family of test problems where we can precisely control the size, sparsity, and numerical difficulty. A great way to do this is with Karush-Kuhn-Tucker (KKT) systems, which are sparse, symmetric, and have a specific block structure.

$A = \begin{pmatrix} D & E^T \\ E & 0 \end{pmatrix}$

This structure gives us two critical knobs to turn. First, with the netgen utility, we can control the $E$

For a symmetric matrix like $D$

Memory and Computation Trade-off

We measure the algorithm against two hypotheses on a large sparse problem with $n=500,000$

Hypothesis 1 (Memory): The one-pass method stores the full basis $\mathbf{V}_k$

Hypothesis 2 (Runtime): The two-pass method performs $2k$

Memory Usage

Memory vs Iterations

The memory data confirms Hypothesis 1 exactly. The one-pass method’s footprint scales as a straight line—each additional iteration adds one vector to the basis. The two-pass method remains flat. No allocation growth happens after initialization.

Runtime: Where Theory Breaks

Runtime vs Iterations

The runtime data contradicts Hypothesis 2. The two-pass method is slower, but never by a factor of two. For small $k$

This difference comes from memory access patterns. Both methods perform matrix-vector products, but they differ in how they reconstruct the solution.

The one-pass method computes $\mathbf{x}_k = \mathbf{V}_k \mathbf{y}_k$

The two-pass method reconstructs the solution incrementally. At each iteration, it operates on exactly three n-dimensional vectors: $\mathbf{v}_{\text{prev}}$

The cost of re-computing basis vectors is less than the latency cost of scanning an $n \times k$

Medium-Scale Behavior

Medium Scale Runtime vs Iterations
Medium Scale Memory Usage vs Iterations

At $n=50,000$

What About Dense Matrices?

To be sure of our hypothesis, we can test it directly using a dense matrix of size $n=10,000$

Dense Matrix Runtime vs Iterations

We can see that the two-pass method runs almost exactly twice as slow as the one-pass method. The slope ratio is exactly 2:1. In a compute-bound regime, the extra matrix-vector products cannot be hidden by cache effects. Here, the theoretical trade-off holds perfectly.

Scalability

Now, let’s fix the iteration count at $k=500$

Scalability Memory Usage

Here we have to use a logarithmic y-axis to show both curves; the two-pass line is so flat relative to the one-pass line that it’s otherwise invisible.

Scalability Runtime

Runtime scales linearly with $n$

As $n$

Well, that’s it. If you want to have a better look at the code or use it, it’s all open source:

This was more of an exploration than a production-ready library, so expect rough edges. But I hope it gives an interesting perspective on how algorithm engineering and low-level implementation details can alter what seems like a straightforward trade-off on a blackboard.

{💬|⚡|🔥} {What do you think?|Share your opinion below!|Tell us your thoughts in comments!}

#️⃣ #CacheFriendly #LowMemory #Lanczos #Algorithm #Rust

🕒 Posted on 1762883248