LLM Engine - C++23 Inference Engine | Nick Stradford

Overview

This is a learning project to get hands-on with C++23 while building an understanding of how LLM inference engines work under the hood. The code follows along with Building LLM Inference Engines with C++23 by William B. Birt — a code-first guide that starts from a blank C++ source file and works toward a production-grade inference engine capable of running models like Llama 3 and Mistral on edge devices and consumer hardware.

The build system uses CMake with Ninja, and tests run on GoogleTest (fetched automatically by CMake).

Follow the progress on GitHub.

View on GitHub

Current Modules

Arena Allocator

The arena allocator is a bump/linear allocator that grabs one big block of memory up front and hands out pieces by advancing an offset. Individual frees aren't possible — instead, a reset() call releases all allocations at once. This pattern is common in inference engines because tensor lifetimes are predictable (allocated at layer entry, freed at layer exit), so the overhead of general-purpose allocation is unnecessary.

The allocator uses 64-byte alignment to match CPU cache line size, preventing false sharing in multi-threaded scenarios:

constexpr size_t ALIGNMENT = 64;

class ArenaAllocator {
public:
    explicit ArenaAllocator(size_t size_bytes)
      : m_total_size(size_bytes), m_offset(0) {
        m_memory = static_cast<std::byte*>(
          std::aligned_alloc(ALIGNMENT, size_bytes)
        );
        if (!m_memory) {
            throw std::runtime_error("Failed to allocate memory for arena");
        }
    }

    ~ArenaAllocator() {
        std::free(m_memory);  // Must use free, not delete, to match aligned_alloc
    }

    // Prevent double-free from copies
    ArenaAllocator(const ArenaAllocator&) = delete;
    ArenaAllocator& operator=(const ArenaAllocator&) = delete;

    template<typename T>
    T* alloc(size_t count) {
        size_t bytes_needed = count * sizeof(T);
        void* ptr = m_memory + m_offset;
        size_t space = m_total_size - m_offset;

        // Align to whichever is stricter: the type's natural alignment or ALIGNMENT
        if (!std::align(
              alignof(T) > ALIGNMENT ? alignof(T) : ALIGNMENT,
              bytes_needed, ptr, space)) {
            throw std::bad_alloc();
        }

        size_t padding = static_cast<std::byte*>(ptr) - (m_memory + m_offset);
        m_offset += padding + bytes_needed;
        return reinterpret_cast<T*>(ptr);
    }

    void reset() { m_offset = 0; }
    size_t used() const { return m_offset; }
    size_t capacity() const { return m_total_size; }

private:
    std::byte* m_memory;
    size_t m_total_size;
    size_t m_offset;
};

Tensor

The tensor wrapper is a non-owning view over allocated memory using C++23's std::mdspan. It takes a reference to existing storage (from the arena allocator or a std::vector) and wraps it in a multidimensional view with compile-time rank and runtime extents:

template <typename T, size_t Rank>
class Tensor {
public:
  using Layout = std::layout_right; // Row-major
  using Extents = std::dextents<size_t, Rank>;
  using View = std::mdspan<T, Extents, Layout>;

  Tensor(std::vector<T>& storage, std::array<size_t, Rank> shape)
    : m_view(storage.data(), Extents(shape)) {
      size_t total = 1;
      for (auto s : shape) total *= s;
      assert(storage.size() >= total);
  }

  View view() const { return m_view; }

  // C++23 multidimensional subscript, constrained to rank-2 only
  T& operator()(size_t i, size_t j) requires (Rank == 2) {
      return m_view[i, j];
  }

private:
  View m_view;
};

The requires (Rank == 2) constraint on the subscript operator is a C++23 concepts feature — it makes the 2D accessor a compile error if you try to use it on a tensor with a different rank.

Difficult Parts

Arena Allocator Memory Management

The alloc method has to handle alignment correctly for arbitrary types. std::align adjusts the pointer forward to meet alignment requirements, which means there can be padding between allocations. The allocator tracks the total consumed bytes (padding + data) to keep the offset correct. Getting this arithmetic right — especially the relationship between std::align's mutation of the ptr and space parameters and the actual offset advancement — required careful reasoning about pointer arithmetic.

Another subtlety: aligned_alloc must be paired with std::free, not delete. The destructor uses free, and the copy constructor/assignment are deleted to prevent double-free from accidental copies.

mdspan for Multidimensional Views

std::mdspan is one of C++23's most powerful additions but also one of its most complex. The template parameters encode the element type, extents (dimensions), and memory layout. Using std::dextents<size_t, Rank> means all dimensions are determined at runtime, while the rank is fixed at compile time. This gives flexibility (any shape of a given rank) while still allowing the compiler to optimize stride calculations.

The non-owning nature of the view is intentional for inference engines — tensors are just windows into arena-allocated memory, so they can be created and destroyed cheaply without any allocation overhead.

C++23 `requires` Clauses

The requires (Rank == 2) clause on the subscript operator demonstrates C++23's constraint system. Without it, calling operator()(i, j) on a 3D tensor would compile but silently produce wrong results (or crash). The constraint makes this a compile-time error instead. Understanding when to use requires clauses versus template specialization versus if constexpr is one of the learning curves of modern C++.

Follow the progress on GitHub.

View on GitHub

LLM Engine - C++23 Inference Engine

Overview

Current Modules

Arena Allocator

Tensor

Difficult Parts

Arena Allocator Memory Management

mdspan for Multidimensional Views

C++23 requires Clauses

C++23 `requires` Clauses