I Built a Deep Learning Framework in Zig and Trained GPT-2.

I wanted to learn how a deep learning framework is structured.
So I built one.

Here’s my repos:

Tensor Object → Tomo - Tomo: Object for Mathematical Operation
Core Framework → Tomorin - TOMO Runs In Neural network
GPT-2 Training → Nina - Nina Is Not AI

It’s slow. Maybe it’s inefficient. The code is messy.
But it was worth it.
It runs.

Trained on Tiny Shakespeare.

[prompt]

Darth Vader: Join me, and I will complete your training. With our combined strength, we can end this destructive conflict and bring order to the galaxy.

Luke: I’LL NEVER JOIN YOU!!

Darth Vader: If you only knew the power of the Dark Side. Obi-Wan never told you what happened to your father.

Luke: He told me enough! He told me you killed him!

Darth Vader: No. I, am your father.

[generated]

Darth Vader: Join me, and I will complete your training. With our combined strength, we can end this destructive conflict and bring order to the galaxy.

Luke: I’LL NEVER JOIN YOU!!

Darth Vader: If you only knew the power of the Dark Side. Obi-Wan never told you what happened to your father.

Luke: He told me enough! He told me you killed him!

Darth Vader: No. I, am your father.

I know how? But that is to the truth of war to return: Lord: I am king, might put What man so beaten as ever see your fair a paragon to crow, as it, he hath sorrow, be found him not thyself. His fault? And I banish’d away.

As cheap yea, welcome. Away with us, I continue BRUTUS: so be with mistress

Then are my mind, by the soul: be in a Of raise his state to the truth of any thing; and effect slain

Loss: 10.47 → 0.9 That’s it.

What inspired this

I started learning deep learning with the Deep Learning from Scratch series (volumes 1, 2, 3) — the Korean translation.
Here’s the author’s page on Amazon.
The books walk through neural networks using only NumPy, which makes them great for understanding the fundamentals.

But as I followed the code, it started to feel mechanical.
I wasn’t really thinking — just typing.

So I thought: maybe I need to switch languages, something that would force me to use my brain more.

The thing is, I’m a beginner at programming, and I only know a few languages.
Aside from Python, they’re all low-level: C, C++, Rust, and Zig.
Not exactly the easiest tools for building a deep learning framework —
but they were all I had.

Then I came across llm.c by Karpathy:
an entire GPT-2 implementation written in C.
It blew my mind.

If someone could build GPT-2 in C, maybe I could too.
That project gave me the motivation to try building one myself in a low-level language.

I chose Zig, since I’d recently become obsessed with it.

How I built it

Memory management

I decided to allocate everything up front and free it all at once.
(Think ArenaAllocator, or region-based memory management.)

But there was a catch:
I still needed to call destructors manually during deallocation,
because:

Function uses heap allocation
Variable uses GPU memory allocation

So I needed to be able to iterate over them.

I used MemoryPool:

allocations were fast
pointer lifetimes matched the container’s

To support iteration, I added prev and next pointers to both Function and Variable,
effectively making them doubly linked lists.

Now I can traverse the chain and call destructors before freeing the pool.

→ chain.zig
→ function.zig
→ variable.zig

// Chain deallocates everything at once.
pub fn destroyFunctions(self: *Chain) void {
    while (self.func_chain) |head| {
        const next = head.next;
        head.destroy();
        self.func_chain = next;
    }
}

Inheriting without inheritance

Zig has no inheritance — but deep learning operators tend to share a lot of logic.

My early implementation had too much duplication:
each function type (like Neg, Add, etc.) needed its own forward, backward, destroy logic.
Many of them were just slight variations on the same pattern.

So I reused the structure using something inspired by CRTP (Curiously Recurring Template Pattern).

Each function defines a Self type and passes it into a reusable decorator struct.
Inside that struct, I implemented common behavior once —
like forward, backward, destroy, and enqueue.

→ function.zig
→ function1in1out.zig

pub fn FuncDecorator1in1outBase(comptime Self: type) type {
    return struct {
        pub fn forwardDecorated(ctx: *anyopaque, args: []*TaggedVar, out: []?*TaggedVar) !void {
            const self: *Self = @ptrCast(@alignCast(ctx));
            self.in = args[0];
            var y = try self.forward(&self.in.?.asUntaggedConst(Self.In).data);
            self.out = try self.base.chain.createVariable(Self.Out, y.move(), null);
            out[0] = self.out.?;
        }

        pub fn backwardDecorated(ctx: *anyopaque) !void {
            const self: *Self = @ptrCast(@alignCast(ctx));
            const gx = try self.backward(self.out.?.asUntaggedConst(Self.Out).grad.?);
            self.in.?.setGrad(gx);
        }

        // destroy, enqueue, getGeneration... etc.
    };
}

pub fn FuncDecorator1in1out(comptime Self: type) type {
    return struct {
        const Base = FuncDecorator1in1outBase(Self);

        // create...
    };
}

Then each actual operator can reuse the implementation with:

pub const Neg = struct {
    in: ?*TaggedVar,
    out: ?*TaggedVar,
    base: FunctionBase,

    pub const In = T;
    pub const Out = T;

    pub usingnamespace FuncDecorator1in1out(Self);
    const Self = Neg;

    pub fn forward(self: *Self, x: *const GPUTensor(T)) !GPUTensor(T) {
        var y = try x.cloneAsync(self.base.context.stream);
        try y.scale(-1.0, self.base.context.stream);
        return y.move();
    }

    pub fn backward(self: *Self, gy: *TaggedVar) !*TaggedVar {
        return try negEx(T, gy, self.base.chain);
    }
};

This kept the interface clean, removed lots of duplicated code,
and still gave each function its own forward/backward logic.

Even more complex functions — like Linear, which takes three inputs —
use the same pattern with just a few customizations.

Layers are just data

In this project, I didn’t use type erasure for layers.

Unlike Function, layers don’t need to be queued, erased, or treated polymorphically.
So instead of dynamic dispatch, I relied on metaprogramming.

Each layer is just a struct containing:

trainable parameters (like w, b)
other layers (like Linear, Dropout, etc.)
extra fields (context, chain, shape info…)

To avoid boilerplate, I use LayerFieldsFactory()
to generate these fields at compile time.

→ layer.zig

// Generates a struct with named fields for parameters and sublayers
pub fn LayerFieldsFactory(
    comptime param_names: []const [:0]const u8,
    comptime layer_names_types: []const std.meta.Tuple(&.{ [:0]const u8, type }),
) type {
    var fields_info: [param_names.len + layer_names_types.len]std.builtin.Type.StructField = undefined;

    var i: comptime_int = 0;
    for (param_names) |name| {
        fields_info[i] = .{
            .name = name,
            .type = ?*TaggedVar,
            .alignment = @alignOf(?*TaggedVar),
            .default_value_ptr = null,
            .is_comptime = false,
        };
        i += 1;
    }
    for (layer_names_types) |entry| {
        fields_info[i] = .{
            .name = entry[0],
            .type = entry[1],
            .alignment = @alignOf(entry[1]),
            .default_value_ptr = null,
            .is_comptime = false,
        };
        i += 1;
    }

    return @Type(.{ .@"struct" = .{
        .layout = .auto,
        .fields = &fields_info,
        .decls = &.{},
        .is_tuple = false,
        .backing_integer = null,
    } });
}

Then LayerDecorator(Self) walks those fields at compile time to collect all parameters, destroy them recursively, or serialize to JSON/Binary.

There’s no dynamic dispatch — just plain Zig structs and compile-time recursion.

pub fn LayerDecorator(comptime Self: type) type {
    return struct {
        // Extracts all trainable parameters in order
        pub fn getParams(self: *Self) [@This().calcParamNum()]?*TaggedVar {
            const info = @typeInfo(@FieldType(Self, "fields")).@"struct";
            var params: [@This().calcParamNum()]?*TaggedVar = undefined;

            var i: usize = 0;
            inline for (info.fields) |field| {
                if (field.type == ?*TaggedVar) {
                    params[i] = @field(self.fields, field.name);
                    i += 1;
                } else {
                    var sub = @field(self.fields, field.name).getParams();
                    for (sub) |param| {
                        params[i] = param;
                        i += 1;
                    }
                }
            }
            return params;
        }
        
        // Other helpers: clearGrads, saveJsonStringField, saveBinary...
    }
}

Even complex modules like CausalSelfAttention use the exact same mechanism:

pub fn CausalSelfAttention(comptime T: type) type {
    return struct {
        pub usingnamespace LayerDecorator(Self);

        fields: LayerFieldsFactory(
            &.{},
            &.{
                .{ "c_attn", Linear(T) },
                .{ "c_proj", Linear(T) },
                .{ "attn_dropout", Dropout(T) },
                .{ "resid_dropout", Dropout(T) },
            },
        ),
        n_head: usize,
        n_embd: usize,
        bias: GPUTensor(T),
        context: *Context,
        // 
    };
}

Everything is plain data. Recursion handles nesting. Serialization is automatic. And no dyn Layer needed.

Grand Theft Tokenizer

To train GPT-2, I needed a tokenizer.

But writing a Byte Pair Encoding (BPE) tokenizer from scratch —
especially one that’s fast enough — was out of reach for me at the time.
So I stole one. From Python.

→ export_tokenizer_zig.py

from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer(...)
tokenizer.train(...)
tokenizer.save_model(...)

I used HuggingFace’s pretrained tokenizer and trained a new one on my own dataset, adding custom special tokens like , , ...

Then I exported the vocab, merges, and decoder map into raw .zig files using a Python script:

def export_zig_files():
    os.makedirs(zig_output_dir, exist_ok=True)
    with open(f"datas/{model_prefix}-vocab.json", "r", encoding="utf-8") as vf:
        vocab = json.load(vf)
    sorted_vocab_by_id = sorted(vocab.items(), key=lambda x: x[1])

    with open(f"{zig_output_dir}/merges_data.zig", "w", encoding="utf-8") as f:
        f.write('// ⚠️ This file is autogenerated. Do not modify directly.\n')
        f.write('// Generated by export_tokenizer_zig.py\n\n')
        f.write("pub const merges_data = [_]struct { []const u8, []const u8 }{\n")
        with open(f"datas/{model_prefix}-merges.txt", "r", encoding="utf-8") as mf:
            for line in mf:
                line = line.strip()
               (...)

Then I got a nice statically defined Zig file like this:

→ merges_data.zig

// ⚠️ This file is autogenerated. Do not modify directly.
pub const merges_data = [_]struct { []const u8, []const u8 }{
    .{ "Ġ", "t" },
    .{ "h", "e" },
    .{ "Ġ", "a" },
    .{ "o", "u" },
    .{ "Ġ", "s" },
    ...
};

And I initialized the merge table using std.StaticStringMap — so it’s all compile-time, with zero runtime bloat!

→ merge_map.zig

const std = @import("std");

const merges_data = @import("merges_data.zig").merges_data;

pub const merge_entries = blk: {
    var entries: [merges_data.len]struct { []const u8, usize } = undefined;
    @setEvalBranchQuota(99999999);
    for (&merges_data, &entries, 0..) |pair, *entry, i| {
        const joined = pair[0] ++ " " ++ pair[1]; // concat with space
        entry.* = .{ joined, @intCast(i) };
    }
    break :blk entries;
};

pub const merge_map: std.StaticStringMap(usize) = .initComptime(merge_entries);

But then I realized something:

My custom Zig tokenizer was way too slow to tokenize large corpus files at training time.

So I took another shortcut: I tokenized everything in advance (using Python) and saved the token IDs to binary files.

Then in Zig, I just loaded them like this:

→ dataset.zig

pub fn init(allocator: Allocator, tokenizer: *const BpeTokenizer, token_paths: []const []const u8) !Self {
    var token_ids: std.ArrayList(usize) = .init(allocator);
    defer token_ids.deinit();

    for (token_paths) |path| {
        var file = try std.fs.cwd().openFile(path, .{});
        defer file.close();
        var reader = std.io.bufferedReader(file.reader()).reader();

        const count = try reader.readInt(usize, .little);
        try token_ids.appendNTimes(0, count);
        try reader.readNoEof(std.mem.sliceAsBytes(token_ids.items[token_ids.items.len - count ..]));
    }

    ...
}

I know, It’s not elegant. But,

“Good artists copy. Great artists steal.”

I stole a tokenizer.

Final thoughts

That’s it for now.

Maybe I’ll clean it up. Maybe I won’t.
But it runs. And I learned a lot.

And if you’re wondering — yes, you can build deep learning from scratch.
Even in Zig.