I wanted to learn how a deep learning framework is structured.
So I built one.
Here’s my repos:
- Tensor Object → Tomo - Tomo: Object for Mathematical Operation
- Core Framework → Tomorin - TOMO Runs In Neural network
- GPT-2 Training → Nina - Nina Is Not AI
It’s slow. Maybe it’s inefficient. The code is messy.
But it was worth it.
It runs.
Trained on Tiny Shakespeare.
[prompt]
Darth Vader: Join me, and I will complete your training. With our combined strength, we can end this destructive conflict and bring order to the galaxy.
Luke: I’LL NEVER JOIN YOU!!
Darth Vader: If you only knew the power of the Dark Side. Obi-Wan never told you what happened to your father.
Luke: He told me enough! He told me you killed him!
Darth Vader: No. I, am your father.
[generated]
Darth Vader: Join me, and I will complete your training. With our combined strength, we can end this destructive conflict and bring order to the galaxy.
Luke: I’LL NEVER JOIN YOU!!
Darth Vader: If you only knew the power of the Dark Side. Obi-Wan never told you what happened to your father.
Luke: He told me enough! He told me you killed him!
Darth Vader: No. I, am your father.
I know how? But that is to the truth of war to return: Lord: I am king, might put What man so beaten as ever see your fair a paragon to crow, as it, he hath sorrow, be found him not thyself. His fault? And I banish’d away.
As cheap yea, welcome. Away with us, I continue BRUTUS: so be with mistress
Then are my mind, by the soul: be in a Of raise his state to the truth of any thing; and effect slain
Loss: 10.47 → 0.9 That’s it.
What inspired this
I started learning deep learning with the Deep Learning from Scratch series (volumes 1, 2, 3) — the Korean translation.
Here’s the author’s page on Amazon.
The books walk through neural networks using only NumPy, which makes them great for understanding the fundamentals.
But as I followed the code, it started to feel mechanical.
I wasn’t really thinking — just typing.
So I thought: maybe I need to switch languages, something that would force me to use my brain more.
The thing is, I’m a beginner at programming, and I only know a few languages.
Aside from Python, they’re all low-level: C, C++, Rust, and Zig.
Not exactly the easiest tools for building a deep learning framework —
but they were all I had.
Then I came across llm.c by Karpathy:
an entire GPT-2 implementation written in C.
It blew my mind.
If someone could build GPT-2 in C, maybe I could too.
That project gave me the motivation to try building one myself in a low-level language.
I chose Zig, since I’d recently become obsessed with it.
How I built it
Memory management
I decided to allocate everything up front and free it all at once.
(Think ArenaAllocator
, or region-based memory management.)
But there was a catch:
I still needed to call destructors manually during deallocation,
because:
Function
uses heap allocationVariable
uses GPU memory allocation
So I needed to be able to iterate over them.
I used MemoryPool
:
- allocations were fast
- pointer lifetimes matched the container’s
To support iteration, I added prev
and next
pointers to both Function
and Variable
,
effectively making them doubly linked lists.
Now I can traverse the chain and call destructors before freeing the pool.
→ chain.zig
→ function.zig
→ variable.zig
// Chain deallocates everything at once.
pub fn destroyFunctions(self: *Chain) void {
while (self.func_chain) |head| {
const next = head.next;
head.destroy();
self.func_chain = next;
}
}
Inheriting without inheritance
Zig has no inheritance — but deep learning operators tend to share a lot of logic.
My early implementation had too much duplication:
each function type (like Neg
, Add
, etc.) needed its own forward, backward, destroy logic.
Many of them were just slight variations on the same pattern.
So I reused the structure using something inspired by CRTP
(Curiously Recurring Template Pattern).
Each function defines a Self
type and passes it into a reusable decorator struct.
Inside that struct, I implemented common behavior once —
like forward
, backward
, destroy
, and enqueue
.
→ function.zig
→ function1in1out.zig
pub fn FuncDecorator1in1outBase(comptime Self: type) type {
return struct {
pub fn forwardDecorated(ctx: *anyopaque, args: []*TaggedVar, out: []?*TaggedVar) !void {
const self: *Self = @ptrCast(@alignCast(ctx));
self.in = args[0];
var y = try self.forward(&self.in.?.asUntaggedConst(Self.In).data);
self.out = try self.base.chain.createVariable(Self.Out, y.move(), null);
out[0] = self.out.?;
}
pub fn backwardDecorated(ctx: *anyopaque) !void {
const self: *Self = @ptrCast(@alignCast(ctx));
const gx = try self.backward(self.out.?.asUntaggedConst(Self.Out).grad.?);
self.in.?.setGrad(gx);
}
// destroy, enqueue, getGeneration... etc.
};
}
pub fn FuncDecorator1in1out(comptime Self: type) type {
return struct {
const Base = FuncDecorator1in1outBase(Self);
// create...
};
}
Then each actual operator can reuse the implementation with:
pub const Neg = struct {
in: ?*TaggedVar,
out: ?*TaggedVar,
base: FunctionBase,
pub const In = T;
pub const Out = T;
pub usingnamespace FuncDecorator1in1out(Self);
const Self = Neg;
pub fn forward(self: *Self, x: *const GPUTensor(T)) !GPUTensor(T) {
var y = try x.cloneAsync(self.base.context.stream);
try y.scale(-1.0, self.base.context.stream);
return y.move();
}
pub fn backward(self: *Self, gy: *TaggedVar) !*TaggedVar {
return try negEx(T, gy, self.base.chain);
}
};
This kept the interface clean, removed lots of duplicated code,
and still gave each function its own forward/backward logic.
Even more complex functions — like Linear
, which takes three inputs —
use the same pattern with just a few customizations.
Layers are just data
In this project, I didn’t use type erasure for layers.
Unlike Function
, layers don’t need to be queued, erased, or treated polymorphically.
So instead of dynamic dispatch, I relied on metaprogramming.
Each layer is just a struct containing:
- trainable parameters (like
w
,b
) - other layers (like
Linear
,Dropout
, etc.) - extra fields (
context
,chain
, shape info…)
To avoid boilerplate, I use LayerFieldsFactory()
to generate these fields at compile time.
// Generates a struct with named fields for parameters and sublayers
pub fn LayerFieldsFactory(
comptime param_names: []const [:0]const u8,
comptime layer_names_types: []const std.meta.Tuple(&.{ [:0]const u8, type }),
) type {
var fields_info: [param_names.len + layer_names_types.len]std.builtin.Type.StructField = undefined;
var i: comptime_int = 0;
for (param_names) |name| {
fields_info[i] = .{
.name = name,
.type = ?*TaggedVar,
.alignment = @alignOf(?*TaggedVar),
.default_value_ptr = null,
.is_comptime = false,
};
i += 1;
}
for (layer_names_types) |entry| {
fields_info[i] = .{
.name = entry[0],
.type = entry[1],
.alignment = @alignOf(entry[1]),
.default_value_ptr = null,
.is_comptime = false,
};
i += 1;
}
return @Type(.{ .@"struct" = .{
.layout = .auto,
.fields = &fields_info,
.decls = &.{},
.is_tuple = false,
.backing_integer = null,
} });
}
Then LayerDecorator(Self)
walks those fields at compile time
to collect all parameters, destroy them recursively, or serialize to JSON/Binary.
There’s no dynamic dispatch — just plain Zig structs and compile-time recursion.
pub fn LayerDecorator(comptime Self: type) type {
return struct {
// Extracts all trainable parameters in order
pub fn getParams(self: *Self) [@This().calcParamNum()]?*TaggedVar {
const info = @typeInfo(@FieldType(Self, "fields")).@"struct";
var params: [@This().calcParamNum()]?*TaggedVar = undefined;
var i: usize = 0;
inline for (info.fields) |field| {
if (field.type == ?*TaggedVar) {
params[i] = @field(self.fields, field.name);
i += 1;
} else {
var sub = @field(self.fields, field.name).getParams();
for (sub) |param| {
params[i] = param;
i += 1;
}
}
}
return params;
}
// Other helpers: clearGrads, saveJsonStringField, saveBinary...
}
}
Even complex modules like CausalSelfAttention
use the exact same mechanism:
pub fn CausalSelfAttention(comptime T: type) type {
return struct {
pub usingnamespace LayerDecorator(Self);
fields: LayerFieldsFactory(
&.{},
&.{
.{ "c_attn", Linear(T) },
.{ "c_proj", Linear(T) },
.{ "attn_dropout", Dropout(T) },
.{ "resid_dropout", Dropout(T) },
},
),
n_head: usize,
n_embd: usize,
bias: GPUTensor(T),
context: *Context,
//
};
}
Everything is plain data.
Recursion handles nesting.
Serialization is automatic.
And no dyn Layer
needed.
Grand Theft Tokenizer
To train GPT-2, I needed a tokenizer.
But writing a Byte Pair Encoding (BPE) tokenizer from scratch —
especially one that’s fast enough — was out of reach for me at the time.
So I stole one. From Python.
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer(...)
tokenizer.train(...)
tokenizer.save_model(...)
I used HuggingFace’s pretrained tokenizer and trained a new one
on my own dataset, adding custom special tokens like
Then I exported the vocab, merges, and decoder map into raw .zig files using a Python script:
def export_zig_files():
os.makedirs(zig_output_dir, exist_ok=True)
with open(f"datas/{model_prefix}-vocab.json", "r", encoding="utf-8") as vf:
vocab = json.load(vf)
sorted_vocab_by_id = sorted(vocab.items(), key=lambda x: x[1])
with open(f"{zig_output_dir}/merges_data.zig", "w", encoding="utf-8") as f:
f.write('// ⚠️ This file is autogenerated. Do not modify directly.\n')
f.write('// Generated by export_tokenizer_zig.py\n\n')
f.write("pub const merges_data = [_]struct { []const u8, []const u8 }{\n")
with open(f"datas/{model_prefix}-merges.txt", "r", encoding="utf-8") as mf:
for line in mf:
line = line.strip()
(...)
Then I got a nice statically defined Zig file like this:
// ⚠️ This file is autogenerated. Do not modify directly.
pub const merges_data = [_]struct { []const u8, []const u8 }{
.{ "Ġ", "t" },
.{ "h", "e" },
.{ "Ġ", "a" },
.{ "o", "u" },
.{ "Ġ", "s" },
...
};
And I initialized the merge table using std.StaticStringMap — so it’s all compile-time, with zero runtime bloat!
const std = @import("std");
const merges_data = @import("merges_data.zig").merges_data;
pub const merge_entries = blk: {
var entries: [merges_data.len]struct { []const u8, usize } = undefined;
@setEvalBranchQuota(99999999);
for (&merges_data, &entries, 0..) |pair, *entry, i| {
const joined = pair[0] ++ " " ++ pair[1]; // concat with space
entry.* = .{ joined, @intCast(i) };
}
break :blk entries;
};
pub const merge_map: std.StaticStringMap(usize) = .initComptime(merge_entries);
But then I realized something:
My custom Zig tokenizer was way too slow to tokenize large corpus files at training time.
So I took another shortcut: I tokenized everything in advance (using Python) and saved the token IDs to binary files.
Then in Zig, I just loaded them like this:
pub fn init(allocator: Allocator, tokenizer: *const BpeTokenizer, token_paths: []const []const u8) !Self {
var token_ids: std.ArrayList(usize) = .init(allocator);
defer token_ids.deinit();
for (token_paths) |path| {
var file = try std.fs.cwd().openFile(path, .{});
defer file.close();
var reader = std.io.bufferedReader(file.reader()).reader();
const count = try reader.readInt(usize, .little);
try token_ids.appendNTimes(0, count);
try reader.readNoEof(std.mem.sliceAsBytes(token_ids.items[token_ids.items.len - count ..]));
}
...
}
I know, It’s not elegant. But,
“Good artists copy. Great artists steal.”
I stole a tokenizer.
Final thoughts
That’s it for now.
Maybe I’ll clean it up. Maybe I won’t.
But it runs. And I learned a lot.
And if you’re wondering — yes, you can build deep learning from scratch.
Even in Zig.