Low-Latency Programming
Kit provides low-level primitives for performance-critical applications where nanoseconds matter. These features give you direct access to CPU timing, branchless algorithms, SIMD operations, and zero-copy I/O.
CPU Timing
For cycle-accurate timing, Kit exposes the CPU timestamp counter directly. This is essential for measuring latency in performance-critical code paths.
Reading the Timestamp Counter
# Read CPU timestamp counter (non-serializing)
start = Time.rdtsc()
# ... your code ...
end = Time.rdtsc()
cycles = end - start
println "Elapsed cycles: ${cycles}"
# Serializing variant - waits for prior instructions to complete
# More accurate for benchmarks, slightly higher overhead
start = Time.rdtscp()
# ... your code ...
cycles = Time.rdtscp() - start
When to use which:
Time.rdtsc()- Lower overhead, good for measuring longer operationsTime.rdtscp()- Serializing, more accurate for micro-benchmarks
Platform Support
On x86_64, these use the RDTSC/RDTSCP instructions directly. On aarch64 (Apple Silicon), they use the CNTVCT_EL0 register. Other architectures fall back to high-resolution monotonic time.
Benchmarking
Kit provides two levels of benchmarking: single-run timing and statistical profiling.
Single-Run Timing
# Time a single operation with Time.measure
result = Time.measure fn => expensive-operation()
println "Result: ${result.result}"
println "Elapsed: ${result.elapsed_ns} ns"
println "Cycles: ${result.cycles}"
Time.measure takes a zero-argument function (thunk) and returns a record
containing the function's return value along with timing information.
Statistical Benchmarking
# Statistical benchmarking with warmup and multiple iterations
stats = Bench.run "order-insert" 1000 10000 fn => insert-order order-book order
# Returns comprehensive statistics
println "Benchmark: ${stats.name}"
println "Iterations: ${stats.iterations}"
println "Mean: ${stats.mean_ns} ns"
println "Median: ${stats.median_ns} ns"
println "Min: ${stats.min_ns} ns"
println "Max: ${stats.max_ns} ns"
println "P99: ${stats.p99_ns} ns"
println "Std Dev: ${stats.std_dev}"
println "Ops/sec: ${stats.ops_per_sec}"
Bench.run takes a name, warmup count, iteration count, and a thunk.
It discards warmup iterations, then collects timing data across all iterations
to produce statistical metrics including percentiles.
Binary Search
Kit provides binary search functions including branchless variants that avoid branch misprediction penalties in latency-critical code.
Standard Binary Search
sorted = [1, 3, 5, 7, 9, 11, 13, 15]
# Find exact match (returns Option Int - index if found)
index = List.binary-search 7 sorted # => Some 3
index = List.binary-search 8 sorted # => None
# With custom key function for records
items = [{name: "apple", price: 100}, {name: "banana", price: 200}]
get-price = fn(item) => item.price
index = List.binary-search-by get-price 200 items # => Some 1
Bound Functions
sorted = [1, 3, 5, 5, 5, 7, 9, 11]
# Lower bound: first index where element >= target
idx = List.lower-bound 5 sorted # => 2
# Upper bound: first index where element > target
idx = List.upper-bound 5 sorted # => 5
# Count occurrences efficiently
count = (List.upper-bound 5 sorted) - (List.lower-bound 5 sorted)
# => 3
Branchless Variants
# Branchless lower/upper bound for latency-critical paths
# Same result as standard versions, but with lower latency variance
idx = List.lower-bound-branchless 5 sorted # => 2
idx = List.upper-bound-branchless 5 sorted # => 5
Branchless variants use conditional moves instead of branches, eliminating branch misprediction. Use these when you need consistent latency and the slight overhead of conditional moves is acceptable.
Atomics
The kit-atomic package provides user-facing atomic operations
with explicit memory orderings.
Atomic Integers
import Atomic
# Create atomic integer
counter = Atomic.int-new 0
# Atomic operations with memory ordering
Atomic.int-store counter 42 :seq-cst
value = Atomic.int-load counter :seq-cst
# Fetch-and-modify operations
old = Atomic.int-fetch-add counter 1 :seq-cst
old = Atomic.int-fetch-sub counter 1 :seq-cst
# Compare-and-swap
success = Atomic.int-compare-swap counter 42 100 :seq-cst :seq-cst
# => true if counter was 42, now 100
Memory Orderings
# Available memory orderings (keywords)
:relaxed # No synchronization, just atomicity
:acquire # Acquire semantics (for loads)
:release # Release semantics (for stores)
:acq-rel # Both acquire and release
:seq-cst # Sequential consistency (strongest)
# Example: producer-consumer pattern
data = Atomic.int-new 0
ready = Atomic.bool-new false
# Producer
Atomic.int-store data 42 :relaxed
Atomic.bool-store ready true :release # Ensures data is visible
# Consumer
if Atomic.bool-load ready :acquire # Synchronizes with release
value = Atomic.int-load data :relaxed # Safe to read
Memory Fences
# Explicit memory barriers
Atomic.fence :acquire # Acquire fence
Atomic.fence :release # Release fence
Atomic.fence :seq-cst # Full fence
SIMD Operations
The Vec module provides SIMD vector operations backed by
Zig's @Vector intrinsics for parallel numeric processing.
# Element-wise arithmetic
result = Vec.add vec1 vec2
result = Vec.sub vec1 vec2
result = Vec.mul vec1 vec2
result = Vec.div vec1 vec2
# Reductions (single value from vector)
total = Vec.sum vec
prod = Vec.product vec
min-val = Vec.min vec
max-val = Vec.max vec
# Linear algebra
dot = Vec.dot-product vec1 vec2
mag = Vec.magnitude vec
normalized = Vec.normalize vec
scaled = Vec.scale vec factor
The implementation uses a vector width of 4 for portability across platforms. Operations compile to native SIMD instructions (SSE/AVX on x86, NEON on ARM).
Zero-Copy I/O
For large files, Kit supports memory-mapped I/O to avoid copying data between kernel and user space.
# Memory-map a file for zero-copy reading
match File.mmap "/path/to/large-file.dat"
| Ok content ->
# content is directly mapped from disk - no copy
process content
# Explicitly unmap when done
File.munmap content
| Err e ->
println "Failed to map file: ${e}"
The mapped memory is read-only. This is ideal for reading large data files, configuration, or any scenario where you want to avoid the overhead of loading an entire file into a heap-allocated buffer.
FlatBuffers
The kit-flatbuf package provides zero-copy binary serialization
using the FlatBuffers format. Unlike JSON or other text formats, FlatBuffers
can be read directly without parsing or memory allocation - the serialized
buffer is the in-memory representation.
Why FlatBuffers for Low-Latency
- Zero-copy reads - Access data directly from the buffer without parsing
- No allocation on read - Reading fields doesn't allocate memory
- Random access - Jump directly to any field without scanning
- Cache-friendly - Compact binary format with good locality
Building FlatBuffers
import FlatBuf
# Create a builder
b = FlatBuf.new()
# Write a string (returns builder and offset)
(b, name-offset) = FlatBuf.create-string b "Alice"
# Build a table with 3 fields
b = FlatBuf.start-object b 3
b = FlatBuf.add-string b 0 name-offset # field 0: name
b = FlatBuf.add-int32 b 1 30 0 # field 1: age (default 0)
b = FlatBuf.add-bool b 2 true false # field 2: active (default false)
(b, root) = FlatBuf.end-object b
# Finish and get bytes
bytes = FlatBuf.finish b root
Pipe-Friendly API
For more ergonomic building, use the pipe-friendly context API with
the |>> operator:
import FlatBuf
Ctx = FlatBuf.Ctx
# Build using pipes - context flows through each operation
bytes = Ctx.new()
|>> Ctx.string "name" "Alice"
|>> Ctx.start-object 3
|>> Ctx.add-string 0 "name"
|>> Ctx.add-int32 1 30 0
|>> Ctx.add-bool 2 true false
|>> Ctx.end-object "person"
|>> Ctx.finish "person"
Zero-Copy Reading
# Parse bytes into a buffer (validates structure)
match FlatBuf.from-bytes bytes
| Ok buf ->
# Get root table position
root = FlatBuf.get-root buf
# Read fields - no allocation, direct buffer access
match FlatBuf.read-string buf root 0
| Ok (Some name) -> println "Name: ${name}"
| Ok None -> println "Name not present"
| Err e -> println "Error: ${e}"
match FlatBuf.read-int32 buf root 1 0
| Ok age -> println "Age: ${Int.to-string age}"
| Err e -> println "Error: ${e}"
match FlatBuf.read-bool buf root 2 false
| Ok active -> println "Active: ${Bool.show active}"
| Err e -> println "Error: ${e}"
| Err e ->
println "Parse error: ${e}"
Vectors and Nested Tables
# Build with nested data
bytes = Ctx.new()
|>> Ctx.string "name" "Bob"
|>> Ctx.vector-int32 "scores" [95, 87, 92, 88]
|>> Ctx.start-object 3
|>> Ctx.add-string 0 "name"
|>> Ctx.add-int32 1 25 0
|>> Ctx.add-offset 2 "scores"
|>> Ctx.end-object "person"
|>> Ctx.finish "person"
# Read vector
match FlatBuf.from-bytes bytes
| Ok buf ->
root = FlatBuf.get-root buf
match FlatBuf.read-vector buf root 2
| Ok (Some vec-pos) ->
match FlatBuf.vector-length buf vec-pos
| Ok len ->
println "${Int.to-string len} scores"
# Access individual elements
match FlatBuf.vector-get-int32 buf vec-pos 0
| Ok score -> println "First: ${Int.to-string score}"
| Err e -> println "Error: ${e}"
| Err e -> println "Error: ${e}"
| Ok None -> println "No scores"
| Err e -> println "Error: ${e}"
| Err e -> println "Parse error"
kit add flatbuf
See kit-flatbuf documentation for full API reference.
Cache Optimization
Kit's lock-free data structures use cache-line padding to prevent false sharing between CPU cores.
How It Works
When multiple cores access different variables that happen to reside on the same cache line, the cache line "ping-pongs" between cores, causing significant performance degradation. Kit's channel implementations pad atomic variables to separate cache lines:
# Cache-line sizes by architecture
# Apple Silicon: 128 bytes
# x86_64: 64 bytes
# In Kit's channel implementation, head and tail pointers
# are padded to separate cache lines to prevent false sharing
# between producer and consumer threads.
Contiguous Data Structures
Kit lists are backed by contiguous arrays (slices), not linked lists. This provides cache-friendly iteration with predictable memory access patterns.
| Structure | Memory Layout | Cache Behavior |
|---|---|---|
| List | Contiguous array | Excellent - sequential access, prefetching works well |
| Map | Hash table | Good for lookups, random access pattern |
| Trees | Node-based | Pointer chasing, less predictable |
For latency-critical lookups on mostly-static data, binary search on a sorted list often outperforms tree-based structures due to better cache locality.