Low-Latency Programming

Kit provides low-level primitives for performance-critical applications where nanoseconds matter. These features give you direct access to CPU timing, branchless algorithms, SIMD operations, and zero-copy I/O.

CPU Timing

For cycle-accurate timing, Kit exposes the CPU timestamp counter directly. This is essential for measuring latency in performance-critical code paths.

Reading the Timestamp Counter

# Read CPU timestamp counter (non-serializing)
start = Time.rdtsc()
# ... your code ...
end = Time.rdtsc()
cycles = end - start
println "Elapsed cycles: ${cycles}"

# Serializing variant - waits for prior instructions to complete
# More accurate for benchmarks, slightly higher overhead
start = Time.rdtscp()
# ... your code ...
cycles = Time.rdtscp() - start

When to use which:

  • Time.rdtsc() - Lower overhead, good for measuring longer operations
  • Time.rdtscp() - Serializing, more accurate for micro-benchmarks

Platform Support

On x86_64, these use the RDTSC/RDTSCP instructions directly. On aarch64 (Apple Silicon), they use the CNTVCT_EL0 register. Other architectures fall back to high-resolution monotonic time.

Benchmarking

Kit provides two levels of benchmarking: single-run timing and statistical profiling.

Single-Run Timing

# Time a single operation with Time.measure
result = Time.measure fn => expensive-operation()

println "Result: ${result.result}"
println "Elapsed: ${result.elapsed_ns} ns"
println "Cycles: ${result.cycles}"

Time.measure takes a zero-argument function (thunk) and returns a record containing the function's return value along with timing information.

Statistical Benchmarking

# Statistical benchmarking with warmup and multiple iterations
stats = Bench.run "order-insert" 1000 10000 fn => insert-order order-book order

# Returns comprehensive statistics
println "Benchmark: ${stats.name}"
println "Iterations: ${stats.iterations}"
println "Mean: ${stats.mean_ns} ns"
println "Median: ${stats.median_ns} ns"
println "Min: ${stats.min_ns} ns"
println "Max: ${stats.max_ns} ns"
println "P99: ${stats.p99_ns} ns"
println "Std Dev: ${stats.std_dev}"
println "Ops/sec: ${stats.ops_per_sec}"

Bench.run takes a name, warmup count, iteration count, and a thunk. It discards warmup iterations, then collects timing data across all iterations to produce statistical metrics including percentiles.

Kit provides binary search functions including branchless variants that avoid branch misprediction penalties in latency-critical code.

Standard Binary Search

sorted = [1, 3, 5, 7, 9, 11, 13, 15]

# Find exact match (returns Option Int - index if found)
index = List.binary-search 7 sorted   # => Some 3
index = List.binary-search 8 sorted   # => None

# With custom key function for records
items = [{name: "apple", price: 100}, {name: "banana", price: 200}]
get-price = fn(item) => item.price
index = List.binary-search-by get-price 200 items  # => Some 1

Bound Functions

sorted = [1, 3, 5, 5, 5, 7, 9, 11]

# Lower bound: first index where element >= target
idx = List.lower-bound 5 sorted   # => 2

# Upper bound: first index where element > target
idx = List.upper-bound 5 sorted   # => 5

# Count occurrences efficiently
count = (List.upper-bound 5 sorted) - (List.lower-bound 5 sorted)
# => 3

Branchless Variants

# Branchless lower/upper bound for latency-critical paths
# Same result as standard versions, but with lower latency variance
idx = List.lower-bound-branchless 5 sorted  # => 2
idx = List.upper-bound-branchless 5 sorted  # => 5

Branchless variants use conditional moves instead of branches, eliminating branch misprediction. Use these when you need consistent latency and the slight overhead of conditional moves is acceptable.

Atomics

The kit-atomic package provides user-facing atomic operations with explicit memory orderings.

Atomic Integers

import Atomic

# Create atomic integer
counter = Atomic.int-new 0

# Atomic operations with memory ordering
Atomic.int-store counter 42 :seq-cst
value = Atomic.int-load counter :seq-cst

# Fetch-and-modify operations
old = Atomic.int-fetch-add counter 1 :seq-cst
old = Atomic.int-fetch-sub counter 1 :seq-cst

# Compare-and-swap
success = Atomic.int-compare-swap counter 42 100 :seq-cst :seq-cst
# => true if counter was 42, now 100

Memory Orderings

# Available memory orderings (keywords)
:relaxed    # No synchronization, just atomicity
:acquire    # Acquire semantics (for loads)
:release    # Release semantics (for stores)
:acq-rel    # Both acquire and release
:seq-cst    # Sequential consistency (strongest)

# Example: producer-consumer pattern
data = Atomic.int-new 0
ready = Atomic.bool-new false

# Producer
Atomic.int-store data 42 :relaxed
Atomic.bool-store ready true :release  # Ensures data is visible

# Consumer
if Atomic.bool-load ready :acquire   # Synchronizes with release
  value = Atomic.int-load data :relaxed  # Safe to read

Memory Fences

# Explicit memory barriers
Atomic.fence :acquire   # Acquire fence
Atomic.fence :release   # Release fence
Atomic.fence :seq-cst   # Full fence

SIMD Operations

The Vec module provides SIMD vector operations backed by Zig's @Vector intrinsics for parallel numeric processing.

# Element-wise arithmetic
result = Vec.add vec1 vec2
result = Vec.sub vec1 vec2
result = Vec.mul vec1 vec2
result = Vec.div vec1 vec2

# Reductions (single value from vector)
total = Vec.sum vec
prod = Vec.product vec
min-val = Vec.min vec
max-val = Vec.max vec

# Linear algebra
dot = Vec.dot-product vec1 vec2
mag = Vec.magnitude vec
normalized = Vec.normalize vec
scaled = Vec.scale vec factor

The implementation uses a vector width of 4 for portability across platforms. Operations compile to native SIMD instructions (SSE/AVX on x86, NEON on ARM).

Zero-Copy I/O

For large files, Kit supports memory-mapped I/O to avoid copying data between kernel and user space.

# Memory-map a file for zero-copy reading
match File.mmap "/path/to/large-file.dat"
  | Ok content ->
      # content is directly mapped from disk - no copy
      process content

      # Explicitly unmap when done
      File.munmap content
  | Err e ->
      println "Failed to map file: ${e}"

The mapped memory is read-only. This is ideal for reading large data files, configuration, or any scenario where you want to avoid the overhead of loading an entire file into a heap-allocated buffer.

FlatBuffers

The kit-flatbuf package provides zero-copy binary serialization using the FlatBuffers format. Unlike JSON or other text formats, FlatBuffers can be read directly without parsing or memory allocation - the serialized buffer is the in-memory representation.

Why FlatBuffers for Low-Latency

  • Zero-copy reads - Access data directly from the buffer without parsing
  • No allocation on read - Reading fields doesn't allocate memory
  • Random access - Jump directly to any field without scanning
  • Cache-friendly - Compact binary format with good locality

Building FlatBuffers

import FlatBuf

# Create a builder
b = FlatBuf.new()

# Write a string (returns builder and offset)
(b, name-offset) = FlatBuf.create-string b "Alice"

# Build a table with 3 fields
b = FlatBuf.start-object b 3
b = FlatBuf.add-string b 0 name-offset   # field 0: name
b = FlatBuf.add-int32 b 1 30 0           # field 1: age (default 0)
b = FlatBuf.add-bool b 2 true false      # field 2: active (default false)
(b, root) = FlatBuf.end-object b

# Finish and get bytes
bytes = FlatBuf.finish b root

Pipe-Friendly API

For more ergonomic building, use the pipe-friendly context API with the |>> operator:

import FlatBuf

Ctx = FlatBuf.Ctx

# Build using pipes - context flows through each operation
bytes = Ctx.new()
  |>> Ctx.string "name" "Alice"
  |>> Ctx.start-object 3
  |>> Ctx.add-string 0 "name"
  |>> Ctx.add-int32 1 30 0
  |>> Ctx.add-bool 2 true false
  |>> Ctx.end-object "person"
  |>> Ctx.finish "person"

Zero-Copy Reading

# Parse bytes into a buffer (validates structure)
match FlatBuf.from-bytes bytes
  | Ok buf ->
      # Get root table position
      root = FlatBuf.get-root buf

      # Read fields - no allocation, direct buffer access
      match FlatBuf.read-string buf root 0
        | Ok (Some name) -> println "Name: ${name}"
        | Ok None -> println "Name not present"
        | Err e -> println "Error: ${e}"

      match FlatBuf.read-int32 buf root 1 0
        | Ok age -> println "Age: ${Int.to-string age}"
        | Err e -> println "Error: ${e}"

      match FlatBuf.read-bool buf root 2 false
        | Ok active -> println "Active: ${Bool.show active}"
        | Err e -> println "Error: ${e}"

  | Err e ->
      println "Parse error: ${e}"

Vectors and Nested Tables

# Build with nested data
bytes = Ctx.new()
  |>> Ctx.string "name" "Bob"
  |>> Ctx.vector-int32 "scores" [95, 87, 92, 88]
  |>> Ctx.start-object 3
  |>> Ctx.add-string 0 "name"
  |>> Ctx.add-int32 1 25 0
  |>> Ctx.add-offset 2 "scores"
  |>> Ctx.end-object "person"
  |>> Ctx.finish "person"

# Read vector
match FlatBuf.from-bytes bytes
  | Ok buf ->
      root = FlatBuf.get-root buf
      match FlatBuf.read-vector buf root 2
        | Ok (Some vec-pos) ->
            match FlatBuf.vector-length buf vec-pos
              | Ok len ->
                  println "${Int.to-string len} scores"
                  # Access individual elements
                  match FlatBuf.vector-get-int32 buf vec-pos 0
                    | Ok score -> println "First: ${Int.to-string score}"
                    | Err e -> println "Error: ${e}"
              | Err e -> println "Error: ${e}"
        | Ok None -> println "No scores"
        | Err e -> println "Error: ${e}"
  | Err e -> println "Parse error"
Installation

kit add flatbuf
See kit-flatbuf documentation for full API reference.

Cache Optimization

Kit's lock-free data structures use cache-line padding to prevent false sharing between CPU cores.

How It Works

When multiple cores access different variables that happen to reside on the same cache line, the cache line "ping-pongs" between cores, causing significant performance degradation. Kit's channel implementations pad atomic variables to separate cache lines:

# Cache-line sizes by architecture
# Apple Silicon: 128 bytes
# x86_64: 64 bytes

# In Kit's channel implementation, head and tail pointers
# are padded to separate cache lines to prevent false sharing
# between producer and consumer threads.

Contiguous Data Structures

Kit lists are backed by contiguous arrays (slices), not linked lists. This provides cache-friendly iteration with predictable memory access patterns.

Structure Memory Layout Cache Behavior
List Contiguous array Excellent - sequential access, prefetching works well
Map Hash table Good for lookups, random access pattern
Trees Node-based Pointer chasing, less predictable

For latency-critical lookups on mostly-static data, binary search on a sorted list often outperforms tree-based structures due to better cache locality.