cuda

Kind	ffi-zig
Capabilities	ffi
Categories	gpu parallel numeric ffi
Keywords	cuda gpu nvidia cublas cufft parallel compute zig-ffi

CUDA GPU computing via cuBLAS, cuFFT, and CUDA runtime (requires NVIDIA GPU and CUDA toolkit)

Files

File	Description
`.editorconfig`	Editor formatting configuration
`.gitignore`	Git ignore rules for build artifacts and dependencies
`.tool-versions`	asdf tool versions (Zig, Kit)
`examples/basic.kit`	Basic usage example
`examples/matmul.kit`	Example: matrix multiplication
`kit.toml`	Package manifest with metadata and dependencies
`LICENSE`	MIT license file
`README.md`	This file
`src/main.kit`	Kit CUDA Package
`tests/cuda.test.kit`	Tests for cuda
`tests/error-types.test.kit`	Tests for error-types
`zig/cuda.zig`	Zig FFI module for CUDA runtime, cuBLAS, and cuFFT
`zig/kit_ffi.zig`	Zig FFI module for Kit common types and helpers

Requirements

NVIDIA GPU with CUDA support
CUDA Toolkit installed (11.0+)
Linux (CUDA is not supported on macOS since CUDA 10.2)

Installation

Ubuntu/Debian:

# Add NVIDIA repository and install CUDA toolkit
# See: https://developer.nvidia.com/cuda-downloads

Fedora:

sudo dnf install cuda

Arch Linux:

sudo pacman -S cuda

Usage

import Cuda

main = fn =>
  # Check for CUDA devices
  match Cuda.device-count
    | Ok n -> print "Found ${n} CUDA device(s)"
    | Err e -> print "CUDA error: ${show e}"

  # Get device properties
  props = Cuda.device-properties 0 |> Result.unwrap
  print "GPU: ${props.name}"
  print "Memory: ${props.total-memory / 1024 / 1024} MB"
  print "Compute: ${props.compute-major}.${props.compute-minor}"

  # Vector dot product on GPU
  x = [1.0, 2.0, 3.0, 4.0]
  y = [5.0, 6.0, 7.0, 8.0]

  # Transfer to GPU
  gx = Cuda.to-device-f32 x |> Result.unwrap
  gy = Cuda.to-device-f32 y |> Result.unwrap

  # Compute on GPU
  dot = Cuda.blas-dot-f32 gx gy |> Result.unwrap
  print "Dot product: ${dot}"  # 70.0

  # Clean up
  Cuda.free-f32 gx
  Cuda.free-f32 gy

API Overview

Device Management

Cuda.device-count         # () -> Result Int CudaError
Cuda.set-device           # Int -> Result () CudaError
Cuda.get-device           # () -> Result Int CudaError
Cuda.device-properties    # Int -> Result DeviceProperties CudaError
Cuda.memory-info          # () -> Result MemoryInfo CudaError
Cuda.synchronize          # () -> Result () CudaError
Cuda.reset                # () -> Result () CudaError

Memory Management

# Float32
Cuda.malloc-f32           # Int -> Result GpuArrayF32 CudaError
Cuda.free-f32             # GpuArrayF32 -> Result () CudaError
Cuda.to-device-f32        # [Float] -> Result GpuArrayF32 CudaError
Cuda.to-host-f32          # GpuArrayF32 -> Result [Float] CudaError

# Float64
Cuda.malloc-f64           # Int -> Result GpuArrayF64 CudaError
Cuda.free-f64             # GpuArrayF64 -> Result () CudaError
Cuda.to-device-f64        # [Float] -> Result GpuArrayF64 CudaError
Cuda.to-host-f64          # GpuArrayF64 -> Result [Float] CudaError

CUDA Streams

Cuda.stream-create        # () -> Result Stream CudaError
Cuda.stream-destroy       # Stream -> Result () CudaError
Cuda.stream-synchronize   # Stream -> Result () CudaError
Cuda.stream-query         # Stream -> Result Bool CudaError

cuBLAS Level 1 (Vector Operations)

# Float32
Cuda.blas-dot-f32         # GpuArrayF32 -> GpuArrayF32 -> Result Float CudaError
Cuda.blas-norm-f32        # GpuArrayF32 -> Result Float CudaError
Cuda.blas-asum-f32        # GpuArrayF32 -> Result Float CudaError
Cuda.blas-iamax-f32       # GpuArrayF32 -> Result Int CudaError
Cuda.blas-scale-f32       # Float -> GpuArrayF32 -> Result () CudaError
Cuda.blas-axpy-f32        # Float -> GpuArrayF32 -> GpuArrayF32 -> Result () CudaError
Cuda.blas-copy-f32        # GpuArrayF32 -> GpuArrayF32 -> Result () CudaError

# Float64
Cuda.blas-dot-f64         # GpuArrayF64 -> GpuArrayF64 -> Result Float CudaError
Cuda.blas-norm-f64        # GpuArrayF64 -> Result Float CudaError
Cuda.blas-axpy-f64        # Float -> GpuArrayF64 -> GpuArrayF64 -> Result () CudaError
Cuda.blas-scale-f64       # Float -> GpuArrayF64 -> Result () CudaError

cuBLAS Level 2 (Matrix-Vector)

Cuda.blas-gemv-f32        # Float -> A -> x -> Float -> y -> m -> n -> Result () CudaError
Cuda.blas-gemv-f64        # Float -> A -> x -> Float -> y -> m -> n -> Result () CudaError

cuBLAS Level 3 (Matrix-Matrix)

Cuda.blas-gemm-f32        # Float -> A -> B -> Float -> C -> m -> n -> k -> Result () CudaError
Cuda.blas-gemm-f64        # Float -> A -> B -> Float -> C -> m -> n -> k -> Result () CudaError

cuFFT (Fast Fourier Transform)

Cuda.fft-forward          # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
Cuda.fft-inverse          # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
Cuda.fft-real-to-complex  # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
Cuda.fft-complex-to-real  # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

High-Level Convenience Functions

High-level convenience functions that handle memory transfer automatically are planned for a future release. For now, use the lower-level API with explicit memory management (see examples below).

Types

type CudaError =
  | CudaError {code: Int, message: String}
  | DeviceError {message: String}
  | MemoryError {message: String}
  | InvalidArgument {message: String}
  | BlasError {code: Int, message: String}
  | FftError {code: Int, message: String}

type DeviceProperties = DeviceProperties {
  name: String,
  total-memory: Int,
  compute-major: Int,
  compute-minor: Int,
  multi-processor-count: Int,
  warp-size: Int,
  max-threads-per-block: Int,
  max-block-dim-x: Int,
  max-block-dim-y: Int,
  max-block-dim-z: Int,
  max-grid-dim-x: Int,
  max-grid-dim-y: Int,
  max-grid-dim-z: Int
}

type MemoryInfo = MemoryInfo {free: Int, total: Int}

type GpuArrayF32 = GpuArrayF32 {ptr: Int, len: Int}
type GpuArrayF64 = GpuArrayF64 {ptr: Int, len: Int}
type Stream = Stream {handle: Int}

Performance Notes

The high-level functions (dot, norm, matmul) include memory transfer overhead
For best performance with multiple operations, use the low-level API to keep data on the GPU
Streams enable overlapping computation and memory transfers

Example: Keeping Data on GPU

import Cuda

# For multiple operations, keep data on GPU
gpu-workflow = fn(data) =>
  # Transfer once
  gx = Cuda.to-device-f32 data |> Result.unwrap
  gy = Cuda.malloc-f32 (length data) |> Result.unwrap

  # Multiple GPU operations
  Cuda.blas-copy-f32 gx gy |> Result.unwrap
  Cuda.blas-scale-f32 2.0 gy |> Result.unwrap
  Cuda.blas-axpy-f32 1.0 gx gy |> Result.unwrap

  # Transfer result back
  result = Cuda.to-host-f32 gy |> Result.unwrap

  # Clean up
  Cuda.free-f32 gx
  Cuda.free-f32 gy

  result

License

MIT

Exported Functions & Types

`CudaError`

Error types for CUDA operations.

Variants

CudaError {code, message}

CUDA operation failed

DeviceError {message}

Device not found or not available

MemoryError {message}

Memory allocation failed on device

InvalidArgument {message}

Invalid argument passed to CUDA function

BlasError {code, message}

cuBLAS operation failed

FftError {code, message}

cuFFT operation failed

`DeviceProperties`

Properties of a CUDA device.

Variants

DeviceProperties {name, total-memory, compute-major, compute-minor, multi-processor-count, warp-size, max-threads-per-block, max-block-dim-x, max-block-dim-y, max-block-dim-z, max-grid-dim-x, max-grid-dim-y, max-grid-dim-z}

`MemoryInfo`

GPU memory information.

Variants

MemoryInfo {free, total}

`GpuArrayF32`

A handle to GPU-allocated memory for floats (f32). This is an opaque handle - do not modify directly.

Variants

GpuArrayF32 {ptr, len}

`GpuArrayF64`

A handle to GPU-allocated memory for doubles (f64).

Variants

GpuArrayF64 {ptr, len}

`GpuArrayInt`

A handle to GPU-allocated memory for integers.

Variants

GpuArrayInt {ptr, len}

`Stream`

A CUDA stream for asynchronous operations.

Variants

Stream {handle}

`device-count`

Returns the number of CUDA-capable devices.

() -> Result Int CudaError

`set-device`

Sets the current CUDA device.

Int -> Result () CudaError

`get-device`

Gets the current CUDA device index.

() -> Result Int CudaError

`device-properties`

Gets properties of a CUDA device.

Int -> Result DeviceProperties CudaError

`memory-info`

Gets memory info for the current device.

() -> Result MemoryInfo CudaError

`synchronize`

Synchronizes the current device (waits for all operations to complete).

() -> Result () CudaError

`reset`

Resets the current device (frees all memory, destroys all streams).

() -> Result () CudaError

`malloc-f32`

Allocates memory on the GPU for f32 values.

Int -> Result GpuArrayF32 CudaError

`free-f32`

Frees GPU memory.

GpuArrayF32 -> Result () CudaError

`to-device-f32`

Copies data from host (CPU) to device (GPU).

[Float] -> Result GpuArrayF32 CudaError

`to-host-f32`

Copies data from device (GPU) to host (CPU).

GpuArrayF32 -> Result [Float] CudaError

`malloc-f64`

Allocates memory on the GPU for f64 values.

Int -> Result GpuArrayF64 CudaError

`free-f64`

Frees GPU memory for f64 array.

GpuArrayF64 -> Result () CudaError

`to-device-f64`

Copies f64 data from host to device.

[Float] -> Result GpuArrayF64 CudaError

`to-host-f64`

Copies f64 data from device to host.

GpuArrayF64 -> Result [Float] CudaError

`stream-create`

Creates a new CUDA stream for asynchronous operations.

() -> Result Stream CudaError

`stream-destroy`

Destroys a CUDA stream.

Stream -> Result () CudaError

`stream-synchronize`

Synchronizes a stream (waits for all operations in the stream to complete).

Stream -> Result () CudaError

`stream-query`

Queries if a stream has completed all operations.

Stream -> Result Bool CudaError

`blas-dot-f32`

Computes the dot product of two vectors on GPU: x . y

GpuArrayF32 -> GpuArrayF32 -> Result Float CudaError

`blas-norm-f32`

Computes the Euclidean norm of a vector: ||x||_2

GpuArrayF32 -> Result Float CudaError

`blas-asum-f32`

Computes the sum of absolute values (L1 norm): sum(|x_i|)

GpuArrayF32 -> Result Float CudaError

`blas-iamax-f32`

Finds the index of the element with maximum absolute value.

GpuArrayF32 -> Result Int CudaError

`blas-scale-f32`

Scales a vector by a scalar: x = alpha * x

Float -> GpuArrayF32 -> Result () CudaError

`blas-axpy-f32`

AXPY: y = alpha * x + y

Float -> GpuArrayF32 -> GpuArrayF32 -> Result () CudaError

`blas-copy-f32`

Copies vector x to vector y: y = x

GpuArrayF32 -> GpuArrayF32 -> Result () CudaError

`blas-dot-f64`

Computes the dot product of two f64 vectors on GPU.

GpuArrayF64 -> GpuArrayF64 -> Result Float CudaError

`blas-norm-f64`

Computes the Euclidean norm of an f64 vector.

GpuArrayF64 -> Result Float CudaError

`blas-axpy-f64`

AXPY for f64: y = alpha * x + y

Float -> GpuArrayF64 -> GpuArrayF64 -> Result () CudaError

`blas-scale-f64`

Scales an f64 vector: x = alpha * x

Float -> GpuArrayF64 -> Result () CudaError

`blas-gemv-f32`

Matrix-vector multiplication: y = alpha * A * x + beta * y A is m x n matrix, x is n-element vector, y is m-element vector.

Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Result () CudaError

`blas-gemv-f64`

f64 matrix-vector multiplication: y = alpha * A * x + beta * y

Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Result () CudaError

`blas-gemm-f32`

General matrix-matrix multiplication: C = alpha * A * B + beta * C A is m x k, B is k x n, C is m x n.

Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Int -> Result () CudaError

`blas-gemm-f64`

f64 general matrix-matrix multiplication: C = alpha * A * B + beta * C

Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Int -> Result () CudaError

`fft-forward`

Computes 1D complex-to-complex FFT (forward). Input: interleaved real/imaginary pairs [r0, i0, r1, i1, ...]

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

`fft-inverse`

Computes 1D complex-to-complex FFT (inverse).

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

`fft-real-to-complex`

Computes 1D real-to-complex FFT. Input: n real values. Output: n/2+1 complex values (interleaved)

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

`fft-complex-to-real`

Computes 1D complex-to-real FFT (inverse of r2c).

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError