cuda

CUDA GPU computing via cuBLAS, cuFFT, and CUDA runtime (requires NVIDIA GPU and CUDA toolkit)

Files

FileDescription
.editorconfigEditor formatting configuration
.gitignoreGit ignore rules for build artifacts and dependencies
.tool-versionsasdf tool versions (Zig, Kit)
examples/basic.kitBasic usage example
examples/matmul.kitExample: matrix multiplication
kit.tomlPackage manifest with metadata and dependencies
LICENSEMIT license file
README.mdThis file
src/main.kitKit CUDA Package
tests/cuda.test.kitTests for cuda
tests/error-types.test.kitTests for error-types
zig/cuda.zigZig FFI module for CUDA runtime, cuBLAS, and cuFFT
zig/kit_ffi.zigZig FFI module for Kit common types and helpers

Requirements

  • NVIDIA GPU with CUDA support
  • CUDA Toolkit installed (11.0+)
  • Linux (CUDA is not supported on macOS since CUDA 10.2)

Installation

Ubuntu/Debian:

# Add NVIDIA repository and install CUDA toolkit
# See: https://developer.nvidia.com/cuda-downloads

Fedora:

sudo dnf install cuda

Arch Linux:

sudo pacman -S cuda

Usage

import Cuda

main = fn =>
  # Check for CUDA devices
  match Cuda.device-count
    | Ok n -> print "Found ${n} CUDA device(s)"
    | Err e -> print "CUDA error: ${show e}"

  # Get device properties
  props = Cuda.device-properties 0 |> Result.unwrap
  print "GPU: ${props.name}"
  print "Memory: ${props.total-memory / 1024 / 1024} MB"
  print "Compute: ${props.compute-major}.${props.compute-minor}"

  # Vector dot product on GPU
  x = [1.0, 2.0, 3.0, 4.0]
  y = [5.0, 6.0, 7.0, 8.0]

  # Transfer to GPU
  gx = Cuda.to-device-f32 x |> Result.unwrap
  gy = Cuda.to-device-f32 y |> Result.unwrap

  # Compute on GPU
  dot = Cuda.blas-dot-f32 gx gy |> Result.unwrap
  print "Dot product: ${dot}"  # 70.0

  # Clean up
  Cuda.free-f32 gx
  Cuda.free-f32 gy

API Overview

Device Management

Cuda.device-count         # () -> Result Int CudaError
Cuda.set-device           # Int -> Result () CudaError
Cuda.get-device           # () -> Result Int CudaError
Cuda.device-properties    # Int -> Result DeviceProperties CudaError
Cuda.memory-info          # () -> Result MemoryInfo CudaError
Cuda.synchronize          # () -> Result () CudaError
Cuda.reset                # () -> Result () CudaError

Memory Management

# Float32
Cuda.malloc-f32           # Int -> Result GpuArrayF32 CudaError
Cuda.free-f32             # GpuArrayF32 -> Result () CudaError
Cuda.to-device-f32        # [Float] -> Result GpuArrayF32 CudaError
Cuda.to-host-f32          # GpuArrayF32 -> Result [Float] CudaError

# Float64
Cuda.malloc-f64           # Int -> Result GpuArrayF64 CudaError
Cuda.free-f64             # GpuArrayF64 -> Result () CudaError
Cuda.to-device-f64        # [Float] -> Result GpuArrayF64 CudaError
Cuda.to-host-f64          # GpuArrayF64 -> Result [Float] CudaError

CUDA Streams

Cuda.stream-create        # () -> Result Stream CudaError
Cuda.stream-destroy       # Stream -> Result () CudaError
Cuda.stream-synchronize   # Stream -> Result () CudaError
Cuda.stream-query         # Stream -> Result Bool CudaError

cuBLAS Level 1 (Vector Operations)

# Float32
Cuda.blas-dot-f32         # GpuArrayF32 -> GpuArrayF32 -> Result Float CudaError
Cuda.blas-norm-f32        # GpuArrayF32 -> Result Float CudaError
Cuda.blas-asum-f32        # GpuArrayF32 -> Result Float CudaError
Cuda.blas-iamax-f32       # GpuArrayF32 -> Result Int CudaError
Cuda.blas-scale-f32       # Float -> GpuArrayF32 -> Result () CudaError
Cuda.blas-axpy-f32        # Float -> GpuArrayF32 -> GpuArrayF32 -> Result () CudaError
Cuda.blas-copy-f32        # GpuArrayF32 -> GpuArrayF32 -> Result () CudaError

# Float64
Cuda.blas-dot-f64         # GpuArrayF64 -> GpuArrayF64 -> Result Float CudaError
Cuda.blas-norm-f64        # GpuArrayF64 -> Result Float CudaError
Cuda.blas-axpy-f64        # Float -> GpuArrayF64 -> GpuArrayF64 -> Result () CudaError
Cuda.blas-scale-f64       # Float -> GpuArrayF64 -> Result () CudaError

cuBLAS Level 2 (Matrix-Vector)

Cuda.blas-gemv-f32        # Float -> A -> x -> Float -> y -> m -> n -> Result () CudaError
Cuda.blas-gemv-f64        # Float -> A -> x -> Float -> y -> m -> n -> Result () CudaError

cuBLAS Level 3 (Matrix-Matrix)

Cuda.blas-gemm-f32        # Float -> A -> B -> Float -> C -> m -> n -> k -> Result () CudaError
Cuda.blas-gemm-f64        # Float -> A -> B -> Float -> C -> m -> n -> k -> Result () CudaError

cuFFT (Fast Fourier Transform)

Cuda.fft-forward          # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
Cuda.fft-inverse          # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
Cuda.fft-real-to-complex  # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
Cuda.fft-complex-to-real  # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

High-Level Convenience Functions

High-level convenience functions that handle memory transfer automatically are planned for a future release. For now, use the lower-level API with explicit memory management (see examples below).

Types

type CudaError =
  | CudaError {code: Int, message: String}
  | DeviceError {message: String}
  | MemoryError {message: String}
  | InvalidArgument {message: String}
  | BlasError {code: Int, message: String}
  | FftError {code: Int, message: String}

type DeviceProperties = DeviceProperties {
  name: String,
  total-memory: Int,
  compute-major: Int,
  compute-minor: Int,
  multi-processor-count: Int,
  warp-size: Int,
  max-threads-per-block: Int,
  max-block-dim-x: Int,
  max-block-dim-y: Int,
  max-block-dim-z: Int,
  max-grid-dim-x: Int,
  max-grid-dim-y: Int,
  max-grid-dim-z: Int
}

type MemoryInfo = MemoryInfo {free: Int, total: Int}

type GpuArrayF32 = GpuArrayF32 {ptr: Int, len: Int}
type GpuArrayF64 = GpuArrayF64 {ptr: Int, len: Int}
type Stream = Stream {handle: Int}

Performance Notes

  • The high-level functions (dot, norm, matmul) include memory transfer overhead
  • For best performance with multiple operations, use the low-level API to keep data on the GPU
  • Streams enable overlapping computation and memory transfers

Example: Keeping Data on GPU

import Cuda

# For multiple operations, keep data on GPU
gpu-workflow = fn(data) =>
  # Transfer once
  gx = Cuda.to-device-f32 data |> Result.unwrap
  gy = Cuda.malloc-f32 (length data) |> Result.unwrap

  # Multiple GPU operations
  Cuda.blas-copy-f32 gx gy |> Result.unwrap
  Cuda.blas-scale-f32 2.0 gy |> Result.unwrap
  Cuda.blas-axpy-f32 1.0 gx gy |> Result.unwrap

  # Transfer result back
  result = Cuda.to-host-f32 gy |> Result.unwrap

  # Clean up
  Cuda.free-f32 gx
  Cuda.free-f32 gy

  result

License

MIT

Exported Functions & Types

CudaError

Error types for CUDA operations.

Variants

CudaError {code, message}
CUDA operation failed
DeviceError {message}
Device not found or not available
MemoryError {message}
Memory allocation failed on device
InvalidArgument {message}
Invalid argument passed to CUDA function
BlasError {code, message}
cuBLAS operation failed
FftError {code, message}
cuFFT operation failed

DeviceProperties

Properties of a CUDA device.

Variants

DeviceProperties {name, total-memory, compute-major, compute-minor, multi-processor-count, warp-size, max-threads-per-block, max-block-dim-x, max-block-dim-y, max-block-dim-z, max-grid-dim-x, max-grid-dim-y, max-grid-dim-z}

MemoryInfo

GPU memory information.

Variants

MemoryInfo {free, total}

GpuArrayF32

A handle to GPU-allocated memory for floats (f32). This is an opaque handle - do not modify directly.

Variants

GpuArrayF32 {ptr, len}

GpuArrayF64

A handle to GPU-allocated memory for doubles (f64).

Variants

GpuArrayF64 {ptr, len}

GpuArrayInt

A handle to GPU-allocated memory for integers.

Variants

GpuArrayInt {ptr, len}

Stream

A CUDA stream for asynchronous operations.

Variants

Stream {handle}

device-count

Returns the number of CUDA-capable devices.

() -> Result Int CudaError

set-device

Sets the current CUDA device.

Int -> Result () CudaError

get-device

Gets the current CUDA device index.

() -> Result Int CudaError

device-properties

Gets properties of a CUDA device.

Int -> Result DeviceProperties CudaError

memory-info

Gets memory info for the current device.

() -> Result MemoryInfo CudaError

synchronize

Synchronizes the current device (waits for all operations to complete).

() -> Result () CudaError

reset

Resets the current device (frees all memory, destroys all streams).

() -> Result () CudaError

malloc-f32

Allocates memory on the GPU for f32 values.

Int -> Result GpuArrayF32 CudaError

free-f32

Frees GPU memory.

GpuArrayF32 -> Result () CudaError

to-device-f32

Copies data from host (CPU) to device (GPU).

[Float] -> Result GpuArrayF32 CudaError

to-host-f32

Copies data from device (GPU) to host (CPU).

GpuArrayF32 -> Result [Float] CudaError

malloc-f64

Allocates memory on the GPU for f64 values.

Int -> Result GpuArrayF64 CudaError

free-f64

Frees GPU memory for f64 array.

GpuArrayF64 -> Result () CudaError

to-device-f64

Copies f64 data from host to device.

[Float] -> Result GpuArrayF64 CudaError

to-host-f64

Copies f64 data from device to host.

GpuArrayF64 -> Result [Float] CudaError

stream-create

Creates a new CUDA stream for asynchronous operations.

() -> Result Stream CudaError

stream-destroy

Destroys a CUDA stream.

Stream -> Result () CudaError

stream-synchronize

Synchronizes a stream (waits for all operations in the stream to complete).

Stream -> Result () CudaError

stream-query

Queries if a stream has completed all operations.

Stream -> Result Bool CudaError

blas-dot-f32

Computes the dot product of two vectors on GPU: x . y

GpuArrayF32 -> GpuArrayF32 -> Result Float CudaError

blas-norm-f32

Computes the Euclidean norm of a vector: ||x||_2

GpuArrayF32 -> Result Float CudaError

blas-asum-f32

Computes the sum of absolute values (L1 norm): sum(|x_i|)

GpuArrayF32 -> Result Float CudaError

blas-iamax-f32

Finds the index of the element with maximum absolute value.

GpuArrayF32 -> Result Int CudaError

blas-scale-f32

Scales a vector by a scalar: x = alpha * x

Float -> GpuArrayF32 -> Result () CudaError

blas-axpy-f32

AXPY: y = alpha * x + y

Float -> GpuArrayF32 -> GpuArrayF32 -> Result () CudaError

blas-copy-f32

Copies vector x to vector y: y = x

GpuArrayF32 -> GpuArrayF32 -> Result () CudaError

blas-dot-f64

Computes the dot product of two f64 vectors on GPU.

GpuArrayF64 -> GpuArrayF64 -> Result Float CudaError

blas-norm-f64

Computes the Euclidean norm of an f64 vector.

GpuArrayF64 -> Result Float CudaError

blas-axpy-f64

AXPY for f64: y = alpha * x + y

Float -> GpuArrayF64 -> GpuArrayF64 -> Result () CudaError

blas-scale-f64

Scales an f64 vector: x = alpha * x

Float -> GpuArrayF64 -> Result () CudaError

blas-gemv-f32

Matrix-vector multiplication: y = alpha * A * x + beta * y A is m x n matrix, x is n-element vector, y is m-element vector.

Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Result () CudaError

blas-gemv-f64

f64 matrix-vector multiplication: y = alpha * A * x + beta * y

Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Result () CudaError

blas-gemm-f32

General matrix-matrix multiplication: C = alpha * A * B + beta * C A is m x k, B is k x n, C is m x n.

Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Int -> Result () CudaError

blas-gemm-f64

f64 general matrix-matrix multiplication: C = alpha * A * B + beta * C

Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Int -> Result () CudaError

fft-forward

Computes 1D complex-to-complex FFT (forward). Input: interleaved real/imaginary pairs [r0, i0, r1, i1, ...]

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

fft-inverse

Computes 1D complex-to-complex FFT (inverse).

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

fft-real-to-complex

Computes 1D real-to-complex FFT. Input: n real values. Output: n/2+1 complex values (interleaved)

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

fft-complex-to-real

Computes 1D complex-to-real FFT (inverse of r2c).

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError