cuda
| Kind | ffi-zig |
|---|---|
| Capabilities | ffi |
| Categories | gpu parallel numeric ffi |
| Keywords | cuda gpu nvidia cublas cufft parallel compute zig-ffi |
CUDA GPU computing via cuBLAS, cuFFT, and CUDA runtime (requires NVIDIA GPU and CUDA toolkit)
Files
| File | Description |
|---|---|
.editorconfig | Editor formatting configuration |
.gitignore | Git ignore rules for build artifacts and dependencies |
.tool-versions | asdf tool versions (Zig, Kit) |
examples/basic.kit | Basic usage example |
examples/matmul.kit | Example: matrix multiplication |
kit.toml | Package manifest with metadata and dependencies |
LICENSE | MIT license file |
README.md | This file |
src/main.kit | Kit CUDA Package |
tests/cuda.test.kit | Tests for cuda |
tests/error-types.test.kit | Tests for error-types |
zig/cuda.zig | Zig FFI module for CUDA runtime, cuBLAS, and cuFFT |
zig/kit_ffi.zig | Zig FFI module for Kit common types and helpers |
Requirements
- NVIDIA GPU with CUDA support
- CUDA Toolkit installed (11.0+)
- Linux (CUDA is not supported on macOS since CUDA 10.2)
Installation
Ubuntu/Debian:
# Add NVIDIA repository and install CUDA toolkit
# See: https://developer.nvidia.com/cuda-downloadsFedora:
sudo dnf install cudaArch Linux:
sudo pacman -S cudaUsage
import Cuda
main = fn =>
# Check for CUDA devices
match Cuda.device-count
| Ok n -> print "Found ${n} CUDA device(s)"
| Err e -> print "CUDA error: ${show e}"
# Get device properties
props = Cuda.device-properties 0 |> Result.unwrap
print "GPU: ${props.name}"
print "Memory: ${props.total-memory / 1024 / 1024} MB"
print "Compute: ${props.compute-major}.${props.compute-minor}"
# Vector dot product on GPU
x = [1.0, 2.0, 3.0, 4.0]
y = [5.0, 6.0, 7.0, 8.0]
# Transfer to GPU
gx = Cuda.to-device-f32 x |> Result.unwrap
gy = Cuda.to-device-f32 y |> Result.unwrap
# Compute on GPU
dot = Cuda.blas-dot-f32 gx gy |> Result.unwrap
print "Dot product: ${dot}" # 70.0
# Clean up
Cuda.free-f32 gx
Cuda.free-f32 gyAPI Overview
Device Management
Cuda.device-count # () -> Result Int CudaError
Cuda.set-device # Int -> Result () CudaError
Cuda.get-device # () -> Result Int CudaError
Cuda.device-properties # Int -> Result DeviceProperties CudaError
Cuda.memory-info # () -> Result MemoryInfo CudaError
Cuda.synchronize # () -> Result () CudaError
Cuda.reset # () -> Result () CudaErrorMemory Management
# Float32
Cuda.malloc-f32 # Int -> Result GpuArrayF32 CudaError
Cuda.free-f32 # GpuArrayF32 -> Result () CudaError
Cuda.to-device-f32 # [Float] -> Result GpuArrayF32 CudaError
Cuda.to-host-f32 # GpuArrayF32 -> Result [Float] CudaError
# Float64
Cuda.malloc-f64 # Int -> Result GpuArrayF64 CudaError
Cuda.free-f64 # GpuArrayF64 -> Result () CudaError
Cuda.to-device-f64 # [Float] -> Result GpuArrayF64 CudaError
Cuda.to-host-f64 # GpuArrayF64 -> Result [Float] CudaErrorCUDA Streams
Cuda.stream-create # () -> Result Stream CudaError
Cuda.stream-destroy # Stream -> Result () CudaError
Cuda.stream-synchronize # Stream -> Result () CudaError
Cuda.stream-query # Stream -> Result Bool CudaErrorcuBLAS Level 1 (Vector Operations)
# Float32
Cuda.blas-dot-f32 # GpuArrayF32 -> GpuArrayF32 -> Result Float CudaError
Cuda.blas-norm-f32 # GpuArrayF32 -> Result Float CudaError
Cuda.blas-asum-f32 # GpuArrayF32 -> Result Float CudaError
Cuda.blas-iamax-f32 # GpuArrayF32 -> Result Int CudaError
Cuda.blas-scale-f32 # Float -> GpuArrayF32 -> Result () CudaError
Cuda.blas-axpy-f32 # Float -> GpuArrayF32 -> GpuArrayF32 -> Result () CudaError
Cuda.blas-copy-f32 # GpuArrayF32 -> GpuArrayF32 -> Result () CudaError
# Float64
Cuda.blas-dot-f64 # GpuArrayF64 -> GpuArrayF64 -> Result Float CudaError
Cuda.blas-norm-f64 # GpuArrayF64 -> Result Float CudaError
Cuda.blas-axpy-f64 # Float -> GpuArrayF64 -> GpuArrayF64 -> Result () CudaError
Cuda.blas-scale-f64 # Float -> GpuArrayF64 -> Result () CudaErrorcuBLAS Level 2 (Matrix-Vector)
Cuda.blas-gemv-f32 # Float -> A -> x -> Float -> y -> m -> n -> Result () CudaError
Cuda.blas-gemv-f64 # Float -> A -> x -> Float -> y -> m -> n -> Result () CudaErrorcuBLAS Level 3 (Matrix-Matrix)
Cuda.blas-gemm-f32 # Float -> A -> B -> Float -> C -> m -> n -> k -> Result () CudaError
Cuda.blas-gemm-f64 # Float -> A -> B -> Float -> C -> m -> n -> k -> Result () CudaErrorcuFFT (Fast Fourier Transform)
Cuda.fft-forward # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
Cuda.fft-inverse # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
Cuda.fft-real-to-complex # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
Cuda.fft-complex-to-real # GpuArrayF32 -> Int -> Result GpuArrayF32 CudaErrorHigh-Level Convenience Functions
High-level convenience functions that handle memory transfer automatically are planned for a future release. For now, use the lower-level API with explicit memory management (see examples below).
Types
type CudaError =
| CudaError {code: Int, message: String}
| DeviceError {message: String}
| MemoryError {message: String}
| InvalidArgument {message: String}
| BlasError {code: Int, message: String}
| FftError {code: Int, message: String}
type DeviceProperties = DeviceProperties {
name: String,
total-memory: Int,
compute-major: Int,
compute-minor: Int,
multi-processor-count: Int,
warp-size: Int,
max-threads-per-block: Int,
max-block-dim-x: Int,
max-block-dim-y: Int,
max-block-dim-z: Int,
max-grid-dim-x: Int,
max-grid-dim-y: Int,
max-grid-dim-z: Int
}
type MemoryInfo = MemoryInfo {free: Int, total: Int}
type GpuArrayF32 = GpuArrayF32 {ptr: Int, len: Int}
type GpuArrayF64 = GpuArrayF64 {ptr: Int, len: Int}
type Stream = Stream {handle: Int}Performance Notes
- The high-level functions (
dot,norm,matmul) include memory transfer overhead - For best performance with multiple operations, use the low-level API to keep data on the GPU
- Streams enable overlapping computation and memory transfers
Example: Keeping Data on GPU
import Cuda
# For multiple operations, keep data on GPU
gpu-workflow = fn(data) =>
# Transfer once
gx = Cuda.to-device-f32 data |> Result.unwrap
gy = Cuda.malloc-f32 (length data) |> Result.unwrap
# Multiple GPU operations
Cuda.blas-copy-f32 gx gy |> Result.unwrap
Cuda.blas-scale-f32 2.0 gy |> Result.unwrap
Cuda.blas-axpy-f32 1.0 gx gy |> Result.unwrap
# Transfer result back
result = Cuda.to-host-f32 gy |> Result.unwrap
# Clean up
Cuda.free-f32 gx
Cuda.free-f32 gy
resultLicense
MIT
Exported Functions & Types
CudaError
Error types for CUDA operations.
Variants
CudaError {code, message}DeviceError {message}MemoryError {message}InvalidArgument {message}BlasError {code, message}FftError {code, message}DeviceProperties
Properties of a CUDA device.
Variants
DeviceProperties {name, total-memory, compute-major, compute-minor, multi-processor-count, warp-size, max-threads-per-block, max-block-dim-x, max-block-dim-y, max-block-dim-z, max-grid-dim-x, max-grid-dim-y, max-grid-dim-z}MemoryInfo
GPU memory information.
Variants
MemoryInfo {free, total}GpuArrayF32
A handle to GPU-allocated memory for floats (f32). This is an opaque handle - do not modify directly.
Variants
GpuArrayF32 {ptr, len}GpuArrayF64
A handle to GPU-allocated memory for doubles (f64).
Variants
GpuArrayF64 {ptr, len}GpuArrayInt
A handle to GPU-allocated memory for integers.
Variants
GpuArrayInt {ptr, len}Stream
A CUDA stream for asynchronous operations.
Variants
Stream {handle}device-count
Returns the number of CUDA-capable devices.
() -> Result Int CudaError
set-device
Sets the current CUDA device.
Int -> Result () CudaError
get-device
Gets the current CUDA device index.
() -> Result Int CudaError
device-properties
Gets properties of a CUDA device.
Int -> Result DeviceProperties CudaError
memory-info
Gets memory info for the current device.
() -> Result MemoryInfo CudaError
synchronize
Synchronizes the current device (waits for all operations to complete).
() -> Result () CudaError
reset
Resets the current device (frees all memory, destroys all streams).
() -> Result () CudaError
malloc-f32
Allocates memory on the GPU for f32 values.
Int -> Result GpuArrayF32 CudaError
free-f32
Frees GPU memory.
GpuArrayF32 -> Result () CudaError
to-device-f32
Copies data from host (CPU) to device (GPU).
[Float] -> Result GpuArrayF32 CudaError
to-host-f32
Copies data from device (GPU) to host (CPU).
GpuArrayF32 -> Result [Float] CudaError
malloc-f64
Allocates memory on the GPU for f64 values.
Int -> Result GpuArrayF64 CudaError
free-f64
Frees GPU memory for f64 array.
GpuArrayF64 -> Result () CudaError
to-device-f64
Copies f64 data from host to device.
[Float] -> Result GpuArrayF64 CudaError
to-host-f64
Copies f64 data from device to host.
GpuArrayF64 -> Result [Float] CudaError
stream-create
Creates a new CUDA stream for asynchronous operations.
() -> Result Stream CudaError
stream-destroy
Destroys a CUDA stream.
Stream -> Result () CudaError
stream-synchronize
Synchronizes a stream (waits for all operations in the stream to complete).
Stream -> Result () CudaError
stream-query
Queries if a stream has completed all operations.
Stream -> Result Bool CudaError
blas-dot-f32
Computes the dot product of two vectors on GPU: x . y
GpuArrayF32 -> GpuArrayF32 -> Result Float CudaError
blas-norm-f32
Computes the Euclidean norm of a vector: ||x||_2
GpuArrayF32 -> Result Float CudaError
blas-asum-f32
Computes the sum of absolute values (L1 norm): sum(|x_i|)
GpuArrayF32 -> Result Float CudaError
blas-iamax-f32
Finds the index of the element with maximum absolute value.
GpuArrayF32 -> Result Int CudaError
blas-scale-f32
Scales a vector by a scalar: x = alpha * x
Float -> GpuArrayF32 -> Result () CudaError
blas-axpy-f32
AXPY: y = alpha * x + y
Float -> GpuArrayF32 -> GpuArrayF32 -> Result () CudaError
blas-copy-f32
Copies vector x to vector y: y = x
GpuArrayF32 -> GpuArrayF32 -> Result () CudaError
blas-dot-f64
Computes the dot product of two f64 vectors on GPU.
GpuArrayF64 -> GpuArrayF64 -> Result Float CudaError
blas-norm-f64
Computes the Euclidean norm of an f64 vector.
GpuArrayF64 -> Result Float CudaError
blas-axpy-f64
AXPY for f64: y = alpha * x + y
Float -> GpuArrayF64 -> GpuArrayF64 -> Result () CudaError
blas-scale-f64
Scales an f64 vector: x = alpha * x
Float -> GpuArrayF64 -> Result () CudaError
blas-gemv-f32
Matrix-vector multiplication: y = alpha * A * x + beta * y A is m x n matrix, x is n-element vector, y is m-element vector.
Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Result () CudaError
blas-gemv-f64
f64 matrix-vector multiplication: y = alpha * A * x + beta * y
Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Result () CudaError
blas-gemm-f32
General matrix-matrix multiplication: C = alpha * A * B + beta * C A is m x k, B is k x n, C is m x n.
Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Int -> Result () CudaError
blas-gemm-f64
f64 general matrix-matrix multiplication: C = alpha * A * B + beta * C
Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Int -> Result () CudaError
fft-forward
Computes 1D complex-to-complex FFT (forward). Input: interleaved real/imaginary pairs [r0, i0, r1, i1, ...]
GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
fft-inverse
Computes 1D complex-to-complex FFT (inverse).
GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
fft-real-to-complex
Computes 1D real-to-complex FFT. Input: n real values. Output: n/2+1 complex values (interleaved)
GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
fft-complex-to-real
Computes 1D complex-to-real FFT (inverse of r2c).
GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError