cuda
| Kind | ffi-zig |
|---|---|
| Capabilities | ffi |
| Categories | gpu parallel numeric ffi |
| Keywords | cuda gpu nvidia cublas cufft parallel compute zig-ffi |
CUDA GPU computing via cuBLAS, cuFFT, and CUDA runtime (requires NVIDIA GPU and CUDA toolkit)
Files
| File | Description |
|---|---|
.editorconfig | Editor formatting configuration |
.gitignore | Git ignore rules for build artifacts and dependencies |
.tool-versions | asdf tool versions (Zig, Kit) |
LICENSE | MIT license file |
README.md | This file |
examples/basic.kit | Basic usage example |
examples/matmul.kit | Example: matrix multiplication |
kit.toml | Package manifest with metadata and dependencies |
src/main.kit | kit-cuda: CUDA GPU computing for Kit |
tests/cuda.test.kit | Tests for CUDA runtime and cuBLAS bindings |
tests/error-types.test.kit | Tests for CUDA error types |
zig/cuda.zig | Zig FFI module for CUDA runtime, cuBLAS, and cuFFT |
zig/kit_ffi.zig | Zig FFI helpers for Kit interop |
Dependencies
No Kit package dependencies.
Native requirements:
- NVIDIA GPU with CUDA support
- CUDA Toolkit 11.0 or newer
- Linux with CUDA libraries available on the system library path
CUDA is not supported on macOS since CUDA 10.2.
Installation
kit add gitlab.com/kit-lang/packages/kit-cuda.gitInstall the CUDA Toolkit for your platform before building this package.
Ubuntu/Debian:
# See https://developer.nvidia.com/cuda-downloadsFedora:
sudo dnf install cudaArch Linux:
sudo pacman -S cudaUsage
import Kit.Cuda as Cuda
main = fn =>
match Cuda.device-count
| Err e ->
println "CUDA error: ${show e}"
| Ok 0 ->
println "No CUDA devices found"
| Ok n ->
println "Found ${n} CUDA device(s)"
props = Cuda.device-properties 0 |> Result.unwrap
println "GPU: ${props.name}"
println "Memory: ${props.total-memory / 1024 / 1024} MB"
println "Compute: ${props.compute-major}.${props.compute-minor}"
x = [1.0, 2.0, 3.0, 4.0]
y = [5.0, 6.0, 7.0, 8.0]
gx = Cuda.to-device-f32 x |> Result.unwrap
gy = Cuda.to-device-f32 y |> Result.unwrap
dot = Cuda.blas-dot-f32 gx gy |> Result.unwrap
println "Dot product: ${dot}"
Cuda.free-f32 gx
Cuda.free-f32 gy
mainDevelopment
Running Examples
Run examples with the interpreter:
kit run examples/basic.kitCompile examples to a native binary:
kit build examples/basic.kit && ./basicRun the matrix multiplication example:
kit run examples/matmul.kitRunning Tests
Run the test suite:
kit testRun the test suite with coverage:
kit test --coverageRunning kit dev
Run the standard development workflow (format, check, test):
kit devThis will:
- Format and check source files in
src/ - Run tests in
tests/with coverage
Generating Documentation
Generate API documentation from doc comments:
kit docNote: Kit sources with doc comments (##) will generate HTML documents in docs/*.html
Cleaning Build Artifacts
Remove generated files, caches, and build artifacts:
kit task cleanNote: Defined in kit.toml.
Local Installation
To install this package locally for development:
kit installThis installs the package to ~/.kit/packages/@kit/cuda/, making it available for import as Kit.Cuda in other projects.
License
This package is released under the MIT License - see LICENSE for details.
CUDA, cuBLAS, and cuFFT are NVIDIA technologies and are distributed under NVIDIA's own license terms.
Exported Functions & Types
CudaError
Error types for CUDA operations.
Variants
CudaError {code, message}DeviceError {message}MemoryError {message}InvalidArgument {message}BlasError {code, message}FftError {code, message}DeviceProperties
Properties of a CUDA device.
Variants
DeviceProperties {name, total-memory, compute-major, compute-minor, multi-processor-count, warp-size, max-threads-per-block, max-block-dim-x, max-block-dim-y, max-block-dim-z, max-grid-dim-x, max-grid-dim-y, max-grid-dim-z}MemoryInfo
GPU memory information.
Variants
MemoryInfo {free, total}GpuArrayF32
A handle to GPU-allocated memory for floats (f32). This is an opaque handle - do not modify directly.
Variants
GpuArrayF32 {ptr, len}GpuArrayF64
A handle to GPU-allocated memory for doubles (f64).
Variants
GpuArrayF64 {ptr, len}GpuArrayInt
A handle to GPU-allocated memory for integers.
Variants
GpuArrayInt {ptr, len}Stream
A CUDA stream for asynchronous operations.
Variants
Stream {handle}device-count
Returns the number of CUDA-capable devices.
() -> Result Int CudaError
set-device
Sets the current CUDA device.
Int -> Result () CudaError
get-device
Gets the current CUDA device index.
() -> Result Int CudaError
device-properties
Gets properties of a CUDA device.
Int -> Result DeviceProperties CudaError
memory-info
Gets memory info for the current device.
() -> Result MemoryInfo CudaError
synchronize
Synchronizes the current device (waits for all operations to complete).
() -> Result () CudaError
reset
Resets the current device (frees all memory, destroys all streams).
() -> Result () CudaError
malloc-f32
Allocates memory on the GPU for f32 values.
Int -> Result GpuArrayF32 CudaError
free-f32
Frees GPU memory.
GpuArrayF32 -> Result () CudaError
to-device-f32
Copies data from host (CPU) to device (GPU).
[Float] -> Result GpuArrayF32 CudaError
to-host-f32
Copies data from device (GPU) to host (CPU).
GpuArrayF32 -> Result [Float] CudaError
malloc-f64
Allocates memory on the GPU for f64 values.
Int -> Result GpuArrayF64 CudaError
free-f64
Frees GPU memory for f64 array.
GpuArrayF64 -> Result () CudaError
to-device-f64
Copies f64 data from host to device.
[Float] -> Result GpuArrayF64 CudaError
to-host-f64
Copies f64 data from device to host.
GpuArrayF64 -> Result [Float] CudaError
stream-create
Creates a new CUDA stream for asynchronous operations.
() -> Result Stream CudaError
stream-destroy
Destroys a CUDA stream.
Stream -> Result () CudaError
stream-synchronize
Synchronizes a stream (waits for all operations in the stream to complete).
Stream -> Result () CudaError
stream-query
Queries if a stream has completed all operations.
Stream -> Result Bool CudaError
blas-dot-f32
Computes the dot product of two vectors on GPU: x . y
GpuArrayF32 -> GpuArrayF32 -> Result Float CudaError
blas-norm-f32
Computes the Euclidean norm of a vector: ||x||_2
GpuArrayF32 -> Result Float CudaError
blas-asum-f32
Computes the sum of absolute values (L1 norm): sum(|x_i|)
GpuArrayF32 -> Result Float CudaError
blas-iamax-f32
Finds the index of the element with maximum absolute value.
GpuArrayF32 -> Result Int CudaError
blas-scale-f32
Scales a vector by a scalar: x = alpha * x
Float -> GpuArrayF32 -> Result () CudaError
blas-axpy-f32
AXPY: y = alpha * x + y
Float -> GpuArrayF32 -> GpuArrayF32 -> Result () CudaError
blas-copy-f32
Copies vector x to vector y: y = x
GpuArrayF32 -> GpuArrayF32 -> Result () CudaError
blas-dot-f64
Computes the dot product of two f64 vectors on GPU.
GpuArrayF64 -> GpuArrayF64 -> Result Float CudaError
blas-norm-f64
Computes the Euclidean norm of an f64 vector.
GpuArrayF64 -> Result Float CudaError
blas-axpy-f64
AXPY for f64: y = alpha * x + y
Float -> GpuArrayF64 -> GpuArrayF64 -> Result () CudaError
blas-scale-f64
Scales an f64 vector: x = alpha * x
Float -> GpuArrayF64 -> Result () CudaError
blas-gemv-f32
Matrix-vector multiplication: y = alpha * A * x + beta * y A is m x n matrix, x is n-element vector, y is m-element vector.
Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Result () CudaError
blas-gemv-f64
f64 matrix-vector multiplication: y = alpha * A * x + beta * y
Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Result () CudaError
blas-gemm-f32
General matrix-matrix multiplication: C = alpha * A * B + beta * C A is m x k, B is k x n, C is m x n.
Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Int -> Result () CudaError
blas-gemm-f64
f64 general matrix-matrix multiplication: C = alpha * A * B + beta * C
Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Int -> Result () CudaError
fft-forward
Computes 1D complex-to-complex FFT (forward). Input: interleaved real/imaginary pairs [r0, i0, r1, i1, ...]
GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
fft-inverse
Computes 1D complex-to-complex FFT (inverse).
GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
fft-real-to-complex
Computes 1D real-to-complex FFT. Input: n real values. Output: n/2+1 complex values (interleaved)
GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError
fft-complex-to-real
Computes 1D complex-to-real FFT (inverse of r2c).
GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError