cuda

Kind	ffi-zig
Capabilities	ffi
Categories	gpu parallel numeric ffi
Keywords	cuda gpu nvidia cublas cufft parallel compute zig-ffi

CUDA GPU computing via cuBLAS, cuFFT, and CUDA runtime (requires NVIDIA GPU and CUDA toolkit)

Files

File	Description
`.editorconfig`	Editor formatting configuration
`.gitignore`	Git ignore rules for build artifacts and dependencies
`.tool-versions`	asdf tool versions (Zig, Kit)
`LICENSE`	MIT license file
`README.md`	This file
`examples/basic.kit`	Basic usage example
`examples/matmul.kit`	Example: matrix multiplication
`kit.toml`	Package manifest with metadata and dependencies
`src/main.kit`	kit-cuda: CUDA GPU computing for Kit
`tests/cuda.test.kit`	Tests for CUDA runtime and cuBLAS bindings
`tests/error-types.test.kit`	Tests for CUDA error types
`zig/cuda.zig`	Zig FFI module for CUDA runtime, cuBLAS, and cuFFT
`zig/kit_ffi.zig`	Zig FFI helpers for Kit interop

Dependencies

No Kit package dependencies.

Native requirements:

NVIDIA GPU with CUDA support
CUDA Toolkit 11.0 or newer
Linux with CUDA libraries available on the system library path

CUDA is not supported on macOS since CUDA 10.2.

Installation

kit add gitlab.com/kit-lang/packages/kit-cuda.git

Install the CUDA Toolkit for your platform before building this package.

Ubuntu/Debian:

# See https://developer.nvidia.com/cuda-downloads

Fedora:

sudo dnf install cuda

Arch Linux:

sudo pacman -S cuda

Usage

import Kit.Cuda as Cuda

main = fn =>
  match Cuda.device-count
    | Err e ->
      println "CUDA error: ${show e}"
    | Ok 0 ->
      println "No CUDA devices found"
    | Ok n ->
      println "Found ${n} CUDA device(s)"

      props = Cuda.device-properties 0 |> Result.unwrap
      println "GPU: ${props.name}"
      println "Memory: ${props.total-memory / 1024 / 1024} MB"
      println "Compute: ${props.compute-major}.${props.compute-minor}"

      x = [1.0, 2.0, 3.0, 4.0]
      y = [5.0, 6.0, 7.0, 8.0]

      gx = Cuda.to-device-f32 x |> Result.unwrap
      gy = Cuda.to-device-f32 y |> Result.unwrap

      dot = Cuda.blas-dot-f32 gx gy |> Result.unwrap
      println "Dot product: ${dot}"

      Cuda.free-f32 gx
      Cuda.free-f32 gy

main

Development

Running Examples

Run examples with the interpreter:

kit run examples/basic.kit

Compile examples to a native binary:

kit build examples/basic.kit && ./basic

Run the matrix multiplication example:

kit run examples/matmul.kit

Running Tests

Run the test suite:

kit test

Run the test suite with coverage:

kit test --coverage

Running kit dev

Run the standard development workflow (format, check, test):

kit dev

This will:

Format and check source files in src/
Run tests in tests/ with coverage

Generating Documentation

Generate API documentation from doc comments:

kit doc

Note: Kit sources with doc comments (##) will generate HTML documents in docs/*.html

Cleaning Build Artifacts

Remove generated files, caches, and build artifacts:

kit task clean

Note: Defined in kit.toml.

Local Installation

To install this package locally for development:

kit install

This installs the package to ~/.kit/packages/@kit/cuda/, making it available for import as Kit.Cuda in other projects.

License

This package is released under the MIT License - see LICENSE for details.

CUDA, cuBLAS, and cuFFT are NVIDIA technologies and are distributed under NVIDIA's own license terms.

Exported Functions & Types

`CudaError`

Error types for CUDA operations.

Variants

CudaError {code, message}

CUDA operation failed

DeviceError {message}

Device not found or not available

MemoryError {message}

Memory allocation failed on device

InvalidArgument {message}

Invalid argument passed to CUDA function

BlasError {code, message}

cuBLAS operation failed

FftError {code, message}

cuFFT operation failed

`DeviceProperties`

Properties of a CUDA device.

Variants

DeviceProperties {name, total-memory, compute-major, compute-minor, multi-processor-count, warp-size, max-threads-per-block, max-block-dim-x, max-block-dim-y, max-block-dim-z, max-grid-dim-x, max-grid-dim-y, max-grid-dim-z}

`MemoryInfo`

GPU memory information.

Variants

MemoryInfo {free, total}

`GpuArrayF32`

A handle to GPU-allocated memory for floats (f32). This is an opaque handle - do not modify directly.

Variants

GpuArrayF32 {ptr, len}

`GpuArrayF64`

A handle to GPU-allocated memory for doubles (f64).

Variants

GpuArrayF64 {ptr, len}

`GpuArrayInt`

A handle to GPU-allocated memory for integers.

Variants

GpuArrayInt {ptr, len}

`Stream`

A CUDA stream for asynchronous operations.

Variants

Stream {handle}

`device-count`

Returns the number of CUDA-capable devices.

() -> Result Int CudaError

`set-device`

Sets the current CUDA device.

Int -> Result () CudaError

`get-device`

Gets the current CUDA device index.

() -> Result Int CudaError

`device-properties`

Gets properties of a CUDA device.

Int -> Result DeviceProperties CudaError

`memory-info`

Gets memory info for the current device.

() -> Result MemoryInfo CudaError

`synchronize`

Synchronizes the current device (waits for all operations to complete).

() -> Result () CudaError

`reset`

Resets the current device (frees all memory, destroys all streams).

() -> Result () CudaError

`malloc-f32`

Allocates memory on the GPU for f32 values.

Int -> Result GpuArrayF32 CudaError

`free-f32`

Frees GPU memory.

GpuArrayF32 -> Result () CudaError

`to-device-f32`

Copies data from host (CPU) to device (GPU).

[Float] -> Result GpuArrayF32 CudaError

`to-host-f32`

Copies data from device (GPU) to host (CPU).

GpuArrayF32 -> Result [Float] CudaError

`malloc-f64`

Allocates memory on the GPU for f64 values.

Int -> Result GpuArrayF64 CudaError

`free-f64`

Frees GPU memory for f64 array.

GpuArrayF64 -> Result () CudaError

`to-device-f64`

Copies f64 data from host to device.

[Float] -> Result GpuArrayF64 CudaError

`to-host-f64`

Copies f64 data from device to host.

GpuArrayF64 -> Result [Float] CudaError

`stream-create`

Creates a new CUDA stream for asynchronous operations.

() -> Result Stream CudaError

`stream-destroy`

Destroys a CUDA stream.

Stream -> Result () CudaError

`stream-synchronize`

Synchronizes a stream (waits for all operations in the stream to complete).

Stream -> Result () CudaError

`stream-query`

Queries if a stream has completed all operations.

Stream -> Result Bool CudaError

`blas-dot-f32`

Computes the dot product of two vectors on GPU: x . y

GpuArrayF32 -> GpuArrayF32 -> Result Float CudaError

`blas-norm-f32`

Computes the Euclidean norm of a vector: ||x||_2

GpuArrayF32 -> Result Float CudaError

`blas-asum-f32`

Computes the sum of absolute values (L1 norm): sum(|x_i|)

GpuArrayF32 -> Result Float CudaError

`blas-iamax-f32`

Finds the index of the element with maximum absolute value.

GpuArrayF32 -> Result Int CudaError

`blas-scale-f32`

Scales a vector by a scalar: x = alpha * x

Float -> GpuArrayF32 -> Result () CudaError

`blas-axpy-f32`

AXPY: y = alpha * x + y

Float -> GpuArrayF32 -> GpuArrayF32 -> Result () CudaError

`blas-copy-f32`

Copies vector x to vector y: y = x

GpuArrayF32 -> GpuArrayF32 -> Result () CudaError

`blas-dot-f64`

Computes the dot product of two f64 vectors on GPU.

GpuArrayF64 -> GpuArrayF64 -> Result Float CudaError

`blas-norm-f64`

Computes the Euclidean norm of an f64 vector.

GpuArrayF64 -> Result Float CudaError

`blas-axpy-f64`

AXPY for f64: y = alpha * x + y

Float -> GpuArrayF64 -> GpuArrayF64 -> Result () CudaError

`blas-scale-f64`

Scales an f64 vector: x = alpha * x

Float -> GpuArrayF64 -> Result () CudaError

`blas-gemv-f32`

Matrix-vector multiplication: y = alpha * A * x + beta * y A is m x n matrix, x is n-element vector, y is m-element vector.

Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Result () CudaError

`blas-gemv-f64`

f64 matrix-vector multiplication: y = alpha * A * x + beta * y

Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Result () CudaError

`blas-gemm-f32`

General matrix-matrix multiplication: C = alpha * A * B + beta * C A is m x k, B is k x n, C is m x n.

Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Int -> Result () CudaError

`blas-gemm-f64`

f64 general matrix-matrix multiplication: C = alpha * A * B + beta * C

Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Int -> Result () CudaError

`fft-forward`

Computes 1D complex-to-complex FFT (forward). Input: interleaved real/imaginary pairs [r0, i0, r1, i1, ...]

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

`fft-inverse`

Computes 1D complex-to-complex FFT (inverse).

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

`fft-real-to-complex`

Computes 1D real-to-complex FFT. Input: n real values. Output: n/2+1 complex values (interleaved)

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

`fft-complex-to-real`

Computes 1D complex-to-real FFT (inverse of r2c).

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError