cuda

CUDA GPU computing via cuBLAS, cuFFT, and CUDA runtime (requires NVIDIA GPU and CUDA toolkit)

Files

FileDescription
.editorconfigEditor formatting configuration
.gitignoreGit ignore rules for build artifacts and dependencies
.tool-versionsasdf tool versions (Zig, Kit)
LICENSEMIT license file
README.mdThis file
examples/basic.kitBasic usage example
examples/matmul.kitExample: matrix multiplication
kit.tomlPackage manifest with metadata and dependencies
src/main.kitkit-cuda: CUDA GPU computing for Kit
tests/cuda.test.kitTests for CUDA runtime and cuBLAS bindings
tests/error-types.test.kitTests for CUDA error types
zig/cuda.zigZig FFI module for CUDA runtime, cuBLAS, and cuFFT
zig/kit_ffi.zigZig FFI helpers for Kit interop

Dependencies

No Kit package dependencies.

Native requirements:

  • NVIDIA GPU with CUDA support
  • CUDA Toolkit 11.0 or newer
  • Linux with CUDA libraries available on the system library path

CUDA is not supported on macOS since CUDA 10.2.

Installation

kit add gitlab.com/kit-lang/packages/kit-cuda.git

Install the CUDA Toolkit for your platform before building this package.

Ubuntu/Debian:

# See https://developer.nvidia.com/cuda-downloads

Fedora:

sudo dnf install cuda

Arch Linux:

sudo pacman -S cuda

Usage

import Kit.Cuda as Cuda

main = fn =>
  match Cuda.device-count
    | Err e ->
      println "CUDA error: ${show e}"
    | Ok 0 ->
      println "No CUDA devices found"
    | Ok n ->
      println "Found ${n} CUDA device(s)"

      props = Cuda.device-properties 0 |> Result.unwrap
      println "GPU: ${props.name}"
      println "Memory: ${props.total-memory / 1024 / 1024} MB"
      println "Compute: ${props.compute-major}.${props.compute-minor}"

      x = [1.0, 2.0, 3.0, 4.0]
      y = [5.0, 6.0, 7.0, 8.0]

      gx = Cuda.to-device-f32 x |> Result.unwrap
      gy = Cuda.to-device-f32 y |> Result.unwrap

      dot = Cuda.blas-dot-f32 gx gy |> Result.unwrap
      println "Dot product: ${dot}"

      Cuda.free-f32 gx
      Cuda.free-f32 gy

main

Development

Running Examples

Run examples with the interpreter:

kit run examples/basic.kit

Compile examples to a native binary:

kit build examples/basic.kit && ./basic

Run the matrix multiplication example:

kit run examples/matmul.kit

Running Tests

Run the test suite:

kit test

Run the test suite with coverage:

kit test --coverage

Running kit dev

Run the standard development workflow (format, check, test):

kit dev

This will:

  1. Format and check source files in src/
  2. Run tests in tests/ with coverage

Generating Documentation

Generate API documentation from doc comments:

kit doc

Note: Kit sources with doc comments (##) will generate HTML documents in docs/*.html

Cleaning Build Artifacts

Remove generated files, caches, and build artifacts:

kit task clean

Note: Defined in kit.toml.

Local Installation

To install this package locally for development:

kit install

This installs the package to ~/.kit/packages/@kit/cuda/, making it available for import as Kit.Cuda in other projects.

License

This package is released under the MIT License - see LICENSE for details.

CUDA, cuBLAS, and cuFFT are NVIDIA technologies and are distributed under NVIDIA's own license terms.

Exported Functions & Types

CudaError

Error types for CUDA operations.

Variants

CudaError {code, message}
CUDA operation failed
DeviceError {message}
Device not found or not available
MemoryError {message}
Memory allocation failed on device
InvalidArgument {message}
Invalid argument passed to CUDA function
BlasError {code, message}
cuBLAS operation failed
FftError {code, message}
cuFFT operation failed

DeviceProperties

Properties of a CUDA device.

Variants

DeviceProperties {name, total-memory, compute-major, compute-minor, multi-processor-count, warp-size, max-threads-per-block, max-block-dim-x, max-block-dim-y, max-block-dim-z, max-grid-dim-x, max-grid-dim-y, max-grid-dim-z}

MemoryInfo

GPU memory information.

Variants

MemoryInfo {free, total}

GpuArrayF32

A handle to GPU-allocated memory for floats (f32). This is an opaque handle - do not modify directly.

Variants

GpuArrayF32 {ptr, len}

GpuArrayF64

A handle to GPU-allocated memory for doubles (f64).

Variants

GpuArrayF64 {ptr, len}

GpuArrayInt

A handle to GPU-allocated memory for integers.

Variants

GpuArrayInt {ptr, len}

Stream

A CUDA stream for asynchronous operations.

Variants

Stream {handle}

device-count

Returns the number of CUDA-capable devices.

() -> Result Int CudaError

set-device

Sets the current CUDA device.

Int -> Result () CudaError

get-device

Gets the current CUDA device index.

() -> Result Int CudaError

device-properties

Gets properties of a CUDA device.

Int -> Result DeviceProperties CudaError

memory-info

Gets memory info for the current device.

() -> Result MemoryInfo CudaError

synchronize

Synchronizes the current device (waits for all operations to complete).

() -> Result () CudaError

reset

Resets the current device (frees all memory, destroys all streams).

() -> Result () CudaError

malloc-f32

Allocates memory on the GPU for f32 values.

Int -> Result GpuArrayF32 CudaError

free-f32

Frees GPU memory.

GpuArrayF32 -> Result () CudaError

to-device-f32

Copies data from host (CPU) to device (GPU).

[Float] -> Result GpuArrayF32 CudaError

to-host-f32

Copies data from device (GPU) to host (CPU).

GpuArrayF32 -> Result [Float] CudaError

malloc-f64

Allocates memory on the GPU for f64 values.

Int -> Result GpuArrayF64 CudaError

free-f64

Frees GPU memory for f64 array.

GpuArrayF64 -> Result () CudaError

to-device-f64

Copies f64 data from host to device.

[Float] -> Result GpuArrayF64 CudaError

to-host-f64

Copies f64 data from device to host.

GpuArrayF64 -> Result [Float] CudaError

stream-create

Creates a new CUDA stream for asynchronous operations.

() -> Result Stream CudaError

stream-destroy

Destroys a CUDA stream.

Stream -> Result () CudaError

stream-synchronize

Synchronizes a stream (waits for all operations in the stream to complete).

Stream -> Result () CudaError

stream-query

Queries if a stream has completed all operations.

Stream -> Result Bool CudaError

blas-dot-f32

Computes the dot product of two vectors on GPU: x . y

GpuArrayF32 -> GpuArrayF32 -> Result Float CudaError

blas-norm-f32

Computes the Euclidean norm of a vector: ||x||_2

GpuArrayF32 -> Result Float CudaError

blas-asum-f32

Computes the sum of absolute values (L1 norm): sum(|x_i|)

GpuArrayF32 -> Result Float CudaError

blas-iamax-f32

Finds the index of the element with maximum absolute value.

GpuArrayF32 -> Result Int CudaError

blas-scale-f32

Scales a vector by a scalar: x = alpha * x

Float -> GpuArrayF32 -> Result () CudaError

blas-axpy-f32

AXPY: y = alpha * x + y

Float -> GpuArrayF32 -> GpuArrayF32 -> Result () CudaError

blas-copy-f32

Copies vector x to vector y: y = x

GpuArrayF32 -> GpuArrayF32 -> Result () CudaError

blas-dot-f64

Computes the dot product of two f64 vectors on GPU.

GpuArrayF64 -> GpuArrayF64 -> Result Float CudaError

blas-norm-f64

Computes the Euclidean norm of an f64 vector.

GpuArrayF64 -> Result Float CudaError

blas-axpy-f64

AXPY for f64: y = alpha * x + y

Float -> GpuArrayF64 -> GpuArrayF64 -> Result () CudaError

blas-scale-f64

Scales an f64 vector: x = alpha * x

Float -> GpuArrayF64 -> Result () CudaError

blas-gemv-f32

Matrix-vector multiplication: y = alpha * A * x + beta * y A is m x n matrix, x is n-element vector, y is m-element vector.

Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Result () CudaError

blas-gemv-f64

f64 matrix-vector multiplication: y = alpha * A * x + beta * y

Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Result () CudaError

blas-gemm-f32

General matrix-matrix multiplication: C = alpha * A * B + beta * C A is m x k, B is k x n, C is m x n.

Float -> GpuArrayF32 -> GpuArrayF32 -> Float -> GpuArrayF32 -> Int -> Int -> Int -> Result () CudaError

blas-gemm-f64

f64 general matrix-matrix multiplication: C = alpha * A * B + beta * C

Float -> GpuArrayF64 -> GpuArrayF64 -> Float -> GpuArrayF64 -> Int -> Int -> Int -> Result () CudaError

fft-forward

Computes 1D complex-to-complex FFT (forward). Input: interleaved real/imaginary pairs [r0, i0, r1, i1, ...]

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

fft-inverse

Computes 1D complex-to-complex FFT (inverse).

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

fft-real-to-complex

Computes 1D real-to-complex FFT. Input: n real values. Output: n/2+1 complex values (interleaved)

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError

fft-complex-to-real

Computes 1D complex-to-real FFT (inverse of r2c).

GpuArrayF32 -> Int -> Result GpuArrayF32 CudaError