arrow

Apache Arrow columnar data format for Kit

Files

FileDescription
.editorconfigEditor formatting configuration
.gitignoreGit ignore rules for build artifacts and dependencies
.tool-versionsasdf tool versions (Zig, Kit)
LICENSEMIT license file
README.mdThis file
c/kit_arrow.cppC++ FFI wrapper for Apache Arrow
c/kit_arrow.hC header for FFI wrapper
examples/analytics.kitExample: aggregation and sorting
examples/basic.kitBasic usage example
examples/file-io.kitExample: reading and writing Arrow IPC files
examples/tables.kitExample: creating and querying tables
kit.tomlPackage manifest with metadata and dependencies
src/arrow.kitkit-arrow: Apache Arrow Columnar Data Format for Kit
tests/arrow.test.kitTests for arrow
tests/types.test.kitTests for data types

Dependencies

Native Dependencies

Apache Arrow C++ library is required:

PlatformInstall Command
macOSbrew install apache-arrow
Ubuntusudo apt install libarrow-dev
Fedorasudo dnf install libarrow-devel

Installation

kit add gitlab.com/kit-lang/packages/kit-arrow.git

Usage

Table Builder

The easiest way to create Arrow tables is with the fluent builder:

import Kit.Arrow as Arrow

create-users = fn =>
  Arrow.build-table
    |> Arrow.add-column "id" Arrow.int64 [1, 2, 3]
    |> Arrow.add-column "name" Arrow.string-type ["Alice", "Bob", "Carol"]
    |> Arrow.add-column "score" Arrow.float64 [95.5, 87.3, 91.0]
    |> Arrow.finalize

main = fn =>
  match create-users
    | Ok tbl ->
      println "Rows: ${show (Arrow.table-num-rows tbl)}"
      println "Columns: ${show (Arrow.table-num-columns tbl)}"

      # DataFrame operations
      match Arrow.table-head 2 tbl
        | Ok top2 ->
          println "First 2 rows: ${show (Arrow.table-num-rows top2)}"
          Arrow.free-table top2
        | Err e -> println "Error: ${show e}"

      Arrow.free-table tbl
    | Err e -> println "Error: ${show e}"

main

The builder supports all primitive Arrow types (int8 through uint64, float32, float64, bool-type, string-type, binary-type, date-type, timestamp-type). Schema is derived automatically from the add-column calls.

Reading Files

import Kit.Arrow as Arrow

main = fn =>
  # Read a Parquet file
  match Arrow.read-parquet "data.parquet"
    | Ok tbl ->
      println "Loaded ${show (Arrow.table-num-rows tbl)} rows"

      total = Arrow.sum "id" tbl
      avg = Arrow.mean "id" tbl
      println "Sum: ${show total}, Mean: ${show avg}"

      Arrow.free-table tbl
    | Err e -> println "Error: ${show e}"

  # Also supports Arrow IPC and CSV
  # Arrow.read-ipc "data.arrow"
  # Arrow.read-csv "data.csv"

main

Development

Running Examples

Run examples with the interpreter:

kit run examples/basic.kit

Compile examples to a native binary:

kit build examples/basic.kit && ./basic

Running Tests

Run the test suite:

kit test

Run the test suite with coverage:

kit test --coverage

Running kit dev

Run the standard development workflow (format, check, test):

kit dev

This will:

  1. Format and check source files in src/
  2. Run tests in tests/ with coverage

Generating Documentation

Generate API documentation from doc comments:

kit doc

Note: Kit sources with doc comments (##) will generate HTML documents in docs/*.html

Cleaning Build Artifacts

Remove generated files, caches, and build artifacts:

kit task clean

Note: Defined in kit.toml.

Local Installation

To install this package locally for development:

kit install

This installs the package to ~/.kit/packages/@kit/arrow/, making it available for import as Kit.Arrow in other projects.

License

This package is released under the MIT License - see LICENSE for details.

Apache Arrow is an open-source columnar memory format maintained by the Apache Software Foundation under the Apache License 2.0.

Exported Functions & Types

ArrowError

Arrow error type for typed error handling.

Variants

ArrowCreateError {message}
ArrowIOError {message}
ArrowAccessError {message}

DataType

Arrow data types for columnar arrays.

These types correspond to Apache Arrow's type system and determine how data is stored in memory. Arrow uses fixed-width types for efficient SIMD operations and variable-width types for strings and binary data.

Primitive Types: - Int8Type, Int16Type, Int32Type, Int64Type: Signed integers - UInt8Type, UInt16Type, UInt32Type, UInt64Type: Unsigned integers - Float32Type, Float64Type: IEEE 754 floating-point numbers - BoolType: Boolean values (stored as bit-packed array) - StringType: UTF-8 encoded variable-length strings - BinaryType: Variable-length binary data

Temporal Types: - DateType: Date as days since Unix epoch (1970-01-01) - TimestampType: Timestamp as microseconds since Unix epoch

Nested Types: - ListType DataType: Variable-length list of elements - StructType List Field: Record with named fields

Variants

Int8Type
Int16Type
Int32Type
Int64Type
UInt8Type
UInt16Type
UInt32Type
UInt64Type
Float32Type
Float64Type
BoolType
StringType
BinaryType
DateType
TimestampType
ListType {DataType}
StructType {List, Field}

int8

8-bit signed integer type. Range: -128 to 127

int16

16-bit signed integer type. Range: -32,768 to 32,767

int32

32-bit signed integer type. Range: -2,147,483,648 to 2,147,483,647

int64

64-bit signed integer type. Range: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

uint8

8-bit unsigned integer type. Range: 0 to 255

uint16

16-bit unsigned integer type. Range: 0 to 65,535

uint32

32-bit unsigned integer type. Range: 0 to 4,294,967,295

uint64

64-bit unsigned integer type. Range: 0 to 18,446,744,073,709,551,615

float32

32-bit IEEE 754 single-precision floating-point type. Precision: ~7 decimal digits

float64

64-bit IEEE 754 double-precision floating-point type. Precision: ~15-17 decimal digits

bool-type

Boolean type stored as bit-packed array. Values: true or false

string-type

UTF-8 encoded variable-length string type. Stored as offsets + data buffer for efficient slicing.

binary-type

Variable-length binary data type. Similar to string but without UTF-8 encoding requirement.

date-type

Date type representing days since Unix epoch (1970-01-01). Stored as 32-bit signed integer.

timestamp-type

Timestamp type representing microseconds since Unix epoch. Stored as 64-bit signed integer.

list-type

Creates a list type containing elements of the given type.

DataType -> DataType

# List of integers
int-list-type = Arrow.list-type Arrow.int64
# Nested list of strings
nested-type = Arrow.list-type (Arrow.list-type Arrow.string-type)

struct-type

Creates a struct type with the given fields.

List Field -> DataType

point-type = Arrow.struct-type [
  Arrow.field "x" Arrow.float64,
  Arrow.field "y" Arrow.float64
]

field

Creates a nullable field with the given name and data type. Nullable fields can contain None/null values in addition to values of the specified type.

NonEmptyString -> DataType -> Field

age-field = Arrow.field "age" Arrow.int32  # Can be None

field-not-null

Creates a non-nullable field with the given name and data type. Non-nullable fields must always contain a value; null is not permitted. Use for required columns that should never be missing.

NonEmptyString -> DataType -> Field

id-field = Arrow.field-not-null "id" Arrow.int64  # Required

int64-array

Creates an Int64 array from a list of integers. Allocates a new Arrow array in columnar format with 64-bit signed integers.

List Int -> Result Array ArrowError

match Arrow.int64-array [1, 2, 3, 4, 5]
  | Ok arr ->
    print "Created array with ${Arrow.array-length arr} elements"
    Arrow.free-array arr
  | Err msg -> print "Error: ${msg}"

float64-array

Creates a Float64 array from a list of floats. Allocates a new Arrow array in columnar format with 64-bit floating-point numbers.

List Float -> Result Array ArrowError

string-array

Creates a String array from a list of strings. Allocates a new Arrow array for UTF-8 encoded variable-length strings.

List String -> Result Array ArrowError

bool-array

Creates a Bool array from a list of booleans. Allocates a new Arrow array for boolean values stored as bit-packed data.

List Bool -> Result Array ArrowError

array-length

Returns the length of an array.

Array -> Int

get

Gets the value at the given index, returning None if out of bounds or null.

Array -> Int -> Option a

is-null?

Checks if the value at the given index is null.

Parameters:

  • - `arr - Array` - The Arrow array
  • - `index - Int` - Zero-based index to check

Returns:

Array -> Int -> Bool

if Arrow.is-null? arr 5 then
  print "Element 5 is null"

null-count

Returns the count of null values in the array.

Parameters:

  • - `arr - Array` - The Arrow array

Returns:

Array -> Int

nulls = Arrow.null-count arr
print "Array has ${nulls} null values"

to-list

Converts an array to a list of Option values.

Creates a Kit list from an Arrow array, preserving null information. Note: This copies all data from columnar to row format.

Parameters:

  • - `arr - Array` - The Arrow array to convert

Returns:

Array -> List (Option a)

values = Arrow.to-list arr
List.iter (fn(v) => match v | Some x -> print x | None -> print "null") values

free-array

Frees the memory used by an array.

Releases the native memory allocated for the Arrow array. Must be called when the array is no longer needed to prevent memory leaks.

Parameters:

  • - `arr - Array` - The Arrow array to free

Returns:

Warning: Do not use the array after calling this function.

Array -> Unit

schema

Creates a schema from a list of fields.

A schema defines the structure of a table or record batch by specifying the names, types, and nullability of columns.

Parameters:

  • - `fields - List Field` - Ordered list of field definitions

Returns: - `Ok Schema` on success- `Err String` if creation fails

List Field -> Result Schema ArrowError

match Arrow.schema [
  Arrow.field "id" Arrow.int64,
  Arrow.field "name" Arrow.string-type,
  Arrow.field "price" Arrow.float64
]
  | Ok s -> # use schema
  | Err msg -> print "Error: ${msg}"

free-schema

Frees the memory used by a schema.

Parameters:

  • - `s - Schema` - The schema to free

Returns:

Schema -> Unit

record-batch

Creates a record batch from a schema and list of column arrays.

A record batch is a contiguous chunk of columnar data. All arrays must have the same length and match the schema field types.

Parameters:

  • - `schema - Schema` - Schema defining column structure
  • - `columns - List Array` - Arrays for each column (must match schema)

Returns:

Errors: - Returns Err if arrays have mismatched lengths - Returns Err if column count doesn't match schema

Schema -> List Array -> Result RecordBatch ArrowError

num-rows

Returns the number of rows in a record batch.

Parameters:

  • - `batch - RecordBatch` - The record batch

Returns:

RecordBatch -> Int

num-columns

Returns the number of columns in a record batch.

Parameters:

  • - `batch - RecordBatch` - The record batch

Returns:

RecordBatch -> Int

column

Gets a column from a record batch by index.

Parameters:

  • - `batch - RecordBatch` - The record batch
  • - `index - Int` - Zero-based column index

Returns: - `Ok Array` with the column data- `Err String` if index out of bounds

RecordBatch -> Int -> Result Array ArrowError

free-batch

Frees the memory used by a record batch.

Parameters:

  • - `batch - RecordBatch` - The batch to free

Returns:

RecordBatch -> Unit

table

Creates a table from a schema and list of column arrays.

Tables are the primary data structure for analytical operations. Unlike RecordBatch, tables can be backed by multiple chunks.

Parameters:

  • - `schema - Schema` - Schema defining column structure
  • - `columns - List Array` - Arrays for each column

Returns:

Schema -> List Array -> Result Table ArrowError

match Arrow.table schema [ids, names, prices]
  | Ok tbl ->
    print "Created table with ${Arrow.table-num-rows tbl} rows"
    Arrow.free-table tbl
  | Err msg -> print "Error: ${msg}"

table-column

Gets a table column by index.

Parameters:

  • - `tbl - Table` - The table
  • - `index - Int` - Zero-based column index

Returns:

Table -> Int -> Result Array ArrowError

table-column-by-name

Gets a table column by name.

Parameters:

  • - `tbl - Table` - The table
  • - `name - String` - Column name

Returns: - `Err` if column name not found

Table -> NonEmptyString -> Result Array ArrowError

table-num-rows

Returns the number of rows in a table.

Parameters:

  • - `tbl - Table` - The table

Returns:

Table -> Int

table-num-columns

Returns the number of columns in a table.

Parameters:

  • - `tbl - Table` - The table

Returns:

Table -> Int

free-table

Frees the memory used by a table.

Parameters:

  • - `tbl - Table` - The table to free

Returns:

Table -> Unit

build-table

Creates an empty table builder.

Use with |> and add-column to incrementally build a table.

TableBuilder

Arrow.build-table
  |> Arrow.add-column "x" Arrow.int64 [1, 2]
  |> Arrow.finalize

add-column

Adds a column to the table builder.

Eagerly creates the Arrow array from the given values. If a previous column creation failed, this is a no-op (the error is preserved).

Supported types: all primitive types (int8 through uint64, float32, float64, bool-type, string-type, binary-type, date-type, timestamp-type). Nested types (list-type, struct-type) are not supported; create those arrays manually and use Arrow.table directly.

Parameters:

  • - `builder - TableBuilder` - The builder (piped with `|>`)
  • - `col-name - String` - Column name
  • - `dtype - DataType` - Arrow data type for this column
  • - `values - List a` - Column values

Returns:

TableBuilder -> String -> DataType -> List a -> TableBuilder

builder |> Arrow.add-column "score" Arrow.float64 [95.5, 87.3]

finalize

Finalizes the builder, creating a table from all accumulated columns.

If any add-column call failed, returns that error. Otherwise creates the schema and table from the accumulated fields and arrays.

Parameters:

  • - `builder - TableBuilder` - The completed builder

Returns:

TableBuilder -> Result Table ArrowError

Arrow.build-table
  |> Arrow.add-column "x" Arrow.int64 [1, 2, 3]
  |> Arrow.finalize

table-head

Returns the first n rows of a table.

Parameters:

  • - `n - Int` - Number of rows to return
  • - `tbl - Table` - Source table

Returns:

Int -> Table -> Result Table ArrowError

match Arrow.table-head 10 tbl
  | Ok top10 -> # process first 10 rows

table-tail

Returns the last n rows of a table.

Parameters:

  • - `n - Int` - Number of rows to return
  • - `tbl - Table` - Source table

Returns:

Int -> Table -> Result Table ArrowError

slice

Slices a table from start index with given length.

Parameters:

  • - `start - Int` - Starting row index (0-based)
  • - `length - Int` - Number of rows to include
  • - `tbl - Table` - Source table

Returns:

Int -> Int -> Table -> Result Table ArrowError

# Get rows 100-199
match Arrow.slice 100 100 tbl
  | Ok page -> # process page

sum

Calculates the sum of values in a numeric column.

Computes: $\sum_{i=0}^{n-1} x_i$

Parameters:

  • - `col-name - String` - Name of the numeric column
  • - `tbl - Table` - Source table

Returns:

NonEmptyString -> Table -> Float

mean

Calculates the arithmetic mean of values in a numeric column.

Computes: $\bar{x} = \frac{1}{n}\sum_{i=0}^{n-1} x_i$

Parameters:

  • - `col-name - String` - Name of the numeric column
  • - `tbl - Table` - Source table

Returns:

NonEmptyString -> Table -> Float

min

Returns the minimum value in a numeric column.

Parameters:

  • - `col-name - String` - Name of the numeric column
  • - `tbl - Table` - Source table

Returns:

NonEmptyString -> Table -> Float

max

Returns the maximum value in a numeric column.

Parameters:

  • - `col-name - String` - Name of the numeric column
  • - `tbl - Table` - Source table

Returns:

NonEmptyString -> Table -> Float

count

Counts non-null values in a column.

Parameters:

  • - `col-name - String` - Name of the column
  • - `tbl - Table` - Source table

Returns:

NonEmptyString -> Table -> Int

table-sort

Sorts a table by column in ascending order.

Parameters:

  • - `col-name - String` - Column to sort by
  • - `tbl - Table` - Source table

Returns:

NonEmptyString -> Table -> Result Table ArrowError

match Arrow.table-sort "price" products
  | Ok sorted -> # cheapest first

table-sort-desc

Sorts a table by column in descending order.

Parameters:

  • - `col-name - String` - Column to sort by
  • - `tbl - Table` - Source table

Returns:

NonEmptyString -> Table -> Result Table ArrowError

table-concat

Concatenates two tables vertically (row-wise union).

Tables must have compatible schemas.

Parameters:

  • - `tbl1 - Table` - First table
  • - `tbl2 - Table` - Second table

Returns:

Table -> Table -> Result Table ArrowError

rename-column

Renames a column in a table.

Parameters:

  • - `old-name - String` - Current column name
  • - `new-name - String` - New column name
  • - `tbl - Table` - Source table

Returns:

NonEmptyString -> NonEmptyString -> Table -> Result Table ArrowError

column-name

Gets the column name at a given index.

Parameters:

  • - `index - Int` - Zero-based column index
  • - `tbl - Table` - The table

Returns:

Int -> Table -> String

write-ipc

Writes a table to an Arrow IPC (Feather) file.

Arrow IPC is a binary format for efficient storage and transfer of Arrow data. IPC files preserve the exact schema and support memory-mapping for zero-copy reads.

Parameters:

  • - `tbl - Table` - Table to write
  • - `path - String` - Output file path (typically .arrow or .feather extension)

Returns:

Table -> NonEmptyString -> Result Unit ArrowError

match Arrow.write-ipc tbl "data.arrow"
  | Ok -> print "Saved successfully"
  | Err msg -> print "Failed: ${msg}"

read-ipc

Reads a table from an Arrow IPC (Feather) file.

Parameters:

  • - `path - String` - Path to the IPC file

Returns:

NonEmptyString -> Result Table ArrowError

match Arrow.read-ipc "data.arrow"
  | Ok tbl ->
    print "Loaded ${Arrow.table-num-rows tbl} rows"
    Arrow.free-table tbl
  | Err msg -> print "Failed: ${msg}"

write-parquet

Writes a table to a Parquet file.

Parquet is a columnar storage format optimized for analytics workloads. It provides efficient compression and encoding for large datasets.

Parameters:

  • - `tbl - Table` - Table to write
  • - `path - String` - Output file path (typically .parquet extension)

Returns:

Table -> NonEmptyString -> Result Unit ArrowError

match Arrow.write-parquet tbl "data.parquet"
  | Ok -> print "Saved to Parquet"
  | Err msg -> print "Failed: ${msg}"

read-parquet

Reads a table from a Parquet file.

Parameters:

  • - `path - String` - Path to the Parquet file

Returns:

NonEmptyString -> Result Table ArrowError

match Arrow.read-parquet "data.parquet"
  | Ok tbl -> # process table
  | Err msg -> print "Failed: ${msg}"

read-csv

Reads a CSV file into a table.

Automatically infers column types from the data. For explicit schema control, read the CSV and then cast columns as needed.

Parameters:

  • - `path - String` - Path to the CSV file

Returns:

NonEmptyString -> Result Table ArrowError

match Arrow.read-csv "data.csv"
  | Ok tbl ->
    print "Loaded CSV with ${Arrow.table-num-columns tbl} columns"
    Arrow.free-table tbl
  | Err msg -> print "Failed: ${msg}"