dataframe
| Kind | ffi-zig |
|---|---|
| Capabilities | ffi file |
| Categories | data-structures analytics ffi |
| Keywords | dataframe columnar data analytics simd zig-ffi |
High-performance columnar DataFrame for Kit with SIMD acceleration
Features
- Core DataFrame Operations: Select, filter, sort, group by, join, aggregate
- Statistical Functions: Variance, standard deviation, quantiles, correlation, skewness, kurtosis
- Reshaping: Pivot, melt, crosstab, stack, unstack, transpose
- Window Functions: Row number, rank, lead/lag, cumulative sums, rolling aggregations
- Column Expressions: Polars-style column transformations and comparisons
- Parallel Operations: Partition-aware aggregations using Kit's concurrency primitives
- Extended I/O: Integration with Parquet, Arrow IPC, and SQLite
- DateTime Support: Parse, extract, format, and compare timestamp columns
- Lazy Evaluation: Query optimization with predicate pushdown and operation fusion
Files
| File | Description |
|---|---|
.editorconfig | Editor formatting configuration |
.gitignore | Git ignore rules for build artifacts and dependencies |
.tool-versions | asdf tool versions (Zig, Kit) |
LICENSE | MIT license file |
README.md | This file |
dev/gapminder.kit | Development script for gapminder |
dev/iris.kit | Development script for iris |
dev/mtcars.kit | Development script for mtcars |
dev/penguins.kit | Development script for penguins |
dev/tips.kit | Development script for tips |
dev/titanic.kit | Development script for titanic |
examples/dataframe.kit | Example: dataframe |
kit.toml | Package manifest with metadata and dependencies |
src/col.kit | Module for col |
src/dataframe.kit | DataFrame error type for typed error handling. |
src/datetime.kit | Module for datetime |
src/duration.kit | Module for duration |
src/eval.kit | Module for eval |
src/expr.kit | Join type enumeration |
src/io.kit | Module for io |
src/optimize.kit | Module for optimize |
src/parallel.kit | Module for parallel |
src/reshape.kit | Create a pivot table from a DataFrame. |
src/rolling.kit | Module for rolling |
src/stats.kit | Calculate sample variance of a column (ddof=1). |
src/str.kit | Convert column values to lowercase. |
src/window.kit | Add 1-indexed row numbers as a new column. |
tests/col.test.kit | Tests for col |
tests/dataframe.test.kit | Tests for dataframe |
tests/io.test.kit | Tests for io |
tests/parallel.test.kit | Tests for parallel |
tests/reshape.test.kit | Tests for reshape |
tests/stats.test.kit | Tests for stats |
zig/dataframe.zig | Zig FFI module for dataframe |
zig/kit_ffi.zig | Zig FFI module for kit ffi |
zig/reshape.zig | Zig FFI module for reshape |
zig/stats.zig | Zig FFI module for stats |
zig/string_ops.zig | Zig FFI module for string ops |
zig/window_ops.zig | Zig FFI module for window ops |
Dependencies
- CSV - CSV parsing for
from-csv/to-csv - kit-arrow - Apache Arrow in-memory format (
read-arrow/write-arrow) - kit-parquet - Apache Parquet columnar storage (
read-parquet/write-parquet) - kit-sqlite - SQLite database access (
read-sql/to-sql)
Installation
kit add gitlab.com/kit-lang/packages/kit-dataframe.gitUsage
Basic Operations
import Kit.Dataframe as DataFrame
# Create from records
df = DataFrame.from-records [
{name: "Alice", age: 30, salary: 75000},
{name: "Bob", age: 25, salary: 55000},
{name: "Carol", age: 35, salary: 85000}
]
# Basic operations
filtered = DataFrame.filter (fn(row) => row.age > 28) df
sorted = DataFrame.sort "salary" df
selected = DataFrame.select ["name", "salary"] df
# Aggregations
total = DataFrame.sum df "salary"
avg = DataFrame.mean df "age"Statistical Functions
import DataFrame.Stats as Stats
# Variance and standard deviation
variance = Stats.var df "returns"
std-dev = Stats.std-sample df "returns"
# Quantiles and percentiles
median = Stats.quantile df "price" 0.5
q1 = Stats.percentile df "price" 25.0
# Correlation and covariance
corr = Stats.corr df "x" "y"
cov = Stats.cov df "x" "y"Reshaping
import DataFrame.Reshape as Reshape
# Pivot table
pivoted = Reshape.pivot df {
index: ["date"],
columns: "product",
values: "sales",
aggfunc: :sum
}
# Melt (unpivot)
melted = Reshape.melt df {
id-vars: ["date"],
value-vars: ["q1", "q2", "q3"],
var-name: "quarter",
value-name: "sales"
}
# Crosstab
cross = Reshape.crosstab df "category" "status"Column Expressions
import DataFrame.Col as Col
# Scale and offset
df2 = df
|> Col.scale-col "price" 1.1 "marked_up"
|> Col.offset-col "score" 10 "adjusted"
# Comparisons
df3 = df
|> Col.gt-col "salary" 60000.0 "is_high_earner"
|> Col.eq-col "status" "active" "is_active"
# Categorize
df4 = Col.categorize "age" "age_group" [
{max: 18, label: "child"},
{max: 65, label: "adult"},
{max: Float.infinity, label: "senior"}
] dfParallel Operations
import DataFrame.Parallel as Par
# Parallel aggregations
total = Par.par-sum df "amount"
avg = Par.par-mean df "score"
# Partitioned operations (map-reduce pattern)
sum = Par.partitioned-sum df "value" 4 # 4 partitionsDateTime Operations
import DataFrame.DateTimeCol as DT
# Parse datetime strings
df2 = df
|> DT.parse-iso-col "timestamp" "parsed_ts"
# Extract components
df3 = df
|> DT.year-col "parsed_ts" "year"
|> DT.month-col "parsed_ts" "month"
|> DT.weekday-col "parsed_ts" "day_of_week"
# Format timestamps
df4 = DT.format-col "parsed_ts" "%Y-%m-%d" "date_string" dfExtended I/O
import DataFrame.IO as IO
# Parquet (requires kit-parquet)
df = IO.read-parquet "data.parquet" |> Result.unwrap
IO.write-parquet df "output.parquet"
# Arrow IPC
df = IO.read-arrow "data.arrow" |> Result.unwrap
IO.write-arrow df "output.arrow"
# SQLite (requires kit-sqlite)
db = SQLite.connect "data.db"
df = IO.read-sql db "SELECT * FROM users" |> Result.unwrap
IO.to-sql df db "users" :replaceInteractive REPL
kit-dataframe ships with preloaded REPL sessions for exploring classic datasets interactively. Each preload creates a ready-to-use DataFrame with pre-built subsets and helper functions.
Available Datasets
| Dataset | Module | Rows | Description |
|---|---|---|---|
dev/iris.kit | Iris | 150 | Fisher's Iris flower measurements (sepal/petal dimensions by species) |
dev/mtcars.kit | Mtcars | 32 | Motor Trend 1974 car road tests (mpg, hp, weight, etc.) |
dev/titanic.kit | Titanic | 100 | Titanic passenger survival data (class, sex, age, fare) |
dev/penguins.kit | Penguins | 150 | Palmer Penguins morphometrics (bill, flipper, mass by species) |
dev/tips.kit | Tips | 50 | Restaurant tipping data (total bill, tip, day, time) |
dev/gapminder.kit | Gapminder | 66 | Global development indicators (life expectancy, GDP, population) |
Running a REPL Session
From the kit-dataframe package directory:
kit repl --preload dev/iris.kitThe REPL prompt shows the module name (e.g., Iris≫) and prints available variables and helpers on startup:
Iris≫ preview iris
sepal-length sepal-width petal-length petal-width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
[150 rows x 5 columns]Each preload provides:
- Pre-built subsets — filtered views by category (e.g.,
setosa,auto,survived) - `preview df` — show first 5 rows as a formatted table
- `info df` — shape, columns, and summary statistics
- `col-stats col df` — mean, std, min, max, median for a column
- `compare-by-* col` — compare a measurement across groups
- `corr col1 col2` — Pearson correlation between two columns
- `top n col` / `bottom n col` — top/bottom n rows by a column
- `sorted col` — sort by any column
Tests
Run the test suite:
cd packages/kit-dataframe
kit devLicense
MIT License - see LICENSE for details.
Exported Functions & Types
parse-col
Parse string column to timestamp using Kit's Time.parse. Creates a new integer column with Unix timestamps (milliseconds).
NonEmptyString -> String -> NonEmptyString -> DataFrame -> DataFrame
parse-iso-col
Parse ISO 8601 datetime string column. Format: "2024-01-15T10:30:00Z" or "2024-01-15 10:30:00"
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
parse-date-col
Parse date-only string column (no time component). Format: "2024-01-15"
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
year-col
Extract year from timestamp column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
month-col
Extract month (1-12) from timestamp column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
day-col
Extract day of month (1-31) from timestamp column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
hour-col
Extract hour (0-23) from timestamp column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
minute-col
Extract minute (0-59) from timestamp column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
second-col
Extract second (0-59) from timestamp column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
weekday-col
Extract day of week (0=Sunday, 6=Saturday) from timestamp column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
add-days-col
Add days to timestamp column.
NonEmptyString -> Int -> DataFrame -> DataFrame
add-months-col
Add months to timestamp column.
NonEmptyString -> Int -> DataFrame -> DataFrame
add-years-col
Add years to timestamp column.
NonEmptyString -> Int -> DataFrame -> DataFrame
diff-col
Calculate difference between two timestamp columns in milliseconds.
NonEmptyString -> NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
diff-days-col
Calculate difference in days between two timestamp columns.
NonEmptyString -> NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
diff-hours-col
Calculate difference in hours between two timestamp columns.
NonEmptyString -> NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
format-col
Format timestamp column to string using strftime format.
NonEmptyString -> String -> NonEmptyString -> DataFrame -> DataFrame
format-iso-col
Format timestamp column to ISO 8601 string.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
format-date-col
Format timestamp column to date string (YYYY-MM-DD).
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
format-time-col
Format timestamp column to time string (HH:MM:SS).
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
is-before-col
Check if timestamp column values are before a reference timestamp.
NonEmptyString -> Int -> NonEmptyString -> DataFrame -> DataFrame
is-after-col
Check if timestamp column values are after a reference timestamp.
NonEmptyString -> Int -> NonEmptyString -> DataFrame -> DataFrame
is-between-col
Check if timestamp column values are between two timestamps.
NonEmptyString -> Int -> Int -> NonEmptyString -> DataFrame -> DataFrame
now-col
Add current timestamp column to DataFrame.
NonEmptyString -> DataFrame -> DataFrame
components-col
Convert timestamp to components record column. Returns a column where each value is {year, month, day, hour, minute, second}.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
EvalError
Evaluation errors that can occur during expression evaluation
Variants
EvalDataFrameError {message}EvalColumnNotFound {column}EvalTypeMismatch {expected, got}EvalInvalidOperation {operation, reason}eval
Evaluate a DataFrame expression tree. Recursively traverses the tree, executing operations bottom-up.
DFExpr -> Result a EvalError
eval-optimized
Evaluate with optimization. Applies the optimizer before evaluation for better performance.
DFExpr -> Result a EvalError
expr-hash
Compute a hash key for an expression. Used for memoization to identify repeated subexpressions.
DFExpr -> String
eval-memoized
Evaluate with memoization. Caches results of subexpressions to avoid redundant computation.
DFExpr -> Result a EvalError
eval-memoized-cache-size
Get the cache size after memoized evaluation.
DFExpr -> Int
eval!
Evaluate and unwrap, panicking on error. Use for scripts where errors should halt execution.
DFExpr -> a
eval-optimized!
Evaluate optimized and unwrap, panicking on error.
DFExpr -> a
collect
Collect: Synonym for eval-optimized. Named to match Polars/Spark terminology.
DFExpr -> Result a EvalError
collect!
Collect and unwrap, panicking on error.
DFExpr -> a
read-parquet
Read a Parquet file into a DataFrame. Uses kit-parquet to read the file, then converts to DataFrame via records.
NonEmptyString -> Result DataFrame String
match IO.read-parquet "data.parquet"
| Ok df -> DataFrame.print df
| Err e -> print "Error: ${e}"
write-parquet
Write a DataFrame to a Parquet file. Converts DataFrame to records, then writes via kit-parquet.
DataFrame -> NonEmptyString -> Result Unit String
IO.write-parquet df "output.parquet"
write-parquet-compressed
Write a DataFrame to Parquet with compression options.
Compression options: :snappy, :gzip, :lz4, :zstd, :uncompressed
DataFrame -> NonEmptyString -> Symbol -> Result Unit String
IO.write-parquet-compressed df "output.parquet" :zstd
read-arrow
Read an Arrow IPC file into a DataFrame. Arrow IPC (Feather) format is efficient for temporary storage and IPC.
NonEmptyString -> Result DataFrame String
match IO.read-arrow "data.arrow"
| Ok df -> DataFrame.print df
| Err e -> print "Error: ${e}"
write-arrow
Write a DataFrame to an Arrow IPC file.
DataFrame -> NonEmptyString -> Result Unit String
IO.write-arrow df "data.arrow"
read-sql
Execute a SQL query and return results as a DataFrame. The query results are converted to DataFrame records.
{query: String -> Result List a, ..} -> String -> Result DataFrame String
db = SQLite.connect "data.db"
match IO.read-sql db "SELECT id, name, age FROM users"
| Ok df -> DataFrame.print df
| Err e -> print "Error: ${e}"
to-sql
Write a DataFrame to a SQLite table. Creates the table if it doesn't exist, or inserts into existing table.
Options for if-exists: :replace - Drop and recreate table :append - Insert rows into existing table :fail - Return error if table exists (default)
DataFrame -> {execute: String -> Result Int a, query: String -> Result List b, ..} -> String -> Symbol -> Result Int String
db = SQLite.connect "data.db"
IO.to-sql df db "users" :replace
read-csv-chunked
Read a CSV file in chunks, applying a processor to each chunk. Useful for processing files larger than available memory.
NonEmptyString -> PositiveInt -> (DataFrame -> a) -> Result Unit String
IO.read-csv-chunked "huge.csv" 10000 (fn(chunk) =>
chunk
|> DataFrame.filter (fn(row) => row.valid?)
|> process-and-save
)
mean
Rolling mean with specified window size. Returns None for first (window-1) rows.
String -> Int -> String -> DataFrame -> DataFrame
sum
Rolling sum with specified window size.
String -> Int -> String -> DataFrame -> DataFrame
std
Rolling standard deviation with specified window size (sample std, ddof=1).
String -> Int -> String -> DataFrame -> DataFrame
min
Rolling minimum with specified window size.
String -> Int -> String -> DataFrame -> DataFrame
max
Rolling maximum with specified window size.
String -> Int -> String -> DataFrame -> DataFrame
JoinKind
Join type enumeration
Variants
InnerLeftOuterRightOuterFullOuterDFExpr
Lazy DataFrame expression tree. Constructing an expression does not perform any computation - it builds a tree structure that can be optimized and then evaluated.
Type parameters: - a: The DataFrame value type (opaque) - b: Predicate/mapper function type (opaque)
Variants
Lit {a}Select {DFExpr, _1}Drop {DFExpr, _1}Filter {DFExpr, a}MapCol {DFExpr, String, a}Sort {DFExpr, String, Bool}Slice {DFExpr, Int, Int}Head {DFExpr, Int}Tail {DFExpr, Int}GroupBy {DFExpr, _1}Agg {DFExpr, a}GroupByAgg {DFExpr, _1, a}Join {DFExpr, DFExpr, String, JoinKind}Concat {DFExpr, DFExpr}WithColumn {DFExpr, String, a}Rename {DFExpr, a}Unique {DFExpr, _1}Sample {DFExpr, Int}FillNone {DFExpr, String, a}DropNone {DFExpr, String}SortDesc {DFExpr, String}TopN {DFExpr, String, Int}of
Create a literal expression from a DataFrame. This is the entry point for building lazy expressions.
a -> DFExpr
select
Select specific columns from the DataFrame.
[String] -> DFExpr -> DFExpr
drop
Drop specific columns from the DataFrame.
[String] -> DFExpr -> DFExpr
filter
Filter rows using a predicate function.
a -> DFExpr -> DFExpr
map-column
Transform a column using a function.
NonEmptyString -> a -> DFExpr -> DFExpr
sort
Sort by column in ascending order.
NonEmptyString -> DFExpr -> DFExpr
sort-desc
Sort by column in descending order.
NonEmptyString -> DFExpr -> DFExpr
sort-by
Sort by column with explicit direction.
NonEmptyString -> Bool -> DFExpr -> DFExpr
slice
Slice rows from start to end (exclusive).
NonNegativeInt -> NonNegativeInt -> DFExpr -> DFExpr
head
Take first n rows.
PositiveInt -> DFExpr -> DFExpr
tail
Take last n rows.
PositiveInt -> DFExpr -> DFExpr
group-by
Group by specified columns. Must be followed by an aggregate operation.
[String] -> DFExpr -> DFExpr
aggregate
Apply aggregations to grouped DataFrame.
a -> DFExpr -> DFExpr
group-by-agg
Combined group-by and aggregate in one operation.
[String] -> a -> DFExpr -> DFExpr
inner-join
Inner join with another DataFrame expression.
DFExpr -> NonEmptyString -> DFExpr -> DFExpr
left-join
Left outer join with another DataFrame expression.
DFExpr -> NonEmptyString -> DFExpr -> DFExpr
right-join
Right outer join with another DataFrame expression.
DFExpr -> NonEmptyString -> DFExpr -> DFExpr
outer-join
Full outer join with another DataFrame expression.
DFExpr -> NonEmptyString -> DFExpr -> DFExpr
concat
Concatenate two DataFrame expressions vertically.
DFExpr -> DFExpr -> DFExpr
with-column
Add or replace a column with values.
NonEmptyString -> a -> DFExpr -> DFExpr
rename
Rename columns using a mapping record.
a -> DFExpr -> DFExpr
unique
Get unique rows based on specified columns.
[String] -> DFExpr -> DFExpr
sample
Take a random sample of n rows.
PositiveInt -> DFExpr -> DFExpr
fill-none
Fill missing values in a column.
NonEmptyString -> a -> DFExpr -> DFExpr
drop-none
Drop rows with missing values in a column.
NonEmptyString -> DFExpr -> DFExpr
top-n
Optimized top-N operation (head of sorted data). More efficient than sort followed by head.
NonEmptyString -> PositiveInt -> DFExpr -> DFExpr
is-literal?
Check if expression is a literal (base case).
DFExpr -> Bool
depth
Get the depth of the expression tree.
DFExpr -> Int
node-count
Count the number of nodes in the expression tree.
DFExpr -> Int
with-fn
Add a column computed from row values using a function. The function receives each row as a record.
NonEmptyString -> (Record -> a) -> DataFrame -> Result DataFrame String
df |> Col.with-fn "total" (fn(row) => row.price * row.qty)
with-many
Add multiple computed columns sequentially. Each spec is a record with {name: String, fn: fn(row) => value}.
[{name: String, fn: Record -> a}] -> DataFrame -> Result DataFrame String
df |> Col.with-many [
{name: "total", fn: fn(row) => row.price * row.qty},
{name: "tax", fn: fn(row) => row.price * 0.08}
]
scale-col
Scale a column by a factor, storing in new column.
NonEmptyString -> Float -> NonEmptyString -> DataFrame -> Result DataFrame String
offset-col
Offset a column by an amount, storing in new column.
NonEmptyString -> Float -> NonEmptyString -> DataFrame -> Result DataFrame String
gt-col
Add boolean column for values > threshold.
NonEmptyString -> Float -> NonEmptyString -> DataFrame -> Result DataFrame String
lt-col
Add boolean column for values < threshold.
NonEmptyString -> Float -> NonEmptyString -> DataFrame -> Result DataFrame String
ge-col
Add boolean column for values >= threshold.
NonEmptyString -> Float -> NonEmptyString -> DataFrame -> Result DataFrame String
le-col
Add boolean column for values <= threshold.
NonEmptyString -> Float -> NonEmptyString -> DataFrame -> Result DataFrame String
eq-col
Add boolean column for values equal to target.
NonEmptyString -> a -> NonEmptyString -> DataFrame -> Result DataFrame String
ne-col
Add boolean column for values not equal to target.
NonEmptyString -> a -> NonEmptyString -> DataFrame -> Result DataFrame String
trim
Trim whitespace from column values.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
str-len
Get string length for each value.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
contains
Check if values contain a substring.
NonEmptyString -> String -> NonEmptyString -> DataFrame -> Result DataFrame String
starts-with
Check if values start with a prefix.
NonEmptyString -> String -> NonEmptyString -> DataFrame -> Result DataFrame String
ends-with
Check if values end with a suffix.
NonEmptyString -> String -> NonEmptyString -> DataFrame -> Result DataFrame String
fill-empty
Fill empty strings with a default value.
NonEmptyString -> String -> NonEmptyString -> DataFrame -> Result DataFrame String
is-empty
Create boolean column for empty values.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
is-not-empty
Create boolean column for non-empty values.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
categorize
Categorize numeric values into bins. bins is a list of {max: Float, label: String} in ascending order.
NonEmptyString -> NonEmptyString -> [{max: Float, label: String}] -> DataFrame -> Result DataFrame String
df |> Col.categorize "age" "age_group" [
{max: 18, label: "child"},
{max: 65, label: "adult"},
{max: Float.infinity, label: "senior"}
]
indicator
Create indicator column (1 for true, 0 for false).
NonEmptyString -> a -> NonEmptyString -> DataFrame -> Result DataFrame String
indicator-gt
Create indicator from comparison.
NonEmptyString -> Float -> NonEmptyString -> DataFrame -> Result DataFrame String
abs-col
Apply absolute value to column.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
round-col
Round column values.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
floor-col
Floor column values.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
ceil-col
Ceiling column values.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
log-col
Apply natural log to column.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
sqrt-col
Apply square root to column.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
exp-col
Apply exponential to column.
NonEmptyString -> NonEmptyString -> DataFrame -> Result DataFrame String
pow-col
Apply power to column values.
NonEmptyString -> Float -> NonEmptyString -> DataFrame -> Result DataFrame String
DataFrameError
DataFrame error type for typed error handling. Variants distinguish between different failure modes.
Variants
DataFrameParseError {message}DataFrameColumnError {message}DataFrameRowError {message}DataFrameIOError {message}DataFrameConversionError {message}parse-csv
Parse a CSV string into a DataFrame. The first line is treated as column headers. Returns Result with Ok(DataFrame) or Err(message).
String -> Result a b
read-csv
Read a CSV file into a DataFrame. Returns Result with Ok(DataFrame) or Err(message).
String -> Result a IOError
optimize
Recursively optimize an expression tree. First optimizes all sub-expressions bottom-up, then applies rewrite rules.
DFExpr -> DFExpr
count-rewrites
Count the number of optimizations applied. Useful for debugging and profiling the optimizer.
DFExpr -> DFExpr -> Int
stats
Get optimization statistics as a record.
DFExpr -> DFExpr -> {original_nodes: Int, optimized_nodes: Int, reduction: Int}
from-millis-col
Create duration from milliseconds column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
from-seconds-col
Create duration from seconds column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
from-minutes-col
Create duration from minutes column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
from-hours-col
Create duration from hours column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
from-days-col
Create duration from days column.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
parse-col
Parse duration string column (e.g., "2h30m").
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
to-millis-col
Convert duration to milliseconds.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
to-seconds-col
Convert duration to seconds.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
to-minutes-col
Convert duration to minutes.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
to-hours-col
Convert duration to hours.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
to-days-col
Convert duration to days.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
add-col
Add two duration columns.
NonEmptyString -> NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
subtract-col
Subtract second duration column from first.
NonEmptyString -> NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
multiply-col
Multiply duration column by a scalar.
NonEmptyString -> Int -> DataFrame -> DataFrame
divide-col
Divide duration column by a scalar.
NonEmptyString -> Int -> DataFrame -> DataFrame
negate-col
Negate duration values in column.
NonEmptyString -> DataFrame -> DataFrame
abs-col
Get absolute value of duration column.
NonEmptyString -> DataFrame -> DataFrame
format-col
Format duration as human-readable string (e.g., "2h 30m").
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
format-long-col
Format duration with full unit names (e.g., "2 hours 30 minutes").
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
format-abbrev-col
Format duration with abbreviated units (e.g., "2h30m").
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
is-zero-col
Check if duration values are zero.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
is-negative-col
Check if duration values are negative.
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
is-positive-col
Check if duration values are positive (not zero or negative).
NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
add-to-time-col
Add duration column to timestamp column.
NonEmptyString -> NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
subtract-from-time-col
Subtract duration column from timestamp column.
NonEmptyString -> NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
between-times-col
Calculate duration between two timestamp columns.
NonEmptyString -> NonEmptyString -> NonEmptyString -> DataFrame -> DataFrame
par-map-column
Apply a function to each value in a column.
NonEmptyString -> (a -> b) -> DataFrame -> Result DataFrame String
Par.map-column "price" (fn(x) => x * 1.1) df
par-filter
Filter rows using a predicate function.
(Record -> Bool) -> DataFrame -> Result DataFrame String
Par.filter (fn(row) => row.age > 30) df
par-sum
Calculate sum of a column.
DataFrame -> NonEmptyString -> Result Float String
Par.sum df "amount"
par-mean
Calculate mean of a column.
DataFrame -> NonEmptyString -> Result Float String
Par.mean df "score"
par-min
Find minimum value in a column.
DataFrame -> NonEmptyString -> Result Float String
Par.min df "temperature"
par-max
Find maximum value in a column.
DataFrame -> NonEmptyString -> Result Float String
Par.max df "revenue"
par-count
Count rows matching a predicate.
(Record -> Bool) -> DataFrame -> Result Int String
Par.count (fn(row) => row.status == "active") df
partitioned-sum
Partition-aware sum: splits data into chunks and aggregates. This pattern is ready for parallel execution.
DataFrame -> NonEmptyString -> PositiveInt -> Result Float String
partitioned-mean
Partition-aware mean: partitioned sum divided by count.
DataFrame -> NonEmptyString -> PositiveInt -> Result Float String
partitioned-min
Partition-aware min: finds minimum in each partition then combines.
DataFrame -> NonEmptyString -> PositiveInt -> Result Float String
partitioned-max
Partition-aware max: finds maximum in each partition then combines.
DataFrame -> NonEmptyString -> PositiveInt -> Result Float String