The comparison format here was inspired by Data Manipulation in Clojure Compared to R and Python, but this post uses kit-dataframe and adds Polars and Julia to the comparison.
Most people evaluating Kit for data work already know dplyr, pandas, Polars, or Julia’s DataFrames.jl. The useful question is not what kit-dataframe can do in the abstract. It is how to do everyday data manipulation in Kit.
This post focuses on a small set of common tasks: read a CSV, inspect it, filter it, summarize it, reshape it, and build a derived column. That is enough to show both the everyday workflow and some of the design choices behind kit-dataframe.
All of the examples use the same local penguins.csv file. The dataset is just a reference point so the style of each library is easy to compare.
Reading a CSV
The first place Kit visibly diverges from the others is error handling. Reading a file returns a Result, so the failure case is part of the normal shape of the code rather than something implicit or deferred.
Kit
import Kit.Dataframe as DataFrame
penguins = match DataFrame.read-csv "penguins.csv"
| Ok df -> df
| Err _ -> panic "could not read penguins.csv"
CSV support in Kit does not depend on kit-dataframe. The standard library already has Encoding.CSV for parsing, formatting, file I/O, header-based parsing, custom delimiters, and automatic delimiter detection. In the Kit source, the shared CSV parser also has a SIMD-accelerated path for larger files when the input is a good fit for it. DataFrame.read-csv and DataFrame.parse-csv build on top of that stdlib CSV support.
R
library(readr)
library(dplyr)
penguins <- read_csv("penguins.csv", na = "NA")
Python
import pandas as pd
penguins = pd.read_csv("penguins.csv", na_values=["NA"])
Polars
import polars as pl
penguins = pl.read_csv("penguins.csv", null_values="NA")
Julia
using CSV, DataFrames
penguins = CSV.read("penguins.csv", DataFrame)
Quick Inspection
After loading the file, the operations are the usual ones: look at the top, inspect names, choose columns, drop columns, and sort. The point is not whether Kit can do these things, but how the verbs are shaped. They are mostly function-first and DataFrame-last, which makes them read naturally in a pipeline.
This gives a quick sense of how close the basic operations are across the libraries.
| Task | Kit | R | Python | Polars | Julia |
|---|---|---|---|---|---|
| First 5 rows | DataFrame.head 5 penguins |
slice_head(penguins, n = 5) |
penguins.head(5) |
penguins.head(5) |
first(penguins, 5) |
| Column names | DataFrame.columns penguins |
colnames(penguins) |
penguins.columns |
penguins.columns |
names(penguins) |
| Select columns | DataFrame.select ["species", "body_mass_g"] penguins |
select(penguins, species, body_mass_g) |
penguins[["species", "body_mass_g"]] |
penguins.select(["species", "body_mass_g"]) |
select(penguins, [:species, :body_mass_g]) |
| Drop columns | DataFrame.drop ["sex"] penguins |
select(penguins, -sex) |
penguins.drop(columns=["sex"]) |
penguins.drop("sex") |
select(penguins, Not(:sex)) |
| Sort descending | DataFrame.sort-desc "body_mass_g" penguins |
arrange(penguins, desc(body_mass_g)) |
penguins.sort_values("body_mass_g", ascending=False) |
penguins.sort("body_mass_g", descending=True) |
sort(penguins, :body_mass_g, rev=true) |
Row Filters and Column Projection
This is where the differences between the libraries become clearer. Kit does not lean on a data-specific mini-language here. The filter condition is an ordinary function over rows, and selection is a separate transform in the same pipeline.
Kit
heavy-gentoo = penguins
|> DataFrame.filter (fn(row) =>
row.body_mass_g > 5000 and row.species == "Gentoo")
|> DataFrame.select ["species", "island", "body_mass_g"]
R
heavy_gentoo <- penguins |>
filter(body_mass_g > 5000, species == "Gentoo") |>
select(species, island, body_mass_g)
Python
heavy_gentoo = (
penguins
.loc[
(penguins["body_mass_g"] > 5000) & (penguins["species"] == "Gentoo"),
["species", "island", "body_mass_g"],
]
)
Polars
heavy_gentoo = (
penguins
.filter((pl.col("body_mass_g") > 5000) & (pl.col("species") == "Gentoo"))
.select(["species", "island", "body_mass_g"])
)
Julia
heavy_gentoo = subset(
select(penguins, [:species, :island, :body_mass_g]),
:body_mass_g => ByRow(>(5000)),
:species => ByRow(==("Gentoo")),
)
That makes the code slightly more explicit than the shortest dplyr or Polars version, but also more uniform. The same Kit language constructs that work in the rest of the language are doing the work here too.
Aggregate by Species
Grouped summaries are usually where tabular libraries stop feeling interchangeable. Different ecosystems make very different bets about how aggregation specs should be expressed.
Kit
by-species = DataFrame.group-by-agg penguins ["species"] {
body_mass_g: "mean",
flipper_length_mm: "mean"
}
R
by_species <- penguins |>
group_by(species) |>
summarise(
body_mass_g = mean(body_mass_g, na.rm = TRUE),
flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE)
)
Python
by_species = (
penguins
.groupby("species", dropna=False)[["body_mass_g", "flipper_length_mm"]]
.mean(numeric_only=True)
.reset_index()
)
Polars
by_species = (
penguins
.group_by("species")
.agg(
pl.col("body_mass_g").mean(),
pl.col("flipper_length_mm").mean(),
)
)
Julia
using Statistics
by_species = combine(
groupby(penguins, :species),
:body_mass_g => (x -> mean(skipmissing(x))) => :body_mass_g,
:flipper_length_mm => (x -> mean(skipmissing(x))) => :flipper_length_mm,
)
In Kit, the grouping keys and aggregation spec stay close together. I like that because it keeps the “group by this, calculate that” intent visible in one place.
Unpivoting Measurements
Reshape operations are another good stress test because they tend to expose whether a library is internally
consistent or just a collection of verbs. In Kit, the wide-to-long operation lives in
DataFrame.Reshape as melt.
Kit
import DataFrame.Reshape as Reshape
measurements = Reshape.melt
penguins
["species", "island", "sex"]
["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
"measurement"
"value"
R
measurements <- penguins |>
tidyr::pivot_longer(
cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g),
names_to = "measurement",
values_to = "value"
)
Python
measurements = pd.melt(
penguins,
id_vars=["species", "island", "sex"],
value_vars=["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"],
var_name="measurement",
value_name="value",
)
Polars
measurements = penguins.melt(
id_vars=["species", "island", "sex"],
value_vars=["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"],
variable_name="measurement",
value_name="value",
)
Julia
measurements = stack(
penguins,
[:bill_length_mm, :bill_depth_mm, :flipper_length_mm, :body_mass_g],
[:species, :island, :sex];
variable_name = :measurement,
value_name = :value,
)
The Kit call is explicit without being noisy. You can read off the identifier columns, the value columns, and the target names without needing much library-specific knowledge.
Building New Columns
Derived columns are a nice dividing line between “spreadsheet-like convenience” and “explicit data programming.” Kit leans toward the latter: compute the new vector, then attach it to the frame.
Kit
import List
bill-ratio = List.zip-with
(fn(len, depth) => len / depth)
(DataFrame.col penguins "bill_length_mm")
(DataFrame.col penguins "bill_depth_mm")
penguins-with-ratio = DataFrame.with-column "bill_ratio" bill-ratio penguins
R
penguins_with_ratio <- penguins |>
mutate(bill_ratio = bill_length_mm / bill_depth_mm)
Python
penguins_with_ratio = penguins.assign(
bill_ratio=penguins["bill_length_mm"] / penguins["bill_depth_mm"]
)
Polars
penguins_with_ratio = penguins.with_columns(
(pl.col("bill_length_mm") / pl.col("bill_depth_mm")).alias("bill_ratio")
)
Julia
penguins_with_ratio = transform(
penguins,
[:bill_length_mm, :bill_depth_mm] => ByRow(/) => :bill_ratio,
)
That extra explicitness is not just aesthetic. It keeps the value flow visible: read two columns, derive a third, produce a new DataFrame. It stays clear what is being allocated or transformed.
What kit-dataframe Was Designed to Avoid
The most interesting design choice in kit-dataframe is not any one verb. It is that the package is trying to avoid a common failure mode in data tooling: either everything is eagerly materialized step by step, or the “serious” path lives behind a separate lazy system that feels like a different product.
Here, the direct DataFrame API and the expression API are related on purpose. You can write small, direct transforms when that is enough, and move into DFExpr plus collect! when the pipeline is large enough that planning and reuse matter.
The local source and docs are explicit about what the optimizer is for: predicate pushdown, projection pushdown, operation fusion, and memoization of repeated subexpressions. In other words, it is trying to avoid wasted intermediate work without demanding that users learn a wholly separate style first.
Kit Lazy/Optimized Pipeline
top-heavy = penguins
|> DataFrame.of
|> DataFrame.filter (fn(row) => row.body_mass_g > 4500)
|> DataFrame.select ["species", "body_mass_g", "flipper_length_mm"]
|> DataFrame.sort-desc "body_mass_g"
|> DataFrame.head 10
|> DataFrame.collect!
That is a useful middle ground. The same package scales from straightforward scripts to more optimized pipelines without turning into a different programming model.
Preloaded REPL Sessions
There is also a practical feature in kit-dataframe that is easy to overlook if you only read the API docs: the package ships with several preloaded REPL sessions for classic datasets, including Iris, Mtcars, Titanic, Penguins, Tips, and Gapminder.
That matters for this post because the Penguins dataset used here is already available as a preload. Instead of wiring up a CSV manually, you can jump into a ready-made session with pre-built subsets and helper commands for inspection, summary stats, correlations, and top/bottom queries.
$ kit repl --preload dev/penguins.kit
Kit REPL
Type ':exit' or Ctrl+D to exit, ':info <Module>' for docs, ':help' for help
The 'env' binding is available with system capabilities and info
Loading: dev/penguins.kit
kit-dataframe REPL helpers loaded!
Palmer Penguins dataset ready (150 samples, 3 species, 3 islands, 4 measurements)
penguins - Full dataset (150 rows x 7 cols)
adelie - Adelie subset (50 rows)
chinstrap - Chinstrap subset (50 rows)
gentoo - Gentoo subset (50 rows)
torgersen - Torgersen island subset (17 rows)
biscoe - Biscoe island subset (67 rows)
dream - Dream island subset (66 rows)
males - Male subset (75 rows)
females - Female subset (75 rows)
measurements - All rows, numeric columns only
Helper functions:
preview df - Show first 5 rows
info df - Shape, columns, and stats
col-stats col df - Mean/std/min/max/median for a column
compare-by-species col - Compare a measurement across species
compare-by-island col - Compare a measurement across islands
corr col1 col2 - Pearson correlation
sorted col - Sort penguins by column
top n col - Top n rows by column (descending)
bottom n col - Bottom n rows by column (ascending)
Try: preview penguins
Try: compare-by-species "bill-length"
Try: corr "flipper-length" "body-mass"
Try: top 10 "body-mass" |> DataFrame.to-string |> println
Preload complete
Penguins≫
Each preload provides helpers like preview df, info df, col-stats col df, corr col1 col2, and top n col. That makes the package useful not only as a library you script against, but also as an interactive environment for exploring the same datasets you may later process in code.
Why the Kit Version Reads Differently
After the first few examples, the syntax stops being the interesting part. The real distinction is the programming model underneath the syntax.
- Kit pushes you toward pipelines made out of plain functions, not hidden mutation.
- Errors are explicit at the boundaries, especially when reading or writing data.
- Row filters are ordinary predicates, which keeps the language surface smaller.
- DataFrame operations return new values, so the flow of state is easier to reason about.
- The same package also supports lazy expressions, optimized collection, and memoized evaluation when a pipeline gets large enough to care.
- When you need more than the basics, kit-dataframe already has reshape, stats, joins, parallel operations, and helper modules for column work.
If you come from R, the main shift is that Kit does less metaprogramming and more plain-language function composition. If you come from pandas or Polars, the shift is toward immutable value flow. If you come from Julia, the change is mostly one of syntax and defaults rather than expressiveness.
Final Thoughts
kit-dataframe is still new, but the core operations already feel coherent. The package covers the common manipulation tasks you reach for first, and the API is consistent with the rest of Kit: explicit, functional, pipe-friendly, and able to scale into optimized expression pipelines when that becomes useful.
For people evaluating Kit as a serious language for analytics, that matters more than trying to exactly mimic dplyr, pandas, Polars, or Julia’s DataFrames. Familiarity helps, but coherence helps more.
Continue with the package documentation: