Data Manipulation in Kit Compared to R, Python, Polars, and Julia

Back to Blog

The comparison format here was inspired by Data Manipulation in Clojure Compared to R and Python, but this post uses kit-dataframe and adds Polars and Julia to the comparison.

Most people evaluating Kit for data work already know dplyr, pandas, Polars, or Julia’s DataFrames.jl. The useful question is not what kit-dataframe can do in the abstract. It is how to do everyday data manipulation in Kit.

This post focuses on a small set of common tasks: read a CSV, inspect it, filter it, summarize it, reshape it, and build a derived column. That is enough to show both the everyday workflow and some of the design choices behind kit-dataframe.

All of the examples use the same local penguins.csv file. The dataset is just a reference point so the style of each library is easy to compare.

Reading a CSV

The first place Kit visibly diverges from the others is error handling. Reading a file returns a Result, so the failure case is part of the normal shape of the code rather than something implicit or deferred.

Kit

import Kit.Dataframe as DataFrame

penguins = match DataFrame.read-csv "penguins.csv"
  | Ok df -> df
  | Err _ -> panic "could not read penguins.csv"

CSV support in Kit does not depend on kit-dataframe. The standard library already has Encoding.CSV for parsing, formatting, file I/O, header-based parsing, custom delimiters, and automatic delimiter detection. In the Kit source, the shared CSV parser also has a SIMD-accelerated path for larger files when the input is a good fit for it. DataFrame.read-csv and DataFrame.parse-csv build on top of that stdlib CSV support.

R

library(readr)
library(dplyr)

penguins <- read_csv("penguins.csv", na = "NA")

Python

import pandas as pd

penguins = pd.read_csv("penguins.csv", na_values=["NA"])

Polars

import polars as pl

penguins = pl.read_csv("penguins.csv", null_values="NA")

Julia

using CSV, DataFrames

penguins = CSV.read("penguins.csv", DataFrame)

Quick Inspection

After loading the file, the operations are the usual ones: look at the top, inspect names, choose columns, drop columns, and sort. The point is not whether Kit can do these things, but how the verbs are shaped. They are mostly function-first and DataFrame-last, which makes them read naturally in a pipeline.

This gives a quick sense of how close the basic operations are across the libraries.

Task Kit R Python Polars Julia
First 5 rows DataFrame.head 5 penguins slice_head(penguins, n = 5) penguins.head(5) penguins.head(5) first(penguins, 5)
Column names DataFrame.columns penguins colnames(penguins) penguins.columns penguins.columns names(penguins)
Select columns DataFrame.select ["species", "body_mass_g"] penguins select(penguins, species, body_mass_g) penguins[["species", "body_mass_g"]] penguins.select(["species", "body_mass_g"]) select(penguins, [:species, :body_mass_g])
Drop columns DataFrame.drop ["sex"] penguins select(penguins, -sex) penguins.drop(columns=["sex"]) penguins.drop("sex") select(penguins, Not(:sex))
Sort descending DataFrame.sort-desc "body_mass_g" penguins arrange(penguins, desc(body_mass_g)) penguins.sort_values("body_mass_g", ascending=False) penguins.sort("body_mass_g", descending=True) sort(penguins, :body_mass_g, rev=true)

Row Filters and Column Projection

This is where the differences between the libraries become clearer. Kit does not lean on a data-specific mini-language here. The filter condition is an ordinary function over rows, and selection is a separate transform in the same pipeline.

Kit

heavy-gentoo = penguins
  |> DataFrame.filter (fn(row) =>
    row.body_mass_g > 5000 and row.species == "Gentoo")
  |> DataFrame.select ["species", "island", "body_mass_g"]

R

heavy_gentoo <- penguins |>
  filter(body_mass_g > 5000, species == "Gentoo") |>
  select(species, island, body_mass_g)

Python

heavy_gentoo = (
    penguins
    .loc[
        (penguins["body_mass_g"] > 5000) & (penguins["species"] == "Gentoo"),
        ["species", "island", "body_mass_g"],
    ]
)

Polars

heavy_gentoo = (
    penguins
    .filter((pl.col("body_mass_g") > 5000) & (pl.col("species") == "Gentoo"))
    .select(["species", "island", "body_mass_g"])
)

Julia

heavy_gentoo = subset(
    select(penguins, [:species, :island, :body_mass_g]),
    :body_mass_g => ByRow(>(5000)),
    :species => ByRow(==("Gentoo")),
)

That makes the code slightly more explicit than the shortest dplyr or Polars version, but also more uniform. The same Kit language constructs that work in the rest of the language are doing the work here too.

Aggregate by Species

Grouped summaries are usually where tabular libraries stop feeling interchangeable. Different ecosystems make very different bets about how aggregation specs should be expressed.

Kit

by-species = DataFrame.group-by-agg penguins ["species"] {
    body_mass_g: "mean",
    flipper_length_mm: "mean"
  }

R

by_species <- penguins |>
  group_by(species) |>
  summarise(
    body_mass_g = mean(body_mass_g, na.rm = TRUE),
    flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE)
  )

Python

by_species = (
    penguins
    .groupby("species", dropna=False)[["body_mass_g", "flipper_length_mm"]]
    .mean(numeric_only=True)
    .reset_index()
)

Polars

by_species = (
    penguins
    .group_by("species")
    .agg(
        pl.col("body_mass_g").mean(),
        pl.col("flipper_length_mm").mean(),
    )
)

Julia

using Statistics

by_species = combine(
    groupby(penguins, :species),
    :body_mass_g => (x -> mean(skipmissing(x))) => :body_mass_g,
    :flipper_length_mm => (x -> mean(skipmissing(x))) => :flipper_length_mm,
)

In Kit, the grouping keys and aggregation spec stay close together. I like that because it keeps the “group by this, calculate that” intent visible in one place.

Unpivoting Measurements

Reshape operations are another good stress test because they tend to expose whether a library is internally consistent or just a collection of verbs. In Kit, the wide-to-long operation lives in DataFrame.Reshape as melt.

Kit

import DataFrame.Reshape as Reshape

measurements = Reshape.melt
    penguins
    ["species", "island", "sex"]
    ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
    "measurement"
    "value"

R

measurements <- penguins |>
  tidyr::pivot_longer(
    cols = c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g),
    names_to = "measurement",
    values_to = "value"
  )

Python

measurements = pd.melt(
    penguins,
    id_vars=["species", "island", "sex"],
    value_vars=["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"],
    var_name="measurement",
    value_name="value",
)

Polars

measurements = penguins.melt(
    id_vars=["species", "island", "sex"],
    value_vars=["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"],
    variable_name="measurement",
    value_name="value",
)

Julia

measurements = stack(
    penguins,
    [:bill_length_mm, :bill_depth_mm, :flipper_length_mm, :body_mass_g],
    [:species, :island, :sex];
    variable_name = :measurement,
    value_name = :value,
)

The Kit call is explicit without being noisy. You can read off the identifier columns, the value columns, and the target names without needing much library-specific knowledge.

Building New Columns

Derived columns are a nice dividing line between “spreadsheet-like convenience” and “explicit data programming.” Kit leans toward the latter: compute the new vector, then attach it to the frame.

Kit

import List

bill-ratio = List.zip-with
    (fn(len, depth) => len / depth)
    (DataFrame.col penguins "bill_length_mm")
    (DataFrame.col penguins "bill_depth_mm")

penguins-with-ratio = DataFrame.with-column "bill_ratio" bill-ratio penguins

R

penguins_with_ratio <- penguins |>
  mutate(bill_ratio = bill_length_mm / bill_depth_mm)

Python

penguins_with_ratio = penguins.assign(
    bill_ratio=penguins["bill_length_mm"] / penguins["bill_depth_mm"]
)

Polars

penguins_with_ratio = penguins.with_columns(
    (pl.col("bill_length_mm") / pl.col("bill_depth_mm")).alias("bill_ratio")
)

Julia

penguins_with_ratio = transform(
    penguins,
    [:bill_length_mm, :bill_depth_mm] => ByRow(/) => :bill_ratio,
)

That extra explicitness is not just aesthetic. It keeps the value flow visible: read two columns, derive a third, produce a new DataFrame. It stays clear what is being allocated or transformed.

What kit-dataframe Was Designed to Avoid

The most interesting design choice in kit-dataframe is not any one verb. It is that the package is trying to avoid a common failure mode in data tooling: either everything is eagerly materialized step by step, or the “serious” path lives behind a separate lazy system that feels like a different product.

Here, the direct DataFrame API and the expression API are related on purpose. You can write small, direct transforms when that is enough, and move into DFExpr plus collect! when the pipeline is large enough that planning and reuse matter.

The local source and docs are explicit about what the optimizer is for: predicate pushdown, projection pushdown, operation fusion, and memoization of repeated subexpressions. In other words, it is trying to avoid wasted intermediate work without demanding that users learn a wholly separate style first.

Kit Lazy/Optimized Pipeline

top-heavy = penguins
  |> DataFrame.of
  |> DataFrame.filter (fn(row) => row.body_mass_g > 4500)
  |> DataFrame.select ["species", "body_mass_g", "flipper_length_mm"]
  |> DataFrame.sort-desc "body_mass_g"
  |> DataFrame.head 10
  |> DataFrame.collect!

That is a useful middle ground. The same package scales from straightforward scripts to more optimized pipelines without turning into a different programming model.

Preloaded REPL Sessions

There is also a practical feature in kit-dataframe that is easy to overlook if you only read the API docs: the package ships with several preloaded REPL sessions for classic datasets, including Iris, Mtcars, Titanic, Penguins, Tips, and Gapminder.

That matters for this post because the Penguins dataset used here is already available as a preload. Instead of wiring up a CSV manually, you can jump into a ready-made session with pre-built subsets and helper commands for inspection, summary stats, correlations, and top/bottom queries.

$ kit repl --preload dev/penguins.kit
Kit REPL
Type ':exit' or Ctrl+D to exit, ':info <Module>' for docs, ':help' for help
The 'env' binding is available with system capabilities and info

Loading: dev/penguins.kit
kit-dataframe REPL helpers loaded!

Palmer Penguins dataset ready (150 samples, 3 species, 3 islands, 4 measurements)
  penguins     - Full dataset (150 rows x 7 cols)
  adelie       - Adelie subset (50 rows)
  chinstrap    - Chinstrap subset (50 rows)
  gentoo       - Gentoo subset (50 rows)
  torgersen    - Torgersen island subset (17 rows)
  biscoe       - Biscoe island subset (67 rows)
  dream        - Dream island subset (66 rows)
  males        - Male subset (75 rows)
  females      - Female subset (75 rows)
  measurements - All rows, numeric columns only

Helper functions:
  preview df              - Show first 5 rows
  info df                 - Shape, columns, and stats
  col-stats col df        - Mean/std/min/max/median for a column
  compare-by-species col  - Compare a measurement across species
  compare-by-island col   - Compare a measurement across islands
  corr col1 col2          - Pearson correlation
  sorted col              - Sort penguins by column
  top n col               - Top n rows by column (descending)
  bottom n col            - Bottom n rows by column (ascending)

Try: preview penguins
Try: compare-by-species "bill-length"
Try: corr "flipper-length" "body-mass"
Try: top 10 "body-mass" |> DataFrame.to-string |> println
Preload complete

Penguins≫

Each preload provides helpers like preview df, info df, col-stats col df, corr col1 col2, and top n col. That makes the package useful not only as a library you script against, but also as an interactive environment for exploring the same datasets you may later process in code.

Why the Kit Version Reads Differently

After the first few examples, the syntax stops being the interesting part. The real distinction is the programming model underneath the syntax.

If you come from R, the main shift is that Kit does less metaprogramming and more plain-language function composition. If you come from pandas or Polars, the shift is toward immutable value flow. If you come from Julia, the change is mostly one of syntax and defaults rather than expressiveness.

Final Thoughts

kit-dataframe is still new, but the core operations already feel coherent. The package covers the common manipulation tasks you reach for first, and the API is consistent with the rest of Kit: explicit, functional, pipe-friendly, and able to scale into optimized expression pipelines when that becomes useful.

For people evaluating Kit as a serious language for analytics, that matters more than trying to exactly mimic dplyr, pandas, Polars, or Julia’s DataFrames. Familiarity helps, but coherence helps more.

Continue with the package documentation: