Visual data summaries
datavis, composition, operators
When exploring a new dataset, we face an immediate challenge: How do we quickly understand the structure and distribution of all our columns?
This notebook explores a “show everything” approach:
- Visual summaries for each column (distributions, categories)
- Summary statistics paired with visualizations
- A scatterplot-matrix-like view showing all column combinations
The goal is to enable rapid visual discovery of patterns and relationships.
Starting with a Complete Dataset
Let’s load a well-known dataset and explore how to present its columns effectively:
(def penguins
(tc/drop-missing (rdatasets/palmerpenguins-penguins)))Option 1: Print the data
We could just print the first few rows, but that only shows a small sample:
penguinshttps://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv [333 9]:
| :rownames | :species | :island | :bill-length-mm | :bill-depth-mm | :flipper-length-mm | :body-mass-g | :sex | :year |
|---|---|---|---|---|---|---|---|---|
| 1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| 2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| 3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| 5 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| 6 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
| 7 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
| 8 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
| 13 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 |
| 14 | Adelie | Torgersen | 38.6 | 21.2 | 191 | 3800 | male | 2007 |
| 15 | Adelie | Torgersen | 34.6 | 21.1 | 198 | 4400 | male | 2007 |
| … | … | … | … | … | … | … | … | … |
| 334 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
| 335 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
| 336 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
| 337 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
| 338 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
| 339 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
| 340 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
| 341 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
| 342 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
| 343 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
| 344 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
Option 2: Summary statistics
We could compute statistics for a single column:
(fms/stats-map (:bill-length-mm penguins)){:MAD 4.700000000000003,
:UOF 76.40000000000003,
:Skewness 0.045340470420402026,
:Max 59.6,
:Variance 29.906333441875624,
:Size 333,
:LAV 32.1,
:UIF 62.52500000000002,
:Mode 41.1,
:Mean 43.99279279279281,
:Q1 39.4,
:Q3 48.650000000000006,
:Min 32.1,
:LIF 25.524999999999988,
:Range 27.5,
:Total 14649.600000000006,
:SD 5.468668342647561,
:IQR 9.250000000000007,
:Outliers (),
:UAV 59.6,
:LOF 11.649999999999977,
:SEM 0.29968117914670855,
:Kurtosis -0.8834182330572031,
:Median 44.5}But this requires mental effort to visualize what the numbers mean.
Option 3: Visual summaries
What if we automatically plot the distribution of every column? This lets us see patterns at a glance.
Visualization inference
(def plot-width 100)(def plot-height 100)Type detection: determines whether to show histograms (numeric) or bar charts (categorical)
(defn is-numeric-type? [col]
(tcc/typeof? col :numerical))(defn plot-basic [g]
(let [{:keys [data mappings geometry]} (g 1)
{:keys [x y]} mappings]
(for [geom geometry]
(case geom
:bar (let [x-vals (remove nil? (data x))
categories (distinct x-vals)
counts (frequencies x-vals)
max-count (when (seq counts) (apply max (vals counts)))
bar-width (/ plot-width (count categories))]
(when max-count
(for [[i cat] (map-indexed vector categories)]
(let [count (get counts cat 0)
bar-height (* (/ count max-count) plot-height)]
[:rect {:x (* i bar-width)
:y (- plot-height bar-height)
:width bar-width
:height bar-height
:fill "lightblue"
:stroke "gray"
:stroke-width 0.5}]))))
:histogram (let [values (remove nil? (data x))
hist-result (when (seq values) (fms/histogram values))
bins (:bins-maps hist-result)]
(when (seq bins)
(let [max-count (apply max (map :count bins))
bin-width (/ plot-width (count bins))]
(for [[i bin] (map-indexed vector bins)]
(let [bar-height (* (/ (:count bin) max-count) plot-height)]
[:rect {:x (* i bin-width)
:y (- plot-height bar-height)
:width bin-width
:height bar-height
:fill "lightblue"
:stroke "gray"
:stroke-width 0.5}])))))
:point (let [xys (mapv (juxt x y) data)]
(for [[x y] xys]
[:circle {:r 2, :cx x, :cy y, :fill "lightblue"}]))
:line (let [xys (mapv (juxt x y) data)]
[:path {:d (str "M " (str/join ","
(first xys))
" L " (str/join " "
(map #(str/join "," %)
(rest xys))))}])))))(defn plot-distribution [ds column geom]
^:kind/hiccup
[:svg {:width 100
:viewBox (str/join " " [0 0 plot-width plot-height])
:xmlns "http://www.w3.org/2000/svg"
:style {:border "solid 1px gray"}}
[:g {:stroke "gray", :fill "none"}
(plot-basic [:graphic {:data ds
:mappings {:x column}
:geometry geom}])]])(plot-distribution penguins :bill-length-mm [:histogram])Single Column Summaries
The summarize function automatically selects the right visualization type:
- Numeric columns → histogram (shows distribution shape)
- Categorical columns → bar chart (shows frequencies)
(defn summarize [ds column]
(if (is-numeric-type? (ds column))
(plot-distribution ds column [:histogram])
(plot-distribution ds column [:bar])))Companion function: provides numeric summaries alongside visualizations Shows count, mean, standard deviation, min/max for numeric data Shows count and unique values for categorical data
(defn get-summary-stats [ds column]
(let [col (ds column)]
(if (is-numeric-type? col)
(let [stats (tcc/descriptive-statistics col)]
(format "n: %d, μ: %.2f, σ: %.2f, min: %.2f, max: %.2f"
(:n-elems stats)
(:mean stats)
(:standard-deviation stats)
(:min stats)
(:max stats)))
(let [values (tcc/drop-missing col)
counts (frequencies values)]
(str "n: " (count values) ", unique: " (count counts))))))Summary Table: All Columns at a Glance
Combines visualization + statistics for every column. This gives us a complete overview of the dataset’s structure.
(defn visual-summary [ds]
(kind/table
(doall (for [column-name (tc/column-names ds)]
[column-name (summarize ds column-name) (get-summary-stats ds column-name)]))))(visual-summary penguins)| rownames | n: 333, μ: 174.32, σ: 98.39, min: 1.00, max: 344.00 | |
| species | n: 333, unique: 3 | |
| island | n: 333, unique: 3 | |
| bill-length-mm | n: 333, μ: 43.99, σ: 5.47, min: 32.10, max: 59.60 | |
| bill-depth-mm | n: 333, μ: 17.16, σ: 1.97, min: 13.10, max: 21.50 | |
| flipper-length-mm | n: 333, μ: 200.97, σ: 14.02, min: 172.00, max: 231.00 | |
| body-mass-g | n: 333, μ: 4207.06, σ: 805.22, min: 2700.00, max: 6300.00 | |
| sex | n: 333, unique: 2 | |
| year | n: 333, μ: 2008.04, σ: 0.81, min: 2007.00, max: 2009.00 |
Matrix View: All Column Combinations
The next step: instead of showing each column separately, what if we show how every column relates to every other column? This is the idea behind the scatterplot matrix.
The matrix automatically chooses the right chart for each combination:
- Numeric × Numeric → scatter plot (reveal relationships)
- Otherwise → bar chart (show distribution differences)
(defn matrix [ds]
(let [column-names (tc/column-names ds)
c (count column-names)]
^:kind/hiccup
[:svg {:width "100%"
:viewBox (str/join " " [0 0 (* plot-width c) (* plot-height c)])
:xmlns "http://www.w3.org/2000/svg"
:style {:border "solid 1px gray"}}
[:g {:stroke "gray", :fill "none"}
(for [[a-idx a] (map-indexed vector column-names)
[b-idx b] (map-indexed vector column-names)]
(let [col-a (ds a)
col-b (ds b)
a-numeric? (is-numeric-type? col-a)
b-numeric? (is-numeric-type? col-b)]
[:g {:transform (str "translate(" (* a-idx plot-width) "," (* b-idx plot-height) ")")}
[:rect {:x 0 :y 0 :width plot-width :height plot-height
:fill "none" :stroke "gray" :stroke-width 1}]
(plot-basic [:graphic {:data ds
:mappings {:x a :y b}
:geometry (cond
(and a-numeric? b-numeric?) [:point]
:else [:bar])}])]))]]))(matrix penguins)