From Correlations to Recommendations
From Correlations to Recommendations
A Publisher’s Journey into Data-Driven Book Sales
When you run an indie publishing house with over 160 titles and sell thousands of books each month, one question keeps coming back: Which books do our customers buy together? This seemingly simple question led me down a fascinating path from basic correlation analysis to building a more robust recommendation system using association rule mining—all with Clojure and the SciCloj ecosystem.

The Starting Point: Understanding Our Data
As a publisher at Jan Melvil Publishing, I had access to rich data: about 58,000 orders from 34,000 customers spanning several years. But the data wasn’t structured for analysis. Orders looked like this:
zakaznik | datum | produkt-produkty |
---|---|---|
customer-21289 | 2023-10-15 09:06:41 | 1× book-047 |
customer-23196 | 2023-08-17 23:58:28 | 1× book-000, 1× book-000, 1× book-000 |
customer-1090 | 2023-09-11 14:54:33 | 1× book-048, 1× book-048, 1× book-048, 1× book-014, 1× book-014, 1× book-014 |
customer-22600 | 2023-09-05 11:56:20 | 1× book-032 |
customer-21937 | 2023-10-16 12:34:43 | 1× book-047 |
(for clarity, many columns were omitted here; rows were generated with (tc/random 5)
from anonymized dataset)
Each row represented one order, with books listed as comma-separated values. There are many exceptions, inconsistencies, and format-based differences 1. To analyze purchasing patterns, I needed to transform this into a format where each book became a binary feature: did a customer buy it (1) or not (0)? This transformation is called one-hot encoding.
1: We sell e-books and audiobooks too – readers can use our fantastic app.
The Transformation: Making Data Analysis-Ready
The transformation from raw orders to an analysis-ready format was crucial. Using Tablecloth, the transformation pipeline was easy (and can be even more simplified).
;; From customer orders with book lists...
map
(fn [customer-row]
(let [customer-name (:zakaznik customer-row)
(set (parse-books-from-list (:all-products customer-row)))
books-bought-set (reduce (fn [acc book]
one-hot-map (assoc acc book (if (contains? books-bought-set book) 1 0)))
(
{}
all-titles)]merge {:zakaznik customer-name}
(
one-hot-map))):as-maps))
(tc/rows customer+orders ;; ...to binary matrix where each column is a book
After transformation, each customer became a row, and each book a column with 1 or 0:
zakaznik | book-006 | book-004 | book-007 | book-003 | book-005 | book-000 | book-008 | book-001 |
---|---|---|---|---|---|---|---|---|
customer-1642 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
customer-8113 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
customer-16443 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
customer-21874 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Now I could start asking interesting questions about co-purchase patterns.
First Insights: The Correlation Matrix
My first instinct was to calculate correlations between all books. A correlation tells you how often two books appear together compared to what you’d expect by chance. When I visualized this as a heatmap, with books ordered chronologically, something fascinating emerged:
(kind/plotly:data [{:type "heatmap"
{:z (tc/columns data/corr-matrix-precalculated)
:x (tc/column-names data/corr-matrix-precalculated)
:y (tc/column-names data/corr-matrix-precalculated)
:colorscale "RdBu"
:zmid 0}]
:layout {:title "Book Purchase Correlations (Chronological Order)"
:xaxis {:tickangle 45}
:margin {:l 200 :b 50}
:width 800 :height 600
:shapes [{:type "rect" :x0 -0.5 :y0 -0.5 :x1 80 :y1 80
:line {:color "yellow" :width 3}}]
:annotations [{:x 40 :y 80 :text "Recently published books show <br>much stronger co-purchase patterns"
:showarrow true :arrowhead 2 :arrowsize 1 :arrowwidth 2 :arrowcolor "yellow"
:ax 60 :ay -60
:font {:size 12 :color "black"}
:bgcolor "rgba(255, 255, 200, 0.9)" :bordercolor "yellow" :borderwidth 2}]}})
The bright red square in the upper-left corner revealed that recently published books have much stronger co-purchase patterns than older titles. This made intuitive sense—customers discovering our catalog tend to buy multiple new releases together.
The Limitation: Correlations Weren’t Enough
Correlations told me what books appeared together, but they couldn’t answer crucial business questions:
- Direction: Does buying Book A lead to buying Book B, or vice versa?
- Confidence: If a customer buys Book A, what’s the probability they’ll buy Book B?
- Practical significance: Is this pattern strong enough to base recommendations on?
I needed something more powerful: association rules.
Enter the Apriori Algorithm
Association rule mining, powered by the Apriori algorithm, discovers patterns like “Customers who bought Book A tend to buy Book B with 65% confidence and 2.3× lift.” These rules are directional and measurable—perfect for building a recommendation system.
The Apriori algorithm’s elegance lies in its core insight: “Any subset of a frequent itemset must also be frequent.” If the combination {Beer, Chips, Salsa}
appears often, then {Beer, Chips}
must appear at least as often. This simple observation allows the algorithm to prune billions of potential combinations efficiently.
The Clever Part: Avoiding Duplicates
One of the most elegant aspects of the Apriori implementation is how it generates larger itemsets from smaller ones without creating duplicates. Consider generating 3-item sets from 2-item sets:
We want: [:book-a :book-b :book-c] ✓
Not: [:book-a :book-c :book-b] ✗ (same set, different order)
[:book-b :book-a :book-c] ✗ (same set, different order)
defn join-itemsets
(
[frequent-itemsets k]let [k-1 (dec k)
(;; Only process itemsets of the correct size
filter #(= (count %) k-1) frequent-itemsets)
valid-sets (;; Group by prefix (first k-2 elements) for efficiency
group-by #(vec (take (- k 2) %)) valid-sets)]
by-prefix (mapcat
(fn [[prefix items]]
(for [set1 items
(
set2 items:let [last1 (last set1)
last set2)]
last2 (;; Only join if last2 > last1 (enforces canonical order)
:when (and (not= last1 last2)
pos? (compare last2 last1)))]
(concat prefix [last1 last2])))
( by-prefix)))
The magic is in that (pos? (compare last2 last1))
check—it ensures items are always combined in alphabetical order, preventing duplicates from ever being generated.
Understanding the Metrics
Association rules are evaluated using three key metrics:
Support measures how frequently an itemset appears:
\[ \text{Support}(\{A, B\}) = \dfrac{\text{orders with A and B}}{\text{total orders}} \]
Confidence measures how often B appears when A is purchased:
\[ \text{Confidence}(A \rightarrow B) = \dfrac{\text{Support}(\{A, B\})}{\text{Support}(\{A\})} \]
Lift measures whether this happens more than random chance:
\[ "\text{Lift}(A \rightarrow B) = \dfrac{\text{Confidence}(A \rightarrow B)}{\text{Support}(\{B\})} \]
A lift greater than 1 indicates positive association—the items are purchased together more often than if they were independent. A lift of 2.3 means the combination is 2.3 times more likely than chance.
The Results: Actionable Rules
Running the Apriori algorithm on our book sales data produced rules like these:
(kind/table:rules-grouped quick-formatted) 15)
(tc/head (:element/max-height "500px"}) {
name | group-id | data |
---|---|---|
K365-anglickych-cool-frazi-a-vyrazu | 0 | Group: K365-anglickych-cool-frazi-a-vyrazu [2 4]: :consequents:confidence:lift:support Zamilujte-se-do-anglictiny0.366336636.124286680.01962865 Let-your-english-september0.792079211.203895350.04244032 |
Zamilujte-se-do-anglictiny | 1 | Group: Zamilujte-se-do-anglictiny [2 4]: :consequents:confidence:lift:support K365-anglickych-cool-frazi-a-vyrazu0.373737376.124286680.01962865 Let-your-english-september0.474747470.722201790.02493369 |
Genialni-potraviny | 2 | Group: Genialni-potraviny [2 4]: :consequents:confidence:lift:support Poridte-si-druhy-mozek0.227272733.390780540.01061008 Let-your-english-september0.215909090.330019360.01007958 |
Poridte-si-druhy-mozek | 3 | Group: Poridte-si-druhy-mozek [1 4]: :consequents:confidence:lift:support Genialni-potraviny0.180180183.390780540.01061008 |
Prezit | 4 | Group: Prezit [2 4]: :consequents:confidence:lift:support Sport-je-bolest0.107526882.256279800.01061008 Stastnejsi0.155913981.752090410.01538462 |
Sport-je-bolest | 5 | Group: Sport-je-bolest [1 4]: :consequents:confidence:lift:support Prezit0.263157892.25627980.01061008 |
Stastnejsi | 6 | Group: Stastnejsi [1 4]: :consequents:confidence:lift:support Prezit0.211678831.752090410.01538462 |
Let-your-english-september | 7 | Group: Let-your-english-september [2 4]: :consequents:confidence:lift:support K365-anglickych-cool-frazi-a-vyrazu0.103626941.203895350.04244032 Zamilujte-se-do-anglictiny0.060880830.722201790.02493369 |
Reading these rules is straightforward. For example, if a customer buys Yuval Noah Harari’s “Sapiens,” there’s a 72% chance they’ll also purchase “Nexus” (another Harari title), and this combination is nearly twice as likely as random chance (lift = 1.99).
Visualizing the Network
Perhaps the most compelling representation of these patterns is a network graph, where books are nodes and rules are edges:
This visualization reveals clusters of books that customers buy together, forming natural “reading paths” through our catalog. The thickness of edges represents lift (stronger associations), while node darkness indicates support (popularity).
(Remember this data comes from part of the dataset with particular parameters and tresholds.)
From Analysis to Production
The final piece was building a prediction function that could recommend books based on a customer’s purchase history:
defn predict-next-book-choice
("Predicts customer's next book based on their purchase history"
:keys [top-n min-confidence]}]
[rules customer-books & {;; Business-oriented relevance score:
;; - 80% weight on confidence (practicality)
;; - 20% weight on lift (interestingness)
;; - Bonus for support (popularity signal)
...)
For example, if a customer has purchased Přežít (Peter Attia’s Outlive) and Skrytý potenciál (Adam Grant’s Hidden Potential), the system recommends:
book | confidence-% | lift-factor | relevance | supporting-rules | example-antecedent |
---|---|---|---|---|---|
Jeste-to-promysli | 26.0% | 2.7× | 0.78 | 2 | Skryty-potencial |
Ultrazpracovani-lide | 26.8% | 2.0× | 0.77 | 2 | Skryty-potencial |
Genialni-potraviny | 25.7% | 1.7× | 0.70 | 1 | Prezit |
Stastnejsi | 20.4% | 2.5× | 0.69 | 1 | Prezit |
Stvoreni-k-pohybu | 18.2% | 2.5× | 0.67 | 1 | Prezit |
Bystrejsi-mozek | 19.6% | 1.9× | 0.67 | 1 | Prezit |
Ctyri-tisice-tydnu | 32.7% | 1.1× | 0.66 | 2 | Skryty-potencial |
Zazracna-imunita | 17.0% | 2.8× | 0.65 | 1 | Prezit |
These recommendations are now powering a new “Customers Also Bought” section on our website (still in “manual” mode for now), complementing our existing “Topically Similar” recommendations with data-driven insights.
Why This Matters for the Clojure Community
This project demonstrates several strengths of Clojure and the SciCloj ecosystem for real-world data science:
1. Readable transformations: Clojure’s threading macro (->
) made complex data pipelines read like narratives. Each step tells a story, making the code understandable to both technical and even business stakeholders.
-> data/orders
(:zakaznik) ;; Per customer
(tc/group-by :total-books count-books}) ;; Count their purchases
(tc/aggregate {:total-books :desc) ;; Best customers first
(tc/order-by 5)) ;; Top 5 (tc/head
2. Interactive development: Working in a REPL with Clay notebooks meant I could explore, visualize, and validate each step immediately. This tight feedback loop was essential for discovery.
3. A complete stack: From data manipulation (Tablecloth) to visualization (Tableplot) to presentation (Clay and Kindly), the SciCloj ecosystem provided everything I needed without leaving Clojure.
4. Production-ready code: The same code that powers my exploratory analysis can be run in production later, generating live recommendations for our website (I hope!).
The Impact
This project is still under construction and tangible business results have yet to be seen. But:
- We are already stopping less effective cross-selling campaigns and starting target author communities
- Our website now features more data-driven “Customers Also Bought” recommendations
- We use these insights to optimize B2B offers for corporate clients
- Our social media campaigns are being better targeted based on purchase pattern clusters
Most importantly, I learned that you don’t need a data science team or expensive tools to extract value from your data. With curiosity, the right tools, and a supportive community (shout out to the SciCloj folks on Zulip!), even a beginner can turn raw data into actionable insights.
Footnotes
^1↩︎