From Correlations to Recommendations

A Publisher’s Journey into Data-Driven Book Sales – exploring how association rule mining can transform business insights using the SciCloj stack.
Author
Published

October 13, 2025

From Correlations to Recommendations

A Publisher’s Journey into Data-Driven Book Sales

When you run an indie publishing house with over 160 titles and sell thousands of books each month, one question keeps coming back: Which books do our customers buy together? This seemingly simple question led me down a fascinating path from basic correlation analysis to building a more robust recommendation system using association rule mining—all with Clojure and the SciCloj ecosystem.

The Starting Point: Understanding Our Data

As a publisher at Jan Melvil Publishing, I had access to rich data: about 58,000 orders from 34,000 customers spanning several years. But the data wasn’t structured for analysis. Orders looked like this:

zakaznik datum produkt-produkty
customer-21289 2023-10-15 09:06:41 1× book-047
customer-23196 2023-08-17 23:58:28 1× book-000, 1× book-000, 1× book-000
customer-1090 2023-09-11 14:54:33 1× book-048, 1× book-048, 1× book-048, 1× book-014, 1× book-014, 1× book-014
customer-22600 2023-09-05 11:56:20 1× book-032
customer-21937 2023-10-16 12:34:43 1× book-047

(for clarity, many columns were omitted here; rows were generated with (tc/random 5) from anonymized dataset)

Each row represented one order, with books listed as comma-separated values. There are many exceptions, inconsistencies, and format-based differences 1. To analyze purchasing patterns, I needed to transform this into a format where each book became a binary feature: did a customer buy it (1) or not (0)? This transformation is called one-hot encoding.


1: We sell e-books and audiobooks too – readers can use our fantastic app.

The Transformation: Making Data Analysis-Ready

The transformation from raw orders to an analysis-ready format was crucial. Using Tablecloth, the transformation pipeline was easy (and can be even more simplified).

;; From customer orders with book lists...
(map
 (fn [customer-row]
   (let [customer-name (:zakaznik customer-row)
         books-bought-set (set (parse-books-from-list (:all-products customer-row)))
         one-hot-map (reduce (fn [acc book]
                               (assoc acc book (if (contains? books-bought-set book) 1 0)))
                             {}
                             all-titles)]
     (merge {:zakaznik customer-name}
            one-hot-map)))
 (tc/rows customer+orders :as-maps))
;; ...to binary matrix where each column is a book

After transformation, each customer became a row, and each book a column with 1 or 0:

zakaznik book-006 book-004 book-007 book-003 book-005 book-000 book-008 book-001
customer-1642 0 0 0 1 0 0 0 0
customer-8113 0 0 0 0 0 0 0 0
customer-16443 0 0 0 0 0 0 0 0
customer-21874 0 0 0 0 0 0 0 0

Now I could start asking interesting questions about co-purchase patterns.

First Insights: The Correlation Matrix

My first instinct was to calculate correlations between all books. A correlation tells you how often two books appear together compared to what you’d expect by chance. When I visualized this as a heatmap, with books ordered chronologically, something fascinating emerged:

(kind/plotly
 {:data [{:type "heatmap"
          :z (tc/columns data/corr-matrix-precalculated)
          :x (tc/column-names data/corr-matrix-precalculated)
          :y (tc/column-names data/corr-matrix-precalculated)
          :colorscale "RdBu"
          :zmid 0}]
  :layout {:title "Book Purchase Correlations (Chronological Order)"
           :xaxis {:tickangle 45}
           :margin {:l 200 :b 50}
           :width 800 :height 600
           :shapes [{:type "rect" :x0 -0.5 :y0 -0.5 :x1 80 :y1 80
                     :line {:color "yellow" :width 3}}]
           :annotations [{:x 40 :y 80 :text "Recently published books show <br>much stronger co-purchase patterns"
                          :showarrow true :arrowhead 2 :arrowsize 1 :arrowwidth 2 :arrowcolor "yellow"
                          :ax 60 :ay -60
                          :font {:size 12 :color "black"}
                          :bgcolor "rgba(255, 255, 200, 0.9)" :bordercolor "yellow" :borderwidth 2}]}})

The bright red square in the upper-left corner revealed that recently published books have much stronger co-purchase patterns than older titles. This made intuitive sense—customers discovering our catalog tend to buy multiple new releases together.

A Surprising Discovery: Czech vs. Foreign Authors

When I analyzed each book’s correlation profile—calculating the mean correlation between each book and all others—another pattern emerged:

Foreign bestsellers (marked in orange) showed consistently higher correlations with other books. They had broad appeal and were purchased alongside many different titles. Czech authors (in blue), however, showed lower correlations, suggesting their readers were more focused. Many customers would buy just one Czech title, often using it as a “gateway” into our catalog (strongly supported by Czech author’s local campaigns), while foreign bestsellers were part of larger, more diverse purchases.

This insight immediately changed our marketing approach. We stopped using generic cross-sell recommendations for Czech authors and instead focused on building author-specific communities and started to cooperate with easily reachable local Czech authors on cross-selling approach.

The Limitation: Correlations Weren’t Enough

Correlations told me what books appeared together, but they couldn’t answer crucial business questions:

  • Direction: Does buying Book A lead to buying Book B, or vice versa?
  • Confidence: If a customer buys Book A, what’s the probability they’ll buy Book B?
  • Practical significance: Is this pattern strong enough to base recommendations on?

I needed something more powerful: association rules.

Enter the Apriori Algorithm

Association rule mining, powered by the Apriori algorithm, discovers patterns like “Customers who bought Book A tend to buy Book B with 65% confidence and 2.3× lift.” These rules are directional and measurable—perfect for building a recommendation system.

The Apriori algorithm’s elegance lies in its core insight: “Any subset of a frequent itemset must also be frequent.” If the combination {Beer, Chips, Salsa} appears often, then {Beer, Chips} must appear at least as often. This simple observation allows the algorithm to prune billions of potential combinations efficiently.

The Clever Part: Avoiding Duplicates

One of the most elegant aspects of the Apriori implementation is how it generates larger itemsets from smaller ones without creating duplicates. Consider generating 3-item sets from 2-item sets:

We want: [:book-a :book-b :book-c] ✓  
Not:     [:book-a :book-c :book-b] ✗ (same set, different order)  
         [:book-b :book-a :book-c] ✗ (same set, different order)  
(defn join-itemsets
  [frequent-itemsets k]
  (let [k-1 (dec k)
        ;; Only process itemsets of the correct size
        valid-sets (filter #(= (count %) k-1) frequent-itemsets)
        ;; Group by prefix (first k-2 elements) for efficiency
        by-prefix (group-by #(vec (take (- k 2) %)) valid-sets)]
    (mapcat
     (fn [[prefix items]]
       (for [set1 items
             set2 items
             :let [last1 (last set1)
                   last2 (last set2)]
             ;; Only join if last2 > last1 (enforces canonical order)
             :when (and (not= last1 last2)
                        (pos? (compare last2 last1)))]
         (concat prefix [last1 last2])))
     by-prefix)))

The magic is in that (pos? (compare last2 last1)) check—it ensures items are always combined in alphabetical order, preventing duplicates from ever being generated.

Understanding the Metrics

Association rules are evaluated using three key metrics:

Support measures how frequently an itemset appears:

\[ \text{Support}(\{A, B\}) = \dfrac{\text{orders with A and B}}{\text{total orders}} \]

Confidence measures how often B appears when A is purchased:

\[ \text{Confidence}(A \rightarrow B) = \dfrac{\text{Support}(\{A, B\})}{\text{Support}(\{A\})} \]

Lift measures whether this happens more than random chance:

\[ "\text{Lift}(A \rightarrow B) = \dfrac{\text{Confidence}(A \rightarrow B)}{\text{Support}(\{B\})} \]

A lift greater than 1 indicates positive association—the items are purchased together more often than if they were independent. A lift of 2.3 means the combination is 2.3 times more likely than chance.

The Results: Actionable Rules

Running the Apriori algorithm on our book sales data produced rules like these:

(kind/table
 (tc/head (:rules-grouped quick-formatted) 15)
 {:element/max-height "500px"})
name group-id data
K365-anglickych-cool-frazi-a-vyrazu 0 Group: K365-anglickych-cool-frazi-a-vyrazu [2 4]:
:consequents:confidence:lift:support
Zamilujte-se-do-anglictiny0.366336636.124286680.01962865
Let-your-english-september0.792079211.203895350.04244032
Zamilujte-se-do-anglictiny 1 Group: Zamilujte-se-do-anglictiny [2 4]:
:consequents:confidence:lift:support
K365-anglickych-cool-frazi-a-vyrazu0.373737376.124286680.01962865
Let-your-english-september0.474747470.722201790.02493369
Genialni-potraviny 2 Group: Genialni-potraviny [2 4]:
:consequents:confidence:lift:support
Poridte-si-druhy-mozek0.227272733.390780540.01061008
Let-your-english-september0.215909090.330019360.01007958
Poridte-si-druhy-mozek 3 Group: Poridte-si-druhy-mozek [1 4]:
:consequents:confidence:lift:support
Genialni-potraviny0.180180183.390780540.01061008
Prezit 4 Group: Prezit [2 4]:
:consequents:confidence:lift:support
Sport-je-bolest0.107526882.256279800.01061008
Stastnejsi0.155913981.752090410.01538462
Sport-je-bolest 5 Group: Sport-je-bolest [1 4]:
:consequents:confidence:lift:support
Prezit0.263157892.25627980.01061008
Stastnejsi 6 Group: Stastnejsi [1 4]:
:consequents:confidence:lift:support
Prezit0.211678831.752090410.01538462
Let-your-english-september 7 Group: Let-your-english-september [2 4]:
:consequents:confidence:lift:support
K365-anglickych-cool-frazi-a-vyrazu0.103626941.203895350.04244032
Zamilujte-se-do-anglictiny0.060880830.722201790.02493369

Reading these rules is straightforward. For example, if a customer buys Yuval Noah Harari’s “Sapiens,” there’s a 72% chance they’ll also purchase “Nexus” (another Harari title), and this combination is nearly twice as likely as random chance (lift = 1.99).

Visualizing the Network

Perhaps the most compelling representation of these patterns is a network graph, where books are nodes and rules are edges:

This visualization reveals clusters of books that customers buy together, forming natural “reading paths” through our catalog. The thickness of edges represents lift (stronger associations), while node darkness indicates support (popularity).

(Remember this data comes from part of the dataset with particular parameters and tresholds.)

From Analysis to Production

The final piece was building a prediction function that could recommend books based on a customer’s purchase history:

(defn predict-next-book-choice
   "Predicts customer's next book based on their purchase history"
   [rules customer-books & {:keys [top-n min-confidence]}]
   ;; Business-oriented relevance score:
   ;; - 80% weight on confidence (practicality)
   ;; - 20% weight on lift (interestingness)
   ;; - Bonus for support (popularity signal)
   ...)

For example, if a customer has purchased Přežít (Peter Attia’s Outlive) and Skrytý potenciál (Adam Grant’s Hidden Potential), the system recommends:

book confidence-% lift-factor relevance supporting-rules example-antecedent
Jeste-to-promysli 26.0% 2.7× 0.78 2 Skryty-potencial
Ultrazpracovani-lide 26.8% 2.0× 0.77 2 Skryty-potencial
Genialni-potraviny 25.7% 1.7× 0.70 1 Prezit
Stastnejsi 20.4% 2.5× 0.69 1 Prezit
Stvoreni-k-pohybu 18.2% 2.5× 0.67 1 Prezit
Bystrejsi-mozek 19.6% 1.9× 0.67 1 Prezit
Ctyri-tisice-tydnu 32.7% 1.1× 0.66 2 Skryty-potencial
Zazracna-imunita 17.0% 2.8× 0.65 1 Prezit

These recommendations are now powering a new “Customers Also Bought” section on our website (still in “manual” mode for now), complementing our existing “Topically Similar” recommendations with data-driven insights.

Why This Matters for the Clojure Community

This project demonstrates several strengths of Clojure and the SciCloj ecosystem for real-world data science:

1. Readable transformations: Clojure’s threading macro (->) made complex data pipelines read like narratives. Each step tells a story, making the code understandable to both technical and even business stakeholders.

(-> data/orders
    (tc/group-by :zakaznik)                    ;; Per customer
    (tc/aggregate {:total-books count-books})  ;; Count their purchases
    (tc/order-by :total-books :desc)           ;; Best customers first
    (tc/head 5))                               ;; Top 5

2. Interactive development: Working in a REPL with Clay notebooks meant I could explore, visualize, and validate each step immediately. This tight feedback loop was essential for discovery.

3. A complete stack: From data manipulation (Tablecloth) to visualization (Tableplot) to presentation (Clay and Kindly), the SciCloj ecosystem provided everything I needed without leaving Clojure.

4. Production-ready code: The same code that powers my exploratory analysis can be run in production later, generating live recommendations for our website (I hope!).

The Impact

This project is still under construction and tangible business results have yet to be seen. But:

  • We are already stopping less effective cross-selling campaigns and starting target author communities
  • Our website now features more data-driven “Customers Also Bought” recommendations
  • We use these insights to optimize B2B offers for corporate clients
  • Our social media campaigns are being better targeted based on purchase pattern clusters

Most importantly, I learned that you don’t need a data science team or expensive tools to extract value from your data. With curiosity, the right tools, and a supportive community (shout out to the SciCloj folks on Zulip!), even a beginner can turn raw data into actionable insights.


About the Author

Tomáš Baránek is a publisher at Jan Melvil Publishing and co-founder of Servantes, developing software for publishers worldwide. He’s a computer science graduate, Clojure enthusiast exploring data science, learning by doing on real publishing challenges. You can find him on Bluesky or read his blog.

Resources: - Author: https://barys.me - Full presentation code: to be published - SciCloj community: scicloj.github.io


This article is based on a presentation at Macroexpand conference, October 2025.

source: src/data_analysis/book_sales_analysis/about_apriori.clj

Footnotes

  1. ^1↩︎