Market Basket Analysis
Discovering Customer Purchase Patterns
Why Market Basket Analysis?
Correlations told us which books are bought together, but not why or in what order.
Market Basket Analysis reveals directional rules: - “Customers who bought Book A tend to also buy Book B” - Confidence: How often does this happen? - Lift: Is this more than random chance?
This lets us make actionable recommendations for our website.
The Apriori Algorithm
The Apriori algorithm is our workhorse. It finds “frequent itemsets” - combinations of books that appear together often enough to be interesting.
The key insight: “Any subset of a frequent itemset must also be frequent.”
This lets us prune billions of possible combinations efficiently.
Step 1: Join Itemsets Efficiently
The challenge: Generate all k-item combinations from (k-1)-item combinations without creating duplicates.
Example: From [[:A :B], [:A :C]]
generate [:A :B :C]
but not [:A :C :B]
(they’re the same set!)
The solution: Enforce canonical ordering. Only generate combinations where items are in alphabetical order. The (pos? (compare last2 last1))
check ensures we only create [:A :B :C]
and skip redundant permutations.
This isn’t about finding duplicates after creation - it prevents them from ever being created and later evaluated for support.
defn join-itemsets
("Itemset joining for Apriori algorithm.
Generates k-itemsets from (k-1)-itemsets efficiently by:
1. Grouping by common prefix (first k-2 elements)
2. Joining pairs that share this prefix
3. Enforcing ordering to prevent duplicate sets
Example: [:A :B] + [:A :C] → [:A :B :C] (but not [:A :C :B])"
[frequent-itemsets k]let [k-1 (dec k)
(;; Only process itemsets of the correct size
filter #(= (count %) k-1) frequent-itemsets)
valid-sets (;; Group by prefix (first k-2 elements) for efficiency
group-by #(vec (take (- k 2) %)) valid-sets)]
by-prefix (mapcat
(fn [[prefix items]]
(for [set1 items
(
set2 items:let [last1 (last set1)
last set2)]
last2 (;; Only join if last2 > last1 (enforces canonical order)
:when (and (not= last1 last2)
pos? (compare last2 last1)))]
(concat prefix [last1 last2])))
( by-prefix)))
Step 2: Generate Frequent Itemsets
Step 3: Generate Association Rules
Stage 1: Generate Analysis (The 15-Minute Step)
Stage 2: Format Results for Display
defn format-analysis-results
("Formats analysis results into presentation-ready tables.
Takes the output from `generate-market-basket-analysis` and creates:
- Nice table with grouped antecedents
- Summary statistics
- Top rules for quick viewing
This is instant - you can call it repeatedly with different options."
:keys [top-n] :or {top-n 10}}]
[analysis-results & {let [rules (:rules analysis-results)
(:meta analysis-results)
meta-info (-> (tc/dataset rules)
rules-nice (:consequent :consequent #(into [] %))
(tc/map-columns :antestr :antecedent
(tc/map-columns fn [ant] (str/join " + " (map (comp str/capitalize name) ant))))
(:consequents :consequent
(tc/map-columns fn [con] (str/join " + " (map (comp str/capitalize name) con))))
(:confidence-display :confidence
(tc/map-columns "%.2f %%" (to-array [(* 100 %)])))
#(String/format :lift-display :lift
(tc/map-columns str (String/format "%.1f " (to-array [%]))
#(when (< % 1) "❗") "×"))
(:antestr :consequents :confidence-display :lift-display :support :confidence :lift])
(tc/select-columns [:confidence-display :confidence
(tc/rename-columns {:lift-display :lift}))]
:rules-table rules-nice
{:rules-grouped (-> (tc/group-by rules-nice :antestr)
:antestr))
(tc/drop-columns :top-rules (take top-n rules)
:summary {:itemsets (:itemsets-count meta-info)
:rules (:rules-count meta-info)
:min-support (:min-support meta-info)
:min-confidence (:min-confidence meta-info)}
:raw-rules rules}))
Stage 3: Create Visualizations
defn create-graphviz-visualization
("Creates GraphViz visualization of association rules network.
This creates the network diagram that reveals hidden patterns in purchase behavior.
Parameters:
- `analysis-results` – output from generate-market-basket-analysis
- `limit` – number of rules to include (default 200, try 20-50 for clearer view)
- `scale` – graph scale factor (default 0.1, increase for bigger output)
The visualization shows:
- **Nodes** = book combinations (color intensity = support)
- **Edges** = rules (thickness = lift strength, labels = confidence %)"
:keys [limit scale] :or {limit 200 scale 0.1}}]
[analysis-results & {let [dataset (:onehot-dataset analysis-results)
(:rules analysis-results)]
rules (
if (empty? rules)
("digraph G {\n node [style=filled, shape=ellipse];\n \"No rules found\" [fillcolor=\"#f0f0f0\"];\n}"
let [top-rules (->> rules
(sort-by :lift >)
(take limit)
(map #(update % :consequent vec))
(mapv #(update % :antecedent (fn [a] (sort-by name a))))
(mapv #(update % :consequent (fn [c] (sort-by name c)))))
(
into []
nodes-set (comp
(mapcat (juxt :antecedent :consequent))
(remove nil?)
(distinct))
(
top-rules)
reduce (fn [acc items]
supports (assoc acc (vec items) (helpers/calculate-support dataset items)))
(
{}
nodes-set)vals supports)
support-values (if (seq support-values) (apply min support-values) 0.0)
min-support (if (seq support-values) (apply max support-values) 1.0)
max-support (apply min (map :lift top-rules))
min-lift (apply max (map :lift top-rules))
max-lift (
map #(str " \""
node-defs (%
"\" [label=\"" (str/join " + \\n" (map (comp str/capitalize name) %)) "\""
", fillcolor=\"" (helpers/color-hex (helpers/calculate-support dataset %) min-support max-support) "\""
", color=\"" (helpers/color-hex (helpers/calculate-support dataset %) min-support max-support) "\"" ;; Match border to fill color
", width=" (+ 0.8 (* 0.12 (count %)))
", height=0.4"
", margin=\"0.15,0.1\""
", pin=false"
"];")
nodes-set)
map #(let [normalized-lift (/ (- (:lift %) min-lift)
edge-defs (max 0.1 (- max-lift min-lift)))
(+ 6.0 (* 12.0 normalized-lift))
edge-len (max 0.05 (- 0.5 (* 0.3 normalized-lift)))]
edge-weight (str " \"" (:antecedent %) "\" -> \"" (:consequent %) "\""
(" [label=\"" (format "%.0f%%" (* 100 (:confidence %))) "\\n"
format "%.1f×" (:lift %)) "\""
(", len=" edge-len
", weight=" edge-weight
", fontsize=8"
", fontname=\"Helvetica\""
", penwidth=" (max 0.5 (* 1.5 normalized-lift))
", minlen=3"
"];"))
top-rules)
str "digraph AssociationRules {\n"
graphviz-def (" layout=neato;\n"
" bgcolor=transparent;\n"
" overlap=false;\n"
" sep=\"+100,100\";\n"
" splines=true;\n"
" concentrate=false;\n"
" scale=" scale ";\n"
" pad=\"1.0,1.0\";\n"
" size=\"20,16\";\n"
" ratio=auto;\n"
" start=random;\n"
" epsilon=0.0001;\n"
" maxiter=5000;\n"
" node [shape=rectangle, style=filled, fontname=\"Helvetica\", fontsize=9];\n"
" edge [color=\"#c1556b\", arrowhead=vee, arrowsize=0.5, splines=curved];\n"
"\n" node-defs) "\n"
(str/join "\n" edge-defs) "\n"
(str/join "}\n")]
graphviz-def))))
Prediction Function
# Presentation: Live Examples
Quick Demo (6,000 orders - runs in ~1 minute)
Full Analysis (All orders - runs in ~15 minutes)
Show Results
Summary Statistics
Top Rules (Grouped by Antecedent)
Network Visualization
Start with small, clear view
Then show full complexity
Prediction Demo
🎯 LIVE DEMO: Show how the website should make recommendations