Market Basket Analysis

Discovering Customer Purchase Patterns

Why Market Basket Analysis?

Correlations told us which books are bought together, but not why or in what order.

Market Basket Analysis reveals directional rules:

“Customers who bought Book A tend to also buy Book B”
Confidence: How often does this happen?
Lift: Is this more than random chance?

This lets us make actionable recommendations for our website.

The Apriori Algorithm

The Apriori algorithm is our workhorse. It finds “frequent itemsets” - combinations of books that appear together often enough to be interesting.

The key insight: “Any subset of a frequent itemset must also be frequent.”

This lets us prune billions of possible combinations efficiently.

Step 1: Join Itemsets Efficiently

The challenge: Generate all k-item combinations from (k-1)-item combinations without creating duplicates.

Example: From [[:A :B], [:A :C]] generate [:A :B :C] but not [:A :C :B] (they’re the same set!)

The solution: Enforce canonical ordering. Only generate combinations where items are in alphabetical order. The (pos? (compare last2 last1)) check ensures we only create [:A :B :C] and skip redundant permutations.

This isn’t about finding duplicates after creation - it prevents them from ever being created and later evaluated for support.

(defn join-itemsets
  "Itemset joining for Apriori algorithm.

   Generates k-itemsets from (k-1)-itemsets efficiently by:
   1. Grouping by common prefix (first k-2 elements)
   2. Joining pairs that share this prefix
   3. Enforcing ordering to prevent duplicate sets
   
   Example: [:A :B] + [:A :C] → [:A :B :C] (but not [:A :C :B])"
  [frequent-itemsets k]
  (let [k-1 (dec k)
        ;; Only process itemsets of the correct size
        valid-sets (filter #(= (count %) k-1) frequent-itemsets)
        ;; Group by prefix (first k-2 elements) for efficiency
        by-prefix (group-by #(vec (take (- k 2) %)) valid-sets)]
    (mapcat
     (fn [[prefix items]]
       (for [set1 items
             set2 items
             :let [last1 (last set1)
                   last2 (last set2)]
             ;; Only join if last2 > last1 (enforces canonical order)
             :when (and (not= last1 last2)
                        (pos? (compare last2 last1)))]
         (concat prefix [last1 last2])))
     by-prefix)))

Step 2: Generate Frequent Itemsets

Step 3: Generate Association Rules

Stage 1: Generate Analysis (The 15-Minute Step)

Stage 2: Format Results for Display

(defn format-analysis-results
  "Formats analysis results into presentation-ready tables.
   
   Takes the output from `generate-market-basket-analysis` and creates:

   - Nice table with grouped antecedents
   - Summary statistics
   - Top rules for quick viewing
   
   This is instant - you can call it repeatedly with different options."
  [analysis-results & {:keys [top-n] :or {top-n 10}}]
  (let [rules (:rules analysis-results)
        meta-info (:meta analysis-results)
        rules-nice (-> (tc/dataset rules)
                       (tc/map-columns :consequent :consequent #(into [] %))
                       (tc/map-columns :antestr :antecedent 
                                      (fn [ant] (str/join " + " (map (comp str/capitalize name) ant))))
                       (tc/map-columns :consequents :consequent 
                                      (fn [con] (str/join " + " (map (comp str/capitalize name) con))))
                       (tc/map-columns :confidence-display :confidence 
                                      #(String/format "%.2f %%" (to-array [(* 100 %)])))
                       (tc/map-columns :lift-display :lift 
                                      #(str (String/format "%.1f " (to-array [%])) 
                                           (when (< % 1) "❗") "×"))
                       (tc/select-columns [:antestr :consequents :confidence-display :lift-display :support :confidence :lift])
                       (tc/rename-columns {:confidence-display :confidence 
                                          :lift-display :lift}))]
    
    {:rules-table rules-nice
     :rules-grouped (-> (tc/group-by rules-nice :antestr)
                        (tc/drop-columns :antestr))
     :top-rules (take top-n rules)
     :summary {:itemsets (:itemsets-count meta-info)
               :rules (:rules-count meta-info)
               :min-support (:min-support meta-info)
               :min-confidence (:min-confidence meta-info)}
     :raw-rules rules}))

Stage 3: Create Visualizations

(defn create-graphviz-visualization
  "Creates GraphViz visualization of association rules network.
   
   This creates the network diagram that reveals hidden patterns in purchase behavior.
   
   Parameters:

   - `analysis-results` – output from generate-market-basket-analysis
   - `limit` – number of rules to include (default 200, try 20-50 for clearer view)
   - `scale` – graph scale factor (default 0.1, increase for bigger output)
   
   The visualization shows:

   - **Nodes** = book combinations (color intensity = support)
   - **Edges** = rules (thickness = lift strength, labels = confidence %)"
  [analysis-results & {:keys [limit scale] :or {limit 200 scale 0.1}}]
  (let [dataset (:onehot-dataset analysis-results)
        rules (:rules analysis-results)]

    (if (empty? rules)
      "digraph G {\n  node [style=filled, shape=ellipse];\n  \"No rules found\" [fillcolor=\"#f0f0f0\"];\n}"
      (let [top-rules (->> rules
                           (sort-by :lift >)
                           (take limit)
                           (map #(update % :consequent vec))
                           (mapv #(update % :antecedent (fn [a] (sort-by name a))))
                           (mapv #(update % :consequent (fn [c] (sort-by name c)))))

            nodes-set (into []
                            (comp
                             (mapcat (juxt :antecedent :consequent))
                             (remove nil?)
                             (distinct))
                            top-rules)

            supports (reduce (fn [acc items]
                               (assoc acc (vec items) (helpers/calculate-support dataset items)))
                             {}
                             nodes-set)
            support-values (vals supports)
            min-support (if (seq support-values) (apply min support-values) 0.0)
            max-support (if (seq support-values) (apply max support-values) 1.0)
            min-lift (apply min (map :lift top-rules))
            max-lift (apply max (map :lift top-rules))

            node-defs (map #(str "  \""
                                 %
                                 "\" [label=\"" (str/join " + \\n" (map (comp str/capitalize name) %)) "\""
                                 ", fillcolor=\"" (helpers/color-hex (helpers/calculate-support dataset %) min-support max-support) "\""
                                 ", color=\"" (helpers/color-hex (helpers/calculate-support dataset %) min-support max-support) "\""  ;; Match border to fill color
                                 ", width=" (+ 0.8 (* 0.12 (count %)))
                                 ", height=0.4"
                                 ", margin=\"0.15,0.1\""
                                 ", pin=false"
                                 "];")
                           nodes-set)

            edge-defs (map #(let [normalized-lift (/ (- (:lift %) min-lift)
                                                     (max 0.1 (- max-lift min-lift)))
                                  edge-len (+ 6.0 (* 12.0 normalized-lift))
                                  edge-weight (max 0.05 (- 0.5 (* 0.3 normalized-lift)))]
                              (str "  \"" (:antecedent %) "\" -> \"" (:consequent %) "\""
                                   " [label=\"" (format "%.0f%%" (* 100 (:confidence %))) "\\n"
                                   (format "%.1f×" (:lift %)) "\""
                                   ", len=" edge-len
                                   ", weight=" edge-weight
                                   ", fontsize=8"
                                   ", fontname=\"Helvetica\""
                                   ", penwidth=" (max 0.5 (* 1.5 normalized-lift))
                                   ", minlen=3"
                                   "];"))
                           top-rules)

            graphviz-def (str "digraph AssociationRules {\n"
                              "  layout=neato;\n"
                              "  bgcolor=transparent;\n"
                              "  overlap=false;\n"
                              "  sep=\"+100,100\";\n"
                              "  splines=true;\n"
                              "  concentrate=false;\n"
                              "  scale=" scale ";\n"
                              "  pad=\"1.0,1.0\";\n"
                              "  size=\"20,16\";\n"
                              "  ratio=auto;\n"
                              "  start=random;\n"
                              "  epsilon=0.0001;\n"
                              "  maxiter=5000;\n"
                              "  node [shape=rectangle, style=filled, fontname=\"Helvetica\", fontsize=9];\n"
                              "  edge [color=\"#c1556b\", arrowhead=vee, arrowsize=0.5, splines=curved];\n"
                              (str/join "\n" node-defs) "\n"
                              (str/join "\n" edge-defs) "\n"
                              "}\n")]
        graphviz-def))))

Prediction Function

# Presentation: Live Examples

Quick Demo (6,000 orders - runs in ~1 minute)

Full Analysis (All orders - runs in ~15 minutes)

Show Results

Summary Statistics

Top Rules (Grouped by Antecedent)

Network Visualization

Start with small, clear view

Then show full complexity

Prediction Demo

🎯 LIVE DEMO: Show how the website should make recommendations

source: src/data_analysis/book_sales_analysis/market_basket_analysis_v2.clj