Market Basket Analysis with SQL

Overview

Market Basket Analysis (or in short: MBA) is a technique using analytics to explore customer purchase behaviors, thus retailers can efficiently employ purchase information to effectively market and optimize sales of diverse products. MBA goes with several other names such as Product Association or Product Recommendation and is one of the most popular and best applications of machine learning, looking for the products that frequently go with each other in purchases.

For example, if a customer buys a pair of pants, how likely are they going to buy a shirt on the same trip? How likely is a customer going to buy a drink following a purchase of a sandwich? This analysis stands at the forefront in the retail industry to solve such cases to make better recommendations for customers, which has been applied widely in such as: cinemas, fast food restaurants, clothing shops.


MBA with AdventureWorks2019 database

To better understand how MBA works, let’s look at the following demonstration in Microsoft SQL Server 2019 on AdventureWorks2019 database, specifically FactInternetSales table with two attributes: SalesOrderNumber and ProductKey.

Description

  • FactInternetSales table: Sales record through internet channel
    1. SalesOrderNumber: The code indentifies the order
    2. ProductKey: The code indentifies the product
SELECT TOP 10 -- see the first 10 rows of the table
	SalesOrderNumber,
	ProductKey
FROM FactInternetSales;

1. Select the orders having at least two different products

SELECT
	SalesOrderNumber,
	COUNT(ProductKey) AS NumberofProducts
FROM FactInternetSales
GROUP BY SalesOrderNumber
HAVING COUNT(ProductKey) >= 2;

We have the following result:


2. List out the SalesOrderNumber and ProductKey of orders having at least two product keys

SELECT
	OrderList.SalesOrderNumber,
	FIS.ProductKey
FROM
	(SELECT
		SalesOrderNumber,
		COUNT(ProductKey) AS NumberofProducts
	FROM FactInternetSales
	GROUP BY SalesOrderNumber
	HAVING COUNT(ProductKey) >= 2) AS OrderList
JOIN FactInternetSales AS FIS ON Orderlist.SalesOrderNumber = FIS.SalesOrderNumber;

We have the following result:

Notice that all the orders returned have more than 1 product key. Before seeing which combination of two products occurs most frequent, let’s use common table expression (CTE) to use this result for further uses.


3. Common Table Expression (CTE)

WITH Info AS
(SELECT
	OrderList.SalesOrderNumber,
	FIS.ProductKey
FROM
	(SELECT
		SalesOrderNumber,
		COUNT(ProductKey) AS NumberofProducts
	FROM FactInternetSales
	GROUP BY SalesOrderNumber
	HAVING COUNT(ProductKey) >= 2) AS OrderList
JOIN FactInternetSales AS FIS ON Orderlist.SalesOrderNumber = FIS.SalesOrderNumber);

4. Create combinations of two products in the same order

WITH Info AS
(SELECT
	OrderList.SalesOrderNumber,
	FIS.ProductKey
FROM
	(SELECT
		SalesOrderNumber,
		COUNT(ProductKey) AS NumberofProducts
	FROM FactInternetSales
	GROUP BY SalesOrderNumber
	HAVING COUNT(ProductKey) >= 2) AS OrderList
JOIN FactInternetSales AS FIS ON Orderlist.SalesOrderNumber = FIS.SalesOrderNumber)

SELECT 
	Info1.SalesOrderNumber,
	Info1.ProductKey AS Product1,
	Info2.ProductKey AS Product2
FROM Info AS Info1
JOIN Info AS Info2 ON Info1.SalesOrderNumber = Info2.SalesOrderNumber
WHERE Info1.ProductKey != Info2.ProductKey 
  AND Info1.ProductKey < Info2.ProductKey;

We self join the temporary “Info” table to view every 2 products that occur in the same order. Pay attention to the “WHERE” statement where we specify two conditions to avoid duplicates.

  • Info1.ProductKey != Info2.ProductKey
    This is to make sure that the products we account for are different.
  • Info1.ProductKey < Info2.ProductKey
    This is to avoid duplicates. If this condition does not exist, the result will be:

Look at the first and the third record where Product1 equals Product2 and vice versa. This makes the result is twice as long as it should be. Therefore, removing duplicates by using second condition is a must.

This is the result without duplicates:


5. Calculate the frequency of a pair of two products

WITH Info AS
(SELECT
	OrderList.SalesOrderNumber,
	FIS.ProductKey
FROM
	(SELECT
		SalesOrderNumber,
		COUNT(ProductKey) AS NumberofProducts
	FROM FactInternetSales
	GROUP BY SalesOrderNumber
	HAVING COUNT(ProductKey) >= 2) AS OrderList
JOIN FactInternetSales AS FIS ON Orderlist.SalesOrderNumber = FIS.SalesOrderNumber)

SELECT
	Info1.ProductKey AS Product1,
	Info2.ProductKey AS Product2,
	COUNT(*) AS Frequency
FROM Info AS Info1
JOIN Info AS Info2 ON Info1.SalesOrderNumber = Info2.SalesOrderNumber
WHERE Info1.ProductKey != Info2.ProductKey 
  AND Info1.ProductKey < Info2.ProductKey
GROUP BY
	Info1.ProductKey,
	Info2.ProductKey
ORDER BY COUNT(*);

We have the following result:

From the result, we can interpret that if a customer buys product 477, they are most likely to buy product 478 and 479. Therefore, recommending customers to buy these two products would be likely to improve sales revenue.


Conclusion

MBA is an extremely powerful technique that a number of companies now are using to anticipate the interest of customers in order to suggest them the most relevant product besides what they chose to optimize the revenue. With only a few lines of code, SQL helps us to extract the data we need and make the right business decisions.


Reference

Thanh Cong