Market Basket Analysis with SQL
Overview
Market Basket Analysis (or in short: MBA) is a technique using analytics to explore customer purchase behaviors, thus retailers can efficiently employ purchase information to effectively market and optimize sales of diverse products. MBA goes with several other names such as Product Association or Product Recommendation and is one of the most popular and best applications of machine learning, looking for the products that frequently go with each other in purchases.
For example, if a customer buys a pair of pants, how likely are they going to buy a shirt on the same trip? How likely is a customer going to buy a drink following a purchase of a sandwich? This analysis stands at the forefront in the retail industry to solve such cases to make better recommendations for customers, which has been applied widely in such as: cinemas, fast food restaurants, clothing shops.
MBA with AdventureWorks2019 database
To better understand how MBA works, let’s look at the following demonstration in Microsoft SQL Server 2019 on AdventureWorks2019 database, specifically FactInternetSales table with two attributes: SalesOrderNumber and ProductKey.
Description
- FactInternetSales table: Sales record through internet channel
- SalesOrderNumber: The code indentifies the order
- ProductKey: The code indentifies the product
SELECT TOP 10 -- see the first 10 rows of the table
SalesOrderNumber,
ProductKey
FROM FactInternetSales;
1. Select the orders having at least two different products
SELECT
SalesOrderNumber,
COUNT(ProductKey) AS NumberofProducts
FROM FactInternetSales
GROUP BY SalesOrderNumber
HAVING COUNT(ProductKey) >= 2;
We have the following result:
2. List out the SalesOrderNumber and ProductKey of orders having at least two product keys
SELECT
OrderList.SalesOrderNumber,
FIS.ProductKey
FROM
(SELECT
SalesOrderNumber,
COUNT(ProductKey) AS NumberofProducts
FROM FactInternetSales
GROUP BY SalesOrderNumber
HAVING COUNT(ProductKey) >= 2) AS OrderList
JOIN FactInternetSales AS FIS ON Orderlist.SalesOrderNumber = FIS.SalesOrderNumber;
We have the following result:
Notice that all the orders returned have more than 1 product key. Before seeing which combination of two products occurs most frequent, let’s use common table expression (CTE) to use this result for further uses.
3. Common Table Expression (CTE)
WITH Info AS
(SELECT
OrderList.SalesOrderNumber,
FIS.ProductKey
FROM
(SELECT
SalesOrderNumber,
COUNT(ProductKey) AS NumberofProducts
FROM FactInternetSales
GROUP BY SalesOrderNumber
HAVING COUNT(ProductKey) >= 2) AS OrderList
JOIN FactInternetSales AS FIS ON Orderlist.SalesOrderNumber = FIS.SalesOrderNumber);
4. Create combinations of two products in the same order
WITH Info AS
(SELECT
OrderList.SalesOrderNumber,
FIS.ProductKey
FROM
(SELECT
SalesOrderNumber,
COUNT(ProductKey) AS NumberofProducts
FROM FactInternetSales
GROUP BY SalesOrderNumber
HAVING COUNT(ProductKey) >= 2) AS OrderList
JOIN FactInternetSales AS FIS ON Orderlist.SalesOrderNumber = FIS.SalesOrderNumber)
SELECT
Info1.SalesOrderNumber,
Info1.ProductKey AS Product1,
Info2.ProductKey AS Product2
FROM Info AS Info1
JOIN Info AS Info2 ON Info1.SalesOrderNumber = Info2.SalesOrderNumber
WHERE Info1.ProductKey != Info2.ProductKey
AND Info1.ProductKey < Info2.ProductKey;
We self join the temporary “Info” table to view every 2 products that occur in the same order. Pay attention to the “WHERE” statement where we specify two conditions to avoid duplicates.
- Info1.ProductKey != Info2.ProductKeyThis is to make sure that the products we account for are different.
- Info1.ProductKey < Info2.ProductKeyThis is to avoid duplicates. If this condition does not exist, the result will be:
Look at the first and the third record where Product1 equals Product2 and vice versa. This makes the result is twice as long as it should be. Therefore, removing duplicates by using second condition is a must.
This is the result without duplicates:
5. Calculate the frequency of a pair of two products
WITH Info AS
(SELECT
OrderList.SalesOrderNumber,
FIS.ProductKey
FROM
(SELECT
SalesOrderNumber,
COUNT(ProductKey) AS NumberofProducts
FROM FactInternetSales
GROUP BY SalesOrderNumber
HAVING COUNT(ProductKey) >= 2) AS OrderList
JOIN FactInternetSales AS FIS ON Orderlist.SalesOrderNumber = FIS.SalesOrderNumber)
SELECT
Info1.ProductKey AS Product1,
Info2.ProductKey AS Product2,
COUNT(*) AS Frequency
FROM Info AS Info1
JOIN Info AS Info2 ON Info1.SalesOrderNumber = Info2.SalesOrderNumber
WHERE Info1.ProductKey != Info2.ProductKey
AND Info1.ProductKey < Info2.ProductKey
GROUP BY
Info1.ProductKey,
Info2.ProductKey
ORDER BY COUNT(*);
We have the following result:
From the result, we can interpret that if a customer buys product 477, they are most likely to buy product 478 and 479. Therefore, recommending customers to buy these two products would be likely to improve sales revenue.
Conclusion
MBA is an extremely powerful technique that a number of companies now are using to anticipate the interest of customers in order to suggest them the most relevant product besides what they chose to optimize the revenue. With only a few lines of code, SQL helps us to extract the data we need and make the right business decisions.