Market Basket Analysis with Pentaho

Statement of Problem
Given a list of transaction in the form…

  • Transaction number
  • Item code
  • Qty
  • Price

What is the probability that Item A is associated with item B in a transaction?

This question is interesting to a retailer to understand the behaviour of the shopper.
How strongly an item is associated with another item can tell the retailer the following…

  • how to layout the store
  • what items should be placed next to which item
  • items that are presented as a logical grouping but is not bought that way
  • ideas for bundling promotion
  • what is the decision tree of the shopper?

This question is a subset of the data mining domain.
It is an association problem.


For each transaction,
    For each unique unordered pair of items 
        calculate the sum of value (sum(qty*price))
        calculate percentage to total value for the date

The percentage to total is the probability that an item A is associated with item B.
This list of unordered pairs forms the fact table for market basket analysis.

The grain is

  • pair of items
  • weight (sum of value)
  • transaction date

This simple algorithm grows exponentially with the number of lines per transaction.
In a grocery store where a basket can have 50 items, the number of pairs is 50×49.

The performance of this algorithm is n*(n-1) where n is number of items in basket.

In a general merchandise store where number of items are not so many, this algorithm can be managed.

One more
Given that an item belongs to a heirarchy of product categories, what are the probability that an item or an element of the heirarchy is associated with another element of the heirarchy or item?

This query answers questions like…

  • what items are associated with dental care in a typical shopper basket?
  • what other categories are closely associated with the health and beauty care category?

The difference with the previous question is now we add the elements in the heirarchy in the mix.


For each transaction,
    For each unique unordered pair of items or elements in heirarchy 
        calculate the sum of value (sum(qty*price))

Say we have a catalog of 8 items with a 2 level heirarchy.
Starting at the root, you have 2 level_1 heirarchy.
Next at each level_1 element we have 2 children, therefor at level_2 we have 4 elements.
Then the 8 items are the leaves of the tree.
The total number of elements is 15.

When we started at item level association, the total permutations is 8*7 = 56.
At the levels, we have 7 elements.
We remove the root as all items can be related to itself.
So we have 6.
Each of these 6 can be associated with any other 6 including itself.
This makes it 6*6 = 36.
At level_1, the 2 elements can be permutated with 4 leaves not covered by each element.
Permutations at level_1 with leaves = 2*4 = 8
At level_2, the 4 elements can be permutated with 6 leaves not covered by each element.
Permutations at level_2 with leaves = 4*6 = 24

Total permutations now = 56+36+24 = 116.

Doing an association including levels of 8 leaves in a 2 level binary tree, increases the permutations from 56 to 116.

The more levels and the bigger the basket; the number of permutations increases exponentially.

Market Basket Transformation
The strategy of the implementation is to have a transformation that takes line items in 1 transaction (a market basket) and produces a list of all unordered pairs from the basket.

The next step is to produce the unordered pairs of all elements of the heirarchy from the list of unordered pairs of products (leaves) from all transactions.

Then we form a job that has 3 transformations.

  1. Trans that generates list of transaction numbers
  2. Pass the result rows of previous trans to the args of the market basket trans and runs it for each row and dumps the results in a text file
  3. The next transformation takes the results from the text file and generates unordered list of heirarchy elements, assigns probability weight to each unordered pair and dumps it to a fact table

In the transformation that generates list of transaction numbers, we can control the grain of the market basket analysis.
If it is daily grained irrespective of stores then we generate list of transactions from all stores on that date.

Posted in Uncategorized | Leave a comment

Dimensional Data Model

Dimensional data model is most often used in data warehousing systems. This is
different from the 3rd normal form, commonly used for transactional (OLTP) type systems. The same data would then be structured and stored differently in a dimensional model than in a 3rd normal form model.

Fig 1: Example of OLTP Master Tables

Fig 2: Star Schema Dimensional Design

To understand dimensional data modeling, let’s define some of the terms commonly used in
this type of modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes within a hierarchy. For example, one possible hierarchy in the Time dimension is

  • Year >
  • Quarter >
  • Month >
  • Day>

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the
appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as “Q1 2001” or “2001 Q1”).

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or
more lookup tables, but fact tables do not have direct relationships to one another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the nonkey
columns in the lookup tables.

In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.

Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.

Snowflake Schema: The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.

Whether one uses a star or a snowflake largely depends on personal preference and
business needs. Personally, I am partial to snowflakes, when there is
a business case to analyze the information at that particular level.

Fact Table Granularity


The first step in designing a fact table is to determine the granularity of the fact table. By
granularity, we mean the lowest level of information that will
be stored in the fact table. This constitutes two steps:

1. Determine which dimensions will be included.
2. Determine where along the hierarchy of each dimension the information will be kept.

The determining factors usually goes back to the requirements.

Which Dimensions To Include

Determining which dimensions to include is usually a straightforward process, because
business processes will often dictate clearly what are the relevant dimensions.

For example, in an offline retail world, the dimensions for a sales fact table are usually time, geography, and product. This list, however, is by no means a complete list for all offline retailers. A supermarket with a Rewards Card program, where customers provide some personal information in exchange for a rewards card, and the supermarket would offer lower prices for certain items for customers who present a rewards card at checkout, will also have the ability to track the customer dimension. Whether
the data warehousing system includes the customer dimension will then be a decision that needs to be made.

What Level Within Each Dimensions To Include

Determining which part of hierarchy the information is stored along each dimension is a bit
more tricky. This is where user requirement (both stated and possibly future) plays a major role.

In the above example, will the supermarket wanting to do analysis along at the hourly level?
(i.e., looking at how certain products may sell by different hours of the day.)
If so, it makes sense to use ‘hour’ as the lowest level of granularity in the time dimension. If daily analysis is sufficient, then ‘day’ can be used as the lowest level of granularity. Since the lower the level of detail, the larger the data amount in the fact table, the granularity exercise is in essence figuring out the sweet spot in the tradeoff between detailed level of analysis and data storage.

Note that sometimes the users will not specify certain requirements, but based on the industry knowledge, the data warehousing team may foresee that certain requirements will be forthcoming that may result in the need of additional details. In such cases, it is prudent for the data warehousing team to design the fact table such that lowerlevel
information is included. This will avoid possibly needing to redesign
the fact table in the future. On the other hand, trying to anticipate all future requirements is an impossible and hence futile exercise, and the data warehousing team needs to fight the urge of the “dumping the lowest level of detail into the data warehouse” symptom, and only includes what is practically needed. Sometimes this can be more of an art than science, and prior experience will become invaluable here.

Fact and Fact Table Types

Types of Facts

There are three types of facts:

  • Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
  • SemiAdditive: Semiadditive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
  • NonAdditive: Nonadditive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:

  • Date
  • Store
  • Product
  • Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a
daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact,
because you can sum up this fact along any of the three dimensions present in the fact table date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.

Say we are a bank with the following fact table:

  • Date
  • Account
  • Current_Balance
  • Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and
Profit_Margin are the facts.

Current_Balance is a semiadditive
fact, as it makes sense to add them up for all accounts (what’s the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information).

Profit_Margin is a non-additive fact, for it does not make sense to add them up for the
account level or the day level.

Types of Fact Tables

Based on the above classifications, there are two types of fact tables:

Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example
presented here is a cumulative fact table.

Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Posted in Uncategorized | 1 Comment

Entry System

Extracting data from a source typically pulls data from transaction tables and the master data reference tables.  An entry system is a set of inter-related tables where referential integrity is maintained.  The set of tables represents the transactions of a business process.

The entry system is typically in a database and usually a packaged software for a departmental function.  Examples are accounting, payroll, inventory and point-of-sale.

The set of tables that has referential integrity and defines the master data is the entry system tables that should be targeted for extraction into dimension tables.

The set of tables that records the transactions at the lowest level of detail should be targeted for extraction into fact tables.

Posted in Uncategorized | Leave a comment

Product Dimension

In a retail datawarehouse, the product dimension is by far the most important.

In a typical ERP system, the product master data is kept in a set of inter-related tables.

The most important is how a product is represented in an ERP system.
The view of the product model will dictate how a product is represented in the tables.

Product Object Model
There are 2 components of the object model.

Product Heirarchy
The product heirarchy is the object model of the reference to the product object.
It is how the product is classified, statistics is keep and maintained.
Typically, the heirarchy is mapped to how the inventory subledger is kept.

Retail Method of Inventory Costing
Many retailers have adopted the retail method of inventory costing.
This costing method is tried and true and adheres to US GAAP accounting standards.
This method tracks purchasing records and aggregates per period per classification per site group.
The margin is then averaged from the previous period and kept at the a higher heirarchy level above the product level.
More detailed explanation of the retail method will not be discussed.

The consequence of this method is that the margin percent is kept in the heirarchy and applied against the retail amounts at the product level in aggregate amounts.

This results in a very rigid merchandise structure.

The merchandise structure in accounting is very critical as changes in merchandise structure affects the cost of goods as well as the account structure in the inventory subledger.

This works against the requirements of the retail business which needs to react quickly to the changes in the demand of customers.

A major change of structure usually requires a laborious process of change that permeates throughout the organisation and affects the financial numbers directly.

Direct Cost Perpetual Method
The direct cost perpetual method calculates the cost of products on the fly.
There are 2 parts in specifying the direct cost.

The first is the method how the cost is calculated, of which there are 3.

  1. Weighted Average
  2. First In First Out
  3. Last In First Out

The second is the periodicity of the aggregation used.
Typical configuration is either Perpetual or Monthly.

In perpetual period, cost price is calculated for every transaction.
In Monthly period, cost is not calculated until a period closing process is done.

The details of the calculations of how cost method is done will not be discussed here.

The purpose of explaining the direct cost method is because the cost information is kept at the product level and not at the heirarchy.

This is significant as when the product object model is separated between the product and the heirarchy, the retail enterprise can be liberated to keep a rigid structure for financial book-keeping but also react to the changes in the customer demand patterns.

In Financial book-keeping, performance statistics can be kept in the natural heirarchy of the product like the NRF standard merchandise classification.
It can also be kept at the organisation structure view and updated at fixed period when staff assignment is changed and performance statistics is kept and follow the person as he/she moves.

The most important view is how the customer views the product assortment in the physical space.
The heirarchy should be broken down by the layout of the product on the selling floor.
However, this assumes that a product is only located in one location in the store.
The product can be assigned to the heirarchy based on its MAIN classification.

Another view of the product heirarchy is that of store assortment.
In a retail enterprise with multiple formats, each format may create a sub-assortment from a main-assortment that is sold in the store.

There are many advantages where the product heirarchy can be separated fromthe product itself.

Product Model
In retail, the product modelcan get complicated very quickly.

Take a can of Coca-cola.

Product is purchased in cartons or pallets.
It can be stored in pallets or cartons.
It can be sold in individual cans, pack of 6, carton of 24.

When it is sold as a can, it has a barcode on the can.
When sold as a six-pack, it has an outer wrapping covering the 6 cans and a barcode on the outer wrapping.
When sold as a carton, a different barcode is on the carton.
These 3 barcodes identified 3 different retail products.

Products are also sold in different context.

If a can of coke is sold off-the shelf at $1.00 and it can also be sold from a fridge chilled next to the exit for $1.20, they are the same product with the same code but sold under different context and thus at a different price.
If it is scanned at the same POS, an additional code could identify the context where an additional charge of $0.20 is charged for the same product because it is sold chilled.
If it is scanned at a different location, like a convenience corner where canned drinks are all sold as chilled, the POS will scan and have the price listed at a different price.

The can of Coke could also be assembled as a set of items comprising of 1 can of Coke, 1 T-shirt with Coke printed on it.
The set is packed as 1 item.
The set is then given a separate code attached to the packaging and separately identified.
It is scanned at POS and a price for the set is presented.

All the above are examples of a fixed price scenario.

Conditional Pricing
Conditional pricing is often referred to as “mix-and-match”.
It is where a discount is applied to a product based on certain match patterns of the basket of products that it is a part of.

Say a can of coke is sold at a discount if it is sold together with other products bottled by Coca-Cola company.

Stock Keeping Unit
The stick-keeping-unit is the smallest indivisible unit of a product that can be kept in storage.
In our case above, the SKU is the can of Coke.
The other logistic-units are composed of different unit-packs of the SKU unit.

Discrete and Process units
Products come in different forms.
The key difference in how products are accounted for is Discrete and Process types.
Discrete products are products that can be counted.
Process products are products that cannot be counted but only measured.
A type of this product is petrol.

A product can be sold by supplier in discrete as well as process unit.
An example is purchasing of whole chicken.
3 pieces of chicken that is priced at $1 per kg.
It is priced in process unit but counted as discrete unit.

The inventory can be kept in several ways…

  • process units
  • discrete units of specific weight
  • discrete units of calculated average weight

Set of items
As mentioned before we can sell set of items.
Sets come in various forms.

  • Set of same items
  • Set of different items
  • Set of items configured at time of sale
Posted in Uncategorized | Leave a comment

Surrogate keys

For the past 3 years I have built the datawarehouse of a company with 3 retail businesses.
The lessons learned were huge.

I will start with the use of a surrogate key.

The surrogate key is a ‘meaningless key’ that is merely a sequence id.
In building a dimension table, extracting master data from an existing operational system, we will encounter a primary key.
The PK is used by the OLTP to maintain referential integrity WITHIN its system.

In the case where there can be multiple systems in several business units, the keys can be duplicates when merged into 1.

Therefore we create the natural key of the table as a composite of 2 fields – “entry system” and key from source system.

The keys must be unique across the entire enterprise.

I used the following approach to create the surrogate key.

Create a table with auto-id and PK on the id.
As new rows are inserted, ids increment and keeps everything unique.

Create exception keys to denote the reason of an exception.
The following are obvious…

Not applicable
This is the key value of the position of the dimension when it is not applicable.
Usually this is hard code not using a lookup step as it is known beforehand that this field will be left as ‘null’ therefore filled with “Not Apllicable” value.

Not found
This is the key value where it is mandatory to have a value but the value is not found in the dimension table.
It is a statement of fact.
Nothing is known why the key is not found.
Below are several possibilities.

Late arriving
This is the key value where the fact table has a valid value but is not found in the DW dimension table.
This is usually due to sync issues where the dimension table is synced with source but right after that, someone creates a new item and enters a transaction. The transaction table is then extracted to the fact table, resulting in the above condition.
When we know that the source system has referential integrity, we can confidently put the value as late arriving.

Invalid value
The key value is ridiculously out of bounds.
This can be assured as invalid.

The exceptional values of the surrogate keys are inserted after ‘releasing’ the auto-id field to enable update/insert on it.

This composite key of system and natural key will have a one-to-one correlation to the surrogate key.

The ‘system’ means the OLTP entry system that the master data is extracted from.
The entry system is a set of inter-related tables that maintains referential integrity.

Posted in Uncategorized | 1 Comment

Hello world!

Welcome to This is your first post. Edit or delete it and start blogging!

Posted in Uncategorized | 1 Comment