Summary: If you are familiar with subset ratios in Qlik, you may not find much new in this post. But if you are new to Qlik or are unfamiliar with subset ratios in your data model, please read on.
When loading data into Qlik and building a new data model, inspecting the Subset Ratio of key fields is an important exercise to ensure data quality.
Subset Ratio is displayed in the Preview Pane (Qlik Sense Data model viewer) or Table Viewer (QlikView) when a key field (field linking two or more tables) is selected.
After clicking a key field, in the Preview pane you will see three important numbers:
Total distinct values: The count of all distinct values for this field (CustomerId) from all tables (Orders” and “Customers”) in the model.
Present distinct values: The count of distinct values for this field (CustomerId) in the currently selected table (Customers).
Subset ratio: “Present distinct values” divided by “Total Distinct Values”. What percentage of total field values are represented in this table? In this case, 100% of all CustomerId values in the data model are represented in the Customers table. This is good. We will typically expect to see 100% Subset ratio in a dimension table.
Let’s take a look at the Subset ratio for the same field in our fact table — the Orders table,
It’s less than 100%. Our Customer table is our “customer master” and represents all of our potential customers. Our Orders table represents a limited time period, perhaps 12 months. Only 44 distinct customers, or 22%, are represented in the set of Orders we have loaded.
Less than %100 Subset ratio is a normal condition for a Fact table. If we don’t want to include Customer data for those customers who have no orders, we can filter the Customer load with a “Where Exists(CustomerId)” clause.
So far we’ve seen “normal” subset ratios. Let’s look at some exceptions. What does it mean when the dimension table (Customers) has less than 100% subset ratio?
It means we have an order(s) that has no link to a Customer row. That’s a data quality problem. In the example above we can see that we have 1 missing CustomerId (201 – 200).
Why do we have a missing Customer? We would have to dig into the data to find out why. It could be that we have loaded historical orders and “inactive” customers are archived from the Customers table. It could be that we have some bad data due to a bug. We have to analyze the data and decide on the best path to remediate.
By the way, what is the specific value(s) of the missing CustomerId? A simple way to make this determination is to create a table chart with two columns — The key field and a field that has 100% density (every row has a value) from the Customers table. I’ll use the CustomerName field. Sort the table by CustomerName and the key value in question will show at the bottom of the table with a null value for CustomerName.
What does it mean when the sum of the subset ratios for two tables equals 100%? It means there are no matching values between the two tables. This can happen for instance when the keys come from two different systems that use slightly different nomenclature. Perhaps in your ERP all ProductId values start with “P” but in the spreadsheet that someone provided for additional part info the “P” is excluded because none of the humans use the “P” when identifying parts.
Examining Subset ratio as you build up your data model is an important quality step. Validating the quality of your data model will make the process of creating visualizations go much smoother.
-Rob