Understanding and Using Subset Ratio

Summary: If you are familiar with subset ratios in Qlik, you may not find much new in this post. But if you are new to Qlik or are unfamiliar with subset ratios in your data model, please read on. 

When loading data into Qlik and building a new data model, inspecting the Subset Ratio of key fields is an important exercise to ensure data quality.

Subset Ratio is displayed in the Preview Pane (Qlik Sense Data model viewer) or Table Viewer (QlikView) when a key field (field linking two or more tables) is selected.

After clicking a key field, in the Preview pane you will see three important numbers:

Total distinct values:  The count of all distinct values for this field (CustomerId) from all tables (Orders” and “Customers”) in the model.

Present distinct values:  The count of distinct values for this field (CustomerId) in the currently selected table (Customers).

Subset ratio: “Present distinct values” divided by  “Total Distinct Values”.  What percentage of total field values are represented in this table?  In this case, 100% of all CustomerId values in the data model are represented in the Customers table.  This is good. We will typically expect to see 100% Subset ratio in a dimension table.

Let’s take a look at the Subset ratio for the same field in our fact table — the Orders table,

It’s less than 100%.  Our Customer table is our “customer master” and represents all of our potential customers.  Our Orders table represents a limited time period, perhaps 12 months.  Only 44 distinct customers, or 22%,  are represented in the set of Orders we have loaded.

Less than %100 Subset ratio is a normal condition for a Fact table.  If we don’t want to include Customer data for those customers who have no orders, we can filter the Customer load with a “Where Exists(CustomerId)” clause.

So far we’ve seen “normal” subset ratios.  Let’s look at some exceptions.  What does it mean when the dimension table (Customers) has less than 100% subset ratio?

It means we have an order(s) that has no link to a Customer row.  That’s a data quality problem. In the example above we can see that we have 1 missing CustomerId (201 – 200).

Why do we have a missing Customer?  We would have to dig into the data to find out why.  It could be that we have loaded historical orders and “inactive” customers are archived from the Customers table.  It could be that we have some bad data due to a bug.  We have to analyze the data and decide on the best path to remediate.

By the way, what is the specific value(s) of the missing CustomerId?  A simple way to make this determination is to create a table chart with two columns — The key field and a field that has 100% density (every row has a value) from the Customers table.  I’ll use the CustomerName field. Sort the table by CustomerName and the key value in question will show at the bottom of the table with a null value for CustomerName.

What does it mean when the sum of the subset ratios for two tables equals 100%? It means there are no matching values between the two tables.   This can happen for instance when the keys come from two different systems that use slightly different nomenclature.  Perhaps in your ERP all ProductId values start with “P” but in the spreadsheet that someone provided for additional part info the “P” is excluded because none of the humans use the “P” when identifying parts.

Examining Subset ratio as you build up your data model is an important quality step.  Validating the quality of your data model will make the process of creating visualizations go much smoother.

-Rob

 

Share

8 thoughts on “Understanding and Using Subset Ratio”

  1. Hi, would often use “Left Keep(Orders)” instead of Exists(CustomerId). I like the extra control this gives. CustomerId might be present in some other table in memory at the point of execution.

  2. Thanks for explaining Rob. I never paid much attention to these ratios in the past but certainly will going forward. I appreciate you taking the time to educate us all on various areas of QlikSense. Please keep the info coming.

  3. Gooda!!! I like easy and simple things, like youur explanation for this subject. Long blogs are boring.

  4. “What does it mean when the sum of the subset ratios for two tables equals 100%? It means there are no matching values between the two tables.”

    This is only true if the key you are looking at only exists in those two tables, right? If you had a third table linked with that key, it could be possible to have two of the tables have subset ratios that add up to 100% but still have some overlap?

    1. Possible yes, but highly improbable I think. I’ll admit the question makes my head hurt.

      Given three tables, two with the same key values (M) and one with completely different values (U). The sum of M -> U will be 100% as long as there are zero matches and M is the larger of the two M tables. M -> M may be greater or less than 100% but not likely to be exactly 100%.

      It gets more complicated if there is a partial overlap between M -> U. The 100% rule only applies when there is zero match. But even with a partial match the subset ratio can be used to detect a bad smell.

  5. I was wanting to expose this in the GUI the other day and wrote a quick sub to do it (sorry format will prob be bonkers)

    Sub InfoDensity (vTableName,vKeyField,vFieldName,vID)

    $(vFieldName)_InfoDensity:
    LOAD $(vKeyField) AS KEY_$(vKeyField)_$(vID), NullCount($(vFieldName)) as NullCount_$(vFieldName),
    count($(vKeyField)) as Count_$(vKeyField)_$(vID)
    Resident $(vTableName) group by $(vKeyField);

    End Sub

    You call it like

    CALL InfoDensity(‘CMP_SALES_ORDER_INT_STEP_1′,’CUSTOMER_NUMBER’,’CUSTOMER_TYPE’,’1′);

    Then in the GUI

    =’CUSTOMER_TYPE Info Density ‘ &num( 1 – sum(NullCount_CUSTOMER_TYPE) /sum(Count_CUSTOMER_NUMBER_1),’#,##0%’)

    1. try a straight table with
      dimensions:
      $Table, $Field

      Expression:
      num(FieldValueCount($Field)/$Rows,’#,##0%’)

      Ratios >100% mean that there is a mismatch key 🙂

Leave a Reply

Your email address will not be published.