Dual Storage vs Dual Behavior

Summary: The Dual() function stores both string and numeric representations of a value.  “Implied Duals”  such as Dates, store only the numeric portion and apply the string mask as needed. In some circumstances such as un-optimized QVD loads, implied duals can get converted to “full duals” using storage unnecessarily.

In QlikView and Qlik Sense you can create a Dual field using the Dual() function such as:

if(ShipDate = OrderDate, Dual('Yes',1), Dual('No', 0)) as SameDayShip

Dual fields have both string and numeric representations and Qlik is smart about using the correct representation based on context.

In a listbox or filter, SameDayShip will show the string values as:

Yes
 No

We can also write expressions such as:

Sum(SameDayShip)

which will smartly and automatically use the numeric value of SameDayShip.

Internally, the values will be stored in the symbol table like this:

Y e s 1
N o 0

The numeric portion, 1 or 0 in this case, will always occupy 8 bytes. The average symbol length will be 10.5 —  (11 + 10) / 2 values. You can display the symbol length by using a tool like Document Analyzer.

What about Date() or Num() fields, which are also Dual fields?  When properly scripted, these are what I call “Implied Dual fields”. They have dual behavior, but do not occupy the full dual storage.

Dates are represented as the number of days since Dec 31, 1899.  Today’s date (March 12, 2017) number is 42806.  A properly optimized date stores only the numeric value and does not store the  string value. Instead , the format mask is stored once as an attribute of the field.

Format: M/D/YYYY
ShipDate
42804
42802
42800

On demand, when the string representation is required (like in a listbox) the format mask is applied.  The symbol length in this case is always 8, only the numeric value.

Sometimes — such as in an un-optimized QVD load — the field is converted to what I call a “full dual” (like the “SameDayShip ” example) and both the string and numeric values are stored in the symbol table.  This can greatly increase the storage used for the symbol table.

3/10/2017 42804
3/8/2017 42802
3/6/2017 42800

 

An example of an un-optimized load that will create the “full dual” representation:

LOAD
 DateField
 FROM Dates.qvd (qvd)
 Where Year(DateField) >= 2016;

In QlikView, you can “fix” this problem by going into the Document Properties, Number pane and changing the field format from “Mixed” to to “Date” format.  QV will immediately release the string storage.

Qlik Sense does not provide a Number Format pane, so you must apply corrections in the script like this:

LOAD
 Date(Num(DateField)) as DateField
 FROM Dates.qvd (qvd)
 Where Year(DateField) >= 2016;

To be fair, this is usually not a big deal for something like Dates, which have a relatively small number of values.  It can become more significant with something like Timestamps or other numeric fields that have many unique values.

The “Recommendations” sheet of Document Analyzer identifies these “Numeric Size” opportunities and quantifies the memory savings if you were to apply a correction.

-Rob

 

 

 

Share

6 thoughts on “Dual Storage vs Dual Behavior”

  1. Hi Rob,
    Is there a script-command, which may enforce the formatting? Think that the setting in the document is not permanent, but seems to get lost, once there is an unsuccessful reload.

    1. Hi Peter,
      You can use the script snippet I showed above to preserve the format in the script.

  2. Hi Rob,
    Interesting piece. Out of curiosity do you reverse engineer this knowledge using the mem files or is this info “leaked” from Qlik? Your knowledge of how Qlik uses memory is quite impressive.

    Also, is there any good reason why Qlik developers had to convert implicit dual values to full dual values when running a un-optimized load?

    Karl

  3. I reverse engineered this specific issue. Qlik, particularly Henric Cronström , have published some details about memory over the years.

    I imagine the conversion to full dual is part of the where clause testing, but I don’t know that they remain that way. I’ve been aware of this for some time — it was actually the topic of my very first QlikView blog post in May 2008 http://qlikviewcookbook.com/2008/05/when-less-data-means-more-ram/.

    There is a bit of tortured history trying to get this fixed as a “bug”. It’s been described as WAD in the past. I’ve pretty much learned to live with it, it has less significance in the 64bit world.

  4. Hello Rob.
    Its a very interesting article.
    I am working with hundreds of millions of records with more than 30 fields per record.
    There are some items with two fields, the description and the code. The code is used to do sorts.
    What do you think that is better. Load the two fields separately or create a dual field with description and code?
    Thanks

    1. Interesting question. My guess is that the Dual() method would be a better overall performer, but I would test to find out.

      If the code is used only for Sort, you might consider loading the data in Load order and using Load Order for sort instead, which would be the fastest.

Leave a Reply

Your email address will not be published.