Category Archives: Performance

Dual Storage vs Dual Behavior

Summary: The Dual() function stores both string and numeric representations of a value.  “Implied Duals”  such as Dates, store only the numeric portion and apply the string mask as needed. In some circumstances such as un-optimized QVD loads, implied duals can get converted to “full duals” using storage unnecessarily.

In QlikView and Qlik Sense you can create a Dual field using the Dual() function such as:

if(ShipDate = OrderDate, Dual('Yes',1), Dual('No', 0)) as SameDayShip

Dual fields have both string and numeric representations and Qlik is smart about using the correct representation based on context.

In a listbox or filter, SameDayShip will show the string values as:

Yes
 No

We can also write expressions such as:

Sum(SameDayShip)

which will smartly and automatically use the numeric value of SameDayShip.

Internally, the values will be stored in the symbol table like this:

Y e s 1
N o 0

The numeric portion, 1 or 0 in this case, will always occupy 8 bytes. The average symbol length will be 10.5 —  (11 + 10) / 2 values. You can display the symbol length by using a tool like Document Analyzer.

What about Date() or Num() fields, which are also Dual fields?  When properly scripted, these are what I call “Implied Dual fields”. They have dual behavior, but do not occupy the full dual storage.

Dates are represented as the number of days since Dec 31, 1899.  Today’s date (March 12, 2017) number is 42806.  A properly optimized date stores only the numeric value and does not store the  string value. Instead , the format mask is stored once as an attribute of the field.

Format: M/D/YYYY
ShipDate
42804
42802
42800

On demand, when the string representation is required (like in a listbox) the format mask is applied.  The symbol length in this case is always 8, only the numeric value.

Sometimes — such as in an un-optimized QVD load — the field is converted to what I call a “full dual” (like the “SameDayShip ” example) and both the string and numeric values are stored in the symbol table.  This can greatly increase the storage used for the symbol table.

3/10/2017 42804
3/8/2017 42802
3/6/2017 42800

 

An example of an un-optimized load that will create the “full dual” representation:

LOAD
 DateField
 FROM Dates.qvd (qvd)
 Where Year(DateField) >= 2016;

In QlikView, you can “fix” this problem by going into the Document Properties, Number pane and changing the field format from “Mixed” to to “Date” format.  QV will immediately release the string storage.

Qlik Sense does not provide a Number Format pane, so you must apply corrections in the script like this:

LOAD
 Date(Num(DateField)) as DateField
 FROM Dates.qvd (qvd)
 Where Year(DateField) >= 2016;

To be fair, this is usually not a big deal for something like Dates, which have a relatively small number of values.  It can become more significant with something like Timestamps or other numeric fields that have many unique values.

The “Recommendations” sheet of Document Analyzer identifies these “Numeric Size” opportunities and quantifies the memory savings if you were to apply a correction.

-Rob

 

 

 

Q-On Courses in January

Just a heads up to get in your planning before taking a holiday break.  I’ll be teaching two on-line courses in early January:

January 7  QlikView™ Document Performance Tuning

Learn how to measure and optimize the performance of your QlikView apps. By the end of the session, you will understand how the calculation process in QlikView works, and how data model, expression and chart design impact response times and resource usage.

You will come away with the skills to analyze your own apps and make them run faster.

 

January 6  QlikView™ Components Scripting Library

Speed up your QlikView™ development workflow by making use of the powerful QlikView Components (QVC) scripting library.

QVC can help you build your QlikView™ projects quicker and ensuring a high level of quality in your scripts. The set of ready-made subroutines that QVC provides can be used to perform common script operations, implemented in a flexible way and incorporating best practices.

 

Hope to see you there !

-Rob

The Impact of Data Islands on Cache and CPU

Well that’s a wordy title isn’t it?

I’m in the midst of  writing a new QlikView Document  Optimization course to be delivered at  Q-On Training .  This work has reminded me of a not-so-obvious issue I sometimes see in  Performance Tuning engagements with customers.

You might be thinking I’m going to write today about how heavy a calculation can be as a result of the cartesian product of disconnected fields in an expression . No…that’s not what I’m thinking of.

What I’m thinking about today is the impacts of selecting  a field that is not used by any expression on the sheet. For example, a Currency listbox may be present on the sheet.  The Currency field is not connected (“Data Island”) to other tables in the model . Some, or none, of the objects on this sheet may reference that Currency field.

What happens when you click a Currency value? Everything on the sheet gets recalculated.  E v e ry t h i n g.  Whether it uses Currency or not. Why? Because the data has changed.

Since the data used in my chart has not changed, the results will be fetched from cache, right? Probably not.  Let’s look at an example:

Dimension:  Customer
Expression:  Sum(Sales)

The current selection in Currency  is “USD”.  The chart has been calculated and the results stored in cache, available for speedy retrieval if the same expression is calculated over the same set of data.

Select “EUR” in Currency. The cached result will not be used even though no change has been made in the data used by the chart. Cache evaluation considers the entire data model, not just data referenced by the object. If you now select “USD”, the previously cached result will be used.

If your application is large enough that you consider this behavior to be a problem, a leaner alternative for the Currency example is to use a variable. When a variable changes, only objects that reference the variable get recalculated. Another great alternative is to put the Currency listbox in a different Alternate State.

Let’s look at another case, the idea of a “universal listbox” that has been published by a several authors. I think it’s a very cool idea and I use it myself for data exploration. The common idea is that you let the user pick any field and then make selections in that field. This can be built using the system fields $Table and $Field.

Let’s consider the app is idle and all the sheet objects are occupying what I’ll term “relative cache slot #1” – the first cache entry for each object.  Now we’ll use the universal listbox.

1. Select a $Table value. Everything is recalculated, consuming CPU resources and the results are stored in “cache slot #2”.

2. Select a $Field value, “ProductGroup”.  Everything is recalculated, consuming CPU resources and the results are stored in “cache slot #3”.

3. Make some selections in  “ProductGroup”.  Everything is recalculated, consuming CPU resources and the results are stored in “cache slot #4”.  This result, with the new data, is what the user is interested in.

We have used three times the CPU and cache resources to achieve the end result.  A leaner alternative would be to place the universal listbox on it’s own sheet, thus avoiding extra recalculations.  You can move back and forth with buttons to make it feel like it’s integrated with each sheet.

I don’t mean to discourage the use of  these data island techniques. They can be incredibly useful to the #1 goal  — getting accurate  information to your business in a timely and cost efficient manner. However, when you are dealing a specific performance concern in larger apps this is an area you may want to examine and optimize.

-Rob

Watch the Q-On Training site or subscribe to our Q-On newsletter to be notified when the new Performance course is available. For more performance related discussions and tips, join me at the Masters Summit for QlikView Sept 23-25 NYC or 29 Sep-1 Oct Copenhagen

 

Better Calendar Scripts

TLDR: A new Qlik Sense training video uses a tired old-school Master Calendar script. I propose that updated training materials should use an updated script.

I just watched  a new  video by Michael Tarallo of Qlik titled “Understanding the Master Calendar – Qlik Sense and QlikView“.  Before I  pick on the contents of this particular video, I want to  credit  Michael for  producing many excellent training videos that are worth watching and learning from. I  highly recommend them.

The video does a great job of explaining the need for and function of a Master Calendar in your data model. It then goes on to show an actual script.

I can’t discuss Master Calendar without expressing disappointment that Calendar generation is not yet a builtin function in Sense. Something like QlikView Components (QVC) does with the single script line:

CALL Qvc.CalendarFromField('OrderDate');

On to the script used in this new video. I’ll reproduce the entire script below and then comment on the techniques used and suggest some more “modern” approaches.

The video script is  similar to the script used in the current QlikView Developer course . I acknowledge that this script works as is and produces correct results. But I don’t think it should be taught to newbies as good scripting. Here’s the script from the video:

Obsolete Code

1. Why is this field created  and where is it used?

It’s not used. It’s left over from a very old version of the exercise and it doesn’t serve any purpose.

2. Why are we sorting the table? Is this statement useful?

Even if I could think of a good reason why  the Calendar table should be in order, it already is in this order because the TempCalendar was generated in a loop. Statement unnecessary.

Inefficient Code

Loading a Resident table can be very slow for a large table.

 

Experienced scripters use the FieldValues array instead.

What’s the difference? FieldValues only reads the distinct values of a field — maybe a thousand or so for several years of dates. Resident reads every row of the table. For a 10M row fact table, that’s 10M reads and the time scales up linearly. The difference can be dramatic.

Error Prone and Extra Work

Peek(), used on lines 13 & 14,  is one of those functions that fails silently. That is, if you misspell a field or table,  you won’t get a script error. Misspelling a variable will also not generate  a script error. Maybe. Or maybe not. Or maybe you will get a Calendar that starts in year 1899.  Depends on which line you make the spelling error on. If your misspelling does result in a script syntax error, it will be downstream from where you created the problem. There are multiple ways to mess this one up and some very curious potential results.

Don’t forget to DROP those temp tables on lines 15 and 35.

And those varMinDate and varMaxDate variables really should be cleared as well.

You can avoid all the syntax traps and extra cleanup by coding this a as a Preceding Load.  Here’s the same script written as a Preceding Load:

Nothing to remember (or forget) to clean up.  If you misspell a fieldname, you will get an understandable error at the right place. This is the calendar script I wish we would provide to newcomers.

Of course if you’ve attended the Masters Summit for QlikView, you’ve learned all about FieldValues, Preceding Loads and more. If not, attend a summit this Fall in NYC or Copenhagen.

QVC users don’t even get out of bed to generate Calendars. But they know that all that best practice stuff is happening under the covers. If you want to learn more about QVC, join my online class June 4 or a future class at Q-on.bi.

-Rob

Entire script suitable for copying:

MasterCalendar: 
Load 
 TempDate AS OrderDate, 
 week(TempDate) As Week, 
 Year(TempDate) As Year, 
 Month(TempDate) As Month, 
 Day(TempDate) As Day, 
 'Q' & ceil(month(TempDate) / 3) AS Quarter, 
 Week(weekstart(TempDate)) & '-' & WeekYear(TempDate) as WeekYear, 
 WeekDay(TempDate) as WeekDay 
;

//=== Generate a temp table of dates === 
LOAD 
 date(mindate + IterNo()) AS TempDate
 ,maxdate // Used in InYearToDate() above, but not kept 
WHILE mindate + IterNo() <= maxdate;

//=== Get min/max dates from Field ===/
LOAD
 min(FieldValue('OrderDate', recno()))-1 as mindate,
 max(FieldValue('OrderDate', recno())) as maxdate
AUTOGENERATE FieldValueCount('OrderDate');

Document Analyzer Numeric Field Recommendations

In the latest release 2.4  of QV Document Analyzer I’ve started the process of adding “Recommendations”, highlighting specific areas where potential improvement may be made to your document.

The recommendation included in this release is “Numeric field sizes”. Numeric fields that occupy more than 8 bytes of storage per value will be flagged. They will be highlighted on the Fields sheet and additional details will be provided on the Recommendation sheet.  The potential savings from “fixing” the field is also shown.

So what is this 8 byte thing? Ideally, most numeric fields (which includes timestamps) can be represented by 8 bytes of storage. In the Number format property or the script num() function you assign a formatting mask to be used for string representation.  QV stores the 8 byte number and applies the mask at display time.

Occasionally, particularly when doing an un-optimized load from QVD, the data can wind up being saved as “Mixed” which means both the numeric and string representations are stored for each value. So you may see Symbol widths of 17, 20, 24 or even greater for this field.

In larger apps with many unique numeric values, I’ve found significant memory savings by restoring these Mixed fields to a numeric format.

The usual tuning and optimizing caveat applies here. Don’t spend effort recovering a few megabytes. You have better things to do. But if you are wrestling with the megabeast app, fixing the biggest of those fields may be worth the effort.

-Rob

How to Choose an Expression

This is a follow on to my post “How Not to Choose an Expression” that described the performance problem sometimes experienced in large apps when choosing one of several expressions.

I received a number of questions about the scalability of my suggested solution to define multiple Expressions using the Expression Conditional property. In this post I’ll present an alternative for when you have a large number of option combinations.

Before I dive in,  an important comment about coding for performance. First, you should code for clarity and maintainability. If your document contains only a few million rows, it probably won’t matter if you use if() or an alternative technique.  I’m fond of the  Donald Knuth quote “Premature optimization is the root of all evil (or at least most of it) in programming”. The techniques presented in this post are meant to solve an identifiable performance problem. I wouldn’t bother implementing them unless I have a need.

Let’s use a scenario where calculations should reflect user selected options.

  • US Dollars or Local Currency  — USD | LC
  • Include VAT?  — Y|N

I’m only dealing with two options to keep my example manageable. You should be able to extend the concepts into many options.

The if method chart Expression for our choice may look like this:

if(CurrencyType='USD' AND [Include VAT?]='N'
 ,Sum(SalesAmount_USD)
,if(CurrencyType='USD' AND [Include VAT?]='Y'
 ,Sum(SalesAmount_USD + VAT_USD)
,if(CurrencyType='LC' AND [Include VAT?]='N'
 ,Sum(SalesAmount_LC)
,if(CurrencyType='LC' AND [Include VAT?]='Y'
 ,Sum(SalesAmount_LC + VAT_LC)
))))

The [CurrencyType] field controls which field we will sum() and the [Include VAT?] field controls if the associated VAT field is included in the sum(). What’s the difference between the alternatives? Only the fields listed in the sum().

Our Document contains 100M rows and we’ve identified this expression as a bottleneck. What are some alternatives to make this perform better?

In my previous post, I discussed dividing this into four expressions and making a choice using the Expression Conditional property. I won’t repeat the details of it here.  In this case, I don’t want to create multiple expressions in the charts. What is another possible technique?

Start by loading a table that reflects the choice fields and the associated sum() fields.

This is an island table that is not linked to any other tables in our model.

Create Listboxes for [CurrencyType] and [Include VAT?]. Set the “Always One Selected” property in the Listboxes.  This will force the user to make selections and at any given time only one row will be possible in our table.

We will then reference the SalesExprField in our chart using Dollar Sign Expansion (DSE). DSE is performed before the expression is evaluated. Our chart expression is now:

Sum($(=SalesExprField))

The “=” within the DSE says to evaluate this as an expression. In other words, substitute in the value of the SalesExprField and then evaluate the Sum().

Do you want to see what DSE is actually substituting? In a Straight Table, clear the Label field. The substituted expression will be visible in the column heading.

If we are not sure that “Always On Selected” is guaranteed, we should use some type of aggregation function to ensure a single SalesExprField is selected. For example, to take the first possible value:

Sum($(=FirstSortedValue(DISTINCT SalesExprField,1)))

Instead of just parameters to  the sum() function, we could have included the entire expression in our island table,  eg “Sum(SalesAmount_LC + VAT_LC)”. In that case our chart expression would be:

$(=SalesExprField)

What if want to control options via variables instead of fields? Use a Set expression in the Only() function.

Sum(
$(=only({<CurrencyType={$(vCurrencyType)},[Include VAT?]={$(vIncludeVAT)}>}SalesExprField))
)

If you’ve read this far, you are probably wondering “Is there a downloadable example”. Yes, it can be found here. Qlikview Cookbook: How to Choose an Expression.

-Rob

Want more performance tips?  Come see Oleg Troyansky’s Performance Tuning session and  additional tips  from other presenters at the “Masters Summit for Qlikvew”.

How Not to Choose an Expression

We sometimes have a  requirement to select between two or more expressions  dependant on user selections or some other condition. For example, this chart Expression.

if(vWithCommission=1
 // Calculate with Commision
 ,sum({<CommissionFlag={1}>}
     SalesAmount - SalesAmount * CommissionRate)
 // Calculate without Commission
 ,sum(SalesAmount)
 )

The  if()  will select one or the other sum() expression based on the value of the vWithCommmision variable.  Because a variable is being tested —  not a row value — only one sum() will be chosen for the entire chart.

If performance is important,  this is not a good way to implement the requirement. QlikView does not “short circuit”. Short circuit means to stop evaluating when the condition is true. QlikView will calculate both sums and then return one of those results.

Some people mistakenly believe that putting the above expression in a variable changes the behavior. This is incorrect. The if() evaluation is still done at the chart level and the performance impact remains.

A performant way to solve the requirement is to put each calculation into a separate chart Expression and use the  Conditional property to select the Expression.

An Expression with  a Conditional evaluating to False will not be calculated or displayed.

There is also the possibility of choosing the calculation in a variable, but you have to follow a few rules.

  1. The variable should return the string representation of the formula. Note the single quotes in the example below.
  2. The variable definition should begin with “=”. This causes the if() to be evaluated only once.
  3. In the chart Expression, reference to the variable should be made with $(). eg  $(vChooseCalc)
=if(vWithCommission=1
    ,'sum({<CommissionFlag={1}>}
        SalesAmount - SalesAmount * CommissionRate)'
    ,'sum(SalesAmount)'
)

-Rob

For more performance tips, join us at the upcoming “Masters Summit for Qlikvew”  in San Francisco May 2015 where Oleg Troyansky presents an always enlightening session on Performance Tuning. Can’t make SF? Check out the other dates and locations on the website.

 

Performance Tip – Using Dual() and Chart Visual Cues

I recently diagnosed a slow Straight Table chart. The chart contained 100K+ rows. One column contained a complex expression that returned a Y/N string flag for the column. Something like:

=if(complex expression, ‘Y’, ‘N’)

They also wanted to set the background color of the cell, green for Y, red for N. So the Expression Background Color property repeated the same complex expression to assign a color:

=if(complex expression, green(), red())

I surmised the expression was being calculated twice for each row. I changed  the main expression to set a Dual().

=if(complex expression, dual(‘Y’,1), dual(‘N’,0))

The chart cell still displays the Y/N text. But now I could use 1 and 0 values on the Visual Cues pane and eliminate the Background Color expression entirely. Much faster!

-Rob

Speed up Script Development with BUFFER

A BUFFER prefix on a LOAD or SQL statement creates and maintains an automatic QVD for that statement. Subsequent executions of the LOAD/SELECT statement will read from the QVD, avoiding another (slower) trip to the database. A read from QVD is generally 5-10 times faster than fetching from database.

TranTab:
BUFFER LOAD 
TranID,
Amount,
CustomerID,
etc…
;
SQL SELECT * FROM TRANSACTIONS
;

On first execution, the SELECT will fetch rows from the database and the resulting TranTab will be stored in a specially named QVD on the local machine. On subsequent reloads, TranTab will automatically be loaded from the local QVD.

If you make a change to the TranTab LOAD/SQL statement, QV Reload will detect the change and fetch from the database again and update the local QVD.

During script development it’s not uncommon to perform a reload several times. You can greatly reduce the duration of a script run by adding BUFFER to your statements. Any script changes/adds you make will automatically invalidate that buffer and re-fetch from the database.

Don’t forget to remove the BUFFER keyword before moving to production!

You can read more about BUFFER and some optional parameters in the Qlikview Help.

-Rob

DROP FIELD Does Not Release All Space

During the “Performance Tuning” session at the Barcelona Masters Summit, Oleg Troyansky demonstrated using Document Analyzer to identify unused fields followed by DROP FIELD statements to remove those unused fields from the data model. Roland Vecera offered an interesting discovery.  DROP FIELD after a BINARY LOAD does not free the expected amount of memory.

For Example:
Binary dropfieldstest_dm.qvw;
DROP FIELD A,B;

Roland has found that a subsequent LOAD RESIDENT of each affected table is required to fully reduce disk and RAM consumption to the expected level.

A field in a QVW is represented by three storage structures:
1. Symbol table, which stores the unique values of the field.
2. Record pointers, a pointer on each table row to the symbol value.
3. State space, where field selections are tracked.

Based on testing and calculation, my interpretation is that in this scenario (BINARY LOAD/DROP FIELD), the Symbols and State space is released. However, the space occupied by the Record pointers is not released, i.e. the records are not rewritten. This may be a significant amount of space, particularly when a table contains many rows.

For most developers this will be an obscure issue. But for people tuning large applications, this may provide an “aha”moment.

Thanks Roland!