Category Archives: Performance

Does Data Sort Order Impact Chart Calc Time?

Sep 19, 2018 Rob Wunderlich 8 Comments

Lately I’ve been digging into an old Qlik performance question. How much impact, if any, does the order of Qlik data tables have on chart calc time? My experience is that for a chart or aggr() cube with a lot of dimension values, ordering of rows by dimension values can have a significant and measurable effect.

Mike Steedle of Axis Group blogged about the issue a couple of years ago. Mike’s post includes a useful subroutine to organize any table by a specific field.

I’ve added my own study and sample files on the topic in this QlikCommunity post.

Mike and I are are working together on the next update to Qlik Sense Document Analyzer. Mike is keen on analyzing the data model and making useful recommendations. One of the optimization questions we are studying is whether it is possible to make a solid recommendation on data table organization.

I’m curious to hear what others have discovered on the topic. Do you have any rules you follow in ordering table rows? Any thresholds or object/expression scenarios where you find it’s worth the trouble to manage the ordering?

-Rob

Performance, Scripting

AutoNumber vs AutoNumberHash128

Apr 15, 2018 Rob Wunderlich 16 Comments

Summary: AutoNumberHash128(A, B) runs about 30% faster than AutoNumber(A &’-‘ & B).

It’s a common practice to use the script AutoNumber() function to reduce the storage required for large compound keys in a Qlik data model. For example:

AutoNumber(A & '-' & B) as %KeyField

As a standard practice, we generally include a separator like ‘-‘ to ensure ‘1’ & ’11’ does not get confused with ’11’ & ‘1’.

The AutoNumber process can add significant run time to a script with many rows.

I’ve always wondered what the AutoNumberHash128() function was good for.

AutoNumberHash128(A,B) as %KeyField

This function first hashes A & B and then autonumbers the result. The end result is the same as the first example given using AutoNumber(). I find the AutoNumberHash128 syntax a bit simpler as a separator is not required.

What surprised me is that the AutoNumberHash128() function runs faster. Typically about 30% faster than a plain AutoNumber with a concatenated string parameter.

Why is it faster? The difference is in the function used to create the single value to be autonumbered. Hash128 is considerably faster than string concatenation (&).

AutoNumberHash128() can take any number of fields, but it does not have an “AutoId” parameter. The “AutoId” (second parameter) in AutoNumber() is recommended to ensure we get sequential integers when autonumbering more than one key field. Sequential integers are the most memory efficient storage for keys.

Don’t despair. AutoNumberHash128() will use the “default” AutoId. That is fine if you are autonumbering only one key. If you are doing more than one key, use AutoNumberHash128() for your largest — most rows — key and use AutoNumber() with AutoId for the rest. You will improve the script run time of one key.

Another possible tradeoff when you have many large keys is to use AutoNumberHash128 for all keys and forgo the sequential integer optimization. You will use only 8 bytes per key value which could be significantly less than the original string keys.

-Rob

Update 20 Sept 2022

Things have changed somewhat with the addition of the AutoNumber statement, which is the recommended method to autonumber keys. AutoId is no longer a problem. In my recent testing creating compound key fields, I still find that Hash128() is somewhat faster than the & operator. Here’s the results. Option 4 is creating the key with the & operator and AutoNumber statement. Option 5 is creating the key with Hash128() and AutoNumber statement.

Performance

Preceding Load Performance Update

Mar 27, 2018 Rob Wunderlich 17 Comments

Summary: Preceding load used to slow down your script. but no more. Beginning with QV Nov 2017, preceding load has no performance penalty.

I’ve posted several times about the elegance of preceding load. I’ve also written about how preceding load can make your script run significantly slower. Good news! Beginning with QV release Nov 2017 (12.20) the performance penalty has been eliminated.

To demonstrate the improvement, let me start with some test results from QV12.10 SR8, prior to the improvement.

Test 0, the first bar, indicates the time in seconds to perform an optimized load of a 20 million row QVD. Test 1, which follows, is loading the same QVD but with the addition of two new calculated fields in the same LOAD statement. The calculations are trivial, so the increase in elapsed time is mostly due to the loss of the optimized load.

Test 2 creates the same calculated fields using preceding load and you can see the dramatic increase in elapsed time. Test 5 adds a “LOAD *” to the preceding load stack and again shows a large increase in duration.

Tests 3, 4 & 6 repeat the same tests using Resident as the source instead of QVD. Once again, a significant increase in duration when preceding is used.

I’ve been running this same test suite for several years across multiple QV releases, different machines and varying datasets. The results are generally the same.

The problem, as explained to me by Henric Cronström and confirmed by my own observations, is that the preceding load code uses only a single processing thread. So while tests 1 & 3 above will use multiple threads, tests 2,4,5,6 will use only a single thread. One way to think of this is not that preceding load runs slower, but that non-preceding load runs faster.

I never did understand why Preceding-Resident ran slower than Preceding-QVD, but I no longer care!

Here I add test results (in red) for QV Nov 2017 SR1 (Qv 12.20) .

You can see optimized QVD (test 0) is about the same. Adding calculated fields (test 1) is slightly better between releases.

What is really significant is there is no longer any increase when using preceding load. Further, Resident performs faster than QVD as I would expect. (Note both tests used an SSD drive).

This is all great news as there are many cases where preceding load can help make your code more maintainable and understandable. I hated to choose between clarity and performance.

What about Qlik Sense? I’ve confirmed that Feb 2018 Desktop exhibits the new “no-penalty” performance. I don’t know about previous releases.

No reason to fear preceding load!

-Rob

Performance, QlikView, Tools

Document Analyzer Batch Analysis

Jul 18, 2017 Rob Wunderlich 15 Comments

I’ve received several requests to provide a batch interface to the popular QV Document Analyzer tool that will provide for analyzing multiple QVWs with a single command. It’s now available in the QV Document Analyzer bundle available for download here.

The script is a windows cmd file. The usage from the command line is:

DaBatch.cmd somedir

where “somedir” is a directory of QVWs to be analyzed. Each QVW in the directory will be processed by Document Analyzer and the results will be saved for later review.

Before running, there are a number of configuration variables in DABatch.cmd you will want to review and modify as necessary.

REM *** Path to QV.exe executable ***

SET qvDir=C:\Program Files\QlikView\Qv.exe

This is location of the QV Desktop executable. The provided value is the default location for most users and is typically appropriate as-is.

REM *** Path to DocumentAnalyzer.qvw. Note that v3.6 or later is required! ***

SET DaPath=DocumentAnalyzer_V3.6.qvw

Where is the Document Analyzer.qvw to be found? Note that DA V3.6 or later is required by DABatch.

REM *** Directory to store DocumentAnalyzerResults QVDs and QVWs. Will be created if it doesn't exist *** SET DaResultsDir=C:\temp\MyDaResults

Specify the directory where analysis results will be saved. If this directory does not exist, it will be created.

REM *** Should the analyzer results be stored in a QVD (YES/NO)? ***
SET SaveResultsInQVD=YES

Do you want to save the DA results in a QVD for later analysis by the DaCompareTool.qvw? The default of “YES” is usually appropriate here. QVD result files include a timestamp so you will always get new files for each run. Change to “NO” if you don’t want result QVDs.

REM *** Should the analyzer results be stored in a QVW (YES/NO)? ***

SET SaveResultsInQVW=YES

If “YES”, a DA QVW will be saved for each analysis and named “DocumentAnalyzer_your_qvwname.qvw”. If a file exists with this name, it will be overwritten. If you don’t want individual DA QVWs, change this variable to “NO”.

After launching DABatch, you will receive one prompt:

Analysis Title? <ENTER> for 'Baseline'

The prompt is requesting a title to be assigned to the Result QVDs that will be consumed by DaCompareTool. To accept the default of “Baseline”, press <Enter>. Otherwise type a new value and press <Enter>.

If you have set “SET SaveResultsInQVD=NO” as a configuration option, the title value is irrelevant. (Perhaps I should not prompt in that case; next version?).

While the script is running Document Analyzer windows will be launched for each QVW and progress message displayed. It’s best to keep your hands off the keyboard to get proper timings. Good time to get that coffee.

When execution is complete you’ll see a summary message.

*****************************************************
Batch Analysis complete. 3 QVWs analyzed.
*****************************************************

You can now review each”DocumentAnalyzer_your_qvwname.qvw” file or load the result QVDs into DaCompareTool.qvw for comparative analysis.

Please let me know in the comments section if you can think of enhancements that support your use case for DA batch analysis.

-Rob

Performance

Dual Storage vs Dual Behavior

Mar 13, 2017 Rob Wunderlich 6 Comments

Summary: The Dual() function stores both string and numeric representations of a value. “Implied Duals” such as Dates, store only the numeric portion and apply the string mask as needed. In some circumstances such as un-optimized QVD loads, implied duals can get converted to “full duals” using storage unnecessarily.

In QlikView and Qlik Sense you can create a Dual field using the Dual() function such as:

if(ShipDate = OrderDate, Dual('Yes',1), Dual('No', 0)) as SameDayShip

Dual fields have both string and numeric representations and Qlik is smart about using the correct representation based on context.

In a listbox or filter, SameDayShip will show the string values as:

Yes
 No

We can also write expressions such as:

Sum(SameDayShip)

which will smartly and automatically use the numeric value of SameDayShip.

Internally, the values will be stored in the symbol table like this:

Y	e	s	1
N	o	0

The numeric portion, 1 or 0 in this case, will always occupy 8 bytes. The average symbol length will be 10.5 — (11 + 10) / 2 values. You can display the symbol length by using a tool like Document Analyzer.

What about Date() or Num() fields, which are also Dual fields? When properly scripted, these are what I call “Implied Dual fields”. They have dual behavior, but do not occupy the full dual storage.

Dates are represented as the number of days since Dec 31, 1899. Today’s date (March 12, 2017) number is 42806. A properly optimized date stores only the numeric value and does not store the string value. Instead , the format mask is stored once as an attribute of the field.

Format: M/D/YYYY

ShipDate

42804

42802

42800

On demand, when the string representation is required (like in a listbox) the format mask is applied. The symbol length in this case is always 8, only the numeric value.

Sometimes — such as in an un-optimized QVD load — the field is converted to what I call a “full dual” (like the “SameDayShip ” example) and both the string and numeric values are stored in the symbol table. This can greatly increase the storage used for the symbol table.

3/10/2017	42804
3/8/2017	42802
3/6/2017	42800

An example of an un-optimized load that will create the “full dual” representation:

LOAD
 DateField
 FROM Dates.qvd (qvd)
 Where Year(DateField) >= 2016;

In QlikView, you can “fix” this problem by going into the Document Properties, Number pane and changing the field format from “Mixed” to to “Date” format. QV will immediately release the string storage.

Qlik Sense does not provide a Number Format pane, so you must apply corrections in the script like this:

LOAD
 Date(Num(DateField)) as DateField
 FROM Dates.qvd (qvd)
 Where Year(DateField) >= 2016;

To be fair, this is usually not a big deal for something like Dates, which have a relatively small number of values. It can become more significant with something like Timestamps or other numeric fields that have many unique values.

The “Recommendations” sheet of Document Analyzer identifies these “Numeric Size” opportunities and quantifies the memory savings if you were to apply a correction.

-Rob

Performance, Scripting

Q-On Courses in January

Dec 14, 2015 Rob Wunderlich Leave a comment

Just a heads up to get in your planning before taking a holiday break. I’ll be teaching two on-line courses in early January:

January 7 QlikView™ Document Performance Tuning

Learn how to measure and optimize the performance of your QlikView apps. By the end of the session, you will understand how the calculation process in QlikView works, and how data model, expression and chart design impact response times and resource usage.

You will come away with the skills to analyze your own apps and make them run faster.

January 6 QlikView™ Components Scripting Library

Speed up your QlikView™ development workflow by making use of the powerful QlikView Components (QVC) scripting library.

QVC can help you build your QlikView™ projects quicker and ensuring a high level of quality in your scripts. The set of ready-made subroutines that QVC provides can be used to perform common script operations, implemented in a flexible way and incorporating best practices.

Hope to see you there !

-Rob

Performance

The Impact of Data Islands on Cache and CPU

Jun 10, 2015 Rob Wunderlich 18 Comments

Well that’s a wordy title isn’t it?

I’m in the midst of writing a new QlikView Document Optimization course to be delivered at Q-On Training . This work has reminded me of a not-so-obvious issue I sometimes see in Performance Tuning engagements with customers.

You might be thinking I’m going to write today about how heavy a calculation can be as a result of the cartesian product of disconnected fields in an expression . No…that’s not what I’m thinking of.

What I’m thinking about today is the impacts of selecting a field that is not used by any expression on the sheet. For example, a Currency listbox may be present on the sheet. The Currency field is not connected (“Data Island”) to other tables in the model . Some, or none, of the objects on this sheet may reference that Currency field.

What happens when you click a Currency value? Everything on the sheet gets recalculated. E v e ry t h i n g. Whether it uses Currency or not. Why? Because the data has changed.

Since the data used in my chart has not changed, the results will be fetched from cache, right? Probably not. Let’s look at an example:

Dimension: Customer
Expression: Sum(Sales)

The current selection in Currency is “USD”. The chart has been calculated and the results stored in cache, available for speedy retrieval if the same expression is calculated over the same set of data.

Select “EUR” in Currency. The cached result will not be used even though no change has been made in the data used by the chart. Cache evaluation considers the entire data model, not just data referenced by the object. If you now select “USD”, the previously cached result will be used.

If your application is large enough that you consider this behavior to be a problem, a leaner alternative for the Currency example is to use a variable. When a variable changes, only objects that reference the variable get recalculated. Another great alternative is to put the Currency listbox in a different Alternate State.

Let’s look at another case, the idea of a “universal listbox” that has been published by a several authors. I think it’s a very cool idea and I use it myself for data exploration. The common idea is that you let the user pick any field and then make selections in that field. This can be built using the system fields $Table and $Field.

Let’s consider the app is idle and all the sheet objects are occupying what I’ll term “relative cache slot #1” – the first cache entry for each object. Now we’ll use the universal listbox.

1. Select a $Table value. Everything is recalculated, consuming CPU resources and the results are stored in “cache slot #2”.

2. Select a $Field value, “ProductGroup”. Everything is recalculated, consuming CPU resources and the results are stored in “cache slot #3”.

3. Make some selections in “ProductGroup”. Everything is recalculated, consuming CPU resources and the results are stored in “cache slot #4”. This result, with the new data, is what the user is interested in.

We have used three times the CPU and cache resources to achieve the end result. A leaner alternative would be to place the universal listbox on it’s own sheet, thus avoiding extra recalculations. You can move back and forth with buttons to make it feel like it’s integrated with each sheet.

I don’t mean to discourage the use of these data island techniques. They can be incredibly useful to the #1 goal — getting accurate information to your business in a timely and cost efficient manner. However, when you are dealing a specific performance concern in larger apps this is an area you may want to examine and optimize.

-Rob

Watch the Q-On Training site or subscribe to our Q-On newsletter to be notified when the new Performance course is available. For more performance related discussions and tips, join me at the Masters Summit for QlikView Sept 23-25 NYC or 29 Sep-1 Oct Copenhagen.

Performance, Scripting

Better Calendar Scripts

May 28, 2015 Rob Wunderlich 62 Comments

TLDR: A new Qlik Sense training video uses a tired old-school Master Calendar script. I propose that updated training materials should use an updated script.

I just watched a new video by Michael Tarallo of Qlik titled “Understanding the Master Calendar – Qlik Sense and QlikView“. Before I pick on the contents of this particular video, I want to credit Michael for producing many excellent training videos that are worth watching and learning from. I highly recommend them.

The video does a great job of explaining the need for and function of a Master Calendar in your data model. It then goes on to show an actual script.

I can’t discuss Master Calendar without expressing disappointment that Calendar generation is not yet a builtin function in Sense. Something like QlikView Components (QVC) does with the single script line:

CALL Qvc.CalendarFromField('OrderDate');

On to the script used in this new video. I’ll reproduce the entire script below and then comment on the techniques used and suggest some more “modern” approaches.

The video script is similar to the script used in the current QlikView Developer course . I acknowledge that this script works as is and produces correct results. But I don’t think it should be taught to newbies as good scripting. Here’s the script from the video:

Obsolete Code

1. Why is this field created and where is it used?

It’s not used. It’s left over from a very old version of the exercise and it doesn’t serve any purpose.

2. Why are we sorting the table? Is this statement useful?

Even if I could think of a good reason why the Calendar table should be in order, it already is in this order because the TempCalendar was generated in a loop. Statement unnecessary.

Inefficient Code

Loading a Resident table can be very slow for a large table.

Experienced scripters use the FieldValues array instead.

What’s the difference? FieldValues only reads the distinct values of a field — maybe a thousand or so for several years of dates. Resident reads every row of the table. For a 10M row fact table, that’s 10M reads and the time scales up linearly. The difference can be dramatic.

Error Prone and Extra Work

Peek(), used on lines 13 & 14, is one of those functions that fails silently. That is, if you misspell a field or table, you won’t get a script error. Misspelling a variable will also not generate a script error. Maybe. Or maybe not. Or maybe you will get a Calendar that starts in year 1899. Depends on which line you make the spelling error on. If your misspelling does result in a script syntax error, it will be downstream from where you created the problem. There are multiple ways to mess this one up and some very curious potential results.

Don’t forget to DROP those temp tables on lines 15 and 35.

And those varMinDate and varMaxDate variables really should be cleared as well.

You can avoid all the syntax traps and extra cleanup by coding this a as a Preceding Load. Here’s the same script written as a Preceding Load:

Nothing to remember (or forget) to clean up. If you misspell a fieldname, you will get an understandable error at the right place. This is the calendar script I wish we would provide to newcomers.

Of course if you’ve attended the Masters Summit for QlikView, you’ve learned all about FieldValues, Preceding Loads and more. If not, attend a summit this Fall in NYC or Copenhagen.

QVC users don’t even get out of bed to generate Calendars. But they know that all that best practice stuff is happening under the covers. If you want to learn more about QVC, join my online class June 4 or a future class at Q-on.bi.

-Rob

Entire script suitable for copying:

MasterCalendar: 
Load 
 TempDate AS OrderDate, 
 week(TempDate) As Week, 
 Year(TempDate) As Year, 
 Month(TempDate) As Month, 
 Day(TempDate) As Day, 
 'Q' & ceil(month(TempDate) / 3) AS Quarter, 
 Week(weekstart(TempDate)) & '-' & WeekYear(TempDate) as WeekYear, 
 WeekDay(TempDate) as WeekDay 
;

//=== Generate a temp table of dates === 
LOAD 
 date(mindate + IterNo()) AS TempDate
 ,maxdate // Used in InYearToDate() above, but not kept 
WHILE mindate + IterNo() <= maxdate;

//=== Get min/max dates from Field ===/
LOAD
 min(FieldValue('OrderDate', recno()))-1 as mindate,
 max(FieldValue('OrderDate', recno())) as maxdate
AUTOGENERATE FieldValueCount('OrderDate');

Performance

Document Analyzer Numeric Field Recommendations

Apr 2, 2015 Rob Wunderlich 4 Comments

In the latest release 2.4 of QV Document Analyzer I’ve started the process of adding “Recommendations”, highlighting specific areas where potential improvement may be made to your document.

The recommendation included in this release is “Numeric field sizes”. Numeric fields that occupy more than 8 bytes of storage per value will be flagged. They will be highlighted on the Fields sheet and additional details will be provided on the Recommendation sheet. The potential savings from “fixing” the field is also shown.

So what is this 8 byte thing? Ideally, most numeric fields (which includes timestamps) can be represented by 8 bytes of storage. In the Number format property or the script num() function you assign a formatting mask to be used for string representation. QV stores the 8 byte number and applies the mask at display time.

Occasionally, particularly when doing an un-optimized load from QVD, the data can wind up being saved as “Mixed” which means both the numeric and string representations are stored for each value. So you may see Symbol widths of 17, 20, 24 or even greater for this field.

In larger apps with many unique numeric values, I’ve found significant memory savings by restoring these Mixed fields to a numeric format.

The usual tuning and optimizing caveat applies here. Don’t spend effort recovering a few megabytes. You have better things to do. But if you are wrestling with the megabeast app, fixing the biggest of those fields may be worth the effort.

-Rob

Performance

How to Choose an Expression

Dec 19, 2014 Rob Wunderlich 8 Comments

This is a follow on to my post “How Not to Choose an Expression” that described the performance problem sometimes experienced in large apps when choosing one of several expressions.

I received a number of questions about the scalability of my suggested solution to define multiple Expressions using the Expression Conditional property. In this post I’ll present an alternative for when you have a large number of option combinations.

Before I dive in, an important comment about coding for performance. First, you should code for clarity and maintainability. If your document contains only a few million rows, it probably won’t matter if you use if() or an alternative technique. I’m fond of the Donald Knuth quote “Premature optimization is the root of all evil (or at least most of it) in programming”. The techniques presented in this post are meant to solve an identifiable performance problem. I wouldn’t bother implementing them unless I have a need.

Let’s use a scenario where calculations should reflect user selected options.

US Dollars or Local Currency — USD | LC
Include VAT? — Y|N

I’m only dealing with two options to keep my example manageable. You should be able to extend the concepts into many options.

The if method chart Expression for our choice may look like this:

if(CurrencyType='USD' AND [Include VAT?]='N'
 ,Sum(SalesAmount_USD)
,if(CurrencyType='USD' AND [Include VAT?]='Y'
 ,Sum(SalesAmount_USD + VAT_USD)
,if(CurrencyType='LC' AND [Include VAT?]='N'
 ,Sum(SalesAmount_LC)
,if(CurrencyType='LC' AND [Include VAT?]='Y'
 ,Sum(SalesAmount_LC + VAT_LC)
))))

The [CurrencyType] field controls which field we will sum() and the [Include VAT?] field controls if the associated VAT field is included in the sum(). What’s the difference between the alternatives? Only the fields listed in the sum().

Our Document contains 100M rows and we’ve identified this expression as a bottleneck. What are some alternatives to make this perform better?

In my previous post, I discussed dividing this into four expressions and making a choice using the Expression Conditional property. I won’t repeat the details of it here. In this case, I don’t want to create multiple expressions in the charts. What is another possible technique?

Start by loading a table that reflects the choice fields and the associated sum() fields.

This is an island table that is not linked to any other tables in our model.

Create Listboxes for [CurrencyType] and [Include VAT?]. Set the “Always One Selected” property in the Listboxes. This will force the user to make selections and at any given time only one row will be possible in our table.

We will then reference the SalesExprField in our chart using Dollar Sign Expansion (DSE). DSE is performed before the expression is evaluated. Our chart expression is now:

Sum($(=SalesExprField))

The “=” within the DSE says to evaluate this as an expression. In other words, substitute in the value of the SalesExprField and then evaluate the Sum().

Do you want to see what DSE is actually substituting? In a Straight Table, clear the Label field. The substituted expression will be visible in the column heading.

If we are not sure that “Always On Selected” is guaranteed, we should use some type of aggregation function to ensure a single SalesExprField is selected. For example, to take the first possible value:

Sum($(=FirstSortedValue(DISTINCT SalesExprField,1)))

Instead of just parameters to the sum() function, we could have included the entire expression in our island table, eg “Sum(SalesAmount_LC + VAT_LC)”. In that case our chart expression would be:

$(=SalesExprField)

What if want to control options via variables instead of fields? Use a Set expression in the Only() function.

Sum(
$(=only({<CurrencyType={$(vCurrencyType)},[Include VAT?]={$(vIncludeVAT)}>}SalesExprField))
)

If you’ve read this far, you are probably wondering “Is there a downloadable example”. Yes, it can be found here. Qlikview Cookbook: How to Choose an Expression.

-Rob

Want more performance tips? Come see Oleg Troyansky’s Performance Tuning session and additional tips from other presenters at the “Masters Summit for Qlikvew”.

Recipes for Qlikview Success