Category Archives: Performance

Document Analyzer Numeric Field Recommendations

In the latest release 2.4  of QV Document Analyzer I’ve started the process of adding “Recommendations”, highlighting specific areas where potential improvement may be made to your document.

The recommendation included in this release is “Numeric field sizes”. Numeric fields that occupy more than 8 bytes of storage per value will be flagged. They will be highlighted on the Fields sheet and additional details will be provided on the Recommendation sheet.  The potential savings from “fixing” the field is also shown.

So what is this 8 byte thing? Ideally, most numeric fields (which includes timestamps) can be represented by 8 bytes of storage. In the Number format property or the script num() function you assign a formatting mask to be used for string representation.  QV stores the 8 byte number and applies the mask at display time.

Occasionally, particularly when doing an un-optimized load from QVD, the data can wind up being saved as “Mixed” which means both the numeric and string representations are stored for each value. So you may see Symbol widths of 17, 20, 24 or even greater for this field.

In larger apps with many unique numeric values, I’ve found significant memory savings by restoring these Mixed fields to a numeric format.

The usual tuning and optimizing caveat applies here. Don’t spend effort recovering a few megabytes. You have better things to do. But if you are wrestling with the megabeast app, fixing the biggest of those fields may be worth the effort.

-Rob

Share

How to Choose an Expression

This is a follow on to my post “How Not to Choose an Expression” that described the performance problem sometimes experienced in large apps when choosing one of several expressions.

I received a number of questions about the scalability of my suggested solution to define multiple Expressions using the Expression Conditional property. In this post I’ll present an alternative for when you have a large number of option combinations.

Before I dive in,  an important comment about coding for performance. First, you should code for clarity and maintainability. If your document contains only a few million rows, it probably won’t matter if you use if() or an alternative technique.  I’m fond of the  Donald Knuth quote “Premature optimization is the root of all evil (or at least most of it) in programming”. The techniques presented in this post are meant to solve an identifiable performance problem. I wouldn’t bother implementing them unless I have a need.

Let’s use a scenario where calculations should reflect user selected options.

  • US Dollars or Local Currency  — USD | LC
  • Include VAT?  — Y|N

I’m only dealing with two options to keep my example manageable. You should be able to extend the concepts into many options.

The if method chart Expression for our choice may look like this:

if(CurrencyType='USD' AND [Include VAT?]='N'
 ,Sum(SalesAmount_USD)
,if(CurrencyType='USD' AND [Include VAT?]='Y'
 ,Sum(SalesAmount_USD + VAT_USD)
,if(CurrencyType='LC' AND [Include VAT?]='N'
 ,Sum(SalesAmount_LC)
,if(CurrencyType='LC' AND [Include VAT?]='Y'
 ,Sum(SalesAmount_LC + VAT_LC)
))))

The [CurrencyType] field controls which field we will sum() and the [Include VAT?] field controls if the associated VAT field is included in the sum(). What’s the difference between the alternatives? Only the fields listed in the sum().

Our Document contains 100M rows and we’ve identified this expression as a bottleneck. What are some alternatives to make this perform better?

In my previous post, I discussed dividing this into four expressions and making a choice using the Expression Conditional property. I won’t repeat the details of it here.  In this case, I don’t want to create multiple expressions in the charts. What is another possible technique?

Start by loading a table that reflects the choice fields and the associated sum() fields.

This is an island table that is not linked to any other tables in our model.

Create Listboxes for [CurrencyType] and [Include VAT?]. Set the “Always One Selected” property in the Listboxes.  This will force the user to make selections and at any given time only one row will be possible in our table.

We will then reference the SalesExprField in our chart using Dollar Sign Expansion (DSE). DSE is performed before the expression is evaluated. Our chart expression is now:

Sum($(=SalesExprField))

The “=” within the DSE says to evaluate this as an expression. In other words, substitute in the value of the SalesExprField and then evaluate the Sum().

Do you want to see what DSE is actually substituting? In a Straight Table, clear the Label field. The substituted expression will be visible in the column heading.

If we are not sure that “Always On Selected” is guaranteed, we should use some type of aggregation function to ensure a single SalesExprField is selected. For example, to take the first possible value:

Sum($(=FirstSortedValue(DISTINCT SalesExprField,1)))

Instead of just parameters to  the sum() function, we could have included the entire expression in our island table,  eg “Sum(SalesAmount_LC + VAT_LC)”. In that case our chart expression would be:

$(=SalesExprField)

What if want to control options via variables instead of fields? Use a Set expression in the Only() function.

Sum(
$(=only({<CurrencyType={$(vCurrencyType)},[Include VAT?]={$(vIncludeVAT)}>}SalesExprField))
)

If you’ve read this far, you are probably wondering “Is there a downloadable example”. Yes, it can be found here. Qlikview Cookbook: How to Choose an Expression.

-Rob

Want more performance tips?  Come see Oleg Troyansky’s Performance Tuning session and  additional tips  from other presenters at the “Masters Summit for Qlikvew”.

Share

How Not to Choose an Expression

We sometimes have a  requirement to select between two or more expressions  dependant on user selections or some other condition. For example, this chart Expression.

if(vWithCommission=1
 // Calculate with Commision
 ,sum({<CommissionFlag={1}>}
     SalesAmount - SalesAmount * CommissionRate)
 // Calculate without Commission
 ,sum(SalesAmount)
 )

The  if()  will select one or the other sum() expression based on the value of the vWithCommmision variable.  Because a variable is being tested —  not a row value — only one sum() will be chosen for the entire chart.

If performance is important,  this is not a good way to implement the requirement. QlikView does not “short circuit”. Short circuit means to stop evaluating when the condition is true. QlikView will calculate both sums and then return one of those results.

Some people mistakenly believe that putting the above expression in a variable changes the behavior. This is incorrect. The if() evaluation is still done at the chart level and the performance impact remains.

A performant way to solve the requirement is to put each calculation into a separate chart Expression and use the  Conditional property to select the Expression.

An Expression with  a Conditional evaluating to False will not be calculated or displayed.

There is also the possibility of choosing the calculation in a variable, but you have to follow a few rules.

  1. The variable should return the string representation of the formula. Note the single quotes in the example below.
  2. The variable definition should begin with “=”. This causes the if() to be evaluated only once.
  3. In the chart Expression, reference to the variable should be made with $(). eg  $(vChooseCalc)
=if(vWithCommission=1
    ,'sum({<CommissionFlag={1}>}
        SalesAmount - SalesAmount * CommissionRate)'
    ,'sum(SalesAmount)'
)

-Rob

For more performance tips, join us at the upcoming “Masters Summit for Qlikvew”  in San Francisco May 2015 where Oleg Troyansky presents an always enlightening session on Performance Tuning. Can’t make SF? Check out the other dates and locations on the website.

 

Share

Performance Tip – Using Dual() and Chart Visual Cues

I recently diagnosed a slow Straight Table chart. The chart contained 100K+ rows. One column contained a complex expression that returned a Y/N string flag for the column. Something like:

=if(complex expression, ‘Y’, ‘N’)

They also wanted to set the background color of the cell, green for Y, red for N. So the Expression Background Color property repeated the same complex expression to assign a color:

=if(complex expression, green(), red())

I surmised the expression was being calculated twice for each row. I changed  the main expression to set a Dual().

=if(complex expression, dual(‘Y’,1), dual(‘N’,0))

The chart cell still displays the Y/N text. But now I could use 1 and 0 values on the Visual Cues pane and eliminate the Background Color expression entirely. Much faster!

-Rob

Share

Speed up Script Development with BUFFER

A BUFFER prefix on a LOAD or SQL statement creates and maintains an automatic QVD for that statement. Subsequent executions of the LOAD/SELECT statement will read from the QVD, avoiding another (slower) trip to the database. A read from QVD is generally 5-10 times faster than fetching from database.

TranTab:
BUFFER LOAD 
TranID,
Amount,
CustomerID,
etc…
;
SQL SELECT * FROM TRANSACTIONS
;

On first execution, the SELECT will fetch rows from the database and the resulting TranTab will be stored in a specially named QVD on the local machine. On subsequent reloads, TranTab will automatically be loaded from the local QVD.

If you make a change to the TranTab LOAD/SQL statement, QV Reload will detect the change and fetch from the database again and update the local QVD.

During script development it’s not uncommon to perform a reload several times. You can greatly reduce the duration of a script run by adding BUFFER to your statements. Any script changes/adds you make will automatically invalidate that buffer and re-fetch from the database.

Don’t forget to remove the BUFFER keyword before moving to production!

You can read more about BUFFER and some optional parameters in the Qlikview Help.

-Rob

Share

DROP FIELD Does Not Release All Space

During the “Performance Tuning” session at the Barcelona Masters Summit, Oleg Troyansky demonstrated using Document Analyzer to identify unused fields followed by DROP FIELD statements to remove those unused fields from the data model. Roland Vecera offered an interesting discovery.  DROP FIELD after a BINARY LOAD does not free the expected amount of memory.

For Example:
Binary dropfieldstest_dm.qvw;
DROP FIELD A,B;

Roland has found that a subsequent LOAD RESIDENT of each affected table is required to fully reduce disk and RAM consumption to the expected level.

A field in a QVW is represented by three storage structures:
1. Symbol table, which stores the unique values of the field.
2. Record pointers, a pointer on each table row to the symbol value.
3. State space, where field selections are tracked.

Based on testing and calculation, my interpretation is that in this scenario (BINARY LOAD/DROP FIELD), the Symbols and State space is released. However, the space occupied by the Record pointers is not released, i.e. the records are not rewritten. This may be a significant amount of space, particularly when a table contains many rows.

For most developers this will be an obscure issue. But for people tuning large applications, this may provide an “aha”moment.

Thanks Roland!

Share

Autonumber() Key Fields and Sequential Integer Optimization

Today let’s reinforce some QV data model principles for us old timers and introduce a few tricks for newbies to Qlikview.

#1. Keys are not data. Key Fields in a Qlikview model should serve the data modeler, and not be used by chart Designers as data.

Consider two tables, OrderHeader (one row for each Order) and OrderDetails (one row for each Order Line). linked together by Field OrderID.

 

 

 

 

OrderID may be a value that you need to display in your charts. However, problems arise when you try to do things like count(OrderID). Which end of the connection should the expression count? It’s unreliable as discussed in detail here:
http://qlikviewnotes.blogspot.com/2010/01/best-way-to-count-keys-dont.html

The solution is to create a counter field on the table that represents the correct cardinality for the counter. If we are counting Orders, that would be the OrderHeader table.

In the LOAD of the OrderHeader table:

1 as OrderCounter

Part two of of the recommendation is to isolate the key field so it is not mistakenly used as a data field. We do this by prefixing the key field name with a special character and SETing the QV system variable “HidePrefix” to that character.

SET HidePrefix=’%’;

In the LOAD of both OrderHeader and OrderDetails:
OrderID as %OrderID

Fields that begin with the HidePrefix value will not show up in:
–  Current Selections.
– Dimension or Expression property dialog (unless “Show System Fields” is checked).

Of course, the specific values of OrderID may be useful to display in UI charts. In that case we must preserve it as a data item in one and only one table. We will include it in the LOAD of the OrderHeader table. Our data model now looks like this:

 

 

 

 

 

 

OrderID is available as a data field, but appropriately only from the OrderHeader table.

OrderCounter is now available as a field such that
=sum(OrderCounter)
will yield the correct Order count.

Now we (the Data Modelers!) own those “%” key fields! They are ours, we told the UI designers explicitly that “it is not data” .

Part three, and a very important part indeed, is to autonumber() the key fields. Autonumber() is a Qlikview lookup function that translates parameter values into integers. The sequential integers returned by autonumber() will reduce the RAM requirements and increase the linkage efficiency as detailed here
http://qlikviewnotes.blogspot.com/2008/05/memory-sizes-for-data-types.html
and
http://community.qlikview.com/blogs/qlikviewdesignblog/2012/11/20/symbol-tables-and-bit-stuffed-pointers

Admittedly, it gets a bit geeky. Bottom line, here is what you want to do for your keys:
autonumber(OrderID, ‘%OrderID’) as %OrderID

The autonumber() function converts values to sequential integers. The second parameter,  ‘%OrderID’, is important if you have multiple keys being autonumber() in your script.

To summarize:
#1. Keys are not data.
#2, Isolate keys using the “SET HidePrefix=x;” variable. Establish counter fields on the appropriate table.
#3. Use the Autonumber() function to convert key field values to sequential integers. This will minimize the memory footprint of the application and improve the efficiency of cross table look-ups.

My friend and colleague Barry Harmsen, author of QlikView 11 for Developers, who is a much wiser data modeler than I, will be discussing QV data modeling in depth with me at the Masters Summit for Qlikview in Europe this October. I hope you can join us!

Share

Super Fast Method to Retrieve QVD High Value

Delta Load, sometimes called Incremental Load, is the technique of pulling only changed or new rows from a database and then merging those rows with a master QVD. The Delta Load pattern follows these general steps:

1. Determine high water mark (“last reload”)
2. Build WHERE predicate in syntax of target DB.
3. SQL SELECT delta rows.
4. Merge delta rows with QVD.
5. If Deletes, INNER JOIN entire set of Keys from DB with QVD

The first step is to determine what cutoff value — delta point — do we want to pass in the SQL SELECT WHERE clause to identify new rows. This value is usually the highest value in the existing QVD.

The most robust and reliable method for step one is loading with max() from the existing QVD. For example:
LOADmax(LastModified) asMaxModifiedFROMmyqvd.qvd (qvd); 

This works well, but reading the entire QVD can take a very looooong time for a large QVD.

A much faster method is to aggregate the max value for the Delta rows only and then save that value on the first row of the QVD. In subsequent delta loads, only the first row of the QVD is read to get the value. This is extremely fast and is not effected by QVD size. For example:

Facts:
SQLSELECT * 
FROMmytableWHERELastModified>= ‘$(vMaxModified)’;

JOIN(Facts)
LOADmax(LastModified) asHighWaterValue
RESIDENTFacts;
// Do QVD Merge…and STORE Facts INTO Facts.qvd
Retrieve value later with:
 FIRST 1 LOADHighWaterValueFROMFacts.qvd(qvd

The “HighWaterValue” field will contain a relatively small number of values and will therefore have a negligible impact on the size of the QVD. You will of course have to create the field the first time before trying to load it with the “FIRST 1…”.

If you are using Qlikview Components (QVC) V8+ for delta loading, you can enable this high performance strategy by setting the variable:

SETQvc.Loader.v.StoreMaxModFieldValue = -1; 

QVC will automatically create a high water field named Qvc.MaxModFieldValue
and detect the availability of the field on subsequent loads. There is no need to pre-create the field before turning on this feature.

The technique is part of the Advanced Scripting material I’ll be presenting at the upcoming Masters Summit for Qlikview in Europe this October. I’ll be discussing more about Delta Loads, including strategies for merging large QVDs and tables with non-unique keys. Review the full agenda here and decide if the Masters Summit might help take your Qlikview skills to the next level.

Share

Document Compression

Today I offer up a discussion of Qlikview “compression”. That is, the Qlikview features that make overall data get smaller, and in some cases, larger.

Should you care? In most cases no. But understanding what “knobs you can turn” can be a useful tool for capacity planning and application tuning. Let’s look at the practices and parameters that affect data size.

 Script Execution:  Data read from sources – such as database tables – are read in to memory (RAM) by the script execution (reload) process. Duplicate values are reduced to the unique set of values for each column. A “Gender” column has only two values – “Female” and “Male”, so the storage required for this column is minimal compared to a column that has a wide range (cardinality) of values such as a timestamp.  This is not really “compression” but rather what I call “de-duplication”.

The ratio of database storage to document storage is dependent on the data content as well as the use of common script techniques like separating timestamps into date and time fields. A typical database to document ratio is 10:1. For example, 2GB of database tables might require 200MB of document RAM.

QVW write to Disk: After reload, the Qlikview document (data tables and screen objects) is written from RAM to Disk as a *.qvw file. If compression is set on (default) for the document, the qvw will be compressed as it is written to disk. The compression results will vary depending on data content, but is typically in the range of 2-5 times. For example, a document that requires 200MB of RAM will require somewhere between 40MB and 100MB of Disk storage.

If compression is set to “None”, the document will be written to disk in the same format it existed in RAM and will occupy the same storage on disk as it utilized in RAM.
The Compression option for each Document is set in the Document Properties, General tab. The default compression for new documents is defined the User Settings, Save tab.

The compression option will of course impact the amount of disk storage used. But it also affects the amount of time it takes to read or write a qvw. I find that for most documents, an uncompressed document will write and read significantly faster than a compressed document. Some documents, especially large ones with high compression ratios, will read faster if compressed. The other factor is speed of the disk being used – local disk or network disk.

I typically do my development with compression off and then do a timing test with both options before migrating to the server.

QVW read from Disk: The *.qvw is loaded to RAM by a developer or on the Server by a user session. The amount of RAM required is the uncompressed size, regardless if compression was used to write the *.qvw to disk.  As discussed in the previous section, my experience is that uncompressed documents read from a local disk typically load up faster, but this is not always true and is worth testing on large documents.

­What is the compression factor for QVD files?
 
Zero.

A QVD file contains the physical representation of an in-memory Qlikview Table. This “RAM image” format is what allows an optimized QVD load to be so quick. The physical blocks of disk are read directly into Qlikview RAM, “ready to go”. Because QVD is the RAM image, there is no compression.

A QVD read with an optimized load will require the same RAM size as its size on disk (1:1). A QVD read with an un-optimized load may require significantly more RAM, due to some numeric fields being converted to strings. The expansion is typically about 2:1 but varies considerably.

Here is a summary of the various “compression points” and typical results.
Source
Destination
Ratio
Example
Result
Notes
Source DB
2GB
Raw Data
Source DB
Document RAM
10:1
200MB
Data de-duplication
Document RAM
QVW Disk
3:1
67MB
Save Compression=High
Document RAM
QVW Disk
1:1
200MB
Save Compression=None
QVW Disk
Document RAM
1:3
200MB
Save Compression=High
QVW Disk
Document RAM
1:1
200MB
Save Compression=None
Document RAM
QVD Disk
1:1
200MB
QVD always uncompressed
QVD Disk
Document RAM
1:1
200MB
Optimized load
QVD Disk
Document RAM
1:2
400MB
Non-Optimized load

If your documents are small and you are not experiencing performance issues, worry about none of this.

Compressed documents occupy less disk space and their smaller size makes them easier to manage for moving, backup, etc.

If you are trying to get a document to load faster, try turning off document compression and benchmark your results. Consider the type of disk when making this decision. Compression may more important in a network storage environment where reducing the amount of data transferred is a significant performance factor.

It’s important to understand that the document compression option has no impact on RAM usage. It only impacts the amount of data read and written to disk.
Share

Memory sizes for data types

An earlier post of mine When less data means more RAM discussed the ways in which storage (“Symbol” space) needed for field values can increase depending on how a field is loaded or manipulated. This generated some followup questions on the QlikCommunity forum about the optimal storage sizes for fields of various data types.

What’s presented below is information gleaned from the documentation, QT Support and experimentation. The numbers come from the document memory statistics file. I hope someone from QT will help me correct any errors.

QV fields have both an internal and external representation. There is a video “Datatype Handling in Qlikview” available on QlikAcademy that explores this subject.This post is concerned with the internal storage of fields.

Numbers

I’ve found that the storage size appears to be related to the number of total digits. Storage size in bytes, for various digit ranges:

1-10 digits, size=4
11 or more digits, size=13

The above sizes assume that the internal storage format is numeric, which is usually the case if loading from a database. Numbers loaded as text such as from a text file or inline, may be stored as strings which will occupy different sizes.

Dates, Times and Timestamps

Different Database systems provide various degrees of precision in timestamps and I assume the ODBC driver is also involved with the exact value provided to QV during the load. QV times are the fractional part of a day, using up to 9 digits to the right of the decimal point.

– Best size for a Date, 4 bytes.
– Best size for a full Time, 13 bytes.
– Best size for a full Timestamp, 13 bytes.

These sizes can increase when the field is manipulated. Want to get the date portion of a timestamp? Don’t use

date(aTimestamp)

date() is a formatting function, it doesn’t “extract” the underlying date portion. In many cases, it actually increases storage size because the result may be a string. Instead, use

floor(aTimestamp)

this will produce a 4 byte integer result.

A common technique for reducing the memory footprint of timestamps is to separate the timestamp into two fields, integer date and fractional time. You can further reduce the number of unique time values by eliminating the hundredths of seconds, or even eliminating the seconds if your application is ok with minute precision.

Strings

Thanks to QT support for providing this detail on Strings.

“The representation is that each symbol has a pointer (4/8 bytes on 32/64-bit platform) + the actual symbol space. This space is the number of bytes (UTF-8 representation) + 2 (1 is a flag byte and 1 is a terminating 0) + 0, 4 or 8 bytes that store the numeric representation of the field.”

So on the 32bit version, a non-numeric string occupies 6 bytes more than the length of the string itself. A numeric string occupies 10 more bytes. For example:

“a” uses 7 bytes
“1” uses 11 bytes

The only way to reduce the string footprint is to reduce the number of unique values. This can be done by breaking the string into component parts if that makes sense in the application. For example, the first 3 characters of a 10 character product code may be a product class. Breaking the field into ProductClass and ProductNumber fields may reduce the number of unique values.

If the strings are keys that don’t need to be displayed, the autonumber() or autonumberhash128() functions can be used to transform the values to 4 byte integers. With these functions you can also get the “sequential integer optimization” which reduces the symbols space to zero.

I’ve found that concatenating fields in autonumber like
autonumber(f1 & f2)
can sometimes produce false duplicates. Better to instead use autonumberhash128 like
autonumberhash128(f1, f2)
This seems to always produce correct results.

Sequential Integer Optimization

For each field, QV maintains both a Symbol table — the unique values of a field — and a State array that tracks which values are selected. If the symbol values are consecutive integers, a very clever optimization takes place. The Symbol space is eliminated and the State array is used to represent both selection state and value. This is a very beneficial effect of using the autonumber functions.

The values need not begin at zero for the optimization to take place, they only need to be consecutive. A set of 5000 consecutive dates will occupy no Symbol space. Take one date out of the middle and the storage reverts to the standard 4 bytes for each date.

It’s not always necessary to be concerned about memory usage. But when it is, I hope this information proves useful.

Share