Lately I’ve been digging into an old Qlik performance question. How much impact, if any, does the order of Qlik data tables have on chart calc time? My experience is that for a chart or aggr() cube with a lot of dimension values, ordering of rows by dimension values can have a significant and measurable effect.
Mike Steedle of Axis Group blogged about the issue a couple of years ago. Mike’s post includes a useful subroutine to organize any table by a specific field.
I’ve added my own study and sample files on the topic in this QlikCommunity post.
Mike and I are are working together on the next update to Qlik Sense Document Analyzer. Mike is keen on analyzing the data model and making useful recommendations. One of the optimization questions we are studying is whether it is possible to make a solid recommendation on data table organization.
I’m curious to hear what others have discovered on the topic. Do you have any rules you follow in ordering table rows? Any thresholds or object/expression scenarios where you find it’s worth the trouble to manage the ordering?
8 thoughts on “Does Data Sort Order Impact Chart Calc Time?”
Interesting question. I guess the answer, as ever, must be, “it depends”!
If the data coming from the fact table is large and contiguous, probably the best way is to have the index reflect that. Just speculating though. I also guess that is the data will be more randomly accessed and bitty, then it may not matter as much.
I am of the mind that if the data is sorted in the script and not in UI (with expressions) sorting shouldn’t really affect performance. I say this without any test to backup my thought.
There is a popular thread on the Qlik community about the impact of presorting on group by operations. The consensus was the cost of the sort negated any improvement, but the group by itself was noticeably faster
Thanks for the comment. “Group by” is the script analog for what I’m posting about here in charts.
Thanks for the shoutout, Rob. On a recent project, the QS Document Analyzer made it very easy to measure the results of test cases when searching for the sort field that would yield the best calculation results.
Hi, This is very useful information. I’m currently investigating how to optimize a model with a 500M row fact table and a 3M row intermediate dimension table. I’m experimenting with different options (join the large dim table to fact table, concatenate…etc) Aggregating records does not result in a lot of gain.
When a table has many keys and charts in the dashboard actually use various keys what is the recommendation on sorting. Do a sort on all the keys of the table? But which field first, second, etc..?
In my experience sorting on only the first field makes a difference. And then that will result in improvement only on charts that use that field as the first dim and only when there are many values.
I’m not absolutely sure about it but I think that creating the dimensionally scope on which the aggregations are performed are related to the system-tables. This means it’s inferior if the fields are coming from a fact- or a dimension-table and if there are many records or not. Of course the kind of the data-model and the size of the dataset will have a big impact on the execution time but more relevant in regard to the topic will be the number of the distinct values from the system-tables. Especially for the background that creating this dimensionally scope (could be considered as a virtual table on top of the data-model) is until nowadays performed in single-threading (the applied aggregations are then performed in multi-threading).
The question is now leads a sorted field respectively system-table to a faster collecting of the possible (in comparison with the state-tables) field-values? I don’t know how Qlik performed this task technically and I must also admit that I never tried to measure it. But I could imagine that reading the system-tables respectively picking the relevant values in a more or less sequentially way may save some hopping and therefore reducing the necessary latencies which it will probably cost – and be in the end a bit faster as without a sorting.
Regarding to the from David mentioned posting I think there is really a benefit in aggregating sorted values against unsorted values – assuming that the effect is very similar in script and UI. Although the sorting itself comes not without any efforts and costs. Therefore it makes no sense to try to sort every field – but in some scenarios it could be the essential advantage. A quite simple and effective approach for it could be to apply the wanted sorting within a pre-loading of the field-values and dropping the needed temp-tables in the end of the script again. The idea behind this method is quite similar to this approach:
only loading the field-values within the wanted order.