AutoNumber vs AutoNumberHash128

Apr 15, 2018 Rob Wunderlich 16 Comments

Summary: AutoNumberHash128(A, B) runs about 30% faster than AutoNumber(A &’-‘ & B).

It’s a common practice to use the script AutoNumber() function to reduce the storage required for large compound keys in a Qlik data model. For example:

AutoNumber(A & '-' & B) as %KeyField

As a standard practice, we generally include a separator like ‘-‘ to ensure ‘1’ & ’11’ does not get confused with ’11’ & ‘1’.

The AutoNumber process can add significant run time to a script with many rows.

I’ve always wondered what the AutoNumberHash128() function was good for.

AutoNumberHash128(A,B) as %KeyField

This function first hashes A & B and then autonumbers the result. The end result is the same as the first example given using AutoNumber(). I find the AutoNumberHash128 syntax a bit simpler as a separator is not required.

What surprised me is that the AutoNumberHash128() function runs faster. Typically about 30% faster than a plain AutoNumber with a concatenated string parameter.

Why is it faster? The difference is in the function used to create the single value to be autonumbered. Hash128 is considerably faster than string concatenation (&).

AutoNumberHash128() can take any number of fields, but it does not have an “AutoId” parameter. The “AutoId” (second parameter) in AutoNumber() is recommended to ensure we get sequential integers when autonumbering more than one key field. Sequential integers are the most memory efficient storage for keys.

Don’t despair. AutoNumberHash128() will use the “default” AutoId. That is fine if you are autonumbering only one key. If you are doing more than one key, use AutoNumberHash128() for your largest — most rows — key and use AutoNumber() with AutoId for the rest. You will improve the script run time of one key.

Another possible tradeoff when you have many large keys is to use AutoNumberHash128 for all keys and forgo the sequential integer optimization. You will use only 8 bytes per key value which could be significantly less than the original string keys.

-Rob

Update 20 Sept 2022

Things have changed somewhat with the addition of the AutoNumber statement, which is the recommended method to autonumber keys. AutoId is no longer a problem. In my recent testing creating compound key fields, I still find that Hash128() is somewhat faster than the & operator. Here’s the results. Option 4 is creating the key with the & operator and AutoNumber statement. Option 5 is creating the key with Hash128() and AutoNumber statement.

16 thoughts on “AutoNumber vs AutoNumberHash128”

DV says:

Apr 15, 2018 at 8:48 pm

Nice one, Rob. Good to know hashing is faster. Funnily, I have seen loads of developers using the Hash and Autonumber functions to join the fields and still keeping the original long-concatenated field. I see why they need it but keeping in the QVW file doesn’t make sense. Rather I would have it stored in the QVD file.

Reply
Herbert says:

Apr 16, 2018 at 4:00 pm

Hi Rob, great post.
I was recently was informed of the existence of the function AutoNumberHash256, which since then I started using to concatenate dimension tables to account for primary keys being changed in the source system tables resulting in missing associations with older rows in my fact table. For this the following approach was used:

// Transform Source Table

PreviousTable:
NoConcatenate
LOAD *
FROM [Table.QVD] (qvd);

STORE PreviousTable into Table_Previous.QVD (QVD);

NewTable:
NoConcatenate LOAD
AutonumberHash256(PK1, PK2, …….) as PrimaryKey,
*
FROM [Table_New.QVD] (qvd);

Concatenate (NewTable)
LOAD *
Resident PreviousTable
Where not Exists(PrimaryKey, AutonumberHash256(PK1, PK2, …….));

DROP Table PreviousTable;
DROP Field PrimaryKey From NewTable;

STORE NewTable into Table_New.QVD (QVD);

DROP Table NewTable;

QUESTION: What would be the advantage of AutonumberHash128 be over AutonumberHash256 or vice versa?

Reply
Tanel says:

Apr 18, 2018 at 11:15 am

Thanks for posting, Rob.
I Agree, AutoNumber() itself is fast but string concatenation slows it down.
A possible work-around is to create the concatentated strings beforehand, in Extract stage where incremental load is used. This way we can still use AutoNumber() in the final script, avoiding slow-down due to “string&string” processing.

When load speed is important, it’s worth to point out that any string concatenation in the load statement will slow down the processing – be it with or without AutoNumber().
I compared different functions (AutoNumber, Hash, RowNo, RecNo etc) some time ago – results published in the community:
https://community.qlik.com/thread/283671

Reply
Michael says:

May 9, 2018 at 5:31 pm

I was just recently curious about why autonumber was so slow. Did a google, and this came up right quick!

Reply
Clever Anjos says:

May 28, 2018 at 7:55 pm

Rob,
I´ve faced the same situation a time ago when I was testing those functions.
In my tests, Autonumber() with concatenaded fields were even worse than 30%, maybe 100% sometimes

In Brazilian Portuguese https://community.qlik.com/groups/qlikview-brasil/blog/2017/09/20/autonumber-x-autonumberhash128-x-autonumberhash256-em-chaves-compostas

Reply
Christian says:

Jun 7, 2018 at 2:03 pm

You can use the advantages of both methods simply by using:

AutoNumber(Hash128(A,B),’AutoId’).

Reply
Davd Dumas says:

Nov 26, 2018 at 2:36 pm

Rob,
I have always used autonumberhash128(). I read that autonumber() is only good (gauranteed) in the qvw that it is used in, whereas (I believe that) autonumberhash128() produces an integer that be used accross multiple qvws. Example. In one qvd loader script autonumberhash128(a,b) will produce the same integer as autonumberhash128(a,b) in another qvd loader script, as long as a and b are the same values. Is that correct?
Dave

Reply
1. Davd Dumas says:
  
  Nov 26, 2018 at 2:53 pm
  
  Just tested that autonumberhash128(a,b) does not produce (true hash key) integers that can be used accross multiple qvws, So forget my last post. It seems that autonumberhash128(a,b) is just more convenient to use and can perform better as Rob says.
  
  Reply
  1. Davd Dumas says:
    
    Nov 26, 2018 at 4:17 pm
    
    Also, in a single qvw, I did an autonumberhash128(a,b,c) and then an autonumberhash128((hash128(a,b,c))).
    
    I expected them to be the same but they were not. So, it seems that AutoNumberHash128() is not simply an autonumber() of a hash128().
    
    Reply
    1. Rob Wunderlich says:
      
      Nov 26, 2018 at 5:32 pm
      
      I think you were on the right track, but one small error. AutonumberHash128(Hash128(a,b,c,)) would re-hash the Hash128 result yielding a different value for Autonumber. What you may have meant to test was that “AutoNumber(Hash128(a,b,c)) = AutoNumberHash128(a,b,c)” which is true.
      
      Reply
      1. David A Dumas says:
        
        Nov 26, 2018 at 6:01 pm
        
        Yes, that makes sense. I will try the AutoNumber(hash128(a,b,c)) to confirm that is the same as AutoNumberhash128(a,b,c). Thanks Rob.
Davd Dumas says:

Nov 26, 2018 at 6:40 pm

Yes, they are produce the same numbers with your suggestion. So, now I am wondering if it always best to use AutoNumber(hash128()) , since you have the opportunity of specifiyng an AutoNumber ID as in AutoNumber(hash128(a,b,c),’MyID’).

While this may not prove as performant, it gives you the extra flexibility if you need that AutoNumber ID to guarantee it own unique set of integers.

Reply
Karl Pover says:

Sep 20, 2022 at 4:44 pm

How has the autonumber versus autonumberhash128 comparison changed with the addition of the AutoNumber script statement? I might think that the original issue with how fast string concatenation is performed still exists, but the ultimate convenience of using a single AutoNumber statement instead of multiple autonumberhash128() function makes this a viable solution only when 30% faster can be measured in minutes and/or reload time is critical.

Reply
1. Rob Wunderlich says:
  
  Sep 20, 2022 at 5:40 pm
  
  I always use the AutoNumber statement, as it’s incredibly fast and convenient. The AutoNumber statement will perform the same irrespective of how the key values were created. So the comparison is now “A&’-‘&B” vs Hash128(A,B). (This question came up at the Masters Summit this week). My recent tests conclude that Hash128 is about 15% faster, even more so if A and B are numeric values, I also find Hash128 to be easier syntax, and AutoId is no longer an issue if you use the AutoNumber statement. Downside to Hash128() is that you can’t see the individual key components and there could be a possibility of a key collision. I’ll update the post with test results using AutoNumber statement.
  
  Reply
  1. Karl Pover says:
    
    Sep 20, 2022 at 7:31 pm
    
    I think Barry has a great hashing article https://knowledge.axisgroup.com/wiki/Customer_Facing_Content/Autonumber_vs._AutonumberHash128_in_Qlik_Sense on the low probability of a key collision. That being said, does the performance improvement change if you use Hash 256 to reduce the chances of a key collision even further? I imagine that the memory usage would increase during the data load, but it’s also not uncommon to have a concatenated keys be longer than 43 characters either. Also, is the performance improvement solely due to the simple act of concatenating two strings or could the length of the strings being concatenated also affect the performance difference?
    
    Reply
    1. Rob Wunderlich says:
      
      Sep 21, 2022 at 6:48 am
      
      I have not tested, but my guess is string length — unless extreme — has little impact on the performance of the key creation.
      
      Reply

Qlikview Cookbook

AutoNumber vs AutoNumberHash128

16 thoughts on “AutoNumber vs AutoNumberHash128”

Leave a Reply Cancel reply

Recipes for Qlikview Success