Summary: AutoNumberHash128(A, B) runs about 30% faster than AutoNumber(A &’-‘ & B).
It’s a common practice to use the script AutoNumber() function to reduce the storage required for large compound keys in a Qlik data model. For example:
AutoNumber(A & '-' & B) as %KeyField
As a standard practice, we generally include a separator like ‘-‘ to ensure ‘1’ & ’11’ does not get confused with ’11’ & ‘1’.
The AutoNumber process can add significant run time to a script with many rows.
I’ve always wondered what the AutoNumberHash128() function was good for.
AutoNumberHash128(A,B) as %KeyField
This function first hashes A & B and then autonumbers the result. The end result is the same as the first example given using AutoNumber(). I find the AutoNumberHash128 syntax a bit simpler as a separator is not required.
What surprised me is that the AutoNumberHash128() function runs faster. Typically about 30% faster than a plain AutoNumber with a concatenated string parameter.
Why is it faster? The difference is in the function used to create the single value to be autonumbered. Hash128 is considerably faster than string concatenation (&).
AutoNumberHash128() can take any number of fields, but it does not have an “AutoId” parameter. The “AutoId” (second parameter) in AutoNumber() is recommended to ensure we get sequential integers when autonumbering more than one key field. Sequential integers are the most memory efficient storage for keys.
Don’t despair. AutoNumberHash128() will use the “default” AutoId. That is fine if you are autonumbering only one key. If you are doing more than one key, use AutoNumberHash128() for your largest — most rows — key and use AutoNumber() with AutoId for the rest. You will improve the script run time of one key.
Another possible tradeoff when you have many large keys is to use AutoNumberHash128 for all keys and forgo the sequential integer optimization. You will use only 8 bytes per key value which could be significantly less than the original string keys.
Update 20 Sept 2022
Things have changed somewhat with the addition of the AutoNumber statement, which is the recommended method to autonumber keys. AutoId is no longer a problem. In my recent testing creating compound key fields, I still find that Hash128() is somewhat faster than the & operator. Here’s the results. Option 4 is creating the key with the & operator and AutoNumber statement. Option 5 is creating the key with Hash128() and AutoNumber statement.
16 thoughts on “AutoNumber vs AutoNumberHash128”
Nice one, Rob. Good to know hashing is faster. Funnily, I have seen loads of developers using the Hash and Autonumber functions to join the fields and still keeping the original long-concatenated field. I see why they need it but keeping in the QVW file doesn’t make sense. Rather I would have it stored in the QVD file.
Hi Rob, great post.
I was recently was informed of the existence of the function AutoNumberHash256, which since then I started using to concatenate dimension tables to account for primary keys being changed in the source system tables resulting in missing associations with older rows in my fact table. For this the following approach was used:
// Transform Source Table
FROM [Table.QVD] (qvd);
STORE PreviousTable into Table_Previous.QVD (QVD);
AutonumberHash256(PK1, PK2, …….) as PrimaryKey,
FROM [Table_New.QVD] (qvd);
Where not Exists(PrimaryKey, AutonumberHash256(PK1, PK2, …….));
DROP Table PreviousTable;
DROP Field PrimaryKey From NewTable;
STORE NewTable into Table_New.QVD (QVD);
DROP Table NewTable;
QUESTION: What would be the advantage of AutonumberHash128 be over AutonumberHash256 or vice versa?
Thanks for posting, Rob.
I Agree, AutoNumber() itself is fast but string concatenation slows it down.
A possible work-around is to create the concatentated strings beforehand, in Extract stage where incremental load is used. This way we can still use AutoNumber() in the final script, avoiding slow-down due to “string&string” processing.
When load speed is important, it’s worth to point out that any string concatenation in the load statement will slow down the processing – be it with or without AutoNumber().
I compared different functions (AutoNumber, Hash, RowNo, RecNo etc) some time ago – results published in the community:
I was just recently curious about why autonumber was so slow. Did a google, and this came up right quick!
I´ve faced the same situation a time ago when I was testing those functions.
In my tests, Autonumber() with concatenaded fields were even worse than 30%, maybe 100% sometimes
In Brazilian Portuguese https://community.qlik.com/groups/qlikview-brasil/blog/2017/09/20/autonumber-x-autonumberhash128-x-autonumberhash256-em-chaves-compostas
You can use the advantages of both methods simply by using:
I have always used autonumberhash128(). I read that autonumber() is only good (gauranteed) in the qvw that it is used in, whereas (I believe that) autonumberhash128() produces an integer that be used accross multiple qvws. Example. In one qvd loader script autonumberhash128(a,b) will produce the same integer as autonumberhash128(a,b) in another qvd loader script, as long as a and b are the same values. Is that correct?
Just tested that autonumberhash128(a,b) does not produce (true hash key) integers that can be used accross multiple qvws, So forget my last post. It seems that autonumberhash128(a,b) is just more convenient to use and can perform better as Rob says.
Also, in a single qvw, I did an autonumberhash128(a,b,c) and then an autonumberhash128((hash128(a,b,c))).
I expected them to be the same but they were not. So, it seems that AutoNumberHash128() is not simply an autonumber() of a hash128().
I think you were on the right track, but one small error. AutonumberHash128(Hash128(a,b,c,)) would re-hash the Hash128 result yielding a different value for Autonumber. What you may have meant to test was that “AutoNumber(Hash128(a,b,c)) = AutoNumberHash128(a,b,c)” which is true.
Yes, that makes sense. I will try the AutoNumber(hash128(a,b,c)) to confirm that is the same as AutoNumberhash128(a,b,c). Thanks Rob.
Yes, they are produce the same numbers with your suggestion. So, now I am wondering if it always best to use AutoNumber(hash128()) , since you have the opportunity of specifiyng an AutoNumber ID as in AutoNumber(hash128(a,b,c),’MyID’).
While this may not prove as performant, it gives you the extra flexibility if you need that AutoNumber ID to guarantee it own unique set of integers.
How has the autonumber versus autonumberhash128 comparison changed with the addition of the AutoNumber script statement? I might think that the original issue with how fast string concatenation is performed still exists, but the ultimate convenience of using a single AutoNumber statement instead of multiple autonumberhash128() function makes this a viable solution only when 30% faster can be measured in minutes and/or reload time is critical.
I always use the AutoNumber statement, as it’s incredibly fast and convenient. The AutoNumber statement will perform the same irrespective of how the key values were created. So the comparison is now “A&’-‘&B” vs Hash128(A,B). (This question came up at the Masters Summit this week). My recent tests conclude that Hash128 is about 15% faster, even more so if A and B are numeric values, I also find Hash128 to be easier syntax, and AutoId is no longer an issue if you use the AutoNumber statement. Downside to Hash128() is that you can’t see the individual key components and there could be a possibility of a key collision. I’ll update the post with test results using AutoNumber statement.
I think Barry has a great hashing article https://knowledge.axisgroup.com/wiki/Customer_Facing_Content/Autonumber_vs._AutonumberHash128_in_Qlik_Sense on the low probability of a key collision. That being said, does the performance improvement change if you use Hash 256 to reduce the chances of a key collision even further? I imagine that the memory usage would increase during the data load, but it’s also not uncommon to have a concatenated keys be longer than 43 characters either. Also, is the performance improvement solely due to the simple act of concatenating two strings or could the length of the strings being concatenated also affect the performance difference?
I have not tested, but my guess is string length — unless extreme — has little impact on the performance of the key creation.