AutoNumber vs AutoNumberHash128

Summary:  AutoNumberHash128(A, B) runs about 30% faster than AutoNumber(A &’-‘ & B).

It’s a common practice to use the script AutoNumber() function to reduce the storage required for large compound keys in a Qlik data model. For example:

AutoNumber(A & '-' & B) as %KeyField

As a standard practice, we generally include a separator like ‘-‘ to ensure ‘1’ & ’11’ does not get confused with ’11’ & ‘1’.

The AutoNumber process can add significant run time to a script with many rows.

I’ve always wondered what the AutoNumberHash128() function was good for.

AutoNumberHash128(A,B) as %KeyField

This function first hashes A & B and then autonumbers the result. The end result is the same as the first example given using AutoNumber().  I find the AutoNumberHash128 syntax a bit simpler as a separator is not required.

What surprised me is that the AutoNumberHash128() function runs faster.  Typically about 30% faster than a plain AutoNumber with a concatenated string parameter.

Why is it faster?  The difference is in the function used to create the single value to be autonumbered.  Hash128 is considerably faster than string concatenation (&).

AutoNumberHash128() can take any number of fields, but it does not have an “AutoId” parameter.  The “AutoId” (second parameter) in AutoNumber() is recommended to ensure we get sequential integers when autonumbering more than one key field.  Sequential integers are the most memory efficient storage for keys.

Don’t despair.  AutoNumberHash128() will use the “default” AutoId.  That is fine if you are autonumbering only one key.  If you are doing more than one key, use AutoNumberHash128() for your largest — most rows — key and use AutoNumber() with AutoId for the rest.  You will improve the script run time of one key.

Another possible tradeoff when you have many large keys is to use AutoNumberHash128 for all keys and forgo the sequential integer optimization.  You will use only 8 bytes per key value which could be significantly less than the original string keys.

-Rob

 

Share

12 thoughts on “AutoNumber vs AutoNumberHash128”

  1. Nice one, Rob. Good to know hashing is faster. Funnily, I have seen loads of developers using the Hash and Autonumber functions to join the fields and still keeping the original long-concatenated field. I see why they need it but keeping in the QVW file doesn’t make sense. Rather I would have it stored in the QVD file.

  2. Hi Rob, great post.
    I was recently was informed of the existence of the function AutoNumberHash256, which since then I started using to concatenate dimension tables to account for primary keys being changed in the source system tables resulting in missing associations with older rows in my fact table. For this the following approach was used:

    // Transform Source Table

    PreviousTable:
    NoConcatenate
    LOAD *
    FROM [Table.QVD] (qvd);

    STORE PreviousTable into Table_Previous.QVD (QVD);

    NewTable:
    NoConcatenate LOAD
    AutonumberHash256(PK1, PK2, …….) as PrimaryKey,
    *
    FROM [Table_New.QVD] (qvd);

    Concatenate (NewTable)
    LOAD *
    Resident PreviousTable
    Where not Exists(PrimaryKey, AutonumberHash256(PK1, PK2, …….));

    DROP Table PreviousTable;
    DROP Field PrimaryKey From NewTable;

    STORE NewTable into Table_New.QVD (QVD);

    DROP Table NewTable;

    QUESTION: What would be the advantage of AutonumberHash128 be over AutonumberHash256 or vice versa?

  3. Thanks for posting, Rob.
    I Agree, AutoNumber() itself is fast but string concatenation slows it down.
    A possible work-around is to create the concatentated strings beforehand, in Extract stage where incremental load is used. This way we can still use AutoNumber() in the final script, avoiding slow-down due to “string&string” processing.

    When load speed is important, it’s worth to point out that any string concatenation in the load statement will slow down the processing – be it with or without AutoNumber().
    I compared different functions (AutoNumber, Hash, RowNo, RecNo etc) some time ago – results published in the community:
    https://community.qlik.com/thread/283671

  4. You can use the advantages of both methods simply by using:

    AutoNumber(Hash128(A,B),’AutoId’).

  5. Rob,
    I have always used autonumberhash128(). I read that autonumber() is only good (gauranteed) in the qvw that it is used in, whereas (I believe that) autonumberhash128() produces an integer that be used accross multiple qvws. Example. In one qvd loader script autonumberhash128(a,b) will produce the same integer as autonumberhash128(a,b) in another qvd loader script, as long as a and b are the same values. Is that correct?
    Dave

    1. Just tested that autonumberhash128(a,b) does not produce (true hash key) integers that can be used accross multiple qvws, So forget my last post. It seems that autonumberhash128(a,b) is just more convenient to use and can perform better as Rob says.

      1. Also, in a single qvw, I did an autonumberhash128(a,b,c) and then an autonumberhash128((hash128(a,b,c))).

        I expected them to be the same but they were not. So, it seems that AutoNumberHash128() is not simply an autonumber() of a hash128().

        1. I think you were on the right track, but one small error. AutonumberHash128(Hash128(a,b,c,)) would re-hash the Hash128 result yielding a different value for Autonumber. What you may have meant to test was that “AutoNumber(Hash128(a,b,c)) = AutoNumberHash128(a,b,c)” which is true.

          1. Yes, that makes sense. I will try the AutoNumber(hash128(a,b,c)) to confirm that is the same as AutoNumberhash128(a,b,c). Thanks Rob.

  6. Yes, they are produce the same numbers with your suggestion. So, now I am wondering if it always best to use AutoNumber(hash128()) , since you have the opportunity of specifiyng an AutoNumber ID as in AutoNumber(hash128(a,b,c),’MyID’).

    While this may not prove as performant, it gives you the extra flexibility if you need that AutoNumber ID to guarantee it own unique set of integers.

Leave a Reply

Your email address will not be published. Required fields are marked *