Observations on a popular string hashing method

20 Mar 2016, updated 18 Dec 2018

On page 144 of The C Programming Language, 2nd Ed, we see this hash method:

unsighed hash(char *s)
{
    unsigned hashval;

    for (hashval = 0; *s != '\0'; s++)
        hashval = *s + 31 * hashval;
    return hashval % HASHSIZE;
}

This way of calculating the hash of a string seems to have a long history.

In at least some implementations of Java's String's hashCode() method, we see a similar calculation:

    public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

The comment at the top of Java's String's hashCode() method says

    /**
     * Returns a hash code for this string. The hash code for a
     * {@code String} object is computed as
     * <blockquote><pre>
     * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
     * </pre></blockquote>
     * using {@code int} arithmetic, where {@code s[i]} is the
     * <i>i</i>th character of the string, {@code n} is the length of
     * the string, and {@code ^} indicates exponentiation.
     * (The hash value of the empty string is zero.)
     *
     * @return  a hash code value for this object.
     */

There's a lot of interesting stuff about such a small but long-lived hashing method.

Multiplying and adding are both fast operations on computers (compared to division), so that's why those were chosen.

The number 31 is prime. Multiplying by a prime number generates a nicer spread of numbers compared to multiplying by, let's say, an even number, or the number 10. And when we are putting values into a hashmap, we want those values not to cluster into some buckets while leaving other buckets empty.

Math people will notice that the Java comment's description of the hash function is a polynomial.

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

becomes

s0xn-1 + s1xn-2 + ... + sn-1

where x = 31, and n = the length of the string.

There is no exponentation in the C or Java source code we have seen, and yet the Java description of the algorithm uses exponentation, and our writing it using polynomial notation clearly has exponentation.

It turns out the C and Java code are still calculating a polynomial; they are just using Horner's method to do so.

The characters of the string "bake", in ASCII, are 98, 97, 107, 101.

The hash of that would therefore be

98xn-1 + 97xn-2 + 107xn-2 + 101

or, filling in x and n...

98 * 313 + 97 * 312 + 107 * 31 + 101
= 98 * 29791 + 97 * 961 + 107 * 31 + 101
= 2919518 + 93217 + 3317 + 101
= 3016153

But doing what the code actually says (which is Horner's method), what actually gets done is this:

31 * (31 * (31 * 98 + 97) + 107) + 101
= 31 * (31 * (3038 + 97) + 107) + 101
= 31 * (31 * 3135 + 107) + 101
= 31 * (97185 + 107) + 101
= 31 * 97292 + 101
= 3016052 + 101
= 3016153