Building hash functions
Under the simple uniform hashing assumption, each key should be equally likely to hash to any of the \(m\) possible hash values, independently of where any other key hashes to. But in practise, we do not necessarily know the distribution of the keys inserted nor are they necessarily drawn independently. A good approach computes hash values in a way that we expect to be independent of the patterns that appear in the keys. As a rough guide we thus try follow the principle “the more random the hash value looks like, the better”. When we compute hash values to be used as indices in hash tables for representing sets and maps, the hash value should also be very efficient to compute.
Naturally, a lot of hash functions for different purposes have been proposed. See, for instance, the following links:
Wikipedia on Murmur hash and a C++ implementation in the GNU C++ standard library.
In the following, we go through some hash function implementations found in Scala and Java. These should never be used in cryptographic applications!
Hashing integers
We first show some ways to produce hash functions for integer keys. We have already seen the “division method” in which the hash value is the remainder when dividing the key with the hash table size \(m\):
This is fast to compute as only one modulo operation is required. But it may be a bad choice if \(m\) is a power of two with \(m = 2^p\): only the \(p\) least significant bits of the key influence the hash value. If possible, \(m\) should rather be a prime number not too close to a power of two.
The “multiplication method” produces a hash value for a \(w\)-bit integer key \(k\) by
multiplying it with another, well chosen constant \(w\)-bit integer \(A\), to a \(2w\)-bit integer \(r = r_{2w-1} r_{2w-2} ... r_0 = kA\) and
taking the hash value from the most significant bits of the least significant half \(r_{w-1} ... r_0\).
Here \(m\) is usually a power of two, \(m = 2^p\), so that one can simply take the \(p\) most significant bits in \(r_{w-1} ... r_0\) by shifting and masking. For instance, consider the Scala function
def h(x: Int): Int = (x * 2654435769L).toInt
Now
h(1) = 9e3779b9
,
h(2) = 3c6ef372
,
h(3) = daa66d2b
and so on, looking quite random (the 32-bit results are shown in hexadecimal so that it is easier to compare their bit-wise structures).
Hashing strings
To translate a string \(s = c_0 ... c_{n-1}\) into an integer, characters \(c_i\) in the string can be processed one-by-one. For instance, the current Java implementations use the hash function
as can be observed in the openjdk Java 8 source code. Observe the “software caching” of the computed hash value; if the hash code of the same string instance is later requested again, it is not recomputed but read from a stored value. !JAVA codes/java8StringHashCode.java
Hashing compound objects
Computing hash values for compound objects (objects with fields, arrays, and so on) can be done by combining the hash values of the components. One of the simplest examples is computing hash code for integer arrays in openjdk Java 8. !JAVA codes/java6ArrayHashCode.java
Currently Scala uses a bit more complex hash function based on MurMurHash 3. It tries to both produce good hash values (few collisions) and to be fast. From the Scala source code:
final def arrayHash[@specialized T](a: Array[T], seed: Int): Int = {
var h = seed
var i = 0
while (i < a.length) {
h = mix(h, a(i).##) // ## is hashCode
i += 1
}
finalizeHash(h, a.length)
}
final def mix(hash: Int, data: Int): Int = {
var h = mixLast(hash, data)
h = rotl(h, 13) // rotl is Interger.rotateLeft
h * 5 + 0xe6546b64
}
final def mixLast(hash: Int, data: Int): Int = {
var k = data
k *= 0xcc9e2d51
k = rotl(k, 15)
k *= 0x1b873593
hash ^ k
}
/** Finalize a hash to incorporate the length and make sure all bits avalanche. */
final def finalizeHash(hash: Int, length: Int): Int = avalanche(hash ^ length)
/** Force all bits of the hash to avalanche. Used for finalizing the hash. */
private final def avalanche(hash: Int): Int = {
var h = hash
h ^= h >>> 16
h *= 0x85ebca6b
h ^= h >>> 13
h *= 0xc2b2ae35
h ^= h >>> 16
h
}