CS 240: Lecture 14 - Hashing

Dear students:

This course is really a marathon of one-upping. We introduce a data structure only to knock it down with the next one. And then we knock that one down when we introduce a third. At the bottom we find tradeoffs. Let's do it again.

Consider the problem of searching for something in a collection. With a plain array, we find the item using a linear search. A linear search runs in linear time, which is only moderately fast. We get better performance by sorting the array and using binary search, which runs in logarithmic time. That's really good. Could we possibly get the cost of search down to constant time? Yes, we can—for a certain class of problems.

Arrays as Lookup Tables

You already know a data structure that has many constant time operations: the array. Its fast operations include:

getting the array's size
placing an element at a particular index
retrieving an element at a particular index
removing an element at a particular index

What makes an array slow is treating it like a list. If its elements form a sequence, then any disruption to the sequence causes costly structural changes. But not every problem uses an array as a list. Suppose you are trying make a histogram of how many siblings people have. On your phone you have a counter program that looks like this:

# of siblings	count
0
1
2
3
4
5
6
…	…

You canvas the campus and ask everyone you see how many siblings they have. The first person says 2, so you tick the spinner in row 2. The second person says 0, so you tick row 0. And so on. Later a friends wagers that there are fewer people with no siblings than people with a single sibling. You open the counter app and immediately see the first two numbers to resolve the wager.

Everything about this array of counts is blazingly fast. That's because it's being used not as a list but as a lookup table. You know exactly where the data you want is supposed to live in the array, so all your interactions are constant time operations.

Hashing

If arrays are the means of achieving constant time lookups, then it would seem that your data has to be keyed on small integer indices in order to be managed by a lookup table. This is a very limiting restriction. It worked well for the sibling count, where the sibling counts aligned very neatly with the array's indices. But what if you wanted to track the phone numbers of your friends? You don't want to have label your friends Friend 0, Friend 1, Friend 2, and so on just so you can record and look up their numbers quickly.

You can allow keys of any type if you pair the array with a magical function that turns any key into an int. That function is called a hash function. It gets its name from its job: it stirs together the bytes of a key into a soup of bits called a hashcode. In most programming languages, a hashcode is a 32-bit integer.

If we have a hash function, we could insert an element into an array with code like this:

int hash = key.hashCode();
items[hash] = new KeyValuePair(key, value);

int hash = key.hashCode();
items[hash] = new KeyValuePair(key, value);

However, it's not quite this simple. First, how many different hashcodes might be generated? About 4 billion, since that's how many possible int values there are. Only 2 billion of those are positive and therefore legal array indices. The array would have to be very large to support all those possible indices. We can get by with smaller arrays if we wrap hashcodes that exceed the bounds back to the beginning of the array. That's a job for the % operator.

Second, what if two different keys produce the same hashcode and therefore map to the same cell of the array? There are several ways we might address this concern, and you'll learn more about them in the textbook.

If we can get this hashing business to work, here's the headline for tomorrow's newspaper: "SEARCH AND INSERT IN CONSTANT TIME!" Because that's how fast this code is, assuming that the hashcode itself can be computed in constant time. A collection whose performance doesn't degrade as its size increases is a breakthrough.

Map Interface

If we assume that a hash function can be written, then we have effectively generalized arrays so that they can be indexed by more than just integers. Any data can be a key as long as it can produce a hashcode. The keys could be state names that map to populations. They could be xy-coordinate pairs that map to colors. They could be arbitrary, multi-attribute objects that map to arbitrary values.

A collection that allows keys of an arbitrary type to be associated with values of an arbitrary type is called a map. A map is an ADT that generally supports these operations:

interface Map<K, V> {
  boolean has(K key);
  V get(K key);
  void put(K key, V value);
  void remove(K key);
}

interface Map<K, V> {
  boolean has(K key);
  V get(K key);
  void put(K key, V value);
  void remove(K key);
}

There are many ways that this interface might be implemented. An implementation that uses an array and a hash function to locate items in the array is called a hashtable. If you want to collect your friends' phone numbers in a hashtable, you use their names as the keys and their numbers as the values:

HashTable<String, Phone> namesToNumbers = new HashTable<>();

HashTable<String, Phone> namesToNumbers = new HashTable<>();

Each element in the backing array is a key-value pair:

class KeyValuePair<K, V> {
  K key;
  V value;
}

class KeyValuePair<K, V> {
  K key;
  V value;
}

To look up a phone number, you need only the key:

Phone number = namesToNumbers.get("Bronwyn");

Phone number = namesToNumbers.get("Bronwyn");

From Hash Code To Index

The hash function's job is to turn an arbitrary object into an integer. It must achieve these goals:

Be fast to compute.
Distribute the keys widely to minimize the chance of two different keys mapping to the same index.

From the hashcode we must produce a legal index for the hashtable. It must be in $[0, \mathrm{table.size})$. But it's not the job of the hash function to perform this compression. Only the hash table itself knows its size, so it will compute the index. The OpenJDK implementation of Java transforms the hashcode into an index like so:

int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % table.length;
table[index] = new KeyValuePair(key, value);

int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % table.length;
table[index] = new KeyValuePair(key, value);

The bit masking shaves off the leading 1 that may make the number negative. An index must be positive.

Hashcode Function

There are many schemes we could use to map an object to its integer hashcode. In OpenJDK, the Integer class computes a hashcode like this:

public class Integer {
  // ...
  public int hashCode() {
    return Integer.hashCode(value);
  }

  public static int hashCode(int value) {
    return value;
  }
}

public class Integer {
  // ...
  public int hashCode() {
    return Integer.hashCode(value);
  }

  public static int hashCode(int value) {
    return value;
  }
}

The hashcode is just the integer itself. The other, smaller integer types probably work similarly. That leads me to wonder how Boolean computes its hashcode. Here's the OpenJDK implementation:

public class Boolean {
  // ...
  public int hashCode() {
    return Boolean.hashCode(value);
  }

  public static int hashCode(boolean value) {
    return value ? 1231 : 1237;
  }
}

public class Boolean {
  // ...
  public int hashCode() {
    return Boolean.hashCode(value);
  }

  public static int hashCode(boolean value) {
    return value ? 1231 : 1237;
  }
}

I was not expecting those two seemingly random numbers.

Suppose our objects are strings. We could implement our hashcode function by just returning the first letter:

class String {
  // ...
  public int hashCode() {
    return characters[0];
  }
}

class String {
  // ...
  public int hashCode() {
    return characters[0];
  }
}

The string "empathy" would map to "e". The string "authenticity" would map to "a". The string "rhythmic" would map to "r". How do you feel about this scheme? Is it a good one? Not really. There are many possible strings, but far fewer hashcodes. If all of the entries began with lowercase letters, they would all collide at the same 26 cells of the array.

The String class implements hashcode with more sophistication than just the first letter:

public class String {
  // ...
  private int hashCache = 0;
  public int hashCode() {
    int hash = hashCache;
    if (hash == 0 && value.length > 0) {
      char[] val = value;
      for (int i = 0; i < value.length; i++) {
        hash = 31 * hash + val[i];
      }
      hashCache = hash;
    }
    return hash;
  }
}

public class String {
  // ...
  private int hashCache = 0;
  public int hashCode() {
    int hash = hashCache;
    if (hash == 0 && value.length > 0) {
      char[] val = value;
      for (int i = 0; i < value.length; i++) {
        hash = 31 * hash + val[i];
      }
      hashCache = hash;
    }
    return hash;
  }
}

This code requires some interpretation. The assignments involving hashCache reveal that the hashcode is cached. Instead of computing the hashcode each time the function is called, it is stored in an instance variable. This is a wise strategy in Java because strings are immutable. The characters won't change and thus invalidate the hashcode.

The hashcode isn't computed until it's first needed. It starts off as 0. This technique is called lazy initialization.

The summation may be easier to understand in its mathematical notation:

$$\textrm{hash} = s[0] \cdot 31^{n - 1} + s[1] \cdot 31^{n - 2} + s[2] \cdot 31^{n-3} + \ldots + s[n - 1] \cdot 31^{0}$$

This hashing function is engineered to prevent strings with similar prefixes or with the same characters in a different order from having hashcodes that cluster together.

The hashcode for ArrayList is defined similarly:

public abstract class AbstractList<E> {
  // ...
  public int hashCode() {
    int hashCode = 1;
    for (E e : this) {
      hashCode = 31 * hashCode + (e == null ? 0 : e.hashCode());
    }
    return hashCode;
  }
}

public abstract class AbstractList<E> {
  // ...
  public int hashCode() {
    int hashCode = 1;
    for (E e : this) {
      hashCode = 31 * hashCode + (e == null ? 0 : e.hashCode());
    }
    return hashCode;
  }
}

When you create your own objects, you are encouraged to implement your own hashCode function. It's not a bad idea to mimic the the strategy of computing a weighted sum of the instance variables' hashcodes:

public class SomeClass {
  // ...
  public int hashCode() {
    int hashCode = 7;
    hashCode = 31 * hashCode + instanceVariableA.hashCode();
    hashCode = 31 * hashCode + instanceVariableB.hashCode();
    hashCode = 31 * hashCode + instanceVariableC.hashCode();
    // ...
    return hashCode;
  }
}

public class SomeClass {
  // ...
  public int hashCode() {
    int hashCode = 7;
    hashCode = 31 * hashCode + instanceVariableA.hashCode();
    hashCode = 31 * hashCode + instanceVariableB.hashCode();
    hashCode = 31 * hashCode + instanceVariableC.hashCode();
    // ...
    return hashCode;
  }
}

The Object class does provide a default implementation of hashCode, but it doesn't uphold an important property: all objects that are considered equal should produce the same hashcode. This is an important property because if you want to look something up later, you want the key that you provide to match the key that's associated with the element. The default hashCode builds the hash out of some unique identifier that is different for every instance of an object, even if it has the same state.

Collisions

As a hash table fills up, you are increasingly likely to find that two different keys resolve to the same index. This is a major issue. You'll read about a couple of strategies for dealing with these collisions in the textbook.

TODO

You've got a few things to do before we meet again:

Start working on PA3. Delays bring grief.

Read chapter 9 in the textbook for Wednesday.

Read chapter 10 in the textbook for Friday.

See you next time!

Sincerely,