CS 240: Lecture 15 - Hash Tables
Last time we introduced a way of generalizing an array to allow indices to be non-integers. We focused on the idea of hashing—which is the process of turning an object into an integer. Today we continue that discussion by implementing a hash table.
Collisions
As a hash table fills up, you are increasingly likely to find that two different keys resolve to the same index. We can't totally eliminate these collisions; we can only hope to minimize them. So, what happens when we do have a collision? We have a few options:
- We could replace the original item. Bad idea.
- We could fail to add the new item. Bad idea.
- We could allow multiple items to be stored in a cell. Each cell could be a linked list of items whose hashcodes collide. This is called open hashing or separate chaining.
- We could do something else, like probe ahead in the table to find an empty cell. This is called closed hashing or open addressing.
Because of these collisions, looking up a value is not just a matter of finding the right cell and giving back the item we find there. We must inspect each item in the search sequence and find the one whose key matches. That means the table must store both the keys and values.
Separate chaining is less complex than probing. Together we will implement a hash table abstraction that uses separate chaining. We'll arrive at something like this:
import java.util.Iterator;
import java.util.LinkedList;
public class HashTable<K, V> {
private LinkedList<KeyValuePair>[] pairs;
@SuppressWarnings("unchecked")
public HashTable() {
pairs = (LinkedList<KeyValuePair>[]) new LinkedList[100];
}
private int indexFor(K key) {
int hashcode = key.hashCode();
int index = (hashcode & 0x7FFFFFFF) % pairs.length;
return index;
}
public void put(K key, V value) {
int i = indexFor(key);
if (pairs[i] == null) {
pairs[i] = new LinkedList<>();
}
// Either update existing pair...
for (KeyValuePair pair : pairs[i]) {
if (pair.key.equals(key)) {
pair.value = value;
return;
}
}
// ...or add a brand new entry.
pairs[i].add(new KeyValuePair(key, value));
}
public V get(K key) {
int i = indexFor(key);
if (pairs[i] != null) {
for (KeyValuePair pair : pairs[i]) {
if (pair.key.equals(key)) {
return pair.value;
}
}
}
return null;
}
public boolean has(K key) {
return get(key) != null;
}
public V remove(K key) {
int i = indexFor(key);
if (pairs[i] != null) {
Iterator<KeyValuePair> it = pairs[i].iterator();
while (it.hasNext()) {
KeyValuePair pair = it.next();
if (pair.key.equals(key)) {
it.remove();
}
}
}
return null;
}
private class KeyValuePair {
K key;
V value;
public KeyValuePair(K key, V value) {
this.key = key;
this.value = value;
}
}
}
In the worst case, what will be the cost of this table's operations? The worst case will happen when all the pairs resolve to the same index. The collection will degenerate to linked list, so all the operations will run in linear time. Achieving good performance in a collection is only possible if the hash function evenly distributes the items.
The OpenJDK Hashtable
uses separate chaining.
Table Size
We've seen how the hash gets turned into an index in the array. What size should this array have? There's no an easy answer, but there are certain sizes that you should avoid. Suppose you have an array of size 10 and someone inserts items with hashcodes 5, 10, 15, 20, 25, 30, 35, … These will map to just two cells in the array.
This pattern of collisions will happen whenever the data and the array size have common factors. An array of size 12 will suffer when the data comes in multiples of 2, 3, 4, or 6. An array of size 48 will suffer when the data comes in multiples of 2, 3, 4, 6, 8, 12, 16, and 24. Without knowing the distribution of keys ahead of time, the only surefire way of prevent this clustering is to make the table has a size that is a prime number. It won't share any factors with the sequence of data except for the length itself.
In our earlier example, what number yields the first collision if we pick an array size of 11? We'll solve this as a peer instruction exercise.
Load Factor
Some of the behaviors of a hashed collection examine the collection's load factor, which has this definition:
A load factor of 0.5 means that 50% of the cells are occupied. When the load factor exceeds 0.5, a new entry is more likely to collide than land in an empty cell. For this reason, we want to avoid large load factors. We are not trying to be efficient with space but rather lookup time. What can we do to ensure that the load factor stays small? If the load factor goes above 0.5, we create a bigger array and move all the elements over, much like we did with an array list.
TODO
You've got a few things to do before we meet again:
See you next time!
Sincerely,
P.S. It's time for a haiku: