Hash tables are often implemented as arrays of linked lists. If we imagine a table storing people’s names, after a few insertions it might be laid out in memory as below, where (enclosed numbers), like (42)
are hash values of the name.
bucket# bucket content / linked list
[0] --> "sue"(780) --> null
[1] null
[2] --> "fred"(42) --> "bill"(9282) --> "jane"(42) --> null
[3] --> "mary"(73) --> null
[4] null
[5] --> "masayuki"(75) --> "sarwar"(105) --> null
[6] --> "margaret"(2626) --> null
[7] null
[8] --> "bob"(308) --> null
[9] null
A few points:
- each of the array entries (indices
[0]
,[1]
…) is known as a bucket, and starts a – possibly empty – linked list of values (aka elements, in this example – people’s names) - each value (e.g.
"fred"
with hash42
) is linked from bucket[hash % number_of_buckets]
e.g.42 % 10 == [2]
;%
is the modulo operator – the remainder when divided by the number of buckets - multiple data values may collide at and be linked from the same bucket, most often because their hash values collide after the modulo operation (e.g.
42 % 10 == [2]
, and9282 % 10 == [2]
), but occasionally because the hash values are the same (e.g."fred"
and"jane"
both shown with hash42
above)- most hash tables handle collisions – with slightly reduced performance but no functional confusion – by comparing the full value (here text) of a value being sought or inserted to each value already in the linked list at the hashed-to bucket
- If the table size grows, hash tables implemented as above tend to resize themselves (i.e. create a bigger array of buckets, create new/updated linked lists there-from, delete the old array) to keep the ratio of values to buckets (aka load factor) somewhere in the 0.5 to 1.0 range.
- The buckets don’t necessarily have to be lists or arrays, they can be any container type, such as a balanced BST. That means
O(log n)
worst case for collisions. But this is why it’s important to choose a good hashing function to avoid putting too many elements into one bucket.