Explain in simple terms how Hash Tables are implemented?

Technology CommunityCategory: Hash TablesExplain in simple terms how Hash Tables are implemented?
VietMX Staff asked 3 years ago

Hash tables are often implemented as arrays of linked lists. If we imagine a table storing people’s names, after a few insertions it might be laid out in memory as below, where (enclosed numbers), like (42) are hash values of the name.

bucket#  bucket content / linked list

[0]      --> "sue"(780) --> null
[1]      null
[2]      --> "fred"(42) --> "bill"(9282) --> "jane"(42) --> null
[3]      --> "mary"(73) --> null
[4]      null
[5]      --> "masayuki"(75) --> "sarwar"(105) --> null
[6]      --> "margaret"(2626) --> null
[7]      null
[8]      --> "bob"(308) --> null
[9]      null

A few points:

  • each of the array entries (indices [0][1]…) is known as a bucket, and starts a – possibly empty – linked list of values (aka elements, in this example – people’s names)
  • each value (e.g. "fred" with hash 42) is linked from bucket [hash % number_of_buckets] e.g. 42 % 10 == [2]% is the modulo operator – the remainder when divided by the number of buckets
  • multiple data values may collide at and be linked from the same bucket, most often because their hash values collide after the modulo operation (e.g. 42 % 10 == [2], and 9282 % 10 == [2]), but occasionally because the hash values are the same (e.g. "fred" and "jane" both shown with hash 42 above)
    • most hash tables handle collisions – with slightly reduced performance but no functional confusion – by comparing the full value (here text) of a value being sought or inserted to each value already in the linked list at the hashed-to bucket
  • If the table size grows, hash tables implemented as above tend to resize themselves (i.e. create a bigger array of buckets, create new/updated linked lists there-from, delete the old array) to keep the ratio of values to buckets (aka load factor) somewhere in the 0.5 to 1.0 range.
  • The buckets don’t necessarily have to be lists or arrays, they can be any container type, such as a balanced BST. That means O(log n) worst case for collisions. But this is why it’s important to choose a good hashing function to avoid putting too many elements into one bucket.