Python DSA Series 04: The Magic of Hash Tables (Dictionaries)
Python DSA Series 04: The Magic of Hash Tables (Dictionaries)
So far, we’ve seen data structures that store items sequentially. Whether it’s a list or a LinkedList, to find an element, we often have to search through them one by one, leading to O(n) search times.
What if we could have near-instantaneous lookups? What if we could find a value just by knowing its key, without searching at all? This is the magic of the Hash Table. In Python, you know and love this data structure as the dictionary (dict).
How Do Hash Tables Work?
A hash table is a data structure that maps keys to values. The core idea is to use a special function, called a hash function, to compute an index into an underlying array where the value is stored.
The Three Core Components
- The Array (Buckets): The hash table is built on top of an array. Each slot in the array is called a bucket.
- The Hash Function: This is a deterministic function that takes a key as input and returns an integer index (a hash code). A good hash function should distribute keys uniformly across the array buckets.
- Key-Value Pairs: The data you want to store.
The Process:
- You want to store a
(key, value)pair. - The
keyis passed to the hash function, which generates anindex. - The
valueis stored in the array bucket at thatindex.
When you want to retrieve the value, you just repeat the process: hash the key, get the index, and look up that index in the array. Since array access is O(1), this is incredibly fast.
1
2
Key -> [Hash Function] -> Index -> [Array of Buckets]
"name" -> hash("name") -> 3 -> buckets[3] = "Alice"
The Inevitable Problem: Collisions
What happens if two different keys produce the same index? For example, hash("name") and hash("age") both return the index 3. This is called a collision.
A good hash function minimizes collisions, but they are impossible to avoid entirely. The most common way to handle them is a technique called Separate Chaining.
Separate Chaining
With separate chaining, each bucket in the array is not a single value, but another data structure—typically a linked list.
The Process with Collisions:
hash("name")returns index3. The bucketbuckets[3]is empty, so a new linked list is created with the node("name", "Alice").hash("age")also returns index3. This is a collision.- The hash table simply traverses the linked list at
buckets[3]and appends a new node,("age", 30).
Visual Representation:
1
2
3
4
5
buckets[0]: None
buckets[1]: None
buckets[2]: None
buckets[3]: [("name", "Alice")] -> [("age", 30)] -> None
buckets[4]: None
When you look up the value for the key "age", the hash table hashes the key to get index 3, then searches the small linked list at that bucket to find the matching key. As long as the linked lists remain short, the performance is excellent.
Python’s dict in Action
The Python dictionary is a highly optimized and mature implementation of a hash table.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Creating a dictionary
person = {"name": "Alice", "age": 30, "city": "New York"}
# Accessing a value (O(1) on average)
print(person["name"]) # Output: Alice
# Adding/Updating a value (O(1) on average)
person["email"] = "alice@example.com" # Add
person["age"] = 31 # Update
# Deleting a value (O(1) on average)
del person["city"]
print(person)
# Output: {'name': 'Alice', 'age': 31, 'email': 'alice@example.com'}
Time Complexity
This is where hash tables shine.
| Operation | Average Case | Worst Case | Why? |
|---|---|---|---|
| Search | O(1) | O(n) | On average, hashing and indexing are instant. In the worst case, all n keys collide into one bucket, turning the hash table into a single linked list. |
| Insert | O(1) | O(n) | Same reason. |
| Delete | O(1) | O(n) | Same reason. |
The worst case is extremely rare in practice, especially with Python’s robust dict implementation. For all intents and purposes in an interview setting, you can state that dictionary operations are O(1).
Interview Problem: Two Sum (Revisited)
Let’s look back at the two_sum problem from our first post. The optimized solution used a dictionary. Now we know exactly why it’s so fast.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def two_sum_optimized(nums, target):
num_map = {} # This is our hash table
for i, num in enumerate(nums):
complement = target - num
# 'complement in num_map' is a hash table lookup.
# This is O(1) on average!
if complement in num_map:
return [num_map[complement], i]
# 'num_map[num] = i' is a hash table insertion.
# This is also O(1) on average.
num_map[num] = i
return []
By using a hash table (dict), we reduced the search for the complement from a linear scan (O(n)) to a constant time lookup (O(1)), bringing the overall algorithm from O(n²) down to O(n).
Conclusion
Hash tables are arguably one of the most important data structures in computer science. They power dictionaries, sets, caches, and more. Their ability to provide average-case O(1) time complexity for lookups, insertions, and deletions makes them the default choice for a huge number of problems.
Key Trade-off: Hash tables sacrifice order for speed. If you need to maintain elements in a specific sequence, a hash table (or a standard Python dict before version 3.7) is not the right choice.
Next, we’ll venture into non-linear data structures, starting with Trees, which are essential for representing hierarchical data.