# Hash Tables and Dictionaries

## Table of Contents

WIP: ChatGPT converted this from my LaTeX document to orgmode so I still need to review this.

## 1. Back to Arrays…

So far we've looked at array-based structures and link-based structures and their tradeoffs in efficiency. Recall that:

Arrays | Linked Lists |
---|---|

Access to an arbitrary index is instant - \(O(1)\) | Access to an arbitrary index is \(O(n)\) |

Inserting is instant if the array isn't full, if the array is full, we need to resize and copy all of the data, which is \(O(n)\) | Inserting is instant - \(O(1)\) - just creating a node and updating pointers. |

Searching is \(O(n)\) in an unsorted array - start at the beginning and search until the end. | Searching is \(O(n)\) in an unsorted linked list - start at the beginning and search until the end. |

What makes an array so fast to access is that we do a simple math operation to find the address of a given index in the array:

\(address_i = address_0 + i \cdot sizeof(data)\)

With this next structure, the **Hash Table** (aka associative array, map, or dictionary), we're going to leverage the speed of an array, and also make searching essentially \(O(1)\) as well, with a little extra math.

## 2. Concepts

### 2.1. Key-Value Pairs

Dictionaries are often thought of as "key-value pairs," where we have some data to store (this is the value), and we can look it up by its key (some searchable identifier).

When designing a class around this, we can reuse some piece of information as the **key**, as long as it's going to be unique in the table.

Employee ID (key) | Employee object (value) |
---|---|

1024 | Name: Amina\n\nEmployee ID: 1024\n\nDepartment: Dev |

2048 | Name: Benita\n\nEmployee ID: 2048\n\nDepartment: Sales |

1280 | Name: Caiyun\n\nEmployee ID: 1280\n\nDepartment: QA |

The **key** should be a datatype that can be easily put into a mathematical function - such as an integer. Strings can also be used, but a complex data type like a class wouldn't work very well as a key.

The **value** can be any data type, usually a specific class that stores more data.

Some examples of key-value uses are:

- People lookup - Associating an employee ID, student ID, etc. (key) to an Employee or Student object (value).
- Address book - Associating a phone number (key) with a contact entry (value).
- Dictionaries - searching for a word (key), finding the definition (value).
- Book lookup - Associating an ISBN (key) to a Book object (value).
- Word counter - Associating each word (key) with the amount of times it has shown up in a document (value).

### 2.2. Hashing the Key to Get the Index

With our Hash Tables, each **element** in the array will be associated with a **key** - similar to our Binary Search Trees. The **key** is a unique identifier, allowing us to access a set of data mapped to the key.

The way we do this is we use this **key** as the input of a function that converts the key into an **index** - somewhere from 0 (the beginning of the array) to \(n-1\) (the end of the array). This is known as the **hash function**. A simple hash function could be taking the key and using the modulus operator with the size of the array to get an index.

Because of this hash function, the efficiency of both **inserting** and **searching** is \(O(1)\) - unless there are **collisions**, where multiple keys map to the same index. This also means that, with minimal collisions, the speed to insert/search is going to be near instantaneous whether we have 10 items or 10,000 items.

### 2.3. Hashing Functions

The simplest hashing function will just take the **key** and the **size of the array** and use modulus to find a resulting index:

\[ Hash(key) = key \% ARRAY\_SIZE \]

Using the modulus operator restricts the resulting indices so that they will be between 0 and ARRAY\_{SIZE}-1.

As an example of a hash function, let’s say we’re going to store employees in a Hash Table by their unique employee IDs. Employee IDs begin at 1000 (not 0 like the array index) and can be any 4-digit number. They’re not sequential. We want to assign an array index to each employee ID so we can quickly search for an employee by their ID. For this, we need a hash function. The simple hash function would look like this:

```c
*/ Input: employee ID integer (key)
/* Output: array index
int Hash(int employeeId) {
return employeeId % ARRAY_{SIZE};
}
```

### 2.4. Collisions

Sooner or later, a collision will occur, and you will need a way to take care of it - basically, to generate a different index than the one that our hash function gives us. There are several ways we can design our hash tables and collision strategies.

#### 2.4.1. Linear Probing (Open Addressing)

With linear probing, the solution to a collision is simply to go forward by one index and see if that position is free. If not, continue stepping forward by 1 until an available space is found.

#### 2.4.2. Quadratic Probing (Open Addressing)

With quadratic probing, we will keep stepping forward until we find an available spot, just like with linear probing. However, the amount we step forward by changes each time we hit a collision on an insert.

- Index wrap-around

Note that with all of these operations, we will also do an additional operation of index%ARRAY

_{SIZEindex}%ARRAY_{SIZE}to make sure the index stays within the bounds of the array. - The problem of clustering

Clustering is a problem that arises, particularly when using linear probing. If we’re only stepping forward by 1 index every time there’s a collision, then we tend to get a Dictionary table where all the elements are clustered together in groups.

#### 2.4.3. Double Hashing (Open Addressing)

With double hashing, we have two hash functions. If the first hash function returns an index that is already taken, then we use the second hash function on the key to generate an offset, instead of just using a linear or quadratic formula. With this method, our new index after a collision will be: newIndex=Hash(key)+collisionCount⋅Hash2(key) newIndex=Hash(key)+collisionCount⋅Hash2(key)

#### 2.4.4. Sizing a Hash Table Array

When we’re writing a hash function, we will want to take into account how likely a collision will be. A general design rule is that making your array size a prime number helps reduce collisions.

Hash Table Efficiency In an ideal world with no collisions, a Hash Table would have an O(1)O(1) growth rate for its Search, Insert, and Deletion no matter how many items it stores. However, since collisions may occur, the efficiency of the Hash Table can vary.

### 2.5. Pros

Searching and Inserting is relatively quick: Since we use hash functions and collision functions to find the index to insert at and to access, it is just one math operation (at best) to find an index, and with collisions, hopefully just a few more calculations. Mapping a key to a value is useful: With a normal Array vs. Linked List, there's not really additional logic that goes into the storage - we just need to store data and it is thrown into a structure. With a Hash Table, we can have a more intentional design for our data.

### 2.6. Cons

Many things affect Hash Table efficiency: The efficiency of finding an index can vary based on the size of the array, how full it is, the quality of the hashing function and collision strategy, and even the data being stored in it. A bad hash function can cause more collisions: Different hash functions can affect how many collisions occur, so different functions can cause different speeds. They are difficult to resize: If we were to resize a hash table, not only would we have to copy all the data over (which would be O(n)O(n)), we would also need to rehash all of the keys to get new indices based on the new table size.

Because resizing a hash table is expensive, if we can reasonably estimate the needed size of the hash table ahead of time, that only helps us out. If we under-estimate the size, that can be costly in having to resize things.

See Also You can find out about more hashing techniques by looking on Wikipedia.

Separate chaining Coalesced hashing Robin-hood hashing