Back in college, it seemed like all I cared about was the inner workings of fundamental data structures, but I didn’t do any Haskell at all. Nowadays, I work with Haskell a lot, but rarely have a need to peer into the inner workings of basic data structures. So, I figured why not combine the two somehow?

Eventually, I landed on the idea of building a Robin Hood HashMap. Not only was it something new to me, but I wasn’t able to find many other implementations of Robin Hood hashing in Haskell. I also knew that pure HashMaps tend to take an \(\mathrm{O}(\log n)\) runtime hit, and I wanted to push that to \(\mathrm{O}(1)\), which eventually meant doing things in `ST`

. However, I didn’t want to worry much about inlining, including every possible instance of strictness, etc., and didn’t do any actual profiling at the end—this project was much more about learning how to write such a thing than tuning it up for real usage. I also did fairly barebones testing, so there are likely some bugs lurking around in the final product.

Of course, there’s plenty of background to dive into before implementing any well-known data structure. For me, this mostly came from reading the hashtables library, which promises “mutable hash tables in the ST monad”, as well as Robin Hood Hashing should be your default Hash Table implementation, which has a nicely readable reference implementation of the basic Robin-Hood-related algorithms in C++. The original 1986 paper by Celis is also worth a skim, but it’s fairly hefty on basic background information and usage simulation/profiling.

Most of this post is about specific Haskell implementation details, but it’d be good to first explain what Robin Hood hashing is, and how it differs from other algorithms. I’ll assume you know what hash functions do, and understand vaguely how hash tables with open addressing work. If you already know all this, feel free to skip ahead.

When we want to insert an element into our HashMap, we need to “probe”. This just means looking for a slot we can put the element into.

When you hash something, you get some array index back. Let’s call that the “desired position”. “Probe length” means the distance that an element is from its desired position. If the HashMap were completely empty, no matter where the hash tells the first element to go, it can go in its desired position, so it’d have a probe length of 0. Robin Hood hashing is about reducing this probe length, on average, by taking from “rich” elements (elements close to their desired positions) and giving to “poor” elements (those far away, with a high probe length).

As an example, let’s say I have a HashMap containing the mapping `{ a: 1, b: 2 }`

, with an underlying array of `[a:1][b:2][_][_]`

(where `[_]`

means an empty slot).

I want to insert `{algorithm: 3}`

. My hash function is really dumb: it just reads the first character and maps `a`

to the first slot, `b`

to the second slot, etc. So the key `algorithm`

hashes to the first slot, currently occupied by `a:1`

.

We can’t insert into an occupied slot, so we compare probe lengths. `a`

is distance 0 from its preferred slot, but so is `algorithm`

at this point. So let’s move on to the slot containing `b:2`

. In this scenario, `b:2`

is “rich”—it also has a probe distance of 0, since it’s in its desired position. `algorithm`

is relatively poor, because if we put it in slot 2, it’ll be 1 slot away from its desired position. Because we want to take from the rich, and give to the poor, we swap elements, putting `algorithm:3`

in slot 2, and now probing for a place to put `b:2`

. Our array currently looks like `[a:1][algorithm:3][_][_]`

and we’re trying to insert `b:2`

.

The next slot is open, so `b:2`

goes there. Our final underlying array is `[a:1][algorithm:3][b:2][_]`

. Note that no element has a probe distance of more than 1. In traditional linear probing, we would have ended up with `[a:1][b:2][algorithm:3][_]`

, which means if we want to look up `algorithm`

again, we have to go a relatively long way. Robin Hood hashing helps average out probe distances, so no element ends up too far from its desired position.

Now on to the implementation! Starting from the top and moving down, the first code I wrote was the interface. I came up with a bunch of functions which were simple to build from a small set of basic parts, pretending that I already had those basic parts in hand. Many of the decisions about what to include came from reading through the interface of unordered-containers.

```
-- Derived functions
null :: HashMap k v -> Bool
null h = size h == 0
member :: (Eq k, Hashable k) => k -> HashMap k a -> Bool
member k h = isJust $ lookup k h
(!?) :: (Eq k, Hashable k) => HashMap k v -> k -> Maybe v
(!?) = flip lookup
findWithDefault :: (Eq k, Hashable k) => v -> k -> HashMap k v -> v
findWithDefault v k h = fromMaybe v $ lookup k h
singleton :: Hashable k => k -> v -> Map k v
singleton k v = insert k v empty
fromList :: (Eq k, Hashable k) => [(k, v)] -> HashMap k v
fromList = foldl (flip $ uncurry insert) empty
toList :: HashMap k v -> [(k, v)]
toList = foldl (flip (:)) []
```

After writing out these interface functions, the core set ended up being these:

```
empty :: HashMap k v
size :: HashMap k v -> Int
lookup :: (Eq k, Hashable k) => k -> HashMap k v -> Maybe v
insert :: (Eq k, Hashable k) => k -> v -> HashMap k v -> HashMap k v
delete :: (Eq k, Hashable k) => k -> HashMap k v -> HashMap k v
foldl :: (a -> v -> a) -> a -> HashMap k v -> a
```

These were a good foundation for doing everything I wanted a HashMap to be able to reasonably do, but without getting too bogged down in details. I ended up doing things in the `ST`

monad eventually, so `foldl`

turned into `foldM`

and there were a lot more `ST s`

spread around. But the core idea for what I’d have to implement was set.

The next important thing to decide was what the HashMap would really be, on the inside. I wanted to remove the usual \(\mathrm{O}(\log n)\) hit for using pure data structures, and since traditional Robin Hood hashing uses open addressing (basically, just storing the elements of the hash table in a big array), this hit could have been even worse. Some \(\mathrm{O}(1)\) operations on mutable arrays are much worse on immutable arrays—for example, immutable single-cell updates can still take \(\mathrm{O}(n)\) to copy the entire array.

At around this point, I landed on Tweag’s post on immutability and unboxing arrays. There are a bunch of choices for array programming in Haskell. I tried `STUArray`

but found it didn’t quite have the interface I wanted—specifically, a super convenient way to read and write from positions at some integral index. There’s also no `MArray`

instance over arbitrary element types for `STUArray`

, like there is for `STArray`

. Since I didn’t care too much about extra pointer indirections, I went with `MutableArray`

, a boxed mutable array that was easy to work with in `ST`

. If performance were a bigger factor, or I had more information about my key and value types, I probably would have chosen differently.

Now on to more coding. The most basic and most important type for the HashMap is, unsurprisingly, the `Hash`

.

Following from our C++ reference implementation’s design, we’ll say that if a `Hash`

is ever zero, that means it’s not being used for any keys in the map. We’ll also reserve the most significant bit as a deletion marker, which allows some of our `Hash`

es to act as a “tombstone”. Deleted elements can be overwritten during insertion, but we don’t take the time at deletion to clear them out, versus just setting one bit. `Word32`

is an unsigned 32-bit number, but reserving one bit for deletion means we’re left with \(2^{31} = 2147483648\) possible slots in a completely full map.

As a convenience for setting the deletion bit, let’s also note which one it is ahead of time.

```
-- if a Hash is a Word32 as in [bit31..bit0],
-- its most significant bit is bit 31
mostSignificantBit :: Int
mostSignificantBit = 31
```

Because we’ll be working in `ST`

, the `HashMap`

type itself will really just be a little bit of bookkeeping over an `STRef`

to an actual data structure. Let’s call the inner one `HashMapData`

.

`s`

is `ST`

’s magic variable it uses to enforce proper scoping, and `k`

and `v`

of course refer to the key and value types of our HashMap.

Another choice we have is to store arrays of elems, keys, and hashes all separately, or do some kind of combining. I found it convenient to build a simple `Elem`

type, strict in both the key and the value:

So, we’ll store an array of `Hash`

es as well as an array of `Elem`

s. The HashMap should also know the current number of elements in it—its size—as well as its maximum capacity, and maximum load factor so it knows when to grow. Putting that all together, and littering the whole thing with strictness and unpacking annotations, we get a final data structure like this:

```
data HashMapData s k v = HashMapData
{ hashMapSize :: {-# UNPACK #-} !Int -- number of elements
, hashMapLoadFactor :: {-# UNPACK #-} !Double -- maximum load factor (between 0 and 1)
, hashMapCapacity :: {-# UNPACK #-} !Int -- available slots
, hashMapHashes :: {-# UNPACK #-} !(MutableArray s Hash)
, hashMapElems :: {-# UNPACK #-} !(MutableArray s (Elem k v))
}
```

In Haskell, it’s possible to work with unboxed types more directly, especially via the `MagicHash`

language extension. However, after messing with this for a while, I decided that merely unpacking data constructors was good enough here—I’m using a boxed array type anyways, after all.

That’s it for data structures! Now we’re on to the real meat of the project: implementing behavior. Starting with `size`

is probably easiest, since we have it stored in `HashMapData`

already, and just need to retrieve it.

```
size :: HashMap s k v -> ST s Int
size (HashMap hashMapDataRef) = do
hashMapSize <$> readSTRef hashMapDataRef
```

The only real trick here is remembering that `HashMap`

is really just a wrapper for an `STRef`

to a `HashMapData`

, and then unpacking things appropriately, before wrapping them back up in `ST`

. It’s easy enough to read the ref, then fmap our size-finding function.

My `foldM`

is taken straight out of `Data.HashTable.ST.Basic`

, because I just wanted to be able to build `toList`

as easily as possible, so it’s not that interesting. The opposite direction, `fromList`

, is also a fold, but it folds `insert`

s over a `[(k ,v)]`

. To build up a structure by folding, we need some “zero” or default structure. This leads to `empty`

, which creates a new HashMap with nothing inside it.

To build an empty HashMap, we need a zeroed-out array of `Hash`

es, some memory reserved for `Elem`

s, and a default `capacity`

and `loadFactor`

. The size of an `empty`

HashMap is `0`

, and `empty`

also builds our `STRef`

that will be passed around everywhere else we want access to `HashMapData`

.

```
empty :: ST s (HashMap s k v)
empty = do
hashArr <- newArray initialCapacity zeroBits
elemArr <- newArray initialCapacity undefined
liftM HashMap . newSTRef $ HashMapData 0 loadFactor initialCapacity hashArr elemArr
where
initialCapacity = 256
loadFactor = 0.9
```

Insertions are probably more interesting, but deletions are simpler, so we’ll work our way up. To delete, we find the index where some key maps to, and set the deletion bit (the most significant bit) on the hash there, remembering to decrement the HashMap’s size.

```
delete :: (Hashable k, Eq k) => k -> HashMap s k v -> ST s ()
delete key hm@(HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
mbIndex <- lookupIndex key hm
whenJust mbIndex $ \index -> do
hashed <- readArray hashMapHashes (fromIntegral index)
let newHash = setBit hashed mostSignificantBit
writeArray hashMapHashes (fromIntegral index) newHash
writeSTRef hashMapDataRef HashMapData{..} { hashMapSize = hashMapSize - 1}
```

I found using `RecordWildCards`

everywhere along with reading the `STRef`

as the first step of every function led to a strangely “global variable”-feeling approach. Whenever I want to reach into some component of `HashMapData`

to read or modify it, it’s always just available. Programming against `ST`

with that always-available kind of thing feels super imperative. However, Haskell still manages to ensure the type safety of everything.

A lookup is a way to take the index a key maps to, if any, and retrieve the value of the element stored there. The trickiest part of looking things up in this style is probably just keeping track of where you are in regards to `ST`

, and where you are in regards to having a `Maybe`

.

```
lookup :: (Eq k, Hashable k) => k -> HashMap s k v -> ST s (Maybe v)
lookup key hm@(HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
mbIndex <- lookupIndex key hm
case mbIndex of
Nothing -> pure Nothing
Just index -> Just . elemValue <$> readArray hashMapElems (fromIntegral index)
```

Both `delete`

and `lookup`

rely on a helper named `lookupIndex`

, which finds the position some key maps to. It takes the “desired position”—the default a hash maps to if there’s nothing already occupying a slot, and updates it according to whether there’s an unused slot there, or we’ve found the key we’re looking for, or if we need to keep looking.

```
lookupIndex :: (Eq k, Hashable k) => k -> HashMap s k v -> ST s (Maybe Hash)
lookupIndex key hm@(HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
let hashed = hashKey key
position <- desiredPosition hashed hm
let loopStep dist pos = do
elemHash <- readArray hashMapHashes (fromIntegral pos)
arrAtPos <- readArray hashMapElems (fromIntegral pos)
probeDist <- probeDistance elemHash pos hm
if | elemHash == 0 -> pure Nothing
| dist > probeDist -> pure Nothing
| elemHash == hashed && elemKey arrAtPos == key -> pure $ Just pos
| otherwise -> loopStep (dist + 1) ((pos + 1) .&. (fromIntegral hashMapCapacity - 1))
loopStep 0 position
```

Finally, the Big Kahuna of any proper HashMap implementation: insertion. This is where the Robin Hood algorithm comes into effect, and it’s also where most potential bugs will come into effect as well. First, if the underlying array isn’t big enough to contain our new insertion, we need to grow the table. Once that’s done, we can do the actual insert, and if the insert tells us it added a new key (instead of merely overriding an existing one), we bump up the HashMap’s size.

```
insert :: (Eq k, Hashable k) => k -> v -> HashMap s k v -> ST s ()
insert !key !value (HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
when (fromIntegral hashMapSize >= fromIntegral hashMapCapacity * hashMapLoadFactor) $ grow (HashMap hashMapDataRef)
HashMapData{..} <- readSTRef hashMapDataRef
!didAddNewKey <- insertWithoutGrowing key value (HashMap hashMapDataRef)
when (didAddNewKey == AddedNewKey) $
writeSTRef hashMapDataRef HashMapData{..} { hashMapSize = hashMapSize + 1 }
```

`DidAddNewKey`

just helps us avoid some boolean blindness when thinking about already-complicated-enough algorithmic things:

Whenever our HashMap starts to exceed a percentage of its capacity, it’s a good idea to increase the capacity. We check whether we should grow upon insertion, and if we should, double the size of the HashMap and reinsert all the old elements.

```
grow :: (Eq k, Hashable k) => HashMap s k v -> ST s ()
grow (HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
let !oldCapacity = hashMapCapacity
!newCapacity = hashMapCapacity * 2
!oldElems = hashMapElems
!oldHashes = hashMapHashes
!hashArr <- newArray newCapacity (zeroBits :: Hash)
!elemArr <- newArray newCapacity undefined
writeSTRef hashMapDataRef $ HashMapData 0 0.9 newCapacity hashArr elemArr
forM_ [0..oldCapacity - 1] $ \i -> do
!hashed <- readArray oldHashes i
when (hashed /= 0 && not (isDeleted hashed)) $ do
Elem k v <- readArray oldElems i
void $ insert k v (HashMap hashMapDataRef)
```

At last, the big reveal. We hash our key, and run a loop that examines potential insertion locations. If a slot is unused, we can put the element there. If a slot is used by the same key that we’re trying to add something to, we overwrite the value at that key. If a slot is occupied by a tombstone, that’s as good as unused, and we add the new key there. Finally, if none of that is true, we do our actual Robin Hood-ing, and swap out “rich” and “poor” elements until we find a happy home for our newly-inserted element.

```
insertWithoutGrowing :: (Eq k, Hashable k) => k -> v -> HashMap s k v -> ST s DidAddNewKey
insertWithoutGrowing key value hm@(HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
!position <- desiredPosition (hashKey key) hm
let nextPos p = (p + 1) .&. (fromIntegral hashMapCapacity - 1)
let
loopStep probeDist pos elem hash = do
!hashAtPos <- readArray hashMapHashes $ fromIntegral pos
elemAtPos <- readArray hashMapElems $ fromIntegral pos
probeDistOfElemAtPos <- probeDistance hashAtPos pos hm
if | hashAtPos == 0 -> do
-- put elem in empty slot
construct pos hash elem hm
pure AddedNewKey
| isDeleted hashAtPos -> do
-- put elem in slot occupied by deleted elem
construct pos hash elem hm
pure AddedNewKey
| elemKey elemAtPos == elemKey elem -> do
-- overwrite value with same key
construct pos hash elem hm
pure DidNotAddNewKey
| probeDistOfElemAtPos < probeDist -> do
-- this makes it Robin Hood. swap and try next position with swapped elem
writeArray hashMapHashes (fromIntegral pos) hash
writeArray hashMapElems (fromIntegral pos) elem
loopStep (probeDist + 1) (nextPos pos) elemAtPos hashAtPos
| otherwise -> do
-- try next position
loopStep (probeDist + 1) (nextPos pos) elem hash
loopStep 0 position (Elem key value) (hashKey key)
```

The full code (including some small tests that show how one might use such a HashMap) can be read at this github repo