mitchell vitez

dark mode

blog about music art media

resume email github

Implementing a Robin Hood HashMap in Haskell

Back in college, it seemed like all I cared about was the inner workings of fundamental data structures, but I didn’t do any Haskell at all. Nowadays, I work with Haskell a lot, but rarely have a need to peer into the inner workings of basic data structures. So, I figured why not combine the two somehow?

Eventually, I landed on the idea of building a Robin Hood HashMap. Not only was it something new to me, but I wasn’t able to find many other implementations of Robin Hood hashing in Haskell. I also knew that pure HashMaps tend to take an \(\mathrm{O}(\log n)\) runtime hit, and I wanted to push that to \(\mathrm{O}(1)\), which eventually meant doing things in ST. However, I didn’t want to worry much about inlining, including every possible instance of strictness, etc., and didn’t do any actual profiling at the end—this project was much more about learning how to write such a thing than tuning it up for real usage. I also did fairly barebones testing, so there are likely some bugs lurking around in the final product.

Of course, there’s plenty of background to dive into before implementing any well-known data structure. For me, this mostly came from reading the hashtables library, which promises “mutable hash tables in the ST monad”, as well as Robin Hood Hashing should be your default Hash Table implementation, which has a nicely readable reference implementation of the basic Robin-Hood-related algorithms in C++. The original 1986 paper by Celis is also worth a skim, but it’s fairly hefty on basic background information and usage simulation/profiling.

What is Robin Hood hashing?

Most of this post is about specific Haskell implementation details, but it’d be good to first explain what Robin Hood hashing is, and how it differs from other algorithms. I’ll assume you know what hash functions do, and understand vaguely how hash tables with open addressing work. If you already know all this, feel free to skip ahead.

When we want to insert an element into our HashMap, we need to “probe”. This just means looking for a slot we can put the element into.

When you hash something, you get some array index back. Let’s call that the “desired position”. “Probe length” means the distance that an element is from its desired position. If the HashMap were completely empty, no matter where the hash tells the first element to go, it can go in its desired position, so it’d have a probe length of 0. Robin Hood hashing is about reducing this probe length, on average, by taking from “rich” elements (elements close to their desired positions) and giving to “poor” elements (those far away, with a high probe length).

As an example, let’s say I have a HashMap containing the mapping { a: 1, b: 2 }, with an underlying array of [a:1][b:2][_][_] (where [_] means an empty slot).

I want to insert {algorithm: 3}. My hash function is really dumb: it just reads the first character and maps a to the first slot, b to the second slot, etc. So the key algorithm hashes to the first slot, currently occupied by a:1.

We can’t insert into an occupied slot, so we compare probe lengths. a is distance 0 from its preferred slot, but so is algorithm at this point. So let’s move on to the slot containing b:2. In this scenario, b:2 is “rich”—it also has a probe distance of 0, since it’s in its desired position. algorithm is relatively poor, because if we put it in slot 2, it’ll be 1 slot away from its desired position. Because we want to take from the rich, and give to the poor, we swap elements, putting algorithm:3 in slot 2, and now probing for a place to put b:2. Our array currently looks like [a:1][algorithm:3][_][_] and we’re trying to insert b:2.

The next slot is open, so b:2 goes there. Our final underlying array is [a:1][algorithm:3][b:2][_]. Note that no element has a probe distance of more than 1. In traditional linear probing, we would have ended up with [a:1][b:2][algorithm:3][_], which means if we want to look up algorithm again, we have to go a relatively long way. Robin Hood hashing helps average out probe distances, so no element ends up too far from its desired position.

Interface Functions

Now on to the implementation! Starting from the top and moving down, the first code I wrote was the interface. I came up with a bunch of functions which were simple to build from a small set of basic parts, pretending that I already had those basic parts in hand. Many of the decisions about what to include came from reading through the interface of unordered-containers.

-- Derived functions
null :: HashMap k v -> Bool
null h = size h == 0

member :: (Eq k, Hashable k) => k -> HashMap k a -> Bool
member k h = isJust $ lookup k h

(!?) :: (Eq k, Hashable k) => HashMap k v -> k -> Maybe v
(!?) = flip lookup

findWithDefault :: (Eq k, Hashable k) => v -> k -> HashMap k v -> v
findWithDefault v k h = fromMaybe v $ lookup k h

singleton :: Hashable k => k -> v -> HashMap k v
singleton k v = insert k v empty

fromList :: (Eq k, Hashable k) => [(k, v)] -> HashMap k v
fromList = foldl (flip $ uncurry insert) empty

toList :: HashMap k v -> [(k, v)]
toList = foldl (flip (:)) []

After writing out these interface functions, the core set ended up being these:

empty :: HashMap k v
size :: HashMap k v -> Int
lookup :: (Eq k, Hashable k) => k -> HashMap k v -> Maybe v
insert :: (Eq k, Hashable k) => k -> v -> HashMap k v -> HashMap k v
delete :: (Eq k, Hashable k) => k -> HashMap k v -> HashMap k v
foldl :: (a -> v -> a) -> a -> HashMap k v -> a

These were a good foundation for doing everything I wanted a HashMap to be able to reasonably do, but without getting too bogged down in details. I ended up doing things in the ST monad eventually, so foldl turned into foldM and there were a lot more ST s spread around. But the core idea for what I’d have to implement was set.

Types and Internal Data Structures

The next important thing to decide was what the HashMap would really be, on the inside. I wanted to remove the usual \(\mathrm{O}(\log n)\) hit for using pure data structures, and since traditional Robin Hood hashing uses open addressing (basically, just storing the elements of the hash table in a big array), this hit could have been even worse. Some \(\mathrm{O}(1)\) operations on mutable arrays are much worse on immutable arrays—for example, immutable single-cell updates can still take \(\mathrm{O}(n)\) to copy the entire array.

At around this point, I landed on Tweag’s post on immutability and unboxing arrays. There are a bunch of choices for array programming in Haskell. I tried STUArray but found it didn’t quite have the interface I wanted—specifically, a super convenient way to read and write from positions at some integral index. There’s also no MArray instance over arbitrary element types for STUArray, like there is for STArray. Since I didn’t care too much about extra pointer indirections, I went with MutableArray, a boxed mutable array that was easy to work with in ST. If performance were a bigger factor, or I had more information about my key and value types, I probably would have chosen differently.

Now on to more coding. The most basic and most important type for the HashMap is, unsurprisingly, the Hash.

type Hash = Word32

Following from our C++ reference implementation’s design, we’ll say that if a Hash is ever zero, that means it’s not being used for any keys in the map. We’ll also reserve the most significant bit as a deletion marker, which allows some of our Hashes to act as a “tombstone”. Deleted elements can be overwritten during insertion, but we don’t take the time at deletion to clear them out, versus just setting one bit. Word32 is an unsigned 32-bit number, but reserving one bit for deletion means we’re left with \(2^{31} = 2147483648\) possible slots in a completely full map.

As a convenience for setting the deletion bit, let’s also note which one it is ahead of time.

-- if a Hash is a Word32 as in [bit31..bit0],
-- its most significant bit is bit 31
mostSignificantBit :: Int
mostSignificantBit = 31

Because we’ll be working in ST, the HashMap type itself will really just be a little bit of bookkeeping over an STRef to an actual data structure. Let’s call the inner one HashMapData.

newtype HashMap s k v = HashMap (STRef s (HashMapData s k v))

s is ST’s magic variable it uses to enforce proper scoping, and k and v of course refer to the key and value types of our HashMap.

Another choice we have is to store arrays of elems, keys, and hashes all separately, or do some kind of combining. I found it convenient to build a simple Elem type, strict in both the key and the value:

data Elem k v = Elem
  { elemKey   :: !k
  , elemValue :: !v
  }

So, we’ll store an array of Hashes as well as an array of Elems. The HashMap should also know the current number of elements in it—its size—as well as its maximum capacity, and maximum load factor so it knows when to grow. Putting that all together, and littering the whole thing with strictness and unpacking annotations, we get a final data structure like this:

data HashMapData s k v = HashMapData
  { hashMapSize       :: {-# UNPACK #-} !Int    -- number of elements
  , hashMapLoadFactor :: {-# UNPACK #-} !Double -- maximum load factor (between 0 and 1)
  , hashMapCapacity   :: {-# UNPACK #-} !Int    -- available slots
  , hashMapHashes     :: {-# UNPACK #-} !(MutableArray s Hash)
  , hashMapElems      :: {-# UNPACK #-} !(MutableArray s (Elem k v))
  }

In Haskell, it’s possible to work with unboxed types more directly, especially via the MagicHash language extension. However, after messing with this for a while, I decided that merely unpacking data constructors was good enough here—I’m using a boxed array type anyways, after all.

Implementing Useful Functions

That’s it for data structures! Now we’re on to the real meat of the project: implementing behavior. Starting with size is probably easiest, since we have it stored in HashMapData already, and just need to retrieve it.

size :: HashMap s k v -> ST s Int
size (HashMap hashMapDataRef) = do
  hashMapSize <$> readSTRef hashMapDataRef

The only real trick here is remembering that HashMap is really just a wrapper for an STRef to a HashMapData, and then unpacking things appropriately, before wrapping them back up in ST. It’s easy enough to read the ref, then fmap our size-finding function.

My foldM is taken straight out of Data.HashTable.ST.Basic, because I just wanted to be able to build toList as easily as possible, so it’s not that interesting. The opposite direction, fromList, is also a fold, but it folds inserts over a [(k, v)]. To build up a structure by folding, we need some “zero” or default structure. This leads to empty, which creates a new HashMap with nothing inside it.

To build an empty HashMap, we need a zeroed-out array of Hashes, some memory reserved for Elems, and a default capacity and loadFactor. The size of an empty HashMap is 0, and empty also builds our STRef that will be passed around everywhere else we want access to HashMapData.

empty :: ST s (HashMap s k v)
empty = do
  hashArr <- newArray initialCapacity zeroBits
  elemArr <- newArray initialCapacity undefined
  liftM HashMap . newSTRef $ HashMapData 0 loadFactor initialCapacity hashArr elemArr
  where
    initialCapacity = 256
    loadFactor = 0.9

Insertions are probably more interesting, but deletions are simpler, so we’ll work our way up. To delete, we find the index where some key maps to, and set the deletion bit (the most significant bit) on the hash there, remembering to decrement the HashMap’s size.

delete :: (Hashable k, Eq k) => k -> HashMap s k v -> ST s ()
delete key hm@(HashMap hashMapDataRef) = do
  HashMapData{..} <- readSTRef hashMapDataRef
  mbIndex <- lookupIndex key hm
  whenJust mbIndex $ \index -> do
    hashed <- readArray hashMapHashes (fromIntegral index)
    let newHash = setBit hashed mostSignificantBit
    writeArray hashMapHashes (fromIntegral index) newHash
    writeSTRef hashMapDataRef HashMapData{..} { hashMapSize = hashMapSize - 1}

I found using RecordWildCards everywhere along with reading the STRef as the first step of every function led to a strangely “global variable”-feeling approach. Whenever I want to reach into some component of HashMapData to read or modify it, it’s always just available. Programming against ST with that always-available kind of thing feels super imperative. However, Haskell still manages to ensure the type safety of everything.

A lookup is a way to take the index a key maps to, if any, and retrieve the value of the element stored there. The trickiest part of looking things up in this style is probably just keeping track of where you are in regards to ST, and where you are in regards to having a Maybe.

lookup :: (Eq k, Hashable k) => k -> HashMap s k v -> ST s (Maybe v)
lookup key hm@(HashMap hashMapDataRef) = do
  HashMapData{..} <- readSTRef hashMapDataRef
  mbIndex <- lookupIndex key hm
  case mbIndex of
    Nothing -> pure Nothing
    Just index -> Just . elemValue <$> readArray hashMapElems (fromIntegral index)

Both delete and lookup rely on a helper named lookupIndex, which finds the position some key maps to. It takes the “desired position”—the default a hash maps to if there’s nothing already occupying a slot, and updates it according to whether there’s an unused slot there, or we’ve found the key we’re looking for, or if we need to keep looking.

lookupIndex :: (Eq k, Hashable k) => k -> HashMap s k v -> ST s (Maybe Hash)
lookupIndex key hm@(HashMap hashMapDataRef) = do
  HashMapData{..} <- readSTRef hashMapDataRef
  let hashed = hashKey key
  position <- desiredPosition hashed hm

  let loopStep dist pos = do
        elemHash <- readArray hashMapHashes (fromIntegral pos)
        arrAtPos <- readArray hashMapElems (fromIntegral pos)
        probeDist <- probeDistance elemHash pos hm

        if | elemHash == 0 -> pure Nothing
           | dist > probeDist -> pure Nothing
           | elemHash == hashed && elemKey arrAtPos == key -> pure $ Just pos
           | otherwise -> loopStep (dist + 1) ((pos + 1) .&. (fromIntegral hashMapCapacity - 1))

  loopStep 0 position

Finally, the Big Kahuna of any proper HashMap implementation: insertion. This is where the Robin Hood algorithm comes into effect, and it’s also where most potential bugs will come into effect as well. First, if the underlying array isn’t big enough to contain our new insertion, we need to grow the table. Once that’s done, we can do the actual insert, and if the insert tells us it added a new key (instead of merely overriding an existing one), we bump up the HashMap’s size.

insert :: (Eq k, Hashable k) => k -> v -> HashMap s k v -> ST s ()
insert !key !value (HashMap hashMapDataRef) = do
  HashMapData{..} <- readSTRef hashMapDataRef
  when (fromIntegral hashMapSize >= fromIntegral hashMapCapacity * hashMapLoadFactor) $ grow (HashMap hashMapDataRef)
  HashMapData{..} <- readSTRef hashMapDataRef
  !didAddNewKey <- insertWithoutGrowing key value (HashMap hashMapDataRef)
  when (didAddNewKey == AddedNewKey) $
    writeSTRef hashMapDataRef HashMapData{..} { hashMapSize = hashMapSize + 1 }

DidAddNewKey just helps us avoid some boolean blindness when thinking about already-complicated-enough algorithmic things:

data DidAddNewKey = AddedNewKey | DidNotAddNewKey
  deriving Eq

Whenever our HashMap starts to exceed a percentage of its capacity, it’s a good idea to increase the capacity. We check whether we should grow upon insertion, and if we should, double the size of the HashMap and reinsert all the old elements.

grow :: (Eq k, Hashable k) => HashMap s k v -> ST s ()
grow (HashMap hashMapDataRef) = do
  HashMapData{..} <- readSTRef hashMapDataRef
  let !oldCapacity = hashMapCapacity
      !newCapacity = hashMapCapacity * 2
      !oldElems = hashMapElems
      !oldHashes = hashMapHashes
  !hashArr <- newArray newCapacity (zeroBits :: Hash)
  !elemArr <- newArray newCapacity undefined
  writeSTRef hashMapDataRef $ HashMapData 0 0.9 newCapacity hashArr elemArr
  forM_ [0..oldCapacity - 1] $ \i -> do
    !hashed <- readArray oldHashes i
    when (hashed /= 0 && not (isDeleted hashed)) $ do
      Elem k v <- readArray oldElems i
      void $ insert k v (HashMap hashMapDataRef)

At last, the big reveal. We hash our key, and run a loop that examines potential insertion locations. If a slot is unused, we can put the element there. If a slot is used by the same key that we’re trying to add something to, we overwrite the value at that key. If a slot is occupied by a tombstone, that’s as good as unused, and we add the new key there. Finally, if none of that is true, we do our actual Robin Hood-ing, and swap out “rich” and “poor” elements until we find a happy home for our newly-inserted element.

insertWithoutGrowing :: (Eq k, Hashable k) => k -> v -> HashMap s k v -> ST s DidAddNewKey
insertWithoutGrowing key value hm@(HashMap hashMapDataRef) = do
  HashMapData{..} <- readSTRef hashMapDataRef
  !position <- desiredPosition (hashKey key) hm
  let nextPos p = (p + 1) .&. (fromIntegral hashMapCapacity - 1)

  let
    loopStep probeDist pos elem hash = do
      !hashAtPos <- readArray hashMapHashes $ fromIntegral pos
      elemAtPos <- readArray hashMapElems $ fromIntegral pos
      probeDistOfElemAtPos <- probeDistance hashAtPos pos hm

      if | hashAtPos == 0 -> do
             -- put elem in empty slot
             construct pos hash elem hm
             pure AddedNewKey

         | isDeleted hashAtPos -> do
             -- put elem in slot occupied by deleted elem
             construct pos hash elem hm
             pure AddedNewKey

         | elemKey elemAtPos == elemKey elem -> do
             -- overwrite value with same key
             construct pos hash elem hm
             pure DidNotAddNewKey

         | probeDistOfElemAtPos < probeDist -> do
             -- this makes it Robin Hood. swap and try next position with swapped elem
             writeArray hashMapHashes (fromIntegral pos) hash
             writeArray hashMapElems (fromIntegral pos) elem
             loopStep (probeDist + 1) (nextPos pos) elemAtPos hashAtPos

          | otherwise -> do
              -- try next position
              loopStep (probeDist + 1) (nextPos pos) elem hash

  loopStep 0 position (Elem key value) (hashKey key)

The full code (including some small tests that show how one might use such a HashMap) can be read at this github repo