Back in college, it seemed like all I cared about was the inner workings of fundamental data structures, but I didn’t do any Haskell at all. Nowadays, I work with Haskell a lot, but rarely have a need to peer into the inner workings of basic data structures. So, I figured why not combine the two somehow?
Eventually, I landed on the idea of building a Robin Hood HashMap. Not only was it something new to me, but I wasn’t able to find many other implementations of Robin Hood hashing in Haskell. I also knew that pure HashMaps tend to take an \(\mathrm{O}(\log n)\) runtime hit, and I wanted to push that to \(\mathrm{O}(1)\), which eventually meant doing things in ST
. However, I didn’t want to worry much about inlining, including every possible instance of strictness, etc., and didn’t do any actual profiling at the end—this project was much more about learning how to write such a thing than tuning it up for real usage. I also did fairly barebones testing, so there are likely some bugs lurking around in the final product.
Of course, there’s plenty of background to dive into before implementing any well-known data structure. For me, this mostly came from reading the hashtables library, which promises “mutable hash tables in the ST monad”, as well as Robin Hood Hashing should be your default Hash Table implementation, which has a nicely readable reference implementation of the basic Robin-Hood-related algorithms in C++. The original 1986 paper by Celis is also worth a skim, but it’s fairly hefty on basic background information and usage simulation/profiling.
Most of this post is about specific Haskell implementation details, but it’d be good to first explain what Robin Hood hashing is, and how it differs from other algorithms. I’ll assume you know what hash functions do, and understand vaguely how hash tables with open addressing work. If you already know all this, feel free to skip ahead.
When we want to insert an element into our HashMap, we need to “probe”. This just means looking for a slot we can put the element into.
When you hash something, you get some array index back. Let’s call that the “desired position”. “Probe length” means the distance that an element is from its desired position. If the HashMap were completely empty, no matter where the hash tells the first element to go, it can go in its desired position, so it’d have a probe length of 0. Robin Hood hashing is about reducing this probe length, on average, by taking from “rich” elements (elements close to their desired positions) and giving to “poor” elements (those far away, with a high probe length).
As an example, let’s say I have a HashMap containing the mapping { a: 1, b: 2 }
, with an underlying array of [a:1][b:2][_][_]
(where [_]
means an empty slot).
I want to insert {algorithm: 3}
. My hash function is really dumb: it just reads the first character and maps a
to the first slot, b
to the second slot, etc. So the key algorithm
hashes to the first slot, currently occupied by a:1
.
We can’t insert into an occupied slot, so we compare probe lengths. a
is distance 0 from its preferred slot, but so is algorithm
at this point. So let’s move on to the slot containing b:2
. In this scenario, b:2
is “rich”—it also has a probe distance of 0, since it’s in its desired position. algorithm
is relatively poor, because if we put it in slot 2, it’ll be 1 slot away from its desired position. Because we want to take from the rich, and give to the poor, we swap elements, putting algorithm:3
in slot 2, and now probing for a place to put b:2
. Our array currently looks like [a:1][algorithm:3][_][_]
and we’re trying to insert b:2
.
The next slot is open, so b:2
goes there. Our final underlying array is [a:1][algorithm:3][b:2][_]
. Note that no element has a probe distance of more than 1. In traditional linear probing, we would have ended up with [a:1][b:2][algorithm:3][_]
, which means if we want to look up algorithm
again, we have to go a relatively long way. Robin Hood hashing helps average out probe distances, so no element ends up too far from its desired position.
Now on to the implementation! Starting from the top and moving down, the first code I wrote was the interface. I came up with a bunch of functions which were simple to build from a small set of basic parts, pretending that I already had those basic parts in hand. Many of the decisions about what to include came from reading through the interface of unordered-containers.
-- Derived functions
null :: HashMap k v -> Bool
null h = size h == 0
member :: (Eq k, Hashable k) => k -> HashMap k a -> Bool
member k h = isJust $ lookup k h
(!?) :: (Eq k, Hashable k) => HashMap k v -> k -> Maybe v
(!?) = flip lookup
findWithDefault :: (Eq k, Hashable k) => v -> k -> HashMap k v -> v
findWithDefault v k h = fromMaybe v $ lookup k h
singleton :: Hashable k => k -> v -> Map k v
singleton k v = insert k v empty
fromList :: (Eq k, Hashable k) => [(k, v)] -> HashMap k v
fromList = foldl (flip $ uncurry insert) empty
toList :: HashMap k v -> [(k, v)]
toList = foldl (flip (:)) []
After writing out these interface functions, the core set ended up being these:
empty :: HashMap k v
size :: HashMap k v -> Int
lookup :: (Eq k, Hashable k) => k -> HashMap k v -> Maybe v
insert :: (Eq k, Hashable k) => k -> v -> HashMap k v -> HashMap k v
delete :: (Eq k, Hashable k) => k -> HashMap k v -> HashMap k v
foldl :: (a -> v -> a) -> a -> HashMap k v -> a
These were a good foundation for doing everything I wanted a HashMap to be able to reasonably do, but without getting too bogged down in details. I ended up doing things in the ST
monad eventually, so foldl
turned into foldM
and there were a lot more ST s
spread around. But the core idea for what I’d have to implement was set.
The next important thing to decide was what the HashMap would really be, on the inside. I wanted to remove the usual \(\mathrm{O}(\log n)\) hit for using pure data structures, and since traditional Robin Hood hashing uses open addressing (basically, just storing the elements of the hash table in a big array), this hit could have been even worse. Some \(\mathrm{O}(1)\) operations on mutable arrays are much worse on immutable arrays—for example, immutable single-cell updates can still take \(\mathrm{O}(n)\) to copy the entire array.
At around this point, I landed on Tweag’s post on immutability and unboxing arrays. There are a bunch of choices for array programming in Haskell. I tried STUArray
but found it didn’t quite have the interface I wanted—specifically, a super convenient way to read and write from positions at some integral index. There’s also no MArray
instance over arbitrary element types for STUArray
, like there is for STArray
. Since I didn’t care too much about extra pointer indirections, I went with MutableArray
, a boxed mutable array that was easy to work with in ST
. If performance were a bigger factor, or I had more information about my key and value types, I probably would have chosen differently.
Now on to more coding. The most basic and most important type for the HashMap is, unsurprisingly, the Hash
.
Following from our C++ reference implementation’s design, we’ll say that if a Hash
is ever zero, that means it’s not being used for any keys in the map. We’ll also reserve the most significant bit as a deletion marker, which allows some of our Hash
es to act as a “tombstone”. Deleted elements can be overwritten during insertion, but we don’t take the time at deletion to clear them out, versus just setting one bit. Word32
is an unsigned 32-bit number, but reserving one bit for deletion means we’re left with \(2^{31} = 2147483648\) possible slots in a completely full map.
As a convenience for setting the deletion bit, let’s also note which one it is ahead of time.
-- if a Hash is a Word32 as in [bit31..bit0],
-- its most significant bit is bit 31
mostSignificantBit :: Int
mostSignificantBit = 31
Because we’ll be working in ST
, the HashMap
type itself will really just be a little bit of bookkeeping over an STRef
to an actual data structure. Let’s call the inner one HashMapData
.
s
is ST
’s magic variable it uses to enforce proper scoping, and k
and v
of course refer to the key and value types of our HashMap.
Another choice we have is to store arrays of elems, keys, and hashes all separately, or do some kind of combining. I found it convenient to build a simple Elem
type, strict in both the key and the value:
So, we’ll store an array of Hash
es as well as an array of Elem
s. The HashMap should also know the current number of elements in it—its size—as well as its maximum capacity, and maximum load factor so it knows when to grow. Putting that all together, and littering the whole thing with strictness and unpacking annotations, we get a final data structure like this:
data HashMapData s k v = HashMapData
{ hashMapSize :: {-# UNPACK #-} !Int -- number of elements
, hashMapLoadFactor :: {-# UNPACK #-} !Double -- maximum load factor (between 0 and 1)
, hashMapCapacity :: {-# UNPACK #-} !Int -- available slots
, hashMapHashes :: {-# UNPACK #-} !(MutableArray s Hash)
, hashMapElems :: {-# UNPACK #-} !(MutableArray s (Elem k v))
}
In Haskell, it’s possible to work with unboxed types more directly, especially via the MagicHash
language extension. However, after messing with this for a while, I decided that merely unpacking data constructors was good enough here—I’m using a boxed array type anyways, after all.
That’s it for data structures! Now we’re on to the real meat of the project: implementing behavior. Starting with size
is probably easiest, since we have it stored in HashMapData
already, and just need to retrieve it.
size :: HashMap s k v -> ST s Int
size (HashMap hashMapDataRef) = do
hashMapSize <$> readSTRef hashMapDataRef
The only real trick here is remembering that HashMap
is really just a wrapper for an STRef
to a HashMapData
, and then unpacking things appropriately, before wrapping them back up in ST
. It’s easy enough to read the ref, then fmap our size-finding function.
My foldM
is taken straight out of Data.HashTable.ST.Basic
, because I just wanted to be able to build toList
as easily as possible, so it’s not that interesting. The opposite direction, fromList
, is also a fold, but it folds insert
s over a [(k, v)]
. To build up a structure by folding, we need some “zero” or default structure. This leads to empty
, which creates a new HashMap with nothing inside it.
To build an empty HashMap, we need a zeroed-out array of Hash
es, some memory reserved for Elem
s, and a default capacity
and loadFactor
. The size of an empty
HashMap is 0
, and empty
also builds our STRef
that will be passed around everywhere else we want access to HashMapData
.
empty :: ST s (HashMap s k v)
empty = do
hashArr <- newArray initialCapacity zeroBits
elemArr <- newArray initialCapacity undefined
liftM HashMap . newSTRef $ HashMapData 0 loadFactor initialCapacity hashArr elemArr
where
initialCapacity = 256
loadFactor = 0.9
Insertions are probably more interesting, but deletions are simpler, so we’ll work our way up. To delete, we find the index where some key maps to, and set the deletion bit (the most significant bit) on the hash there, remembering to decrement the HashMap’s size.
delete :: (Hashable k, Eq k) => k -> HashMap s k v -> ST s ()
delete key hm@(HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
mbIndex <- lookupIndex key hm
whenJust mbIndex $ \index -> do
hashed <- readArray hashMapHashes (fromIntegral index)
let newHash = setBit hashed mostSignificantBit
writeArray hashMapHashes (fromIntegral index) newHash
writeSTRef hashMapDataRef HashMapData{..} { hashMapSize = hashMapSize - 1}
I found using RecordWildCards
everywhere along with reading the STRef
as the first step of every function led to a strangely “global variable”-feeling approach. Whenever I want to reach into some component of HashMapData
to read or modify it, it’s always just available. Programming against ST
with that always-available kind of thing feels super imperative. However, Haskell still manages to ensure the type safety of everything.
A lookup is a way to take the index a key maps to, if any, and retrieve the value of the element stored there. The trickiest part of looking things up in this style is probably just keeping track of where you are in regards to ST
, and where you are in regards to having a Maybe
.
lookup :: (Eq k, Hashable k) => k -> HashMap s k v -> ST s (Maybe v)
lookup key hm@(HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
mbIndex <- lookupIndex key hm
case mbIndex of
Nothing -> pure Nothing
Just index -> Just . elemValue <$> readArray hashMapElems (fromIntegral index)
Both delete
and lookup
rely on a helper named lookupIndex
, which finds the position some key maps to. It takes the “desired position”—the default a hash maps to if there’s nothing already occupying a slot, and updates it according to whether there’s an unused slot there, or we’ve found the key we’re looking for, or if we need to keep looking.
lookupIndex :: (Eq k, Hashable k) => k -> HashMap s k v -> ST s (Maybe Hash)
lookupIndex key hm@(HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
let hashed = hashKey key
position <- desiredPosition hashed hm
let loopStep dist pos = do
elemHash <- readArray hashMapHashes (fromIntegral pos)
arrAtPos <- readArray hashMapElems (fromIntegral pos)
probeDist <- probeDistance elemHash pos hm
if | elemHash == 0 -> pure Nothing
| dist > probeDist -> pure Nothing
| elemHash == hashed && elemKey arrAtPos == key -> pure $ Just pos
| otherwise -> loopStep (dist + 1) ((pos + 1) .&. (fromIntegral hashMapCapacity - 1))
loopStep 0 position
Finally, the Big Kahuna of any proper HashMap implementation: insertion. This is where the Robin Hood algorithm comes into effect, and it’s also where most potential bugs will come into effect as well. First, if the underlying array isn’t big enough to contain our new insertion, we need to grow the table. Once that’s done, we can do the actual insert, and if the insert tells us it added a new key (instead of merely overriding an existing one), we bump up the HashMap’s size.
insert :: (Eq k, Hashable k) => k -> v -> HashMap s k v -> ST s ()
insert !key !value (HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
when (fromIntegral hashMapSize >= fromIntegral hashMapCapacity * hashMapLoadFactor) $ grow (HashMap hashMapDataRef)
HashMapData{..} <- readSTRef hashMapDataRef
!didAddNewKey <- insertWithoutGrowing key value (HashMap hashMapDataRef)
when (didAddNewKey == AddedNewKey) $
writeSTRef hashMapDataRef HashMapData{..} { hashMapSize = hashMapSize + 1 }
DidAddNewKey
just helps us avoid some boolean blindness when thinking about already-complicated-enough algorithmic things:
Whenever our HashMap starts to exceed a percentage of its capacity, it’s a good idea to increase the capacity. We check whether we should grow upon insertion, and if we should, double the size of the HashMap and reinsert all the old elements.
grow :: (Eq k, Hashable k) => HashMap s k v -> ST s ()
grow (HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
let !oldCapacity = hashMapCapacity
!newCapacity = hashMapCapacity * 2
!oldElems = hashMapElems
!oldHashes = hashMapHashes
!hashArr <- newArray newCapacity (zeroBits :: Hash)
!elemArr <- newArray newCapacity undefined
writeSTRef hashMapDataRef $ HashMapData 0 0.9 newCapacity hashArr elemArr
forM_ [0..oldCapacity - 1] $ \i -> do
!hashed <- readArray oldHashes i
when (hashed /= 0 && not (isDeleted hashed)) $ do
Elem k v <- readArray oldElems i
void $ insert k v (HashMap hashMapDataRef)
At last, the big reveal. We hash our key, and run a loop that examines potential insertion locations. If a slot is unused, we can put the element there. If a slot is used by the same key that we’re trying to add something to, we overwrite the value at that key. If a slot is occupied by a tombstone, that’s as good as unused, and we add the new key there. Finally, if none of that is true, we do our actual Robin Hood-ing, and swap out “rich” and “poor” elements until we find a happy home for our newly-inserted element.
insertWithoutGrowing :: (Eq k, Hashable k) => k -> v -> HashMap s k v -> ST s DidAddNewKey
insertWithoutGrowing key value hm@(HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
!position <- desiredPosition (hashKey key) hm
let nextPos p = (p + 1) .&. (fromIntegral hashMapCapacity - 1)
let
loopStep probeDist pos elem hash = do
!hashAtPos <- readArray hashMapHashes $ fromIntegral pos
elemAtPos <- readArray hashMapElems $ fromIntegral pos
probeDistOfElemAtPos <- probeDistance hashAtPos pos hm
if | hashAtPos == 0 -> do
-- put elem in empty slot
construct pos hash elem hm
pure AddedNewKey
| isDeleted hashAtPos -> do
-- put elem in slot occupied by deleted elem
construct pos hash elem hm
pure AddedNewKey
| elemKey elemAtPos == elemKey elem -> do
-- overwrite value with same key
construct pos hash elem hm
pure DidNotAddNewKey
| probeDistOfElemAtPos < probeDist -> do
-- this makes it Robin Hood. swap and try next position with swapped elem
writeArray hashMapHashes (fromIntegral pos) hash
writeArray hashMapElems (fromIntegral pos) elem
loopStep (probeDist + 1) (nextPos pos) elemAtPos hashAtPos
| otherwise -> do
-- try next position
loopStep (probeDist + 1) (nextPos pos) elem hash
loopStep 0 position (Elem key value) (hashKey key)
The full code (including some small tests that show how one might use such a HashMap) can be read at this github repo