Back in college, it seemed like all I cared about was the inner workings of fundamental data structures, but I didn’t do any Haskell at all. Nowadays, I work with Haskell a lot, but rarely have a need to peer into the inner workings of basic data structures. So, I figured why not combine the two somehow?
Eventually, I landed on the idea of building a Robin Hood HashMap.
Not only was it something new to me, but I wasn’t able to find many
other implementations of Robin Hood hashing in Haskell. I also knew that
pure HashMaps tend to take an \(\mathrm{O}(\log n)\) runtime hit, and I
wanted to push that to \(\mathrm{O}(1)\), which eventually meant
doing things in ST
. However, I didn’t want to worry much
about inlining, including every possible instance of strictness, etc.,
and didn’t do any actual profiling at the end—this project was much more
about learning how to write such a thing than tuning it up for real
usage. I also did fairly barebones testing, so there are likely some
bugs lurking around in the final product.
Of course, there’s plenty of background to dive into before implementing any well-known data structure. For me, this mostly came from reading the hashtables library, which promises “mutable hash tables in the ST monad”, as well as Robin Hood Hashing should be your default Hash Table implementation, which has a nicely readable reference implementation of the basic Robin-Hood-related algorithms in C++. The original 1986 paper by Celis is also worth a skim, but it’s fairly hefty on basic background information and usage simulation/profiling.
Most of this post is about specific Haskell implementation details, but it’d be good to first explain what Robin Hood hashing is, and how it differs from other algorithms. I’ll assume you know what hash functions do, and understand vaguely how hash tables with open addressing work. If you already know all this, feel free to skip ahead.
When we want to insert an element into our HashMap, we need to “probe”. This just means looking for a slot we can put the element into.
When you hash something, you get some array index back. Let’s call that the “desired position”. “Probe length” means the distance that an element is from its desired position. If the HashMap were completely empty, no matter where the hash tells the first element to go, it can go in its desired position, so it’d have a probe length of 0. Robin Hood hashing is about reducing this probe length, on average, by taking from “rich” elements (elements close to their desired positions) and giving to “poor” elements (those far away, with a high probe length).
As an example, let’s say I have a HashMap containing the mapping
{ a: 1, b: 2 }
, with an underlying array of
[a:1][b:2][_][_]
(where [_]
means an empty
slot).
I want to insert {algorithm: 3}
. My hash function is
really dumb: it just reads the first character and maps a
to the first slot, b
to the second slot, etc. So the key
algorithm
hashes to the first slot, currently occupied by
a:1
.
We can’t insert into an occupied slot, so we compare probe lengths.
a
is distance 0 from its preferred slot, but so is
algorithm
at this point. So let’s move on to the slot
containing b:2
. In this scenario, b:2
is
“rich”—it also has a probe distance of 0, since it’s in its desired
position. algorithm
is relatively poor, because if we put
it in slot 2, it’ll be 1 slot away from its desired position. Because we
want to take from the rich, and give to the poor, we swap elements,
putting algorithm:3
in slot 2, and now probing for a place
to put b:2
. Our array currently looks like
[a:1][algorithm:3][_][_]
and we’re trying to insert
b:2
.
The next slot is open, so b:2
goes there. Our final
underlying array is [a:1][algorithm:3][b:2][_]
. Note that
no element has a probe distance of more than 1. In traditional linear
probing, we would have ended up with
[a:1][b:2][algorithm:3][_]
, which means if we want to look
up algorithm
again, we have to go a relatively long way.
Robin Hood hashing helps average out probe distances, so no element ends
up too far from its desired position.
Now on to the implementation! Starting from the top and moving down, the first code I wrote was the interface. I came up with a bunch of functions which were simple to build from a small set of basic parts, pretending that I already had those basic parts in hand. Many of the decisions about what to include came from reading through the interface of unordered-containers.
-- Derived functions
null :: HashMap k v -> Bool
null h = size h == 0
member :: (Eq k, Hashable k) => k -> HashMap k a -> Bool
= isJust $ lookup k h
member k h
(!?) :: (Eq k, Hashable k) => HashMap k v -> k -> Maybe v
!?) = flip lookup
(
findWithDefault :: (Eq k, Hashable k) => v -> k -> HashMap k v -> v
= fromMaybe v $ lookup k h
findWithDefault v k h
singleton :: Hashable k => k -> v -> HashMap k v
= insert k v empty
singleton k v
fromList :: (Eq k, Hashable k) => [(k, v)] -> HashMap k v
= foldl (flip $ uncurry insert) empty
fromList
toList :: HashMap k v -> [(k, v)]
= foldl (flip (:)) [] toList
After writing out these interface functions, the core set ended up being these:
empty :: HashMap k v
size :: HashMap k v -> Int
lookup :: (Eq k, Hashable k) => k -> HashMap k v -> Maybe v
insert :: (Eq k, Hashable k) => k -> v -> HashMap k v -> HashMap k v
delete :: (Eq k, Hashable k) => k -> HashMap k v -> HashMap k v
foldl :: (a -> v -> a) -> a -> HashMap k v -> a
These were a good foundation for doing everything I wanted a HashMap
to be able to reasonably do, but without getting too bogged down in
details. I ended up doing things in the ST
monad
eventually, so foldl
turned into foldM
and
there were a lot more ST s
spread around. But the core idea
for what I’d have to implement was set.
The next important thing to decide was what the HashMap would really be, on the inside. I wanted to remove the usual \(\mathrm{O}(\log n)\) hit for using pure data structures, and since traditional Robin Hood hashing uses open addressing (basically, just storing the elements of the hash table in a big array), this hit could have been even worse. Some \(\mathrm{O}(1)\) operations on mutable arrays are much worse on immutable arrays—for example, immutable single-cell updates can still take \(\mathrm{O}(n)\) to copy the entire array.
At around this point, I landed on Tweag’s post
on immutability and unboxing arrays. There are a bunch of choices
for array programming in Haskell. I tried STUArray
but
found it didn’t quite have the interface I wanted—specifically, a super
convenient way to read and write from positions at some integral index.
There’s also no MArray
instance over arbitrary element
types for STUArray
, like there is for STArray
.
Since I didn’t care too much about extra pointer indirections, I went
with MutableArray
, a boxed mutable array that was easy to
work with in ST
. If performance were a bigger factor, or I
had more information about my key and value types, I probably would have
chosen differently.
Now on to more coding. The most basic and most important type for the
HashMap is, unsurprisingly, the Hash
.
type Hash = Word32
Following from our C++
reference implementation’s design, we’ll say that if a
Hash
is ever zero, that means it’s not being used for any
keys in the map. We’ll also reserve the most significant bit as a
deletion marker, which allows some of our Hash
es to act as
a “tombstone”. Deleted elements can be overwritten during insertion, but
we don’t take the time at deletion to clear them out, versus just
setting one bit. Word32
is an unsigned 32-bit number, but
reserving one bit for deletion means we’re left with \(2^{31} = 2147483648\) possible slots in a
completely full map.
As a convenience for setting the deletion bit, let’s also note which one it is ahead of time.
-- if a Hash is a Word32 as in [bit31..bit0],
-- its most significant bit is bit 31
mostSignificantBit :: Int
= 31 mostSignificantBit
Because we’ll be working in ST
, the HashMap
type itself will really just be a little bit of bookkeeping over an
STRef
to an actual data structure. Let’s call the inner one
HashMapData
.
newtype HashMap s k v = HashMap (STRef s (HashMapData s k v))
s
is ST
’s magic variable it uses to enforce
proper scoping, and k
and v
of course refer to
the key and value types of our HashMap.
Another choice we have is to store arrays of elems, keys, and hashes
all separately, or do some kind of combining. I found it convenient to
build a simple Elem
type, strict in both the key and the
value:
data Elem k v = Elem
elemKey :: !k
{ elemValue :: !v
, }
So, we’ll store an array of Hash
es as well as an array
of Elem
s. The HashMap should also know the current number
of elements in it—its size—as well as its maximum capacity, and maximum
load factor so it knows when to grow. Putting that all together, and
littering the whole thing with strictness and unpacking annotations, we
get a final data structure like this:
data HashMapData s k v = HashMapData
hashMapSize :: {-# UNPACK #-} !Int -- number of elements
{ hashMapLoadFactor :: {-# UNPACK #-} !Double -- maximum load factor (between 0 and 1)
, hashMapCapacity :: {-# UNPACK #-} !Int -- available slots
, hashMapHashes :: {-# UNPACK #-} !(MutableArray s Hash)
, hashMapElems :: {-# UNPACK #-} !(MutableArray s (Elem k v))
, }
In Haskell, it’s possible to work with unboxed types more directly,
especially via the MagicHash
language extension. However,
after messing with this for a while, I decided that merely unpacking
data constructors was good enough here—I’m using a boxed array type
anyways, after all.
That’s it for data structures! Now we’re on to the real meat of the
project: implementing behavior. Starting with size
is
probably easiest, since we have it stored in HashMapData
already, and just need to retrieve it.
size :: HashMap s k v -> ST s Int
HashMap hashMapDataRef) = do
size (<$> readSTRef hashMapDataRef hashMapSize
The only real trick here is remembering that HashMap
is
really just a wrapper for an STRef
to a
HashMapData
, and then unpacking things appropriately,
before wrapping them back up in ST
. It’s easy enough to
read the ref, then fmap our size-finding function.
My foldM
is taken straight out of
Data.HashTable.ST.Basic
, because I just wanted to be able
to build toList
as easily as possible, so it’s not that
interesting. The opposite direction, fromList
, is also a
fold, but it folds insert
s over a [(k, v)]
. To
build up a structure by folding, we need some “zero” or default
structure. This leads to empty
, which creates a new HashMap
with nothing inside it.
To build an empty HashMap, we need a zeroed-out array of
Hash
es, some memory reserved for Elem
s, and a
default capacity
and loadFactor
. The size of
an empty
HashMap is 0
, and empty
also builds our STRef
that will be passed around everywhere
else we want access to HashMapData
.
empty :: ST s (HashMap s k v)
= do
empty <- newArray initialCapacity zeroBits
hashArr <- newArray initialCapacity undefined
elemArr HashMap . newSTRef $ HashMapData 0 loadFactor initialCapacity hashArr elemArr
liftM where
= 256
initialCapacity = 0.9 loadFactor
Insertions are probably more interesting, but deletions are simpler, so we’ll work our way up. To delete, we find the index where some key maps to, and set the deletion bit (the most significant bit) on the hash there, remembering to decrement the HashMap’s size.
delete :: (Hashable k, Eq k) => k -> HashMap s k v -> ST s ()
@(HashMap hashMapDataRef) = do
delete key hmHashMapData{..} <- readSTRef hashMapDataRef
<- lookupIndex key hm
mbIndex $ \index -> do
whenJust mbIndex <- readArray hashMapHashes (fromIntegral index)
hashed let newHash = setBit hashed mostSignificantBit
fromIntegral index) newHash
writeArray hashMapHashes (HashMapData{..} { hashMapSize = hashMapSize - 1} writeSTRef hashMapDataRef
I found using RecordWildCards
everywhere along with
reading the STRef
as the first step of every function led
to a strangely “global variable”-feeling approach. Whenever I want to
reach into some component of HashMapData
to read or modify
it, it’s always just available. Programming against ST
with
that always-available kind of thing feels super imperative. However,
Haskell still manages to ensure the type safety of everything.
A lookup is a way to take the index a key maps to, if any, and
retrieve the value of the element stored there. The trickiest part of
looking things up in this style is probably just keeping track of where
you are in regards to ST
, and where you are in regards to
having a Maybe
.
lookup :: (Eq k, Hashable k) => k -> HashMap s k v -> ST s (Maybe v)
lookup key hm@(HashMap hashMapDataRef) = do
HashMapData{..} <- readSTRef hashMapDataRef
<- lookupIndex key hm
mbIndex case mbIndex of
Nothing -> pure Nothing
Just index -> Just . elemValue <$> readArray hashMapElems (fromIntegral index)
Both delete
and lookup
rely on a helper
named lookupIndex
, which finds the position some key maps
to. It takes the “desired position”—the default a hash maps to if
there’s nothing already occupying a slot, and updates it according to
whether there’s an unused slot there, or we’ve found the key we’re
looking for, or if we need to keep looking.
lookupIndex :: (Eq k, Hashable k) => k -> HashMap s k v -> ST s (Maybe Hash)
@(HashMap hashMapDataRef) = do
lookupIndex key hmHashMapData{..} <- readSTRef hashMapDataRef
let hashed = hashKey key
<- desiredPosition hashed hm
position
let loopStep dist pos = do
<- readArray hashMapHashes (fromIntegral pos)
elemHash <- readArray hashMapElems (fromIntegral pos)
arrAtPos <- probeDistance elemHash pos hm
probeDist
if | elemHash == 0 -> pure Nothing
| dist > probeDist -> pure Nothing
| elemHash == hashed && elemKey arrAtPos == key -> pure $ Just pos
| otherwise -> loopStep (dist + 1) ((pos + 1) .&. (fromIntegral hashMapCapacity - 1))
0 position loopStep
Finally, the Big Kahuna of any proper HashMap implementation: insertion. This is where the Robin Hood algorithm comes into effect, and it’s also where most potential bugs will come into effect as well. First, if the underlying array isn’t big enough to contain our new insertion, we need to grow the table. Once that’s done, we can do the actual insert, and if the insert tells us it added a new key (instead of merely overriding an existing one), we bump up the HashMap’s size.
insert :: (Eq k, Hashable k) => k -> v -> HashMap s k v -> ST s ()
!key !value (HashMap hashMapDataRef) = do
insert HashMapData{..} <- readSTRef hashMapDataRef
fromIntegral hashMapSize >= fromIntegral hashMapCapacity * hashMapLoadFactor) $ grow (HashMap hashMapDataRef)
when (HashMapData{..} <- readSTRef hashMapDataRef
!didAddNewKey <- insertWithoutGrowing key value (HashMap hashMapDataRef)
== AddedNewKey) $
when (didAddNewKey HashMapData{..} { hashMapSize = hashMapSize + 1 } writeSTRef hashMapDataRef
DidAddNewKey
just helps us avoid some boolean blindness
when thinking about already-complicated-enough algorithmic things:
data DidAddNewKey = AddedNewKey | DidNotAddNewKey
deriving Eq
Whenever our HashMap starts to exceed a percentage of its capacity, it’s a good idea to increase the capacity. We check whether we should grow upon insertion, and if we should, double the size of the HashMap and reinsert all the old elements.
grow :: (Eq k, Hashable k) => HashMap s k v -> ST s ()
HashMap hashMapDataRef) = do
grow (HashMapData{..} <- readSTRef hashMapDataRef
let !oldCapacity = hashMapCapacity
!newCapacity = hashMapCapacity * 2
!oldElems = hashMapElems
!oldHashes = hashMapHashes
!hashArr <- newArray newCapacity (zeroBits :: Hash)
!elemArr <- newArray newCapacity undefined
$ HashMapData 0 0.9 newCapacity hashArr elemArr
writeSTRef hashMapDataRef 0..oldCapacity - 1] $ \i -> do
forM_ [!hashed <- readArray oldHashes i
/= 0 && not (isDeleted hashed)) $ do
when (hashed Elem k v <- readArray oldElems i
$ insert k v (HashMap hashMapDataRef) void
At last, the big reveal. We hash our key, and run a loop that examines potential insertion locations. If a slot is unused, we can put the element there. If a slot is used by the same key that we’re trying to add something to, we overwrite the value at that key. If a slot is occupied by a tombstone, that’s as good as unused, and we add the new key there. Finally, if none of that is true, we do our actual Robin Hood-ing, and swap out “rich” and “poor” elements until we find a happy home for our newly-inserted element.
insertWithoutGrowing :: (Eq k, Hashable k) => k -> v -> HashMap s k v -> ST s DidAddNewKey
@(HashMap hashMapDataRef) = do
insertWithoutGrowing key value hmHashMapData{..} <- readSTRef hashMapDataRef
!position <- desiredPosition (hashKey key) hm
let nextPos p = (p + 1) .&. (fromIntegral hashMapCapacity - 1)
let
elem hash = do
loopStep probeDist pos !hashAtPos <- readArray hashMapHashes $ fromIntegral pos
<- readArray hashMapElems $ fromIntegral pos
elemAtPos <- probeDistance hashAtPos pos hm
probeDistOfElemAtPos
if | hashAtPos == 0 -> do
-- put elem in empty slot
elem hm
construct pos hash pure AddedNewKey
| isDeleted hashAtPos -> do
-- put elem in slot occupied by deleted elem
elem hm
construct pos hash pure AddedNewKey
| elemKey elemAtPos == elemKey elem -> do
-- overwrite value with same key
elem hm
construct pos hash pure DidNotAddNewKey
| probeDistOfElemAtPos < probeDist -> do
-- this makes it Robin Hood. swap and try next position with swapped elem
fromIntegral pos) hash
writeArray hashMapHashes (fromIntegral pos) elem
writeArray hashMapElems (+ 1) (nextPos pos) elemAtPos hashAtPos
loopStep (probeDist
| otherwise -> do
-- try next position
+ 1) (nextPos pos) elem hash
loopStep (probeDist
0 position (Elem key value) (hashKey key) loopStep
The full code (including some small tests that show how one might use such a HashMap) can be read at this github repo