Like many languages, Python is open-source and the source is reasonably readable. We’ll be taking advantage of those facts and looking into cpython’s implementation of dictionaries in dictobject.c. You could of course read the whole file yourself (it was fun!) but if you prefer just a few of the more interesting tidbits, stick around. If you have an afternoon and are into that sort of thing, it’s a pretty interesting read, though be warned it’s 4,753 lines of C. There’s a related file dictnotes.txt which goes over many of the key uses and other factors taken into account which choosing this specific implementation.
If you read only one piece of dictobject.c, read the initial comments. They describe the layout of the object, the states entries can be in, the (new) compactness and orderedness properties, and why the minimum size was chosen.
The layout is given by the following table. Pretty straightforward, just some introductory header material like the size of the dictionary and other properties it currently has, followed by
dk_indices, a hashtable to an index in
dk_entries, which is where we store pointers to our actual objects.
As described on the mailing list, cpython dictionaries are compact and ordered since Python 3.6. (Note that this is not in the spec and therefore may change in the future—do not rely on this implementation detail!) These properties come from separating the dictionary entries (each of which has a hash, a pointer to the key, and a pointer to the value) from the indices hashtable which as its name implies, just indexes into the actual entries. This means when doing things like
for k, v in my_dict.items(), you’re really iterating over entries, which stay in the order they were inserted. The indices, being a hashtable, are susceptible to jumping around (e.g. on a resize) so don’t provide that property themselves. This also lets the indices be sparse, and saves memory overall since we no longer have empty entries taking up loads of room. The example given is this:
Another interesting bit is the discussion on what the initial dictionary size should be. This might not seem all that important: so what if we cost everyone a few bytes? You have to remember that in this case “everyone” is literally “everyone who uses Python” so it is worth taking a least a little time to consider. (If you don’t buy that, maybe “striving for the platonic ideal dictionary implementation is a good in and of itself” is more your speed.) The Python developers decided on 8, and I’ll let them explain why:
/* PyDict_MINSIZE is the starting size for any new dict. * 8 allows dicts with no more than 5 active entries; experiments suggested * this suffices for the majority of dicts (consisting mostly of usually-small * dicts created to pass keyword arguments). * Making this 8, rather than 4 reduces the number of resizes for most * dictionaries, without any significant extra memory use. */ #define PyDict_MINSIZE 8
It was a little bit surprising to me that a typical dictionary is under 5 elements, but I buy the argument about kwargs dicts taking up a lot of room in the space of all possible dictionaries.
The reason we’re limited to 5 active entries in a dict with size 8 is explained later by loading factors. If we have a loading factor of 2/3, the maximum number of entries before a resize will be
int(2/3 * 8), or 5. Likewise, with a size of 4, we get
int(2/3 * 4) or only 2 active entries before a resize is needed.
Try running this code:
Looks pretty regular, right? It makes some sense though: a very fast hash function for small integers is “return the number you were given”. However, if the rest of the dictionary were implemented naively with respect to this fact, we could see huge performance drops and all kinds of nasty issues due to big sequences of collisions. According to dictobject.c:150, this “makes a good collision resolution strategy crucial”.
The recurrence for table indices is given by:
\[ j = 5j + 1 \mod 2^i \]
This has a few important properties. First, it’s much less likely than the recurrence \[ j = j + 1 \mod 2^i \] to show up in the wild. This means we don’t run into issues when inserting things like sequences of integers into our array. Second, it still covers every possibility. Per dictobject.c:126, “To ensure the lookup algorithm terminates, there must be at least one Unused slot (NULL key) in the table.” Imagine a recurrence which skipped over a certain slot every time, and that was the only empty slot available (or imagine no slots are available). In either case, the algorithm keeps looking but never finds the open slot. It’s stuck on a quest for what it cannot have.
There is also a perturbation factor given at each step by
perturb >>= PERTURB_SHIFT which take into account bits in the hash code. This allows for greater dependence on the specific hash, instead of just relying on the index \[ j \] we’re originally searching from.
Finally, a note on avoiding memory overhead via an open addressing scheme from that file: “Open addressing is preferred over chaining since the link overhead for chaining would be substantial (100% with typical malloc overhead).”
If you want to read some code, and not just comments about it, I reccomend the lookup function
lookdict at dictobject.c:747. It uses the recurrence mentioned to, well, lookup where a key maps to in the dictionary.
Because of the solution to “mark” entries as deleted but not remove them immediately, a resize due to the number of entries growing can actually shrink the amount of storage needed. Another fun piece of comment:
I never knew before reading this file that Python dicts used to only have string keys (or that exceptions weren’t possible, I suppose). This means there’s a kind of encoding failure here:
NULL means more than one thing, and though the usual meaning is “key not found”, it can be returned even when the key could have been found!
/* Note that, for historical reasons, PyDict_GetItem() suppresses all errors * that may occur (originally dicts supported only string keys, and exceptions * weren't possible). So, while the original intent was that a NULL return * meant the key wasn't present, in reality it can mean that, or that an error * (suppressed) occurred while computing the key's hash, or that some error * (suppressed) occurred when comparing keys in the dict's internal probe * sequence. A nasty example of the latter is when a Python-coded comparison * function hits a stack-depth error, which can cause this to return NULL * even if the key is present. */
Besides the language here (who says “durnit”?) it’s almost comforting to see somewhere in the code where everything is falling apart so badly, we just
goto a place where we can safely try again.
If I had to pick out one single moral to this rambling story, it’d probably be something like “be careful, and know what’s going on”. This comment speaks to that better than any I found. Also, that emoticon adds so much fun!