Skip to content Skip to sidebar Skip to footer

Why Adding Multiple 'nan' In Python Dictionary Giving Multiple Entries?

Example problem: import numpy as np dc = dict() dc[np.float('nan')] = 100 dc[np.float('nan')] = 200 It is creating multiple entries for nan like dc.keys() will produce {nan: 100,

Solution 1:

The short answer to your question (of why adding NaN keys to a Python dict create multiple entries), is because floating-point NaN values are unordered, i.e. a NaN value is not equal to, greater than, or less than anything, including itself. This behavior is defined in the IEEE 754 standard for floating point arithmetic. The explanation why is that so is given by an IEEE 754 committee member in this answer.


For a longer, Python-specific, answer, let's first have a look at how item insertion and key comparison work in CPython dictionaries.

When you say d[key] = val, PyDict_SetItem() for dictionary d is called, which in turn calls (internal) insertdict(), which will either update the existing dictionary item, or insert a new item (maybe resizing the hash table consequentially).

The first step on insert is to lookup the key in the hash table of dictionary keys. A general-purpose lookup function that gets called in your case (of non-string keys) is lookdict().

lookdict will use key's hash value to locate the key, iterating over a list of possible keys with identical hash value, comparing first by address, then by calling keys' equivalence operator(s) (see excellent comments in Objects/dictobject.c for more details on hash collision resolution in Python's implementation of open addressing).

Since every float('nan')has the same hash value, yet each one is a different object (with a different "identity", i.e. memory address), and they're not equal by their float-value:

>>>a, b = float('nan'), float('nan')>>>hash(a), hash(b)
(0, 0)
>>>id(a), id(b)
(94753433907296, 94753433907272)
>>>a == b
False

when you say:

d = dict()
d[float('nan')] = 1
d[float('nan')] = 2

lookdict will search for the second NaN by looking at its hash (0), then trying to resolve hash collision by iterating over keys with the same hash and comparing the keys by identity/address (they are different), then by invoking (the expensive) PyObject_RichCompareBool/do_richcompare, which in turn calls float_richcompare which compares floats just as C does:

/* Comparison is pretty much a nightmare.  When comparing float to float,
 * we do it as straightforwardly (andlong-windedly) as conceivable, so
 * that, e.g., Python x == y delivers the same result as the platform
 * C x == y when x and/or y is a NaN.

which behaves according to IEEE 754 standard (from GNU C library docs):

20.5.2 Infinity and NaN

[...]

The basic operations and math functions all accept infinity and NaN and produce sensible output. Infinities propagate through calculations as one would expect: for example, 2 + ∞ = ∞, 4/∞ = 0, atan (∞) = π/2. NaN, on the other hand, infects any calculation that involves it. Unless the calculation would produce the same result no matter what real value replaced NaN, the result is NaN.

In comparison operations, positive infinity is larger than all values except itself and NaN, and negative infinity is smaller than all values except itself and NaN. NaN is unordered: it is not equal to, greater than, or less than anything, including itself. x == x is false if the value of x is NaN. You can use this to test whether a value is NaN or not, but the recommended way to test for NaN is with the isnan function (see Floating Point Classes). In addition, <, >, <=, and >= will raise an exception when applied to NaNs.

and which will return false for NaN == NaN.

That's why Python decides the second NaN object is worthy of a new dictionary entry. It may have the same hash, but its address and equivalence test say it is different from all the other NaN objects.

However, note that if you always use the same NaN object (with the same address) since the address is tested before float equivalence, you'll get the expected behavior:

>>>nan = float('nan')>>>d = dict()>>>d[nan] = 1>>>d[nan] = 2>>>d
{nan: 2}

Solution 2:

For historical reasons explained here, np.float('nan') == np.float('nan') is False. The rule is just that you can't have two dictionary keys which are equal to each other - so you can have two keys equal to np.float('nan').

Of course, this behavior is counterintuitive and surprising - so you should avoid using np.float('nan') as a key.

Solution 3:

As was mentioned in a comment to you, nan is never "equal" to another nan, do your dict is writing a new key for it. This is the behavior for nan values in most languages, not just python.

I would suggest not using it as a key at all, or at least explain the purpose of it so we can find better ways to achieve that purpose without falling into pitfalls like this.

In your case you could test this bahavior for yourself:

a=list(dc.keys())
print(a[0]==a[1]) # will output False

The output for the above code (False) means that for the system, the are in fact different keys that do not clash

Post a Comment for "Why Adding Multiple 'nan' In Python Dictionary Giving Multiple Entries?"