1 minute read

Caution when you apply any or all on a one-column Boolean DataFrame, or you should never apply these 2 functions to such a DataFrame because it works differently from your intention.

The logics of any and all are:

def all(iterable):
    for element in iterable:
        if not element:
            return False
    return True
def any(iterable):
    for element in iterable:
        if element:
            return True
    return False

The trickiest part with a DataFrame is that when being iterated, its elements are actually its column names, with types!

A column name can be a string (in most cases), an integer (acceptable) or even a Boolean (What?). Therefore when applying any() or all() to a DataFrame, you’re actually iterating its column names, which are then evaluated in the if element: clause, and at the same time NONE of the values underneath is accessed at all.

>>> foo = pd.DataFrame(data=[1,2])
>>> foo
   0
0  1
1  2
>>> for element in foo:
...     print(element)
... 
0
>>> foo = pd.DataFrame(data=[1,2], columns=[True])
>>> foo
   True
0     1
1     2
>>> for element in foo:
...     print(element)
... 
True

And this exactly explains the seemingly contradicting results in the following code:

>>> one_c_df = pd.DataFrame(data=[False, False], index=["a", "b"], columns=["a_colname_not_evaluated_to_False"])
>>> one_c_df
   a_colname_not_evaluated_to_False
a                             False
b                             False
>>> any(one_c_df)  # I cannot believe this at first sight!
True
>>> all(one_c_df)
True
>>> series = pd.Series(data=[False, False], index=["a", "b"], name="does_not_matter")
>>> series
a    False
b    False
Name: does_not_matter, dtype: bool
>>> any(series)
False
>>> all(series)
False

Categories:

Updated:

Comments