Skip to content Skip to sidebar Skip to footer

Subclassing A Pandas DataFrame, Updates?

To inherit, or not to inherit? What is the latest on the subclassing issue for Pandas? (Most of the other threads are 3-4 years old). I am hoping to do something like ... import pa

Solution 1:

This is how I've done it. I've followed advice found:

The example below only shows the use of constructing new subclasses of pandas.DataFrame. If you follow the advice in my first link, you may consider subclassing pandas.Series as well to account for taking single dimensional slices of your pandas.DataFrame subclass.

Defining SomeData

import pandas as pd
import numpy as np

class SomeData(pd.DataFrame):
    # This class variable tells Pandas the name of the attributes
    # that are to be ported over to derivative DataFrames.  There
    # is a method named `__finalize__` that grabs these attributes
    # and assigns them to newly created `SomeData`
    _metadata = ['my_attr']

    @property
    def _constructor(self):
        """This is the key to letting Pandas know how to keep
        derivative `SomeData` the same type as yours.  It should
        be enough to return the name of the Class.  However, in
        some cases, `__finalize__` is not called and `my_attr` is
        not carried over.  We can fix that by constructing a callable
        that makes sure to call `__finlaize__` every time."""
        def _c(*args, **kwargs):
            return SomeData(*args, **kwargs).__finalize__(self)
        return _c

    def __init__(self, *args, **kwargs):
        # grab the keyword argument that is supposed to be my_attr
        self.my_attr = kwargs.pop('my_attr', None)
        super().__init__(*args, **kwargs)

    def my_method(self, other):
        return self * np.sign(self - other)

Demonstration

mydata = SomeData(dict(A=[1, 2, 3], B=[4, 5, 6]), my_attr='an attr')

print(mydata, type(mydata), mydata.my_attr, sep='\n' * 2)

   A  B
0  1  4
1  2  5
2  3  6

<class '__main__.SomeData'>

an attr
newdata = mydata.mul(2)

print(newdata, type(newdata), newdata.my_attr, sep='\n' * 2)

   A   B
0  2   8
1  4  10
2  6  12

<class '__main__.SomeData'>

an attr
newerdata = mydata.my_method(newdata)

print(newerdata, type(newerdata), newerdata.my_attr, sep='\n' * 2)

   A  B
0 -1 -4
1 -2 -5
2 -3 -6

<class '__main__.SomeData'>

an attr

Gotchas

This borks on the method pd.DataFrame.equals

newerdata.equals(newdata)  # Should be `False`
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-304-866170ab179e> in <module>()
----> 1 newerdata.equals(newdata)

~/anaconda3/envs/3.6.ml/lib/python3.6/site-packages/pandas/core/generic.py in equals(self, other)
   1034         the same location are considered equal.
   1035         """
-> 1036         if not isinstance(other, self._constructor):
   1037             return False
   1038         return self._data.equals(other._data)

TypeError: isinstance() arg 2 must be a type or tuple of types

What happens is that this method expected to find an object of type type in the _constructor attribute. Instead, it found my callable that I placed there in order to fix the __finalize__ issue I came across.

Work around

Override the equals method with the following in your class definition.

    def equals(self, other):
        try:
            pd.testing.assert_frame_equal(self, other)
            return True
        except AssertionError:
            return False

newerdata.equals(newdata)  # Should be `False`

False

Post a Comment for "Subclassing A Pandas DataFrame, Updates?"