Subclassing A Pandas DataFrame, Updates?
To inherit, or not to inherit? What is the latest on the subclassing issue for Pandas? (Most of the other threads are 3-4 years old). I am hoping to do something like ... import pa
Solution 1:
This is how I've done it. I've followed advice found:
The example below only shows the use of constructing new subclasses of pandas.DataFrame
. If you follow the advice in my first link, you may consider subclassing pandas.Series
as well to account for taking single dimensional slices of your pandas.DataFrame
subclass.
Defining SomeData
import pandas as pd
import numpy as np
class SomeData(pd.DataFrame):
# This class variable tells Pandas the name of the attributes
# that are to be ported over to derivative DataFrames. There
# is a method named `__finalize__` that grabs these attributes
# and assigns them to newly created `SomeData`
_metadata = ['my_attr']
@property
def _constructor(self):
"""This is the key to letting Pandas know how to keep
derivative `SomeData` the same type as yours. It should
be enough to return the name of the Class. However, in
some cases, `__finalize__` is not called and `my_attr` is
not carried over. We can fix that by constructing a callable
that makes sure to call `__finlaize__` every time."""
def _c(*args, **kwargs):
return SomeData(*args, **kwargs).__finalize__(self)
return _c
def __init__(self, *args, **kwargs):
# grab the keyword argument that is supposed to be my_attr
self.my_attr = kwargs.pop('my_attr', None)
super().__init__(*args, **kwargs)
def my_method(self, other):
return self * np.sign(self - other)
Demonstration
mydata = SomeData(dict(A=[1, 2, 3], B=[4, 5, 6]), my_attr='an attr')
print(mydata, type(mydata), mydata.my_attr, sep='\n' * 2)
A B
0 1 4
1 2 5
2 3 6
<class '__main__.SomeData'>
an attr
newdata = mydata.mul(2)
print(newdata, type(newdata), newdata.my_attr, sep='\n' * 2)
A B
0 2 8
1 4 10
2 6 12
<class '__main__.SomeData'>
an attr
newerdata = mydata.my_method(newdata)
print(newerdata, type(newerdata), newerdata.my_attr, sep='\n' * 2)
A B
0 -1 -4
1 -2 -5
2 -3 -6
<class '__main__.SomeData'>
an attr
Gotchas
This borks on the method pd.DataFrame.equals
newerdata.equals(newdata) # Should be `False`
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-304-866170ab179e> in <module>()
----> 1 newerdata.equals(newdata)
~/anaconda3/envs/3.6.ml/lib/python3.6/site-packages/pandas/core/generic.py in equals(self, other)
1034 the same location are considered equal.
1035 """
-> 1036 if not isinstance(other, self._constructor):
1037 return False
1038 return self._data.equals(other._data)
TypeError: isinstance() arg 2 must be a type or tuple of types
What happens is that this method expected to find an object of type type
in the _constructor
attribute. Instead, it found my callable that I placed there in order to fix the __finalize__
issue I came across.
Work around
Override the equals
method with the following in your class definition.
def equals(self, other):
try:
pd.testing.assert_frame_equal(self, other)
return True
except AssertionError:
return False
newerdata.equals(newdata) # Should be `False`
False
Post a Comment for "Subclassing A Pandas DataFrame, Updates?"