Dataframe Re-indexing Object Unnecessarily Preserved In Memory
In continuation from this question I've implemented 2 functions doing the same thing, one is using re-indexing and the other does not. the functions differ in the 3rd line: def upd
Solution 1:
Here is my debug code, when you do indexing, Index object will create _tuples
and engine map
, I think the memory is used by this two cache object. If I add the lines marked by ****
, then the memory increase is very small, about 6M on my PC:
import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc
def get_memory():
pid = os.getpid()
p = psutil.Process(pid)
return p.get_memory_info().rss
def get_object_ids():
return set(id(obj) for obj in gc.get_objects())
m1 = get_memory()
n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])
ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))
m2 = get_memory()
objs1 = get_object_ids()
z = []
for i in range(5):
df2 = df.reindex(ix, level=0).reindex(iy, level=1)
z.append(df2.mean().mean())
df.index._tuples = None # ****
df.index._cleanup() # ****
del df2
gc.collect() # ****
m3 = get_memory()
print (m2-m1)/1e6, (m3-m2)/1e6
from collections import Counter
counter = Counter()
for obj in gc.get_objects():
if id(obj) not in objs1:
typename = type(obj).__name__
counter[typename] += 1
print counter
Post a Comment for "Dataframe Re-indexing Object Unnecessarily Preserved In Memory"