Skip to content Skip to sidebar Skip to footer

Dataframe Re-indexing Object Unnecessarily Preserved In Memory

In continuation from this question I've implemented 2 functions doing the same thing, one is using re-indexing and the other does not. the functions differ in the 3rd line: def upd

Solution 1:

Here is my debug code, when you do indexing, Index object will create _tuples and engine map, I think the memory is used by this two cache object. If I add the lines marked by ****, then the memory increase is very small, about 6M on my PC:

import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc

def get_memory():
    pid = os.getpid()
    p = psutil.Process(pid)
    return p.get_memory_info().rss

def get_object_ids():
    return set(id(obj) for obj in gc.get_objects())

m1 = get_memory()

n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])

ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))

m2 = get_memory()
objs1 = get_object_ids()

z = []
for i in range(5):
    df2 = df.reindex(ix, level=0).reindex(iy, level=1)
    z.append(df2.mean().mean())
df.index._tuples = None    # ****
df.index._cleanup()        # ****
del df2
gc.collect()               # ****
m3 = get_memory()

print (m2-m1)/1e6, (m3-m2)/1e6

from collections import Counter

counter = Counter()
for obj in gc.get_objects():
    if id(obj) not in objs1:
        typename = type(obj).__name__
        counter[typename] += 1
print counter

Post a Comment for "Dataframe Re-indexing Object Unnecessarily Preserved In Memory"