Skip to content Skip to sidebar Skip to footer

Remove Multiple Items From A Numy.narray Without Numpy.delete

I am using a large nump.narray (11.000x3180) to develop an active learning algorithm (Text mining). In this algorithm, I have to delete each itarecion 16 samples (row vectors) in m

Solution 1:

It may help to understand exactly what np.delete does. In your case

newset = np.delete(dataset, ListifoIndex, axis = 0)  # corrected

in essence it does:

keep = np.ones(dataset.shape[0], dtype=bool) # array of True matching 1st dim
keep[ListifoIndex] = False
newset = dataset[keep, :]

In other words, it constructs a boolean index of the rows it wants to keep.

If I run

dataset = np.delete(dataset, ListifoIndex, axis = 0)

repeatedly in an interactive shell, there isn't any accumulation of intermediate arrays. Temporarily while running delete there will be this keep array, and a new copy of dataset. But with assignment, the old copy disappears.

Are you sure it's the delete that's growing memory use, as opposed to growing the training set?

As for speed, you might improve that by maintaining a 'mask' of all 'delete' rows, rather than actually deleting anything. But depending on how ListifoIndex overlaps with previous deletions, updating that mask might be more trouble than it's worth. It's also likely to be more error prone.

Solution 2:

I know this is old, but I ran into the same problem and wanted to share the fix here. You are sort of correct when you say that numpy.delete keeps a copy of the database, but it isn't numpy, its python itself.

Say you randomly choose an row from the database to be part of the training set. Instead of taking the row, python will take the reference of the row and keep the whole database for when you next want to use that row. In this way, when you delete the row from the old database, you create a new database where you can choose another row. That database gets saved as well because it is referenced as the next row in the training set. 100 iterations later you end up with 100 copies of the database, each having 1 less row than the last, but containing the same data.

The solution I found instead of appending the row to the training set, making a copy using copy.deepcopy to pull the row from the array and putting it in the training set. This way python doesn't need to carry the old database for reference purposes.

Bad -

database = [0,1,2,3,4,5,6]
Train = []
for i in range(len(database)):
    Train.append(database[i])

Good -

for i in range(len(database)):
    copy_of_thing = copy.deepcopy(database[i])
    Train.append(copy_of_thing)

Solution 3:

If the order doesn't metter, you can swap the rows to delete to the end of the array:

import numpy as np

n = 1000
a = np.random.rand(n, 8)
a[:, 0] = np.arange(n)
del_index = np.array([10, 100, 200, 500, 800, 995, 997, 999])
del_index2 = del_index[del_index < len(a) - len(del_index)]

copy_index = np.arange(len(a) - len(del_index), len(a))
copy_index2 = np.setdiff1d(copy_index, del_index)
a[copy_index2], a[del_index2] = a[del_index2], a[copy_index2]

and then you can use slice to create a new view:

a2 = a[:-len(del_index)]

If you want to keep the order, you can use for loop and slice copy:

import numpy as np

n = 1000
a = np.random.rand(n, 8)
a[:, 0] = np.arange(n)
a2 = np.delete(a, del_index, axis=0)
del_index = np.array([100, 10, 200, 500, 800, 995, 997, 999])
del_index.sort()

for i, (start, end) in enumerate(zip(del_index[:-1], del_index[1:])):
    a[start-i:end-1-i] = a[start+1:end]

print np.all(a[:-8] == a2)

Post a Comment for "Remove Multiple Items From A Numy.narray Without Numpy.delete"