Subsampling An Unbalanced Dataset In Tensorflow
Tensorflow beginner here. This is my first project and I am working with pre-defined estimators. I have an extremely unbalanced dataset where positive outcomes represent roughly 0.
Solution 1:
You will probably get better results by oversampling your under-represented class rather than throwing away data in your over-represented class. This way you keep the variance in the over-represented class. You might as well use the data you have.
The easiest way to achieve this is probably to create two Datasets, one for each class. Then you can use Dataset.interleave
to sample equally from both datasets.
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#interleave
Solution 2:
Oversampling can be easily achieved with following code:
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.7, 0.3])
Tensorflow has a good guide on dealing with unbalanced data you can find more ideas here: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#oversampling
Post a Comment for "Subsampling An Unbalanced Dataset In Tensorflow"