Skip to content Skip to sidebar Skip to footer

Distributing Python Module - Spark Vs Process Pools

I've made a Python module that extracts handwritten text from PDFs. The extraction can sometimes be quite slow (20-30 seconds per file). I have around 100,000 PDFs (some with lots

Solution 1:

This is perfectly reasonable to use Spark for since you can distribute the task of text extraction across multiple executors by placing the files on distributed storage. This would let you scale out your compute to process the files and write the results back out very efficiently and easily with pySpark. You could even use your existing Python text extraction code:

input = sc.binaryFiles("/path/to/files")
processed = input.map(lambda (filename, content): (filename, myModule.extract(content)))

As your data volume increases or you wish to increase your throughput you can simply add additional nodes.

Post a Comment for "Distributing Python Module - Spark Vs Process Pools"