Distributing Python Module - Spark Vs Process Pools

December 11, 2023 Post a Comment

I've made a Python module that extracts handwritten text from PDFs. The extraction can sometimes be quite slow (20-30 seconds per file). I have around 100,000 PDFs (some with lots

Solution 1:

This is perfectly reasonable to use Spark for since you can distribute the task of text extraction across multiple executors by placing the files on distributed storage. This would let you scale out your compute to process the files and write the results back out very efficiently and easily with pySpark. You could even use your existing Python text extraction code:

input = sc.binaryFiles("/path/to/files")
processed = input.map(lambda (filename, content): (filename, myModule.extract(content)))

As your data volume increases or you wish to increase your throughput you can simply add additional nodes.

Python College

Distributing Python Module - Spark Vs Process Pools

Solution 1:

Post a Comment for "Distributing Python Module - Spark Vs Process Pools"