Skip to content Skip to sidebar Skip to footer

Concurrent Sax Processing Of Large, Simple Xml Files?

I have a couple of gigantic XML files (10GB-40GB) that have a very simple structure: just a single root node containing multiple row nodes. I'm trying to parse them using SAX in Py

Solution 1:

You can't easily split the SAX parsing into multiple threads, and you don't need to: if you just run the parse without any other processing, it should run in 20 minutes or so. Focus on the processing you do to the data in your ContentHandler.

Solution 2:

My suggested way is to read the whole XML file into an internal format and do the extra processing afterwards. SAX should be fast enough to read 40GB of XML in no more than an hour.

Depending on the data you could use a SQLite database or HDF5 file for intermediate storage.

By the way, Python is not really multi-threaded (see GIL). You need the multiprocessing module to split the work into different processes.

Post a Comment for "Concurrent Sax Processing Of Large, Simple Xml Files?"