Skip to content Skip to sidebar Skip to footer

Difference Between Python Urllib.urlretrieve() And Wget

I am trying to retrieve a 500mb file using Python, and I have a script which uses urllib.urlretrieve(). There seems to some network problem between me and the download site, as thi

Solution 1:

The answer is quite simple. Python's urllib and urllib2 are nowhere near as mature and robust as they could be. Even better than wget in my experience is cURL. I've written code that downloads gigabytes of files over HTTP with file sizes ranging from 50 KB to over 2 GB. To my knowledge, cURL is the most reliable piece of software on the planet right now for this task. I don't think python, wget, or even most web browsers can match it in terms of correctness and robustness of implementation. On a modern enough python using urllib2 in the exact right way, it can be made pretty reliable, but I still run a curl subprocess and that is absolutely rock solid.

Another way to state this is that cURL does one thing only and it does it better than any other software because it has had much more development and refinement. Python's urllib2 is serviceable and convenient and works well enough for small to average workloads, but cURL is way ahead in terms of reliability.

Also, cURL has numerous options to tune the reliability behavior including retry counts, timeout values, etc.


Solution 2:

If you are using:

page = urllib.retrieve('http://example.com/really_big_file.html')

you are creating a 500mb string which may well tax your machine, make it slow, and cause the connection to timeout. If so, you should be using:

(filename, headers) = urllib.retrieve('http://...', 'local_outputfile.html')

which won't tax the interpreter.

It is worth noting urllib.retrieve() uses urllib.urlopen() which is now deprecated.


Post a Comment for "Difference Between Python Urllib.urlretrieve() And Wget"