For Loop Web Scraping A Website Brings Up Timeouterror, Newconnectionerror And A Requests.exceptions.connectionerror

August 25, 2023 Post a Comment

Apologies, I am a beginning at Python and webscraping. I am web scraping wugniu.com to extract readings for characters that I input. I made a list of 10273 characters to format int

Solution 1:

That's completely normal behavior expected. as that's related to Chicken-Egg Issue.

Imagine that you open the Firefox browser and then you open google.com, And then you close it and repeat the circle!

That's count as a DDOS attack and all modern servers will block your requests and flag your IP as that's really hurting their bandwidth!

The logical and right approach is to use the same session instead of keep creating multiple sessions. As that's will not be shown under TCP-Syn Flood Flag. Check Legal tcp-flags.

On the other side, you really need to use Context-Manager instead of keep remember your variables.

Example:

output = open("out.txt", "a", encoding="utf-8")
output.close()

Can be handled via With as below:

withopen('out.txt', 'w', newline='', encoding='utf-8') as output:
    # here you can do your operation.

and once you be out the with then your file will be closed automatically!

Also, consider using the new format string instead of the old

url = "https://wugniu.com/search?char=%s&table=wenzhou" % char

Can be:

"https://wugniu.com/search?char={}&table=wenzhou".format(char)

I'll not use a professional code here, I've made it simple for you to can understand the concept.

Pay attention to how I picked up the desired element and how I wrote it to the file. and the difference speed from lxml and html.parser can be found here

import requests
from bs4 import BeautifulSoup
import urllib3

urllib3.disable_warnings()


defmain(url, chars):
    withopen('result.txt', 'w', newline='', encoding='utf-8') as f, requests.Session() as req:
        req.verify = Falsefor char in chars:
            print(f"Extracting {char}")
            r = req.get(url.format(char))
            soup = BeautifulSoup(r.text, 'lxml')
            target = [x['id'][2:] for x in soup.select('audio[id^="0-"]')]
            print(target)
            f.write(f'{char}\n{str(target)}\n')


if __name__ == "__main__":
    chars = ['核']
    main('https://wugniu.com/search?char={}&table=wenzhou', chars)

Also as to follow the Python Dry Principle You can set req.verify = False instead of keep setting verify = False on each request.

Next Step: You should take a look at Threading or AsyncProgrammingg in order to enhance your code operation time as in real-world projects we aren't using a normal for loop (count as really slow) while you can send a bunch of URLs and wait for a response.

Python College

For Loop Web Scraping A Website Brings Up Timeouterror, Newconnectionerror And A Requests.exceptions.connectionerror

Solution 1:

Post a Comment for "For Loop Web Scraping A Website Brings Up Timeouterror, Newconnectionerror And A Requests.exceptions.connectionerror"