Web Scraping Using Python
I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python.
Solution 1:
http://www.nseindia.com/ seems to require an Accept
header, for whatever reason. This should work:
importurllib2r= urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author@example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()
Refusing requests without Accept
headers is incorrect; RFC 2616 clearly states
If no Accept header field is present, then it is assumed that the client accepts all media types.
Post a Comment for "Web Scraping Using Python"