Skip to content Skip to sidebar Skip to footer

Python - Issue Scraping With BeautifulSoup

I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jo

Solution 1:

Disclaimer: I did some asking of my own for a part of this answer.

from bs4 import BeautifulSoup
import requests
import json

# note: link is slightly different; yours just redirects here
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')

s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]

print(len(urls))
50

Process:

  1. Use soup.find rather than soup.find_all. This will give a JSON bs4.element.Tag
  2. json.loads(s.text) is a nested dict. Access the values for itemListElement key to get a dict of urls, and convert to list.

Post a Comment for "Python - Issue Scraping With BeautifulSoup"