Get Links From Requests Or Beautifulsoup That Are Hidden Inside The Page
Solution 1:
You can do this without the heavy guns of selenium.
You can fake the Ajax
request and get all the links to bulk downloads of all files for all folders.
Here's how to do it:
import re
import requests
with requests.Session() as connection:
_ = connection.get("https://ec.europa.eu")
initial_response = connection.get(
"https://ec.europa.eu/eurostat/web/main/data/database",
)
ajax_url = re.search(
r"sendAjaxRequest\('(.*?)',",
initial_response.text,
).group(1)
main_response = connection.get(ajax_url).text
links = re.findall(r"href: '(https:.*?)',", main_response)
for link in links:
r = connection.get(link)
file_name = r.headers["Content-Disposition"].split("=")[-1]
print(f"Downloading: {file_name}")
withopen(file_name, "wb") as f:
f.write(r.content)
This grabs all the .zip
files that contain all the files per folder you see on the page.
NavTree_cei_en.zip
NavTree_es_en.zip
NavTree_sdg_en.zip
NavTree_shorties_en.zip
NavTree_t2020_en.zip
NavTree_tepsr_en.zip
NavTree_tips_en.zip
EDIT:
Alright, you were right that not all files were there but you didn't mention there's an entire section on Eurostat's page devoted to bulk download.
If you read their guide and check out some of the URLs you'll most likely end up here.
Then you can go to [data]
and, boom, there are all the files and their links.
You can order by ALL
and scrape that.
You should end up with over 6700
files available for download.
Here's how to do it:
import time
import requests
from bs4 import BeautifulSoup
defwait_a_bit(wait_for: float = 1.5):
time.sleep(wait_for)
with requests.Session() as connection:
endpoint = connection.get(
"https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/""BulkDownloadListing?dir=data&sort=1&sort=2&start=all"
)
soup = BeautifulSoup(endpoint.text, "lxml").find_all("a", {"rel": "external"})
links = [a["href"] for a in soup]
print(f"Found {len(links)} files.")
for link in links[:10]: # getting only first 10 files
r = connection.get(link)
file_name = (
r.headers["Content-Disposition"]
.split("=")[-1]
.replace('"', "")
)
print(f"Downloading: {file_name}")
withopen(file_name, "wb") as f:
f.write(r.content)
wait_a_bit()
NOTE: I'm using a limiter here (fetching only the first 10 files), but if you really want to download ALL the files then change this:
forlinkin links[:10]: # getting only first 10 files
to this:
forlinkin links:
Solution 2:
As mention by @rdas the website you are trying to scape rendering the links after the page is loaded, it is probably using a modern JS front-end framework.
To extract the links from this website you can use Selenium and wait for the website to render then extract all the links and parse them to get the ones you want. If all you want is to download the data you can also just open them and download the data (also with Selenium).
The code would look something like this:
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import time
options = Options()
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
# Open the website
driver.get("https://ec.europa.eu/eurostat/web/main/data/database")
# Wait for the page to load
time.sleep(5)
elems = driver.find_elements_by_xpath("//a[@href]")
extracted_links = []
# Iterate over elements with hrefsfor elem in elems:
# Condition to only get the links you wantif (elem.get_attribute('href')).endswith(.tsv.gz)
extracted_links.append(elem.get_attribute('href'))
print(extracted_links)
I hope this helped
Post a Comment for "Get Links From Requests Or Beautifulsoup That Are Hidden Inside The Page"