Skip to content Skip to sidebar Skip to footer

Get Links From Requests Or Beautifulsoup That Are Hidden Inside The Page

More general question since I do not think I have a complete enough code to post. https://ec.europa.eu/eurostat/web/main/data/database This is the webpage I am interested in downlo

Solution 1:

You can do this without the heavy guns of selenium.

You can fake the Ajax request and get all the links to bulk downloads of all files for all folders.

Here's how to do it:

import re

import requests

with requests.Session() as connection:
    _ = connection.get("https://ec.europa.eu")
    initial_response = connection.get(
        "https://ec.europa.eu/eurostat/web/main/data/database",
    )
    ajax_url = re.search(
        r"sendAjaxRequest\('(.*?)',",
        initial_response.text,
    ).group(1)
    main_response = connection.get(ajax_url).text
    links = re.findall(r"href: '(https:.*?)',", main_response)
    for link in links:
        r = connection.get(link)
        file_name = r.headers["Content-Disposition"].split("=")[-1]
        print(f"Downloading: {file_name}")
        withopen(file_name, "wb") as f:
            f.write(r.content)

This grabs all the .zip files that contain all the files per folder you see on the page.

NavTree_cei_en.zip
NavTree_es_en.zip
NavTree_sdg_en.zip
NavTree_shorties_en.zip
NavTree_t2020_en.zip
NavTree_tepsr_en.zip
NavTree_tips_en.zip

EDIT:

Alright, you were right that not all files were there but you didn't mention there's an entire section on Eurostat's page devoted to bulk download.

If you read their guide and check out some of the URLs you'll most likely end up here.

Then you can go to [data] and, boom, there are all the files and their links.

You can order by ALL and scrape that.

You should end up with over 6700 files available for download.

Here's how to do it:

import time

import requests
from bs4 import BeautifulSoup


defwait_a_bit(wait_for: float = 1.5):
    time.sleep(wait_for)


with requests.Session() as connection:
    endpoint = connection.get(
        "https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/""BulkDownloadListing?dir=data&sort=1&sort=2&start=all"
    )
    soup = BeautifulSoup(endpoint.text, "lxml").find_all("a", {"rel": "external"})
    links = [a["href"] for a in soup]
    print(f"Found {len(links)} files.")

    for link in links[:10]:  # getting only first 10 files
        r = connection.get(link)
        file_name = (
            r.headers["Content-Disposition"]
            .split("=")[-1]
            .replace('"', "")
        )
        print(f"Downloading: {file_name}")
        withopen(file_name, "wb") as f:
            f.write(r.content)
        wait_a_bit()

NOTE: I'm using a limiter here (fetching only the first 10 files), but if you really want to download ALL the files then change this:

forlinkin links[:10]:  # getting only first 10 files

to this:

forlinkin links:

Solution 2:

As mention by @rdas the website you are trying to scape rendering the links after the page is loaded, it is probably using a modern JS front-end framework.

To extract the links from this website you can use Selenium and wait for the website to render then extract all the links and parse them to get the ones you want. If all you want is to download the data you can also just open them and download the data (also with Selenium).

The code would look something like this:

from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium import webdriver

import time

options = Options()
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

# Open the website
driver.get("https://ec.europa.eu/eurostat/web/main/data/database")

# Wait for the page to load
time.sleep(5)

elems = driver.find_elements_by_xpath("//a[@href]")

extracted_links = []

# Iterate over elements with hrefsfor elem in elems:
    # Condition to only get the links you wantif (elem.get_attribute('href')).endswith(.tsv.gz)
        extracted_links.append(elem.get_attribute('href'))

print(extracted_links)
    

I hope this helped

Post a Comment for "Get Links From Requests Or Beautifulsoup That Are Hidden Inside The Page"