Skip to content Skip to sidebar Skip to footer

How To Isolate Titles From These Image Urls?

I have a list of image urls contained in 'images'. I am trying to isolate the title from these image urls so that I can display, on the html, the image (using the whole url) and th

Solution 1:

cene_of_the_Prodigal_Son_-_Google_Art_Project.jpg/220px-Rembrandt_-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg

You have this string and want to remove. Let's say I have this stored in x

y = x.lsplit("px-")[1] 
z = x.rsplit("_Google_Art")[0]

This makes a list with 2 elements: stuff before "px-" in the string, and stuff after. We're just grabbing the stuff after, since you wanted to remove the stuff before. If "px-" isn't always in the string, then we need to find something else to split on. Then we split on something towards the end, and grab the stuff before it.

Edit: Addressing comment on how to split in that loop.. I think you are referring to this: titles=[image[149:199].strip() for image in images]

List comps are great but sometimes it's easier to just write it out. Haven't tested this but here's the idea:

titles = []
for image in images:
    title = image[149:199].strip()
    cleaned_left = title.lsplit("px-")[1]
    cleaned_title = title.rsplit("_Google_Art")[0]
    titles.append(cleaned_title)

Solution 2:

import re                          # regular expressions used to match strings 
from bs4 import BeautifulSoup      # web scraping library
from urllib.request import urlopen # open a url connection 
from urllib.parse import unquote   # decode special url characters

@app.route('/')
@app.route('/home')
def home():
    images=imagescrape()
    # Iterate over all sources and extract the title from the URL
    titles=(titleextract(src) for src in images)
    
    # zip combines two lists into one.
    # It goes through all elements and takes one element from the first
    # and one element from the second list, combines them into a tuple 
    # and adds them to a sequence / generator.
    images_titles = zip(images, titles)
    return render_template('home.html', image_titles=images_titles)

def imagescrape():
    result_images=[]
    #html = urlopen('https://en.wikipedia.org/wiki/Prince_Harry,_Duke_of_Sussex')
    html = urlopen('https://en.wikipedia.org/wiki/Rembrandt')
    bs = BeautifulSoup(html, 'html.parser')
    images = bs.find_all('img', {'src':re.compile('.jpg')})
    for image in images:
        result_images.append("https:"+image['src']+'\n') #concatenation!
    return result_images

def titleextract(url):
    # Extract the part of the string between the last two "/" characters
    # Decode special URL characters and cut off the suffix
    # Replace all "_" with spaces
    return unquote(url[58:url.rindex("/", 58)-4]).replace('_', ' ')
{% for image, title in images_titles %}
    <div class="card" style="width: 18rem;">
      <img src="{{image}}" class="card-img-top" alt="...">
      <div class="card-body">
        <h5 class="card-title">{{title}}</h5>
        <p class="card-text">Some quick example text to build on the card title and make up the bulk of the card's content.</p>
        <a href="#" class="btn btn-primary">Go somewhere</a>
      </div>
    </div>
{% endfor %}

Post a Comment for "How To Isolate Titles From These Image Urls?"