Automatically Extracting Feed Links (atom, Rss,etc) From Webpages
Solution 1:
There's feedfinder
:
>>>import feedfinder>>>>>>feedfinder.feed('scripting.com')
'http://scripting.com/rss.xml'
>>>>>>feedfinder.feeds('scripting.com')
['http://delong.typepad.com/sdj/atom.xml',
'http://delong.typepad.com/sdj/index.rdf',
'http://delong.typepad.com/sdj/rss.xml']
>>>
Solution 2:
I second waffle paradox in recommending Beautiful Soup for parsing the HTML and then getting the <link > tags, where the feeds are referenced. The code I usually use:
from BeautifulSoup import BeautifulSoup as parser
defdetect_feeds_in_HTML(input_stream):
""" examines an open text stream with HTML for referenced feeds.
This is achieved by detecting all ``link`` tags that reference a feed in HTML.
:param input_stream: an arbitrary opened input stream that has a :func:`read` method.
:type input_stream: an input stream (e.g. open file or URL)
:return: a list of tuples ``(url, feed_type)``
:rtype: ``list(tuple(str, str))``
"""# check if really an input streamifnothasattr(input_stream, "read"):
raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream))
result = []
# get the textual data (the HTML) from the input stream
html = parser(input_stream.read())
# find all links that have an "" attribute
feed_urls = html.findAll("link", )
# extract URL and typefor feed_link in feed_urls:
url = feed_link.get("href", None)
# if a valid URL is thereif url:
result.append(url)
return result
Solution 3:
I don't know any existing library, but Atom or RSS feeds are usually indicated with a <link>
tag in the <head>
section as such:
<link rel="alternative"type="application/rss+xml" href="http://link.to/feed">
<link rel="alternative"type="application/atom+xml" href="http://link.to/feed">
Straightforward way would be downloading and parsing these URL's with an HTML parser like lxml.html and getting the href
attribute of relevant <link>
tags.
Solution 4:
Depending on how well-formed the information in these feeds are (e.g., Are all the links in the form of http://.../
? Do you know if they will all be in href
or link
tags? Are all the links in the feeds going to be to other feeds? etc.), I'd recommend anything from a simple regex to a straight-up parsing module to extract links from the feeds.
As far as parsing modules go, I can only recommend beautiful soup. Though even the best parser will only go so far--esp in the case I mentioned above, if you can't guarantee all links in the data are going to be links to other feeds; then you have to do some additional crawling and probing on your own.
Post a Comment for "Automatically Extracting Feed Links (atom, Rss,etc) From Webpages"