Web Scraping with Selenium in Python (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Tue, 02 Nov 2021

Web Scraping with Selenium in Python

This is part 1 of my selenium exploration.

At the New Mexico GNU & Linux User Group, currently meeting virtually on Jitsi, someone expressed interest in scraping websites. Since I do quite a bit of scraping, I offered to give a tutorial on scraping with the Python module BeautifulSoup.

"What about selenium?" he asked. Sorry, I said, I've never needed selenium enough to figure it out.

But then a week later, I found I did have a need.

Specifically, I wanted to be able to use my RSS fetcher, FeedMe, for the New York Times. I had subscribed several months back, after verifying that they had an RSS feed; but then found that I couldn't actually read any stories linked from the RSS, because of their Byzantine collection of JavaScript that does the password checking. Just setting a few cookies doesn't help. The NY Times also has an API you can sign up for -- but weirdly, none of the API calls include a way to fetch actual stories.

All I knew about selenium was that it was a way of running a full browser under programmatic control. But that was exactly what I'd need for fetching stories from the NYT. Maybe I needed to look into this selenium thing after all.

Basic Selenium

The most basic use of selenium is very easy. In the Python 3 console:

>>> from selenium import webdriver
>>> driver = webdriver.Firefox()    # A browser window pops up
>>> driver.get("http://shallowsky.com/blog/")
>>> fullhtml = driver.page_source
(I'm using firefox here, but there's also a chromium driver that works very similarly.)

The full HTML source of the page is in driver.page_source. Or you can look for specific elements. For instance, suppose I want to loop over all the <h2 class="story"> tags and print their innerHTML:

for story in driver.find_elements_by_class_name("story"):
    print(story.get_attribute('innerHTML'))
... except, wait, that won't work, because there's a <div class="story"> that contains <h2 class="story">. That selector will get both the divs and the h2s, and the innerHTML of the div is the title plus the whole teaser that follows it. So you could do something like
for story in driver.find_elements_by_class_name("story"):
    if story.tag_name == "h2":
        print(story.get_attribute('innerHTML'))

If you need to combine different selectors in the same query, you have to use an XPath:

for story in driver.find_elements_by_xpath("//h2[@class='story']"):
    print(story.get_attribute('innerHTML'))

Read more about element selection at the selenium-python documentation. It's not as flexible or easy to use as BeautifulSoup, but if you get frustrated, you always have the option of getting the full HTML string driver.page_source and feeding that to BeautifulSoup.

Firefox Profiles

The simple selenium example above created its own new Firefox profile, a perfectly reasonable thing to do so it doesn't interfere with any Firefox that might already be running.

But of course, to do something like fetch NY Times stories as a subscriber, I need a browser profile where I'm logged in and have all the appropriate cookies set.

I ran firefox -p to being up the profile manager, and created a new profile called selenium. Firefox likes to add random strings to profile directories, so the directory it used was ~/.mozilla/firefox/random-string.selenium.

I ran firefox with that profile, went to nytimes.com and logged in to my subscriber account, then exited firefox.

Now to access that profile from selenium. Here's what nearly all pages tell you to do, and it works:

>>> import os
>>> foxprofiledir = os.path.expanduser("~/.mozilla/firefox/<i>random-string</i>.selenium")

>>> driver = webdriver.Firefox(firefox_profile=foxprofiledir)
<stdin>:1: DeprecationWarning: firefox_profile has been deprecated, please pass in a Service object
>>> 

I'm running selenium 4.0.0~a1, the version in Ubuntu hirsute's python3-selenium package. Presumably all the web tutorials are written for selenium 3 -- as is the online documentation. I haven't been able to find either examples or documenttaion of this mysterious "service object" that selenium 4 apparently wants. Fortunately, despite the deprecation warning, this method seems to work fine, and I'm able to driver.get() NYT pages. Perhaps selenium's documentation will eventually catch up with its code and I'll be able to get rid of the warning message.

Running Headless

In all these tests, selenium popped up a browser window and I could see each page load. That's great for testing, but the point of the exercise is to automate the page fetching, and you really don't want a visible browser window popping up for that. Fortunately it's easy to suppress the browser window by running headless:

>>> from selenium.driver.firefox.options import Options
>>> 
>>> options = Options()
>>> options.headless = True
>>> 
>>> driver = webdriver.Firefox(firefox_profile=foxprofiledir, options=options)

With that, I could make a feedme helper (once I patched feedme to accept helper modules, something I'd never needed before) that checks the NY Times RSS feed, loops over new stories, and uses the selenium firefox profile to fetch the stories.

This, alas, is only part of the problem (the easy part). It worked okay as long as I was running the script interactively on my own machine. But the point of this was to run the selenium script automatically each day, creating files that FeedMe could pick up as part of my daily feeds -- even if I'm not at home, even if my laptop isn't running.

I needed to run selenium from a server that's always running. That turned out to be quite a bit more difficult, since servers don't generally have X and GUI libraries and firefox installed. I'll address those issues in a separate article.

Tags: , , ,
[ 19:58 Nov 02, 2021    More programming | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus