In most circumstances web scraping is done by downloading a web page using your programming language and a library. However in some cases you need to simulate a web browser. This could be because the web site is poorly programmed or because it makes extensive use of JavaScript.
You can do this by using a headless browser. On morph.io Google Chrome is pre-installed ready for you to use.
The Google Chrome binary is installed at /usr/bin/google-chrome
on every morph.io container
as part of the build process.
You can use Google Chrome directly by running it with google-chrome --headless --disable-gpu
or
you can control it via WebDriver with ChromeDriver
which is installed on morph.io at /usr/local/bin/chromedriver
.
With Capybara and Selenium Webdriver you can control Chrome. To install them add them to your scraper Gemfile:
gem 'capybara' gem 'selenium-webdriver'
Then in your scraper start a Capybara session using selenium_chrome_headless
:
require "capybara" require "selenium-webdriver" capybara = Capybara::Session.new(:selenium_chrome_headless) # Start scraping capybara.visit("https://morph.io/") puts capybara.find("#banner h2").text
Sometimes there's nothing like seeing a real-life example. If you have a scraper you would like to add to this list, please let us know.
morph.io, faye.morph.io, www.gravatar.com, and 6 others