Documentation


Scraping JavaScript heavy sites

In most circumstances web scraping is done by downloading a web page using your programming language and a library. However in some cases you need to simulate a web browser. This could be because the web site is poorly programmed or because it makes extensive use of JavaScript.

You can do this by using a headless browser. On morph.io Google Chrome is pre-installed ready for you to use.

Using Chrome Headless

The Google Chrome binary is installed at /usr/bin/google-chrome on every morph.io container as part of the build process. You can use Google Chrome directly by running it with google-chrome --headless --disable-gpu or you can control it via WebDriver with ChromeDriver which is installed on morph.io at /usr/local/bin/chromedriver.

Usage

With Capybara and Selenium Webdriver you can control Chrome. To install them add them to your scraper Gemfile:

gem 'capybara'
gem 'selenium-webdriver'

Then in your scraper start a Capybara session using selenium_chrome_headless:

require "capybara"
require "selenium-webdriver"

capybara = Capybara::Session.new(:selenium_chrome_headless)
# Start scraping
capybara.visit("https://morph.io/")
puts capybara.find("#banner h2").text
Missing documentation. Help out by writing some.
Missing documentation. Help out by writing some.
Missing documentation. Help out by writing some.
Missing documentation. Help out by writing some.

Examples

Sometimes there's nothing like seeing a real-life example. If you have a scraper you would like to add to this list, please let us know.

openaustralia/example_ruby_chrome_headless_scraper
Example scraper showing how to use Chrome headless from a ruby scraper

morph.io, faye.morph.io, www.gravatar.com, and 6 others