You can use any libraries you want with your scraper. This makes things incredibly flexible and powerful. It's also incredibly straightforward.
You have a favorite scraping library? No problem. You need a library to convert some obscure file format into something sensible? No problem.
All you need to is specify the libraries that you want to use in your scraper repository. Each language does this slightly differently using the native tools for that language.
If you're already familiar with using Heroku, this will be even simpler for you, as morph.io's system for libraries is built on top of Buildpacks, the same technology that drives the installation of libraries on Heroku.
To have morph.io
install specific gems for your scraper, add a Gemfile
to your repository. For instance to install the mechanize
and sqlite3
gems:
source 'https://rubygems.org' gem "mechanize" gem "sqlite3"
Then run bundle update
. This will work out which specific versions of each gem will be installed and write the result of that to Gemfile.lock
.
Make sure that you add both Gemfile
and Gemfile.lock
to your repository.
PHP in morph.io uses Composer for managing dependencies and runtime. You create
a file composer.json
in the root of your scraper repository which says what libraries and
extensions you want installed.
For example, to install the XSL
PHP extension your composer.json
could look like this
{ "require": { "ext-xsl": "*" } }
Then run composer install
. Depending on whether you have
Composer locally or globally installed
on your personal machine you can run either
php composer.phar install
or
composer install
As well as installing the libraries locally it will also create a composer.lock
file which should
be added alongside the composer.json
file into git.
The next time the scraper runs on morph it will build an environment from this.
For more on the specifics of what can go in composer.json
see the
Composer documentation.
For Python morph.io installs libraries using pip
from a requirements.txt
file in the root of
your scraper repository. The format for requirements.txt
is straightforward.
For example to install specific version of the Pygments
and SQLAlchemy
library, requirements.txt
could look like this
Pygments==1.4 SQLAlchemy==0.6.6
To choose which libraries to install you will need a file cpanfile
in the root of your scraper
directory. It can install anything from CPAN and has a very straightforward syntax.
For instance to install specific versions of HTTP::Message
and XML::Parser
your cpanfile
should
look like
requires "HTTP::Message", "6.06"; requires "XML::Parser", "2.41";
You don't have to specify the versions to install but it's recommended as otherwise different runs of the scraper could potentially use different versions of libraries.
Check cpanfile
into git alongside your scraper and the next time it's run on morph it will install
the libraries.
To have morph.io
install packages from npm for your scraper, edit the package.json
file in the root of your repository. You can edit this file by hand, or with npm install
. For instance to install the express
and sqlite3
packages:
{ "name": "myscraper", "description": "a scraper that runs on morph.io", "version": "1.0.0", "dependencies": { "express": "^4.13.3", "sqlite3": "latest" } }
Make sure that you do not add the node_modules
directory to git. You should add this directory to your .gitignore
file.