You can use any libraries you want with your scraper. This makes things incredibly flexible and powerful. It's also incredibly straightforward.
You have a favorite scraping library? No problem. You need a library to convert some obscure file format into something sensible? No problem.
All you need to is specify the libraries that you want to use in your scraper repository. Each language does this slightly differently using the native tools for that language.
If you're already familiar with using Heroku, this will be even simpler for you, as morph.io's system for libraries is built on top of Buildpacks, the same technology that drives the installation of libraries on Heroku.
To have morph.io install specific gems for your scraper, add a Gemfile to your repository. For instance to install the mechanize and sqlite3 gems:
source 'https://rubygems.org' gem "mechanize" gem "sqlite3"
Then run bundle update. This will work out which specific versions of each gem will be installed and write the result of that to Gemfile.lock.
Make sure that you add both Gemfile and Gemfile.lock to your repository.
PHP in morph.io uses Composer for managing dependencies and runtime. You create
a file composer.json in the root of your scraper repository which says what libraries and
extensions you want installed.
For example, to install the XSL PHP extension your composer.json could look like this
{
"require": {
"ext-xsl": "*"
}
}Then run composer install. Depending on whether you have
Composer locally or globally installed
on your personal machine you can run either
php composer.phar install
or
composer install
As well as installing the libraries locally it will also create a composer.lock file which should
be added alongside the composer.json file into git.
The next time the scraper runs on morph it will build an environment from this.
For more on the specifics of what can go in composer.json see the
Composer documentation.
For Python morph.io installs libraries using pip from a requirements.txt file in the root of
your scraper repository. The format for requirements.txt is straightforward.
For example to install specific version of the Pygments and SQLAlchemy library, requirements.txt
could look like this
Pygments==1.4 SQLAlchemy==0.6.6
To choose which libraries to install you will need a file cpanfile in the root of your scraper
directory. It can install anything from CPAN and has a very straightforward syntax.
For instance to install specific versions of HTTP::Message and XML::Parser your cpanfile should
look like
requires "HTTP::Message", "6.06"; requires "XML::Parser", "2.41";
You don't have to specify the versions to install but it's recommended as otherwise different runs of the scraper could potentially use different versions of libraries.
Check cpanfile into git alongside your scraper and the next time it's run on morph it will install
the libraries.
To have morph.io install packages from npm for your scraper, edit the package.json file in the root of your repository. You can edit this file by hand, or with npm install. For instance to install the express and sqlite3 packages:
{
"name": "myscraper",
"description": "a scraper that runs on morph.io",
"version": "1.0.0",
"dependencies": {
"express": "^4.13.3",
"sqlite3": "latest"
}
}Make sure that you do not add the node_modules directory to git. You should add this directory to your .gitignore file.