5 things to consider before doing web site data extraction

Info

Dec 12, 2018

0 20120

Web scraping is not an illegal activity, but that does not mean you can scrape any site you want. There are some sites that explicitly block any sort of automated data extraction either via the robots.txt file or their Terms of service page.

Disclaimer: You can find no legal advice in any respect here. The legality cannot be generalised as the laws are different in each country.

Google built its business on scraping and indexing others content continuously. Do you think they are doing something illegal or unethical? No, they are not. They are providing an amazing value add to the extracted data.

There are some general things to consider before doing web data extraction:

Robots.txt

It is probably the first thing to check out before scraping a website. Robots.txt is used to communicate with web crawlers and web robots. This file informs the web robot about which areas of the website should not be processed or scanned. Robots.txt is located in the root of the web site hierarchy (e.g. https://www.google.com/robots.txt)

User-agent: *
Disallow:

If you find this in the robots.txt file of a website you’re trying to crawl, you’re in luck. This means all pages on the site are open to be crawled by bots.

User-agent: *
Disallow: /

This is a pretty clear signal to avoid scraping these sites.

You should always respect and follow all the rules listed in robots.txt.

Extracting the top answers from Quora can be beneficial to media companies looking for fresh content for news. But scraping data from Quora, using a bot is against Quora’s terms of service.
But the good news is that you can consider crawling of StackOverflow popular Q&A site that allow scraping or use their API.

Terms of Service.

If you consider web scraping, you should also check web site’s “Terms of Use” or “Terms of Service”.

I think that a website’s robots.txt and “Terms of Use” should be coordinated with and complement one another because ultimately robots that crawl multiple sites probably don’t analyze “Terms of Use”. But a polite crawler definitely reads and obeys the robots.txt file rules before fetching a web page.

Many web sites have clauses in “Terms of Service” that limit the way you can use the data found on the site. By violating “Terms of Use”, you are in a situation wherein the legal actions can be initiated against you for the breach of contract.

However if a Terms of Use provision does not say that it limits access to bots, spiders, etc, crawling is basically allowed.

If a website clearly states that web scraping is not allowed, you must respect that.

Amazon as many other web sites provide an easy way to access their data through official API – Product Advertising API. However if you can not fetch enough data using the provided API, you can try to scrape their web site. Amazon allows crawling their pages.
As opposite to Amazon, the data extraction from JustDial is prohibited according to their Terms of Use”

GDPR compliance.

In order to be compliant with new EU General Data Protection Regulation (GDPR) you should evaluate your web scraping project first.

If you don’t scrape personal data, then GDPR does not apply. In this case you can just skip this section and move to the next step.

Non personal data examples:

Company registration number;
Email address such as [email protected];
Anonymised data.

However, it is now illegal to scrape a EU (EEA) residents personal data under GDPR without person’s explicit consent.

Examples of personal data:

Name and surname;
Home address;
Email address such as [email protected];
Identification card number;
location data (for example the location data function on a mobile phone);
Bank Details;
Internet Protocol (IP) address;
Employment Info;
Social Security Number;
Medical information;
Video/Audio Recording;

GDPR is a regulation specific to European Union/European Economic Area countries. So GDPR may not apply if you extract the personal information of other countries residents (for example USA, Australia, Canada, etc.)

Unless you have clear explicit consent and legitimate reason to scrape personal data of EU citizens you should avoid scraping it.

Copyright infringement.

What do you want to do with the extracted data? If this is intended for your own personal use, then it is legal as it falls under fair use doctrine.

Fair use permits limited use of copyrighted material without having to first acquire permission from the copyright holder.

Technically, there is absolutely no difference between accessing a web site using an automated script and a human-driven viewing a website.

The complications start with reproducing copyrighted content.

Facts themselves are not protected by copyright. A narrative work that includes or explains facts can be protected by copyright (e.g. an encyclopedia is copyrightable).

But rephrasing/ reorganizing the data gets you around that.

Denial of Service.

Big popular web sites were built to handle high traffic. Smaller ones may not be so robust, and may not be ready to handle too many requests per second, causing degraded performance in a web site and shutting down access for other users. Malicious hackers use this tactic in what’s known as a “Denial of Service” attack.

Why does this happen? Well, automated data scrapers “read” a website pages much quicker than a human could. As not every site makes it clear how robust their server is, this is a bit tricky question to avoid excessively overload a server.

No matter whether you are a hacker or just a researcher, causing a Denial of Service error to a site can result in legal action taken against you.

Here are some of considerations making sure your crawler doesn’t hit a web site too hard.

Respect the delay that crawlers should wait between requests by following the robots.txt Crawl-Delay directive.
Increase scraping intervals to avoid server overload.
Set your scraper to operate on off-peak business hours for the site
Smaller companies use smaller servers, so don’t scrape them as aggressively as, say, a giant corporation’s web site.

When in doubt, ask!

And finally, if it’s not clear from a website, contact the webmaster and ask if and what you’re allowed to harvest.

Wrap Up

So web scraping is absolutely legal if done right. Furthretmore, scraping can provide many benefits to all involved.

There is a bunch of great use cases for web scraping:

Retailers use web scraping to monitor their competitor prices and collect product reviews for analysis.
Lawyers look for the past judgement reports for their case references.
Recruiters collect people profiles.
Media companies follow trending topics and look for a fresh content for publications.

Dataflow kit – Turn Websites into structured data

Please share your personal experiences with scraping ethics or legality in the comments section!