Why capture web pages?

At the end of 2012, we started this process to answer a simple question: how many e-commerce websites existed in Brazil. Talking to other companies and experts on the market, we ended up discovering that while many companies talked about the web and claimed impressive market share, very little was actually known. This lead us to decide to expand our scope in order to find out more and answer more questions. Where are the sites hosted? What payment services they use? What is the web's favorite development language in each country?

Our crawler was structured to answer these and many other questions. If you are curious to know the answers, click here to register and get started. Search for any site you want, or simply take a look at our reports. It's the web, unlike you've ever seen it before.

Why should I let you access my website?

Beyond simple generosity of spirit, when you let us access your site, you ensure that information about it will be compiled with our statistics and with a quick registration you can look at all of our data, even comparing your website to the web in general.

How does your crawling process work?

Our crawling processed is based on a growing list of website addresses. Every time we find a link to a new site, we add it to our list to be visited at a future date. While we're visiting your site, we try to be as polite as possible: we'll strive to visit each page only once, we have a minimum wait time between successive visits, and we do our best to avoid downloading any files that aren't actual HTML pages.

We also do our best to comply with all the rules and conventions of good behavior for crawling, including the robots.txt protocol.

Won't your spider damage the performance of my website?

As we mentioned above, we do our very best to avoid harming any website we visit. If the Google crawler doesn't hurt your site, ours shouldn't. With that said, over time we've seen some recurring misunderstandings crop up:

- "Invalid requests": The first thing we do when trying to load a webpage is to call the "HEAD" method, to try and load headers without loading the content, so we can skip over things we're not interested. Some web servers don't respond to this method, and this may be appear in your logs as an invalid HTTP request.

- Accessing pages after I added your bot to robots.txt: We'll always try to load at least the home page of your website, even if you've added our bot to your robots.txt file. The reason for this is that we need this first access to figure out any domain redirections that may be in place.

How do I know when your bot accessed my site?

Like any well-behaved crawler, our bot has a name, a 'user-agent' identification string that allows you to identify our accesses on your logs and filter them out if you want to.

Our crawler's personal user-agent string is: 'Mozilla/5.0 (compatible; BDCbot/1.0; +http://ecommerce.bigdatacorp.com.br/faq.aspx)'

How do I stop your bot from visiting/crawling my site?

The best way to stop our crawler from accessing your site is through a "robots.txt" file. If you already have one of these on your site, all you have to do is include the lines to block our crawler (which can be found below). If not, you can create a file named "robots.txt" on your site's root directory and include the lines to block us.

To stop our crawler from accessing your site, just include the following lines on your robots.txt file:
User-agent: BDCbot
Disallow: /

Another way you can stop our crawler from visiting your website is by getting in touch with us. Just ask us, and we'll add your site to our special "do not visit" list, effectively blocking our crawler. If you want to do this, just send an email to support-bigweb@bigdatacorp.com.br.

Is there anything else I should know?

If you would like to know more about the robots.txt protocol, to lear about other commands you can use and how to block other crawlers, just go to the robotstxt.org website.

If you want to know more about how our large-scale process works and how we set it up, you can read the case study we wrote with AWS here (article in portuguese), or you can watch one of our talks on YouTube.

The BigWeb Team.