Why capture web pages?
At the end of 2012, we started this process to answer a simple question: how many e-commerce websites
existed in Brazil. Talking to other companies and experts on the market, we ended up discovering that
while many companies talked about the web and claimed impressive market share, very little was actually
known. This lead us to decide to expand our scope in order to find out more and answer more questions.
Where are the sites hosted? What payment services they use? What is the web's favorite development language
in each country?
Our crawler was structured to answer these and many other questions. If you are curious to know the answers,
click here to register and get started. Search for any site you want,
or simply take a look at our reports. It's the web, unlike you've ever seen it before.
Why should I let you access my website?
Beyond simple generosity of spirit, when you let us access your site, you ensure that information about it will
be compiled with our statistics and with a quick registration you can
look at all of our data, even comparing your website to the web in general.
How does your crawling process work?
Our crawling processed is based on a growing list of website addresses. Every time we find a link to a new site, we add
it to our list to be visited at a future date. While we're visiting your site, we try to be as polite as possible: we'll
strive to visit each page only once, we have a minimum wait time between successive visits, and we do our best to avoid
downloading any files that aren't actual HTML pages.
We also do our best to comply with all the rules and conventions of good behavior for crawling, including the
Won't your spider damage the performance of my website?
As we mentioned above, we do our very best to avoid harming any website we visit. If the Google crawler doesn't hurt your
site, ours shouldn't. With that said, over time we've seen some recurring misunderstandings crop up:
- "Invalid requests": The first thing we do when trying to load a webpage is to call the "HEAD" method, to try
and load headers without loading the content, so we can skip over things we're not interested. Some web servers don't
respond to this method, and this may be appear in your logs as an invalid HTTP request.
- Accessing pages after I added your bot to robots.txt: We'll always try to load at least the home page of your
website, even if you've added our bot to your robots.txt file. The reason for this is that we need this first access to
figure out any domain redirections that may be in place.
How do I know when your bot accessed my site?
Like any well-behaved crawler, our bot has a name, a 'user-agent' identification string that allows you to identify our
accesses on your logs and filter them out if you want to.
Our crawler's personal user-agent string is: 'Mozilla/5.0 (compatible; BDCbot/1.0; +http://ecommerce.bigdatacorp.com.br/faq.aspx)'
How do I stop your bot from visiting/crawling my site?
The best way to stop our crawler from accessing your site is through a "robots.txt" file. If you already have one of these on
your site, all you have to do is include the lines to block our crawler (which can be found below). If not, you can create a
file named "robots.txt" on your site's root directory and include the lines to block us.
To stop our crawler from accessing your site, just include the following lines on your robots.txt file:
Another way you can stop our crawler from visiting your website is by getting in touch with us. Just ask us, and we'll add
your site to our special "do not visit" list, effectively blocking our crawler. If you want to do this, just send an email
Is there anything else I should know?
If you would like to know more about the robots.txt protocol, to lear about other commands you can use and how to block other
crawlers, just go to the robotstxt.org website.
If you want to know more about how our large-scale process works and how we set it up, you can read the case study we wrote with AWS
here (article in portuguese),
or you can watch one of our talks on YouTube.