> I managed to crawl [..] more than 300k movies from IMDB in just a few hours
I suppose IMDB already has a pretty good architecture to handle that load, but please, if you're crawling from a single site, be careful. I host a similar database myself, and the CPU/load graphs of my server can tell me exactly when someone has a crawler active again. That's not fun if your goal is to keep a site responsive while keeping the hosting at low cost.
Very true indeed. I was also randomly changing user-agents (Mozilla, Safari, Chrome, IE). I thought that this will be harder to tell whether there is a lot of traffic from the same network or someone is just intensively crawling the site.
For me, it was more a proof of how efficient and fast a crawler can be.
Also, a response from IMDB was very fast in less than 0.4 seconds, so not that much time was lost there.
If the /robots.txt does not mention a Crawl-delay, one page per 3 seconds is often a safe value. Of course this rather heavily depends on the site. In any case, if you have any specific need, always contact the people responsible for the site. I occasionaly run custom queries against the database on request, for example.
I suppose IMDB already has a pretty good architecture to handle that load, but please, if you're crawling from a single site, be careful. I host a similar database myself, and the CPU/load graphs of my server can tell me exactly when someone has a crawler active again. That's not fun if your goal is to keep a site responsive while keeping the hosting at low cost.