Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I managed to crawl [..] more than 300k movies from IMDB in just a few hours

I suppose IMDB already has a pretty good architecture to handle that load, but please, if you're crawling from a single site, be careful. I host a similar database myself, and the CPU/load graphs of my server can tell me exactly when someone has a crawler active again. That's not fun if your goal is to keep a site responsive while keeping the hosting at low cost.



Very true indeed. I was also randomly changing user-agents (Mozilla, Safari, Chrome, IE). I thought that this will be harder to tell whether there is a lot of traffic from the same network or someone is just intensively crawling the site.

For me, it was more a proof of how efficient and fast a crawler can be. Also, a response from IMDB was very fast in less than 0.4 seconds, so not that much time was lost there.


Gray hat question out of curiosity and possible experience: did you also use proxies or perhaps even Tor?


so how polite does one need to be? One hit per x seconds?


If the /robots.txt does not mention a Crawl-delay, one page per 3 seconds is often a safe value. Of course this rather heavily depends on the site. In any case, if you have any specific need, always contact the people responsible for the site. I occasionaly run custom queries against the database on request, for example.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: