Googlebot on a tcp/ip level

jacquesm · on Oct 9, 2010

Apart from the auto-backslapping this a an interesting confirmation that performance of your website impacts search engine results.

As for google sending multiple requests, from the way the article is written it sounds as though google sends the requests all at once and then waits for the answers to come back one by one, you can cure this by switching keep-alive off on the server side.

Typically in your http configuration you would add a line like this (example for apache):

    KeepAlive Off

You could even do this just for the googlebot:

    BrowserMatch "GoogleBot" nokeepalive

That way you can 'fix' the google bot issues without affecting the normal users of the site.

ivank · on Oct 9, 2010

Even better would be to convince GoogleBot not to pipeline, while keeping keep-alive on. Perhaps by sending some kind of IIS/4 Server header to Googlebot?

jacquesm · on Oct 9, 2010

Keep-alive and pipelining go hand in hand, a trick I abused for many years in order to serve up streaming video using jpegs.

The funny thing is it works both ways, if you switch keepalive to 'on' and you start dumping answers in to the pipe pre-emptively (because, for instance you know you're talking to your own little piece of javascript on the other side, so you can predict the next request) then you can save yourself the round-trip delays that you would have if you stuck to the regular request/answer, request/answer pattern.

Keep-alive on pretty much implies that pipelining is ok.

For many years this was the 'secret sauce' that my company lived off, the fact that nobody clued in to it is something that amazes me to this date, it seemed a pretty obvious thing to do.

ivank · on Oct 9, 2010

> dumping answers in to the pipe pre-emptively

That's a very interesting hack, but I hope no one decides to deploy it today. You really can't predict what the next request will be - the browser can reuse the connection for some other request at a whim.

And I'm sure there's some proxy around that will panic if it gets a response before getting a request.

> Keep-alive on pretty much implies that pipelining is ok.

Well, sure, in theory. But since nearly every browser keeps it off (edit: doesn't pipeline requests), obviously a ton of servers are broken. Even Flickr was broken for many years with pipelining on (image downloads would abort randomly). Chrome is planning to eventually enable it with a bunch of heuristics; hopefully that will improve the situation.

Edit: "Making HTTP Pipelining Usable on the Open Web": http://tools.ietf.org/html/draft-nottingham-http-pipeline-00

jacquesm · on Oct 9, 2010

As a rule I have it 'off' because I have seen more bugs related to keep-alive than that I've seen benefits from it but in some special cases the speedup can be dramatic so you should always at least test to see what it does for you.

For instance, a gallery page with lots of small thumbnails could benefit from keep-alive being on.

dedward · on Oct 9, 2010

Is pipelining a well-defined behaviour now? I thought it was still pretty browser dependent.

Obviously you need keep-alive for pipelining to work at all - but are you saying you could actually dump responses down a keep-alive connection without corresponding requests and popular modern browsers will just deal with it?

jacquesm · on Oct 9, 2010

I've used it since (don't laugh) IE 3 came out because that was the only way to get decent performance out of it.

Pipelining is a thing that is actually harder to guard against than to implement because as soon as keep-alive is on you can start sending multiple requests down the wire.

The other side will presumably respond to the first request by scanning the input up and until the end of the request, then send back the result via the same connection. As soon as it has done that it will look for more input.

Both sides can 'cheat', in other words, you don't actually need to look at the input requests when you're a server, and you don't actually need to wait for that first response to arrive before sending the next if your'e a client.

The socket layer doesn't care one bit, to it it is all data, and the application layer presumably would have a hard time telling when certain bytes were sent if it is not explicitly polling the other side of the connection while generating another answer or reading a new request.

So, typically the sequence on a keepalive connection looks like this:

  request - answer - request - answer - request - answer - etc

But you could do this: (the google bot situation):

  request - request - request - answer - answer - answer - etc

or even this (which is how my old webcam software worked):

  request - answer - answer - answer - request - request - etc

So as soon as you switch on keep-alive you give the other side the opportunity to start pipe-lining requests or answers.

The problem with the way the google bot works here is that they seem to log all the pipelined but as yet unanswered requests as 'failed' whereas they should only log the request that was next up for answering as failed.

The obvious solutions are either keep alive 'off' for google or speed things up to the point where it will simply work.

ivank · on Oct 9, 2010

Just to reduce confusion (hopefully):

> or even this (which is how my old webcam software worked):

> request - answer - answer - answer - request - request - etc

This is not HTTP pipelining, though it sort of looks similar. It's a neat hack, but the HTTP spec doesn't allow guessing what the request will be.

jacquesm · on Oct 9, 2010

It doesn't say you can't either ;)

dedward · on Oct 9, 2010

That would just slow them down, no?

I'm all for google to be a force pushing for adoption - pipelining is good - if google requiring it means people pay more attention to making their sites work with it, great.

(I haven't read the spec - does a browser with pipelining on requier the requests rae returned in order?)

jacquesm · on Oct 9, 2010

> does a browser with pipelining on require the requests are returned in order?

Yes.

hartror · on Oct 9, 2010

While the article was interesting the first line "We did it. We solved one of the unsolved big SEO (Search Engine Optimization) mysteries of the modern time." had me reaching for the close button. Is it just me or is there a LOT of spin from SEO types?

rythie · on Oct 9, 2010

I imagine getting on Hacker News is a good boost for their SEO, so maybe it worked?

franze · on Oct 9, 2010

as the author of the article in question, SEO and geek i must say: i deliberately overspinned the wording of the article, as - as a matter of fact - the mystery was so mysterious that nobody in the so called SEO "community" ever noticed that there was something rotten going on in their beloved google webmaster reports.

and (i'm not sure why, it was a friday afternoon when i wrote that thing) i thought it would be funny to take the typical over-the-the-top-linkbait-kind-of-writing-style and ...... did something nobody has ever done before ... put some real information in it.

carbocation · on Oct 9, 2010

I think the post conveyed the tongue-in-cheek nature pretty well. I actually thought you'd be disclosing an even smaller finding.

Tichy · on Oct 9, 2010

Still, SEO exists as a concept and as a name for a group of activities. I am not sure what you find wrong with the sentence?

nl · on Oct 10, 2010

Someone needs to point out that this isn't the TCP/IP level, it's just HTTP.

I know it is pedantic technical discussions need precision in their agreed terminology. HTTP and TCP/IP are at different stack levels, and everything discussed here is HTTP.

TimH · on Oct 9, 2010

It would be useful if Google hooked some info like this into the error message somehow. Can anyone pass it on to the right people at Google?

alanh · on Oct 11, 2010

Matt Cutts already commented on the article, so I'm sure the right people will know before Monday is over.

mrb · on Oct 10, 2010

"If you don’t understand a single word i just wrote, please remember, we are geeks."

Hum. An SEO discovers HTTP pipelining and gets all excited about it. yawns

Andrewski · on Oct 10, 2010

SEO bums are not geeks, they are cancer. SEO is why Google and other search engines suck now.

Remember to get your herbal V14gra, and to Digg this!

tbrownaw · on Oct 10, 2010

I don't think the point was the pipelining itself, but how it interacts with google's handling of variability in response times.

mybbor · on Oct 9, 2010

Thank you so much. The timing of this article is fantastic I have actually been battling this exact bug/quirk all week.

whackberry · on Oct 9, 2010

I wish we had a downvote arrow as well. This is SEO spam and it reached nr 1 on HN.