I initially retweeted the news and explained that it was a trick shortly after.
It would have been a low shot, really. I don't think it's even possible to remove that large set of results on all the datacenters of Google in a short time frame.
What really happened
Someone made up this query:
http://www.google.com/search?q=оrаcІе
Initially the result page was empty (Your search - ... - did not match any documents). Then people began tweeting and sharing the query and Google started showing up them as the unique results:
So how did they do it?
At first I thought someone used a capital i (I) to substitute the L of Oracle, but Google is smart and would perform a case-insensitive search in this case:
http://www.google.com/search?q=oracIe
Nevertheless, the difference between capital i and lowercase L is not so visible in Google's font.
But, if you try to paste the link or save the page and go over it with hexedit, you'll notice this:
http://www.google.com/search?q=%D0%BEr%D0%B0c%D0%86%D0%B5
This is clearly the sign that someone has inserted non-ASCII characters in the query.
The character table for Unicode/UTF-8 says that we have, in sequence:
CYRILLIC SMALL LETTER OThis combination of characters is very unlikely to be found in actual documents. In fact, at first it did not produce results. Furthermore, in Google's font of choice, Arial, the difference between these letters and their latin counterparts (if there is any) is again not clear to the naked eye. It makes sense to reuse glyphs that are actually the same in ordinary printed text.
LATIN SMALL LETTER R
CYRILLIC SMALL LETTER A
LATIN SMALL LETTER C
CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
CYRILLIC SMALL LETTER IE
And finally, the forgery replaces the majority of the latin letters, because replacing only one or two would lead to a Did you mean: oracle notice.
Mystery solved
So UTF-8 struck again, and some of us were fooled by a ingenious, well-forgered Google query. Technically this is called an homograph attack.
The potential of UTF-8 as a dangerous mean of fooling users is great - imagine if non-latin URLs will become a reality. Fortunately, the ICANN and major browsers have been working on a solution, but we as web developers should be aware of the problem too.
Thanks for explaining what happened in the code. What interests me is why there were only accusations pointed at Google, and no on saying Oracle removed themselves.
ReplyDeleteWell-(and promptly!)-done. Not to be a stickler, but did you mean "Mystery" (not "Mistery") solved? Considering the subject of the post, you probably want to go after every single character with a fine tooth comb...
ReplyDeleteThanks for helping to clear this up.
ReplyDeleteIf you feel like a laugh (or a facepalm, as the case may be), you might want to see Gene continue to defend the position that Google really did change their search results. This man just doesn't know when to stop digging.
http://www.ipwatchdog.com/2010/08/13/google-briefly-punishes-oracle-by-removal-from-google-search/
@Jim:
ReplyDeleteProbably because saying that Oracle *requested* themselves removed from the index is far more outrageous than the alternative.
@Garrett: Gene Quinn is arrogant and dishonest - what you could expect from a patent attorney. He probably created the case to make some publicity for himself. But when you type the URLs of the results included in his screenshot, you get to pages (like http://dvlprs.com/link/2483939) that LINK TO THE FORGERED query, with the exact cyrillic characters shown here. This is exactly why only those pages they were found, if you try now you'll find also this page.
ReplyDelete@Jim, Matt: Oracle is a valid English word. So Oracle removed would not result in the Wikipedia pages about Greek oracles removed too, along with all the web pages containing the "oracle" word.
@Mister Snitch: thank you for finding the typo, I always confuse that and "holiday".
Since Gene Quinn tried to discredit me and the other debunkers, here's a follow-up:
ReplyDeletehttp://giorgiosironi.blogspot.com/2010/08/public-response-to-gene-quinn-on-google.html
I also think you probably meant that UTF-8 "strikes" or "struck" again, rather than "striked". Sorry :-/
ReplyDeleteThanks. As you may know I'm not a native English speaker, and I had little time for getting this post out. I've spent much of it checking the UTF-8 character table than proofreading. :)
ReplyDeleteGood catch, Giorgio! Gene obviously never learned when to pack his kit and leave via the back door!
ReplyDelete"imagine if non-latin URLs become a reality"
ReplyDeleteWhat do you mean? They already have. See http://www.bbc.co.uk/news/10100108
I did some research and notice they are now available, but the support is still not very good. There was a case study a while ago about falsifying paypal.com with Cyrillic characters, which still would be discovered by browsers (you would see something like www.--xn---.com).
ReplyDeleteI would like to exchange links with your site giorgiosironi.blogspot.com
ReplyDeleteIs this possible?
It's considered a bad practice by us web developers. Not interested, thanks.
ReplyDeleteGood evening
ReplyDeleteCan I link to this post please?