Invisible to the eye: Google never removed Oracle from its index

Saturday, August 14, 2010

Google never removed Oracle from its index

Some folks have been reporting a strange behavior assumed by Google after the lawsuit filed by Oracle against Android and Google: it supposedly removed oracle.com pages, and all the pages that talk about Oracle, from its search index. Even the wikipedia page on the Delphic oracle.
I initially retweeted the news and explained that it was a trick shortly after.
It would have been a low shot, really. I don't think it's even possible to remove that large set of results on all the datacenters of Google in a short time frame.

What really happened
Someone made up this query:
http://www.google.com/search?q=оrаcІе
Initially the result page was empty (Your search - ... - did not match any documents). Then people began tweeting and sharing the query and Google started showing up them as the unique results:

So how did they do it?
At first I thought someone used a capital i (I) to substitute the L of Oracle, but Google is smart and would perform a case-insensitive search in this case:
http://www.google.com/search?q=oracIe

Nevertheless, the difference between capital i and lowercase L is not so visible in Google's font.
But, if you try to paste the link or save the page and go over it with hexedit, you'll notice this:
http://www.google.com/search?q=%D0%BEr%D0%B0c%D0%86%D0%B5
This is clearly the sign that someone has inserted non-ASCII characters in the query.
The character table for Unicode/UTF-8 says that we have, in sequence:

CYRILLIC SMALL LETTER O
LATIN SMALL LETTER R
CYRILLIC SMALL LETTER A
LATIN SMALL LETTER C
CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
CYRILLIC SMALL LETTER IE

This combination of characters is very unlikely to be found in actual documents. In fact, at first it did not produce results. Furthermore, in Google's font of choice, Arial, the difference between these letters and their latin counterparts (if there is any) is again not clear to the naked eye. It makes sense to reuse glyphs that are actually the same in ordinary printed text.
And finally, the forgery replaces the majority of the latin letters, because replacing only one or two would lead to a Did you mean: oracle notice.

Mystery solved
So UTF-8 struck again, and some of us were fooled by a ingenious, well-forgered Google query. Technically this is called an homograph attack.
The potential of UTF-8 as a dangerous mean of fooling users is great - imagine if non-latin URLs will become a reality. Fortunately, the ICANN and major browsers have been working on a solution, but we as web developers should be aware of the problem too.

14 comments:

Jim HarvieAugust 14, 2010 11:21 PM
Thanks for explaining what happened in the code. What interests me is why there were only accusations pointed at Google, and no on saying Oracle removed themselves.
ReplyDelete
Replies
Jeff FariaAugust 15, 2010 12:06 AM
Well-(and promptly!)-done. Not to be a stickler, but did you mean "Mystery" (not "Mistery") solved? Considering the subject of the post, you probably want to go after every single character with a fine tooth comb...
ReplyDelete
Replies
GarrettAugust 15, 2010 12:30 AM
Thanks for helping to clear this up.

If you feel like a laugh (or a facepalm, as the case may be), you might want to see Gene continue to defend the position that Google really did change their search results. This man just doesn't know when to stop digging.

http://www.ipwatchdog.com/2010/08/13/google-briefly-punishes-oracle-by-removal-from-google-search/
ReplyDelete
Replies
MattAugust 15, 2010 5:57 AM
@Jim:

Probably because saying that Oracle *requested* themselves removed from the index is far more outrageous than the alternative.
ReplyDelete
Replies
GiorgioAugust 15, 2010 11:58 AM
@Garrett: Gene Quinn is arrogant and dishonest - what you could expect from a patent attorney. He probably created the case to make some publicity for himself. But when you type the URLs of the results included in his screenshot, you get to pages (like http://dvlprs.com/link/2483939) that LINK TO THE FORGERED query, with the exact cyrillic characters shown here. This is exactly why only those pages they were found, if you try now you'll find also this page.
@Jim, Matt: Oracle is a valid English word. So Oracle removed would not result in the Wikipedia pages about Greek oracles removed too, along with all the web pages containing the "oracle" word.
@Mister Snitch: thank you for finding the typo, I always confuse that and "holiday".
ReplyDelete
Replies
GiorgioAugust 15, 2010 1:30 PM
Since Gene Quinn tried to discredit me and the other debunkers, here's a follow-up:
http://giorgiosironi.blogspot.com/2010/08/public-response-to-gene-quinn-on-google.html
ReplyDelete
Replies
Carey TewsAugust 15, 2010 5:19 PM
I also think you probably meant that UTF-8 "strikes" or "struck" again, rather than "striked". Sorry :-/
ReplyDelete
Replies
GiorgioAugust 15, 2010 8:46 PM
Thanks. As you may know I'm not a native English speaker, and I had little time for getting this post out. I've spent much of it checking the UTF-8 character table than proofreading. :)
ReplyDelete
Replies
DocAugust 15, 2010 9:27 PM
Good catch, Giorgio! Gene obviously never learned when to pack his kit and leave via the back door!
ReplyDelete
Replies
TomWAugust 16, 2010 6:29 PM
"imagine if non-latin URLs become a reality"
What do you mean? They already have. See http://www.bbc.co.uk/news/10100108
ReplyDelete
Replies
GiorgioAugust 16, 2010 8:29 PM
I did some research and notice they are now available, but the support is still not very good. There was a case study a while ago about falsifying paypal.com with Cyrillic characters, which still would be discovered by browsers (you would see something like www.--xn---.com).
ReplyDelete
Replies
AnonymousOctober 03, 2010 10:29 PM
I would like to exchange links with your site giorgiosironi.blogspot.com
Is this possible?
ReplyDelete
Replies
GiorgioOctober 04, 2010 11:40 AM
It's considered a bad practice by us web developers. Not interested, thanks.
ReplyDelete
Replies
AnonymousDecember 07, 2010 8:56 PM
Good evening

Can I link to this post please?
ReplyDelete
Replies