RBL rule 350000 and Google (MSN) crawlers

Community support forums for the free/delayed modsecurity rules feed. There is no such thing as a bad question here as long as it pertains to using the delayed modsecurity rules feed. Newbies feel free to get help getting started or asking questions that may be obvious.
andy928
New Forum User
New Forum User
Posts: 4
Joined: Tue Jul 10, 2012 7:05 pm
Location: Australia

RBL rule 350000 and Google (MSN) crawlers

Unread post by andy928 »

We are getting problems with "access denied 403" with both Google and MSN crawlers. The problem started around August 9-10. Modsecurity captures the rule 350000 with a succeeded lookup of spamhouse.org. Further checking these IPs with spamhouse show similar host infections like:
IP Address 66.249.73.156 is listed in the CBL. It appears to be infected with a spam sending trojan or proxy.
It was last detected at 2012-08-21 08:00 GMT (+/- 30 minutes), approximately 6 days, 15 hours, 29 minutes ago.
This IP is infected with, or is NATting for a machine infected with Win32/Zbot (Microsoft).
I am not sure how to resolve the problem. Google is dropping the site's index because of "access denied". Should we add these IPs to a whitelist? it could be not as simple because Google uses a lot of IPs for their robots.

I tried to add Gogglebot IP ranges like into conf files:
SecRule REMOTE_ADDR "@ipMatch 64.249.66.0/19"phase:1,nolog,allow,ctl:ruleEngine=Off
SecRule REMOTE_ADDR "@ipMatch 64.233.160.0/19"phase:1,nolog,allow,ctl:ruleEngine=Off
SecRule REMOTE_ADDR "@ipMatch 66.102.0.0/32"phase:1,nolog,allow,ctl:ruleEngine=Off
SecRule REMOTE_ADDR "@ipMatch 72.14.192.0/18"phase:1,nolog,allow,ctl:ruleEngine=Off
SecRule REMOTE_ADDR "@ipMatch 74.125.0.0/16"phase:1,nolog,allow,ctl:ruleEngine=Off
SecRule REMOTE_ADDR "@ipMatch 209.85.128.0/17"phase:1,nolog,allow,ctl:ruleEngine=Off
SecRule REMOTE_ADDR "@ipMatch 216.239.32.0/19"phase:1,nolog,allow,ctl:ruleEngine=Off
but am not sure if it is the best solution. I also tried adding domain name ".googlebot.com" into the whitelist file, but it doesn't work.

I would appreciate any suggestions.

Thanks
User avatar
mikeshinn
Atomicorp Staff - Site Admin
Atomicorp Staff - Site Admin
Posts: 4152
Joined: Thu Feb 07, 2008 7:49 pm
Location: Chantilly, VA

Re: RBL rule 350000 and Google (MSN) crawlers

Unread post by mikeshinn »

Just add the IPs to the whitelist. In the realtime rules, you can use CIDRs, or in ASL you can just enable search engine detection which will auto ignore search engines.
andy928
New Forum User
New Forum User
Posts: 4
Joined: Tue Jul 10, 2012 7:05 pm
Location: Australia

Re: RBL rule 350000 and Google (MSN) crawlers

Unread post by andy928 »

mikeshinn wrote:Just add the IPs to the whitelist. In the realtime rules, you can use CIDRs, or in ASL you can just enable search engine detection which will auto ignore search engines.
you

Thank you. Can I use host names in the white list, like .googlebot.com to avoid possible IP changes? Iwas thinking about something like
SecRule REMOTE_HOST "!@endsWith .googlebot.com"
but in that case I will have to enable HostNameLookups, which is not a good idea.

What about this solution:
SecFilterSelective HTTP_USER_AGENT Google nolog,allow
SecFilterSelective HTTP_USER_AGENT Googlebot nolog,allow
SecFilterSelective HTTP_USER_AGENT GoogleBot nolog,allow
SecFilterSelective HTTP_USER_AGENT googlebot nolog,allow
SecFilterSelective HTTP_USER_AGENT Googlebot-Image nolog,allow
Would it be better than specifying IP ranges?
User avatar
mikeshinn
Atomicorp Staff - Site Admin
Atomicorp Staff - Site Admin
Posts: 4152
Joined: Thu Feb 07, 2008 7:49 pm
Location: Chantilly, VA

Re: RBL rule 350000 and Google (MSN) crawlers

Unread post by mikeshinn »

Thank you. Can I use host names in the white list, like .googlebot.com to avoid possible IP changes? Iwas thinking about something like
Only if apache is doing reverse lookups. If its not setup to do that, then that will not do anything.
but in that case I will have to enable HostNameLookups, which is not a good idea.
Depending on the speed at which your DNS server replies, this can slow down page response times dramatically. Plus this will occur for every request. If your DNS server is really fast, then you can try this. You'll want to read up reverse DNS as well, because the forward records are trivial to fake so you need that enabled as well.
What about this solution:
Quote:
SecFilterSelective HTTP_USER_AGENT Google nolog,allow
SecFilterSelective HTTP_USER_AGENT Googlebot nolog,allow
SecFilterSelective HTTP_USER_AGENT GoogleBot nolog,allow
SecFilterSelective HTTP_USER_AGENT googlebot nolog,allow
SecFilterSelective HTTP_USER_AGENT Googlebot-Image nolog,allow

Would it be better than specifying IP ranges?
No, that header is trivial to fake and should never be trusted. Its set by the client, and the client can set that to anything they want. Attackers/spammers/people_in_general use fake user agent fields of search engines all the time for all sorts of reasons, including attempting to get past security controls. You can expect to get hacked pretty badly if you trust that header, so I would not recommend you use rules like that. That header can not be trusted.
Post Reply