Mobile app version of vmapp.org
Login or Join
Vandalay111

: How to block a user-agent that has a spacing in its name? I got a hit from a crawler with a user-agent called DV CRAWLER which is an abvious a spam-bot. I tried to block it in both .htacess

@Vandalay111

Posted in: #Botattack #Htaccess #Nginx #UserAgent #WebCrawlers

I got a hit from a crawler with a user-agent called DV CRAWLER which is an abvious a spam-bot. I tried to block it in both .htacess and nginx configuration as I'm running nginx as a reverse proxy in front of apache.

Here is the code I used for .htaccess

RewriteCond %{HTTP_USER_AGENT} ^.*(Baiduspider|DV CRAWLER).*$ [NC]
RewriteRule .* - [F,L]


Seems that the spacing in the name of the user agent has broke the code. I discovered that it only works with user agents that has no spaces. Same scenario with nginx, it doesn't accept spacing in the name of user agent and returns error.

Nginx code:

if ($http_user_agent ~ (Baiduspider|DV CRAWLER) ) {
return 403;
}


So, what is the alternative for this? I don't want these spam bots to crawl my website. Any answer would be greatly appreciated.

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @Vandalay111

3 Comments

Sorted by latest first Latest Oldest Best

 

@Ogunnowo487

The space is a delimiter (ie. a special character) in .htaccess so must be backslash escaped if you want to match a literal space in the regex. Eg. DV CRAWLER. (Otherwise you are likely to get a less than helpful 500 Internal Server error.)

Or, you can use the shorthand character class s which matches any white space character (space, tab or new line / line break) - so not technically just a space.

10% popularity Vote Up Vote Down


 

@Heady270

When in doubt, add parenthesis and escaping to regular expressions. Try this first:

(Baiduspider|(DV CRAWLER))


I think that your problem is that it evaluating as "Baiduspider or DV followed by CRAWLER" when you don't have the paranthesis. If that doesn't work, then try escaping the space:

(Baiduspider|(DVsCRAWLER))


Where s is any white space character.

10% popularity Vote Up Vote Down


 

@Sherry384

Your regex code in general is wrong.

Try instead something like this:

RewriteCond %{HTTP_USER_AGENT} (.*Baiduspider.*|.*DV.*CRAWLER.*) [NC]


You are matching against a string in each iteration between the parenthesis () separated by the pipe character | whereas .* is a wild card that matches anything. Optionally you can use s or s+ for spaces but .* works too and may be better. Not knowing what the DV CRAWLER string looks like, I made a guess (SWAG). You may need to adjust this.

For example: A string of a line of red cars driving down the street could be matched simply using .*red.*cars.*. There are slicker regular expressions for this, but using this simple method can be safely repeated over and over.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme