X (Twitter) Facebook Pinterest LinkedIn E-mail

Bots are everywhere. Some are essential (e.g., Googlebot or Bingbot index your website for search engines), but others slow down, consume bandwidth, scrape content to train AI models, or seek vulnerabilities. If you manage a website or an API, you need to separate the wheat from the chaff: allow the legitimate traffic and block abuse before it reaches your server.

Here’s a practical guide—with examples and tools—for a tech site that requires real measures without breaking SEO or usability.

1) Identify Suspicious Traffic

Before blocking, measure:

Server logs (Apache/Nginx): request spikes to the same URL, patterns like /wp-login.php or /xmlrpc.php, empty or fake User-Agents, nighttime bursts.
Analytics (GA4/Matomo): anomalous bounce rates, 0-sec sessions, countries where you have little or no audience.
Latency and bandwidth: if the 80/20 rule isn’t met (high consumption with little value), there’s noise.

Typical Signs:

Spikes around login endpoints, search, RSS, JSON feeds, and /wp-json/wp/v2/users (user enumeration).
Massive downloads of images or PDFs.
User-Agents imitating Google (“Googlebot/2.1”) with IPs that don’t resolve to Google.

2) Robots.txt: Useful for the Good, Irrelevant for the Bad

robots.txt does not block malicious bots; it guides compliant ones. Still, configure it to reduce unnecessary crawling:

User-agent: *
Disallow: /wp-admin/
Disallow: /cgi-bin/
Disallow: /search
Allow: /wp-admin/admin-ajax.php

Sitemap: https://your-domain.com/sitemap.xml

Important: don’t put sensitive paths in robots.txt unless protected by other means (authentication, firewall). A malicious bot will use it as a guide.

3) IP/User-Agent Blocking via .htaccess (Apache) or Nginx: Targeted, Not the Silver Bullet

To block IPs or specific User-Agents in Apache:

# .htaccess

  
    Require all granted
    Require not ip 203.0.113.0/24
  


# Block by user-agent
BrowserMatchNoCase "curl|python-requests|scrapy|wget" badbot
Order Allow,Deny
Allow from all
Deny from env=badbot

In Nginx:

map $http_user_agent $badbot {
  default 0;
  ~*(curl|python-requests|scrapy|wget) 1;
}

server {
  if ($badbot) { return 403; }
  deny 203.0.113.0/24;
}

Limitations: maintaining manual lists is costly; scrapers change IPs and User-Agents. Use this for surgical cases, not as the sole strategy.

4) Web Application Firewall (WAF): Filter Before Your App

A WAF applies known rules to block malicious patterns (SQLi, XSS, RFI, brute force). Two approaches:

4.1 Managed WAF (ModSecurity + OWASP CRS)

ModSecurity (engine) + OWASP ModSecurity Core Rule Set (rules) is a proven classic.
You can adjust the Paranoia Level (PL1→PL4) and Anomaly Threshold to balance security and false positives.

Example (ModSecurity) rule for empty User-Agent:

SecRule REQUEST_HEADERS:User-Agent "^$" \
"id:1000101,phase:1,deny,status:403,msg:'Empty UA blocked'"

4.2 Cloud WAFs (Cloudflare, etc.)

Advantage: for traffic before reaching your server (saves CPU/IO).
Rules based on regex, IP reputation, country, ASN, User-Agent, presence of JavaScript, etc.

Example (Cloudflare Firewall Rule):
(http.user_agent contains "python-requests") or (ip.geoip.country in {"CN" "RU"} and http.request.uri.path eq "/wp-login.php") → Block

5) Rate Limiting & Human Verification without Sacrificing UX

Rate Limit: restrict requests per IP/Key/API at critical endpoints (login, search, APIs).
Challenge: use JavaScript challenges or light Proof of Work (PoW) to increase bot costs without excessive friction.
Accessible CAPTCHA: if friction is unavoidable, avoid intrusive visual challenges; offer privacy-centered options, including audio compliant with WCAG/EAA.

Tip: apply challenges only in high-risk cases (new IP, suspicious User-Agent, no prior cookies), not universally.

6) Treat Good Bots Well: Verification & Allowlist

Verify Google and Bing IPs with reverse DNS and forward-confirmed reverse DNS.
Maintain an allowlist for M2M integrations (monitoring, uptime, payments).
Serve up-to-date sitemaps and avoid blocking essential CSS/JS.

Verifying Googlebot (Summary):

Perform reverse DNS on the IP → should return a Google domain (.googlebot.com/.google.com).
Perform forward DNS on that hostname → should resolve back to the same IP.

7) Protect Expensive APIs & Endpoints

Tokens and Keys with scopes and short expiration.
M2M Rate Limiting per consumer (key/API, OAuth client).
HMAC signatures or mutual TLS where feasible.
Strict CORS and minimal methods/origins allowed.
Payload limits and strong input validation (avoid resource exhaustion).

8) How to Do It with a Managed WAF (Example: RunCloud + ModSecurity)

If managing your sites through a panel with ModSecurity + OWASP CRS (e.g., RunCloud):

In the Dashboard, go to Firewall.
Adjust the Paranoia Level (start at PL1 and increase gradually) and Anomaly Threshold (lower means stricter).
Add custom rules to allow/block by IP, country, User-Agent, or cookie value.
Enable notifications and review block logs (rule ID, endpoint).
Monitor false positives (e.g., internal searches, form plugins) and create exceptions for specific routes or parameters.

When the WAF blocks, the visitor will see a 403 Forbidden. Your goal is to balance security and user experience.

9) Continuous Monitoring & Fine-tuning

KPI: blocked requests, false positive rate, bandwidth consumption, CPU/IO.
Alerts: new patterns (unknown routes, credential stuffing, aggressive scraping).
Monthly review: update rules, review allowlist, focus on sensitive endpoints.
Testing: execute smoke tests after rule changes to ensure critical workflows aren’t broken.

10) “Copy & Paste” Examples

Blocking /wp-login.php access by country (Nginx + GeoIP):

geo $block_country {
  default 0;
  GB 0; US 0; ES 0;  # allowed
  default 1;         # block others
}
location = /wp-login.php {
  if ($block_country) { return 403; }
  try_files $uri =404;
  include fastcgi_params;
  fastcgi_pass php-fpm;
}

Apache: deny python-requests and curl:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} "(python-requests|curl)" [NC]
RewriteRule .* - [F]

ModSecurity: limit xmlrpc.php:

SecRule REQUEST_URI "@streq /xmlrpc.php" \
"id:1000201,phase:1,t:none,deny,status:403,msg:'xmlrpc blocked'"

Quick Checklist

Logs reviewed (suspicious UA/IP/endpoints).
robots.txt up-to-date (no sensitive paths).
Rate limiting on login/search/APIs.
Active WAF (ModSecurity+OWASP CRS or cloud WAF) with custom rules.
Allowlist for good bots and M2M services.
Temporary challenges (JS/PoW/CAPTCHA) for high-risk cases.
API protections (tokens, scopes, CORS, HMAC/mTLS).
Monitoring & alerts; periodic review of false positives.

Frequently Asked Questions

Can I block all bots at once?
Not recommended: you’d lose valuable SEO and useful services. The key is to differentiate (WAF + search engine verification + allowlist) and limit abusive traffic with rate limiting and high-risk challenges.

Does robots.txt stop bad bots?
No. It’s a courtesy protocol for legitimate crawlers. Malicious bots ignore it. Use it to guide good bots and combine with WAF/rate limiters for the rest.

Are IP blocks enough?
They’re temporary: scrapers rotate IPs/ASNs. Better to combine IP/User-Agent/geography + rate limiting + reputation + adaptive challenges.

How to avoid hurting SEO?
Keep an allowlist of verified good bots (Google, Bing), publish sitemaps, and don’t block essential CSS/JS needed for rendering. Verify bots’ real IPs with reverse DNS before allowing.

Conclusion

Blocking “bad bots” isn’t a single magic rule: it’s layers. Start by monitoring what’s happening, apply WAF + rate limiting, use challenges only when necessary, and protect APIs and costly endpoints. With continuous monitoring and periodic adjustments, you can cut down the noise without sacrificing SEO or legitimate user experience.

X (Twitter) Facebook Pinterest LinkedIn E-mail