Bot Detection¶

Implementations used for bot detection.

searx.botdetection.get_network(real_ip: IPv4Address | IPv6Address, cfg: config.Config) → IPv4Network | IPv6Network[source]¶

Returns the (client) network of whether the real_ip is part of.

The ipv4_prefix and ipv6_prefix define the number of leading bits in an address that are compared to determine whether or not an address is part of a (client) network.

[botdetection]

ipv4_prefix = 32
ipv6_prefix = 48

searx.botdetection.too_many_requests(network: IPv4Network | IPv6Network, log_msg: str) → Response | None[source]¶: Returns a HTTP 429 response object and writes a ERROR message to the ‘botdetection’ logger. This function is used in part by the filter methods to return the default Too Many Requests response.

class searx.botdetection.ProxyFix(wsgi_app: WSGIApplication)[source]¶

A middleware like the ProxyFix class, where the x_for argument is replaced by a method that determines the number of trusted proxies via the botdetection.trusted_proxies setting.

The remote IP (flask.Request.remote_addr) of the request is taken from (first match):

X-Forwarded-For: If the header is set, the first untrusted IP that comes before the IPs that are still part of the botdetection.trusted_proxies is used.
X-Real-IP: If X-Forwarded-For is not set, X-Real-IP is used (botdetection.trusted_proxies is ignored).

If none of the header is set, the REMOTE_ADDR from the WSGI layer is used. If (for whatever reasons) none IP can be determined, an error message is displayed and 100:: is used instead (RFC 6666).

IP lists ¶

Method `ip_lists`¶

The ip_lists method implements block-list and pass-list.

[botdetection.ip_lists]

pass_ip = [
  '167.235.158.251', # IPv4 of check.searx.space
  '192.168.0.0/16',  # IPv4 private network
  'fe80::/10',       # IPv6 linklocal
]

block_ip = [
  '93.184.216.34',   # IPv4 of example.org
  '257.1.1.1',       # invalid IP --> will be ignored, logged in ERROR class
]

searx.botdetection.ip_lists.SEARXNG_ORG = ['167.235.158.251', '2a01:04f8:1c1c:8fc2::/64']¶: Passlist of IPs from the SearXNG organization, e.g. check.searx.space.

searx.botdetection.ip_lists.pass_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) → Tuple[bool, str][source]¶: Checks if the IP on the subnet is in one of the members of the botdetection.ip_lists.pass_ip list.

searx.botdetection.ip_lists.block_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) → Tuple[bool, str][source]¶: Checks if the IP on the subnet is in one of the members of the botdetection.ip_lists.block_ip list.

Rate limit ¶

Method `ip_limit`¶

The ip_limit method counts request from an IP in sliding windows. If there are to many requests in a sliding window, the request is evaluated as a bot request. This method requires a valkey DB and needs a HTTP X-Forwarded-For header. To take privacy only the hash value of an IP is stored in the valkey DB and at least for a maximum of 10 minutes.

The link_token method can be used to investigate whether a request is suspicious. To activate the link_token method in the ip_limit method add the following configuration:

[botdetection.ip_limit]
link_token = true

If the link_token method is activated and a request is suspicious the request rates are reduced:

BURST_MAX -> BURST_MAX_SUSPICIOUS
LONG_MAX -> LONG_MAX_SUSPICIOUS

To intercept bots that get their IPs from a range of IPs, there is a SUSPICIOUS_IP_WINDOW. In this window the suspicious IPs are stored for a longer time. IPs stored in this sliding window have a maximum of SUSPICIOUS_IP_MAX accesses before they are blocked. As soon as the IP makes a request that is not suspicious, the sliding window for this IP is dropped.

searx.botdetection.ip_limit.BURST_WINDOW = 20¶: Time (sec) before sliding window for burst requests expires.

searx.botdetection.ip_limit.BURST_MAX = 15¶: Maximum requests from one IP in the BURST_WINDOW

searx.botdetection.ip_limit.BURST_MAX_SUSPICIOUS = 2¶: Maximum of suspicious requests from one IP in the BURST_WINDOW

searx.botdetection.ip_limit.LONG_WINDOW = 600¶: Time (sec) before the longer sliding window expires.

searx.botdetection.ip_limit.LONG_MAX = 150¶: Maximum requests from one IP in the LONG_WINDOW

searx.botdetection.ip_limit.LONG_MAX_SUSPICIOUS = 10¶: Maximum suspicious requests from one IP in the LONG_WINDOW

searx.botdetection.ip_limit.API_WINDOW = 3600¶: Time (sec) before sliding window for API requests (format != html) expires.

searx.botdetection.ip_limit.API_MAX = 4¶: Maximum requests from one IP in the API_WINDOW

searx.botdetection.ip_limit.SUSPICIOUS_IP_WINDOW = 2592000¶: Time (sec) before sliding window for one suspicious IP expires.

searx.botdetection.ip_limit.SUSPICIOUS_IP_MAX = 3¶: Maximum requests from one suspicious IP in the SUSPICIOUS_IP_WINDOW.

Method `link_token`¶

The link_token method evaluates a request as suspicious if the URL /client<token>.css is not requested by the client. By adding a random component (the token) in the URL, a bot can not send a ping by request a static URL.

Note

This method requires a valkey DB and needs a HTTP X-Forwarded-For header.

To get in use of this method a flask URL route needs to be added:

@app.route('/client<token>.css', methods=['GET', 'POST'])
def client_token(token=None):
    link_token.ping(request, token)
    return Response('', mimetype='text/css')

And in the HTML template from flask a stylesheet link is needed (the value of link_token comes from get_token):

<link rel="stylesheet"
      href="{{ url_for('client_token', token=link_token) }}"
      type="text/css" >

searx.botdetection.link_token.TOKEN_LIVE_TIME = 600¶: Lifetime (sec) of limiter’s CSS token.

searx.botdetection.link_token.PING_LIVE_TIME = 3600¶: Lifetime (sec) of the ping-key from a client (request)

searx.botdetection.link_token.PING_KEY = 'SearXNG_limiter.ping'¶: Prefix of all ping-keys generated by get_ping_key

searx.botdetection.link_token.TOKEN_KEY = 'SearXNG_limiter.token'¶: Key for which the current token is stored in the DB

searx.botdetection.link_token.is_suspicious(network: IPv4Network | IPv6Network, request: Request, renew: bool = False)[source]¶: Checks whether a valid ping is exists for this (client) network, if not this request is rated as suspicious. If a valid ping exists and argument renew is True the expire time of this ping is reset to PING_LIVE_TIME.

searx.botdetection.link_token.ping(request: Request, token: str)[source]¶: This function is called by a request to URL /client<token>.css. If token is valid a PING_KEY for the client is stored in the DB. The expire time of this ping-key is PING_LIVE_TIME.

searx.botdetection.link_token.get_ping_key(network: IPv4Network | IPv6Network, request: Request) → str[source]¶: Generates a hashed key that fits (more or less) to a WEB-browser session in a network.

searx.botdetection.link_token.get_token() → str[source]¶

Returns current token. If there is no currently active token a new token is generated randomly and stored in the Valkey DB. Without without a database connection, string “12345678” is returned.

TOKEN_LIVE_TIME
TOKEN_KEY

Probe HTTP headers ¶

Method `http_accept`¶

The http_accept method evaluates a request as the request of a bot if the Accept header ..

did not contain text/html

Method `http_accept_encoding`¶

The http_accept_encoding method evaluates a request as the request of a bot if the Accept-Encoding header ..

did not contain gzip AND deflate (if both values are missed)
did not contain text/html

Method `http_accept_language`¶

The http_accept_language method evaluates a request as the request of a bot if the Accept-Language header is unset.

Method `http_connection`¶

The http_connection method evaluates a request as the request of a bot if the Connection header is set to close.

Method `http_user_agent`¶

The http_user_agent method evaluates a request as the request of a bot if the User-Agent header is unset or matches the regular expression USER_AGENT.

searx.botdetection.http_user_agent.USER_AGENT = '(unknown|[Cc][Uu][Rr][Ll]|[wW]get|Scrapy|splash|JavaFX|FeedFetcher|python-requests|Go-http-client|Java|Jakarta|okhttp|HttpClient|Jersey|Python|libwww-perl|Ruby|SynHttpClient|UniversalFeedParser|Googlebot|GoogleImageProxy|bingbot|Baiduspider|yacybot|YandexMobileBot|YandexBot|Yahoo! Slurp|MJ12bot|AhrefsBot|archive.org_bot|msnbot|MJ12bot|SeznamBot|linkdexbot|Netvibes|SMTBot|zgrab|James BOT|Sogou|Abonti|Pixray|Spinn3r|SemrushBot|Exabot|ZmEu|BLEXBot|bitlybot|HeadlessChrome|Mozilla/5\\.0\\ \\(compatible;\\ Farside/0\\.1\\.0;\\ \\+https://farside\\.link\\)|.*PetalBot.*)'¶: Regular expression that matches to User-Agent from known bots

Method `http_sec_fetch`¶

The http_sec_fetch method protect resources from web attacks with Fetch Metadata. A request is filtered out in case of:

http header Sec-Fetch-Mode is invalid
http header Sec-Fetch-Dest is invalid

searx.botdetection.http_sec_fetch.is_browser_supported(user_agent: str) → bool[source]¶

Check if the browser supports Sec-Fetch headers.

https://caniuse.com/mdn-http_headers_sec-fetch-dest https://caniuse.com/mdn-http_headers_sec-fetch-mode https://caniuse.com/mdn-http_headers_sec-fetch-site

Supported browsers: - Chrome >= 80 - Firefox >= 90 - Safari >= 16.4 - Edge (mirrors Chrome) - Opera (mirrors Chrome)

Config ¶

Configuration class Config with deep-update, schema validation and deprecated names.

The Config class implements a configuration that is based on structured dictionaries. The configuration schema is defined in a dictionary structure and the configuration data is given in a dictionary structure.

class searx.botdetection.config.Config(cfg_schema: dict[str, Any], deprecated: dict[str, str])[source]¶

Base class used for configuration

validate(cfg: dict[str, Any])[source]¶: Validation of dictionary cfg on Config.SCHEMA. Validation is done by validate.

update(upd_cfg: dict[str, Any])[source]¶: Update this configuration by upd_cfg.

default(name: str)[source]¶: Returns default value of field name in self.cfg_schema.

get(name: str, default: ~typing.Any = <UNSET>, replace: bool = True) → Any[source]¶

Returns the value to which name points in the configuration.

If there is no such name in the config and the default is UNSET, a KeyError is raised.

set(name: str, val: Any)[source]¶

Set the value to which name points in the configuration.

If there is no such name in the config, a KeyError is raised.

path(name: str, default: ~typing.Any = <UNSET>)[source]¶: Get a pathlib.Path object from a config string.

pyobj(name: str, default: ~typing.Any = <UNSET>)[source]¶: Get python object referred by full qualiffied name (FQN) in the config string.

final exception searx.botdetection.config.SchemaIssue(level: Literal['warn', 'invalid'], msg: str)[source]¶: Exception to store and/or raise a message from a schema issue.