Bot Detection

Implementations used for bot detection.

searx.botdetection.get_network(real_ip: IPv4Address | IPv6Address, cfg: config.Config) IPv4Network | IPv6Network[source]

Returns the (client) network of whether the real_ip is part of.

The ipv4_prefix and ipv6_prefix define the number of leading bits in an address that are compared to determine whether or not an address is part of a (client) network.

[botdetection]

ipv4_prefix = 32
ipv6_prefix = 48
searx.botdetection.too_many_requests(network: IPv4Network | IPv6Network, log_msg: str) Response | None[source]

Returns a HTTP 429 response object and writes a ERROR message to the ‘botdetection’ logger. This function is used in part by the filter methods to return the default Too Many Requests response.

class searx.botdetection.ProxyFix(wsgi_app: WSGIApplication)[source]

A middleware like the ProxyFix class, where the x_for argument is replaced by a method that determines the number of trusted proxies via the botdetection.trusted_proxies setting.

The remote IP (flask.Request.remote_addr) of the request is taken from (first match):

  • X-Forwarded-For: If the header is set, the first untrusted IP that comes before the IPs that are still part of the botdetection.trusted_proxies is used.

  • X-Real-IP: If X-Forwarded-For is not set, X-Real-IP is used (botdetection.trusted_proxies is ignored).

If none of the header is set, the REMOTE_ADDR from the WSGI layer is used. If (for whatever reasons) none IP can be determined, an error message is displayed and 100:: is used instead (RFC 6666).

IP lists

Method ip_lists

The ip_lists method implements block-list and pass-list.

[botdetection.ip_lists]

pass_ip = [
  '167.235.158.251', # IPv4 of check.searx.space
  '192.168.0.0/16',  # IPv4 private network
  'fe80::/10',       # IPv6 linklocal
]

block_ip = [
  '93.184.216.34',   # IPv4 of example.org
  '257.1.1.1',       # invalid IP --> will be ignored, logged in ERROR class
]
searx.botdetection.ip_lists.SEARXNG_ORG = ['167.235.158.251', '2a01:04f8:1c1c:8fc2::/64']

Passlist of IPs from the SearXNG organization, e.g. check.searx.space.

searx.botdetection.ip_lists.pass_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) Tuple[bool, str][source]

Checks if the IP on the subnet is in one of the members of the botdetection.ip_lists.pass_ip list.

searx.botdetection.ip_lists.block_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) Tuple[bool, str][source]

Checks if the IP on the subnet is in one of the members of the botdetection.ip_lists.block_ip list.

Rate limit

Method ip_limit

The ip_limit method counts request from an IP in sliding windows. If there are to many requests in a sliding window, the request is evaluated as a bot request. This method requires a valkey DB and needs a HTTP X-Forwarded-For header. To take privacy only the hash value of an IP is stored in the valkey DB and at least for a maximum of 10 minutes.

The link_token method can be used to investigate whether a request is suspicious. To activate the link_token method in the ip_limit method add the following configuration:

[botdetection.ip_limit]
link_token = true

If the link_token method is activated and a request is suspicious the request rates are reduced:

To intercept bots that get their IPs from a range of IPs, there is a SUSPICIOUS_IP_WINDOW. In this window the suspicious IPs are stored for a longer time. IPs stored in this sliding window have a maximum of SUSPICIOUS_IP_MAX accesses before they are blocked. As soon as the IP makes a request that is not suspicious, the sliding window for this IP is dropped.

searx.botdetection.ip_limit.BURST_WINDOW = 20

Time (sec) before sliding window for burst requests expires.

searx.botdetection.ip_limit.BURST_MAX = 15

Maximum requests from one IP in the BURST_WINDOW

searx.botdetection.ip_limit.BURST_MAX_SUSPICIOUS = 2

Maximum of suspicious requests from one IP in the BURST_WINDOW

searx.botdetection.ip_limit.LONG_WINDOW = 600

Time (sec) before the longer sliding window expires.

searx.botdetection.ip_limit.LONG_MAX = 150

Maximum requests from one IP in the LONG_WINDOW

searx.botdetection.ip_limit.LONG_MAX_SUSPICIOUS = 10

Maximum suspicious requests from one IP in the LONG_WINDOW

searx.botdetection.ip_limit.API_WINDOW = 3600

Time (sec) before sliding window for API requests (format != html) expires.

searx.botdetection.ip_limit.API_MAX = 4

Maximum requests from one IP in the API_WINDOW

searx.botdetection.ip_limit.SUSPICIOUS_IP_WINDOW = 2592000

Time (sec) before sliding window for one suspicious IP expires.

searx.botdetection.ip_limit.SUSPICIOUS_IP_MAX = 3

Maximum requests from one suspicious IP in the SUSPICIOUS_IP_WINDOW.

Lifetime (sec) of limiter’s CSS token.

Lifetime (sec) of the ping-key from a client (request)

Prefix of all ping-keys generated by get_ping_key

Key for which the current token is stored in the DB

Checks whether a valid ping is exists for this (client) network, if not this request is rated as suspicious. If a valid ping exists and argument renew is True the expire time of this ping is reset to PING_LIVE_TIME.

This function is called by a request to URL /client<token>.css. If token is valid a PING_KEY for the client is stored in the DB. The expire time of this ping-key is PING_LIVE_TIME.

Generates a hashed key that fits (more or less) to a WEB-browser session in a network.

Returns current token. If there is no currently active token a new token is generated randomly and stored in the Valkey DB. Without without a database connection, string “12345678” is returned.

Probe HTTP headers

Method http_accept

The http_accept method evaluates a request as the request of a bot if the Accept header ..

  • did not contain text/html

Method http_accept_encoding

The http_accept_encoding method evaluates a request as the request of a bot if the Accept-Encoding header ..

  • did not contain gzip AND deflate (if both values are missed)

  • did not contain text/html

Method http_accept_language

The http_accept_language method evaluates a request as the request of a bot if the Accept-Language header is unset.

Method http_connection

The http_connection method evaluates a request as the request of a bot if the Connection header is set to close.

Method http_user_agent

The http_user_agent method evaluates a request as the request of a bot if the User-Agent header is unset or matches the regular expression USER_AGENT.

searx.botdetection.http_user_agent.USER_AGENT = '(unknown|[Cc][Uu][Rr][Ll]|[wW]get|Scrapy|splash|JavaFX|FeedFetcher|python-requests|Go-http-client|Java|Jakarta|okhttp|HttpClient|Jersey|Python|libwww-perl|Ruby|SynHttpClient|UniversalFeedParser|Googlebot|GoogleImageProxy|bingbot|Baiduspider|yacybot|YandexMobileBot|YandexBot|Yahoo! Slurp|MJ12bot|AhrefsBot|archive.org_bot|msnbot|MJ12bot|SeznamBot|linkdexbot|Netvibes|SMTBot|zgrab|James BOT|Sogou|Abonti|Pixray|Spinn3r|SemrushBot|Exabot|ZmEu|BLEXBot|bitlybot|HeadlessChrome|Mozilla/5\\.0\\ \\(compatible;\\ Farside/0\\.1\\.0;\\ \\+https://farside\\.link\\)|.*PetalBot.*)'

Regular expression that matches to User-Agent from known bots

Method http_sec_fetch

The http_sec_fetch method protect resources from web attacks with Fetch Metadata. A request is filtered out in case of:

searx.botdetection.http_sec_fetch.is_browser_supported(user_agent: str) bool[source]

Check if the browser supports Sec-Fetch headers.

https://caniuse.com/mdn-http_headers_sec-fetch-dest https://caniuse.com/mdn-http_headers_sec-fetch-mode https://caniuse.com/mdn-http_headers_sec-fetch-site

Supported browsers: - Chrome >= 80 - Firefox >= 90 - Safari >= 16.4 - Edge (mirrors Chrome) - Opera (mirrors Chrome)

Config

Configuration class Config with deep-update, schema validation and deprecated names.

The Config class implements a configuration that is based on structured dictionaries. The configuration schema is defined in a dictionary structure and the configuration data is given in a dictionary structure.

class searx.botdetection.config.Config(cfg_schema: dict[str, Any], deprecated: dict[str, str])[source]

Base class used for configuration

validate(cfg: dict[str, Any])[source]

Validation of dictionary cfg on Config.SCHEMA. Validation is done by validate.

update(upd_cfg: dict)[source]

Update this configuration by upd_cfg.

default(name: str)[source]

Returns default value of field name in self.cfg_schema.

get(name: str, default: ~typing.Any = <UNSET>, replace: bool = True) Any[source]

Returns the value to which name points in the configuration.

If there is no such name in the config and the default is UNSET, a KeyError is raised.

set(name: str, val)[source]

Set the value to which name points in the configuration.

If there is no such name in the config, a KeyError is raised.

path(name: str, default=<UNSET>)[source]

Get a pathlib.Path object from a config string.

pyobj(name, default=<UNSET>)[source]

Get python object referred by full qualiffied name (FQN) in the config string.

exception searx.botdetection.config.SchemaIssue(level: Literal['warn', 'invalid'], msg: str)[source]

Exception to store and/or raise a message from a schema issue.