Bot Detection¶
Implementations used for bot detection.
- searx.botdetection.get_network(real_ip: IPv4Address | IPv6Address, cfg: config.Config) IPv4Network | IPv6Network [source]¶
Returns the (client) network of whether the
real_ip
is part of.The
ipv4_prefix
andipv6_prefix
define the number of leading bits in an address that are compared to determine whether or not an address is part of a (client) network.[botdetection] ipv4_prefix = 32 ipv6_prefix = 48
- searx.botdetection.too_many_requests(network: IPv4Network | IPv6Network, log_msg: str) Response | None [source]¶
Returns a HTTP 429 response object and writes a ERROR message to the ‘botdetection’ logger. This function is used in part by the filter methods to return the default
Too Many Requests
response.
- class searx.botdetection.ProxyFix(wsgi_app: WSGIApplication)[source]¶
A middleware like the ProxyFix class, where the
x_for
argument is replaced by a method that determines the number of trusted proxies via thebotdetection.trusted_proxies
setting.The remote IP (
flask.Request.remote_addr
) of the request is taken from (first match):X-Forwarded-For: If the header is set, the first untrusted IP that comes before the IPs that are still part of the
botdetection.trusted_proxies
is used.X-Real-IP: If X-Forwarded-For is not set, X-Real-IP is used (
botdetection.trusted_proxies
is ignored).
If none of the header is set, the REMOTE_ADDR from the WSGI layer is used. If (for whatever reasons) none IP can be determined, an error message is displayed and
100::
is used instead (RFC 6666).
IP lists¶
Method ip_lists
¶
The ip_lists
method implements block-list
and
pass-list
.
[botdetection.ip_lists]
pass_ip = [
'167.235.158.251', # IPv4 of check.searx.space
'192.168.0.0/16', # IPv4 private network
'fe80::/10', # IPv6 linklocal
]
block_ip = [
'93.184.216.34', # IPv4 of example.org
'257.1.1.1', # invalid IP --> will be ignored, logged in ERROR class
]
- searx.botdetection.ip_lists.SEARXNG_ORG = ['167.235.158.251', '2a01:04f8:1c1c:8fc2::/64']¶
Passlist of IPs from the SearXNG organization, e.g. check.searx.space.
- searx.botdetection.ip_lists.pass_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) Tuple[bool, str] [source]¶
Checks if the IP on the subnet is in one of the members of the
botdetection.ip_lists.pass_ip
list.
- searx.botdetection.ip_lists.block_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) Tuple[bool, str] [source]¶
Checks if the IP on the subnet is in one of the members of the
botdetection.ip_lists.block_ip
list.
Rate limit¶
Method ip_limit
¶
The ip_limit
method counts request from an IP in sliding windows. If
there are to many requests in a sliding window, the request is evaluated as a
bot request. This method requires a valkey DB and needs a HTTP X-Forwarded-For
header. To take privacy only the hash value of an IP is stored in the valkey DB
and at least for a maximum of 10 minutes.
The link_token
method can be used to investigate whether a request is
suspicious. To activate the link_token
method in the
ip_limit
method add the following configuration:
[botdetection.ip_limit]
link_token = true
If the link_token
method is activated and a request is suspicious
the request rates are reduced:
To intercept bots that get their IPs from a range of IPs, there is a
SUSPICIOUS_IP_WINDOW
. In this window the suspicious IPs are stored
for a longer time. IPs stored in this sliding window have a maximum of
SUSPICIOUS_IP_MAX
accesses before they are blocked. As soon as the IP
makes a request that is not suspicious, the sliding window for this IP is
dropped.
- searx.botdetection.ip_limit.BURST_WINDOW = 20¶
Time (sec) before sliding window for burst requests expires.
- searx.botdetection.ip_limit.BURST_MAX = 15¶
Maximum requests from one IP in the
BURST_WINDOW
- searx.botdetection.ip_limit.BURST_MAX_SUSPICIOUS = 2¶
Maximum of suspicious requests from one IP in the
BURST_WINDOW
- searx.botdetection.ip_limit.LONG_WINDOW = 600¶
Time (sec) before the longer sliding window expires.
- searx.botdetection.ip_limit.LONG_MAX = 150¶
Maximum requests from one IP in the
LONG_WINDOW
- searx.botdetection.ip_limit.LONG_MAX_SUSPICIOUS = 10¶
Maximum suspicious requests from one IP in the
LONG_WINDOW
- searx.botdetection.ip_limit.API_WINDOW = 3600¶
Time (sec) before sliding window for API requests (format != html) expires.
- searx.botdetection.ip_limit.API_MAX = 4¶
Maximum requests from one IP in the
API_WINDOW
- searx.botdetection.ip_limit.SUSPICIOUS_IP_WINDOW = 2592000¶
Time (sec) before sliding window for one suspicious IP expires.
- searx.botdetection.ip_limit.SUSPICIOUS_IP_MAX = 3¶
Maximum requests from one suspicious IP in the
SUSPICIOUS_IP_WINDOW
.
Method link_token
¶
The link_token
method evaluates a request as suspicious
if the URL /client<token>.css
is not requested by the
client. By adding a random component (the token) in the URL, a bot can not send
a ping by request a static URL.
Note
This method requires a valkey DB and needs a HTTP X-Forwarded-For header.
To get in use of this method a flask URL route needs to be added:
@app.route('/client<token>.css', methods=['GET', 'POST'])
def client_token(token=None):
link_token.ping(request, token)
return Response('', mimetype='text/css')
And in the HTML template from flask a stylesheet link is needed (the value of
link_token
comes from get_token
):
<link rel="stylesheet"
href="{{ url_for('client_token', token=link_token) }}"
type="text/css" >
- searx.botdetection.link_token.TOKEN_LIVE_TIME = 600¶
Lifetime (sec) of limiter’s CSS token.
- searx.botdetection.link_token.PING_LIVE_TIME = 3600¶
Lifetime (sec) of the ping-key from a client (request)
- searx.botdetection.link_token.PING_KEY = 'SearXNG_limiter.ping'¶
Prefix of all ping-keys generated by
get_ping_key
- searx.botdetection.link_token.TOKEN_KEY = 'SearXNG_limiter.token'¶
Key for which the current token is stored in the DB
- searx.botdetection.link_token.is_suspicious(network: IPv4Network | IPv6Network, request: Request, renew: bool = False)[source]¶
Checks whether a valid ping is exists for this (client) network, if not this request is rated as suspicious. If a valid ping exists and argument
renew
isTrue
the expire time of this ping is reset toPING_LIVE_TIME
.
- searx.botdetection.link_token.ping(request: Request, token: str)[source]¶
This function is called by a request to URL
/client<token>.css
. Iftoken
is valid aPING_KEY
for the client is stored in the DB. The expire time of this ping-key isPING_LIVE_TIME
.
- searx.botdetection.link_token.get_ping_key(network: IPv4Network | IPv6Network, request: Request) str [source]¶
Generates a hashed key that fits (more or less) to a WEB-browser session in a network.
Probe HTTP headers¶
Method http_accept
¶
The http_accept
method evaluates a request as the request of a bot if the
Accept header ..
did not contain
text/html
Method http_accept_encoding
¶
The http_accept_encoding
method evaluates a request as the request of a
bot if the Accept-Encoding header ..
did not contain
gzip
ANDdeflate
(if both values are missed)did not contain
text/html
Method http_accept_language
¶
The http_accept_language
method evaluates a request as the request of a bot
if the Accept-Language header is unset.
Method http_connection
¶
The http_connection
method evaluates a request as the request of a bot if
the Connection header is set to close
.
Method http_user_agent
¶
The http_user_agent
method evaluates a request as the request of a bot if
the User-Agent header is unset or matches the regular expression
USER_AGENT
.
- searx.botdetection.http_user_agent.USER_AGENT = '(unknown|[Cc][Uu][Rr][Ll]|[wW]get|Scrapy|splash|JavaFX|FeedFetcher|python-requests|Go-http-client|Java|Jakarta|okhttp|HttpClient|Jersey|Python|libwww-perl|Ruby|SynHttpClient|UniversalFeedParser|Googlebot|GoogleImageProxy|bingbot|Baiduspider|yacybot|YandexMobileBot|YandexBot|Yahoo! Slurp|MJ12bot|AhrefsBot|archive.org_bot|msnbot|MJ12bot|SeznamBot|linkdexbot|Netvibes|SMTBot|zgrab|James BOT|Sogou|Abonti|Pixray|Spinn3r|SemrushBot|Exabot|ZmEu|BLEXBot|bitlybot|HeadlessChrome|Mozilla/5\\.0\\ \\(compatible;\\ Farside/0\\.1\\.0;\\ \\+https://farside\\.link\\)|.*PetalBot.*)'¶
Regular expression that matches to User-Agent from known bots
Method http_sec_fetch
¶
The http_sec_fetch
method protect resources from web attacks with Fetch
Metadata. A request is filtered out in case of:
http header Sec-Fetch-Mode is invalid
http header Sec-Fetch-Dest is invalid
- searx.botdetection.http_sec_fetch.is_browser_supported(user_agent: str) bool [source]¶
Check if the browser supports Sec-Fetch headers.
https://caniuse.com/mdn-http_headers_sec-fetch-dest https://caniuse.com/mdn-http_headers_sec-fetch-mode https://caniuse.com/mdn-http_headers_sec-fetch-site
Supported browsers: - Chrome >= 80 - Firefox >= 90 - Safari >= 16.4 - Edge (mirrors Chrome) - Opera (mirrors Chrome)
Config¶
Configuration class Config
with deep-update, schema validation
and deprecated names.
The Config
class implements a configuration that is based on
structured dictionaries. The configuration schema is defined in a dictionary
structure and the configuration data is given in a dictionary structure.
- class searx.botdetection.config.Config(cfg_schema: dict[str, Any], deprecated: dict[str, str])[source]¶
Base class used for configuration
- validate(cfg: dict[str, Any])[source]¶
Validation of dictionary
cfg
onConfig.SCHEMA
. Validation is done byvalidate
.
- get(name: str, default: ~typing.Any = <UNSET>, replace: bool = True) Any [source]¶
Returns the value to which
name
points in the configuration.If there is no such
name
in the config and thedefault
isUNSET
, aKeyError
is raised.
- set(name: str, val)[source]¶
Set the value to which
name
points in the configuration.If there is no such
name
in the config, aKeyError
is raised.
- path(name: str, default=<UNSET>)[source]¶
Get a
pathlib.Path
object from a config string.