mirror of
https://github.com/searxng/searxng.git
synced 2025-10-19 04:50:44 -04:00
[mod] OpenAlex engine: revision of the engine (Paper result)
Revision of the engine / use of the result type Paper as well as other typifications. Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This commit is contained in:
parent
599d9488c5
commit
0691e50e13
@ -1,100 +1,8 @@
|
|||||||
.. _openalex engine:
|
.. _openalex engine:
|
||||||
|
|
||||||
=========
|
|
||||||
OpenAlex
|
|
||||||
=========
|
|
||||||
|
|
||||||
Overview
|
|
||||||
========
|
========
|
||||||
|
OpenAlex
|
||||||
The OpenAlex engine integrates the `OpenAlex`_ Works API to return scientific paper
|
========
|
||||||
results using the :origin:`paper.html <searx/templates/simple/result_templates/paper.html>`
|
|
||||||
template. It is an "online" JSON engine that uses the official public API and does
|
|
||||||
not require an API key.
|
|
||||||
|
|
||||||
.. _OpenAlex: https://openalex.org
|
|
||||||
.. _OpenAlex API overview: https://docs.openalex.org/how-to-use-the-api/api-overview
|
|
||||||
|
|
||||||
Key features
|
|
||||||
------------
|
|
||||||
|
|
||||||
- Uses the official Works endpoint (JSON)
|
|
||||||
- Paging support via ``page`` and ``per-page``
|
|
||||||
- Relevance sorting (``sort=relevance_score:desc``)
|
|
||||||
- Language filter support (maps SearXNG language to ``filter=language:<iso2>``)
|
|
||||||
- Maps fields commonly used in scholarly results: title, authors, abstract
|
|
||||||
(reconstructed from inverted index), journal/venue, publisher, DOI, tags
|
|
||||||
(concepts), PDF/HTML links, pages, volume, issue, published date, and a short
|
|
||||||
citations comment
|
|
||||||
- Supports OpenAlex "polite pool" by adding a ``mailto`` parameter
|
|
||||||
|
|
||||||
|
|
||||||
Configuration
|
|
||||||
=============
|
|
||||||
|
|
||||||
Minimal example for :origin:`settings.yml <searx/settings.yml>`:
|
|
||||||
|
|
||||||
.. code:: yaml
|
|
||||||
|
|
||||||
- name: openalex
|
|
||||||
engine: openalex
|
|
||||||
shortcut: oa
|
|
||||||
categories: science, scientific publications
|
|
||||||
timeout: 5.0
|
|
||||||
# Recommended by OpenAlex: join the polite pool with an email address
|
|
||||||
mailto: "[email protected]"
|
|
||||||
|
|
||||||
Notes
|
|
||||||
-----
|
|
||||||
|
|
||||||
- The ``mailto`` key is optional but recommended by OpenAlex for better service.
|
|
||||||
- Language is inherited from the user's UI language; when it is not ``all``, the
|
|
||||||
engine adds ``filter=language:<iso2>`` (e.g. ``language:fr``). If OpenAlex has
|
|
||||||
few results for that language, you may see fewer items.
|
|
||||||
- Results typically include a main link. When the primary landing page from
|
|
||||||
OpenAlex is a DOI resolver, the engine will use that stable link. When an open
|
|
||||||
access link is available, it is exposed via the ``PDF`` and/or ``HTML`` links
|
|
||||||
in the result footer.
|
|
||||||
|
|
||||||
|
|
||||||
What is returned
|
|
||||||
================
|
|
||||||
|
|
||||||
Each result uses the ``paper.html`` template and may include:
|
|
||||||
|
|
||||||
- ``title`` and ``content`` (abstract; reconstructed from the inverted index)
|
|
||||||
- ``authors`` (display names)
|
|
||||||
- ``journal`` (host venue display name) and ``publisher``
|
|
||||||
- ``doi`` (normalized to the plain DOI, without the ``https://doi.org/`` prefix)
|
|
||||||
- ``tags`` (OpenAlex concepts display names)
|
|
||||||
- ``pdf_url`` (Open access PDF if available) and ``html_url`` (landing page)
|
|
||||||
- ``publishedDate`` (parsed from ``publication_date``)
|
|
||||||
- ``pages``, ``volume``, ``number`` (issue)
|
|
||||||
- ``type`` and a brief ``comments`` string with citation count
|
|
||||||
|
|
||||||
|
|
||||||
Rate limits & polite pool
|
|
||||||
=========================
|
|
||||||
|
|
||||||
OpenAlex offers a free public API with generous daily limits. For extra courtesy
|
|
||||||
and improved service quality, include a contact email in each request via
|
|
||||||
``mailto``. You can set it directly in the engine configuration as shown above.
|
|
||||||
See: `OpenAlex API overview`_.
|
|
||||||
|
|
||||||
|
|
||||||
Troubleshooting
|
|
||||||
===============
|
|
||||||
|
|
||||||
- Few or no results in a non-English UI language:
|
|
||||||
Ensure the selected language has sufficient coverage at OpenAlex, or set the
|
|
||||||
UI language to English and retry.
|
|
||||||
- Preference changes fail while testing locally:
|
|
||||||
Make sure your ``server.secret_key`` and ``server.base_url`` are set in your
|
|
||||||
instance settings so signed cookies work; see :ref:`settings server`.
|
|
||||||
|
|
||||||
|
|
||||||
Implementation
|
|
||||||
===============
|
|
||||||
|
|
||||||
.. automodule:: searx.engines.openalex
|
.. automodule:: searx.engines.openalex
|
||||||
:members:
|
:members:
|
||||||
|
@ -1,14 +1,103 @@
|
|||||||
# SPDX-License-Identifier: AGPL-3.0-or-later
|
# SPDX-License-Identifier: AGPL-3.0-or-later
|
||||||
# pylint: disable=missing-module-docstring
|
"""The OpenAlex engine integrates the `OpenAlex`_ Works API to return scientific
|
||||||
#
|
paper results using the :ref:`result_types.paper` class. It is an "online" JSON
|
||||||
# Engine is documented in: docs/dev/engines/online/openalex.rst
|
engine that uses the official public API and does not require an API key.
|
||||||
|
|
||||||
|
.. _OpenAlex: https://openalex.org
|
||||||
|
.. _OpenAlex API overview: https://docs.openalex.org/how-to-use-the-api/api-overview
|
||||||
|
|
||||||
|
Key features
|
||||||
|
------------
|
||||||
|
|
||||||
|
- Uses the official Works endpoint (JSON)
|
||||||
|
- Paging support via ``page`` and ``per-page``
|
||||||
|
- Relevance sorting (``sort=relevance_score:desc``)
|
||||||
|
- Language filter support (maps SearXNG language to ``filter=language:<iso2>``)
|
||||||
|
- Maps fields commonly used in scholarly results: title, authors, abstract
|
||||||
|
(reconstructed from inverted index), journal/venue, publisher, DOI, tags
|
||||||
|
(concepts), PDF/HTML links, pages, volume, issue, published date, and a short
|
||||||
|
citations comment
|
||||||
|
- Supports OpenAlex "polite pool" by adding a ``mailto`` parameter
|
||||||
|
|
||||||
|
|
||||||
|
Configuration
|
||||||
|
=============
|
||||||
|
|
||||||
|
Minimal example for :origin:`settings.yml <searx/settings.yml>`:
|
||||||
|
|
||||||
|
.. code:: yaml
|
||||||
|
|
||||||
|
- name: openalex
|
||||||
|
engine: openalex
|
||||||
|
shortcut: oa
|
||||||
|
categories: science, scientific publications
|
||||||
|
timeout: 5.0
|
||||||
|
# Recommended by OpenAlex: join the polite pool with an email address
|
||||||
|
mailto: "[email protected]"
|
||||||
|
|
||||||
|
Notes
|
||||||
|
-----
|
||||||
|
|
||||||
|
- The ``mailto`` key is optional but recommended by OpenAlex for better service.
|
||||||
|
- Language is inherited from the user's UI language; when it is not ``all``, the
|
||||||
|
engine adds ``filter=language:<iso2>`` (e.g. ``language:fr``). If OpenAlex has
|
||||||
|
few results for that language, you may see fewer items.
|
||||||
|
- Results typically include a main link. When the primary landing page from
|
||||||
|
OpenAlex is a DOI resolver, the engine will use that stable link. When an open
|
||||||
|
access link is available, it is exposed via the ``PDF`` and/or ``HTML`` links
|
||||||
|
in the result footer.
|
||||||
|
|
||||||
|
|
||||||
|
What is returned
|
||||||
|
================
|
||||||
|
|
||||||
|
Each result uses the :ref:`result_types.paper` class and may include:
|
||||||
|
|
||||||
|
- ``title`` and ``content`` (abstract; reconstructed from the inverted index)
|
||||||
|
- ``authors`` (display names)
|
||||||
|
- ``journal`` (host venue display name) and ``publisher``
|
||||||
|
- ``doi`` (normalized to the plain DOI, without the ``https://doi.org/`` prefix)
|
||||||
|
- ``tags`` (OpenAlex concepts display names)
|
||||||
|
- ``pdf_url`` (Open access PDF if available) and ``html_url`` (landing page)
|
||||||
|
- ``publishedDate`` (parsed from ``publication_date``)
|
||||||
|
- ``pages``, ``volume``, ``number`` (issue)
|
||||||
|
- ``type`` and a brief ``comments`` string with citation count
|
||||||
|
|
||||||
|
|
||||||
|
Rate limits & polite pool
|
||||||
|
=========================
|
||||||
|
|
||||||
|
OpenAlex offers a free public API with generous daily limits. For extra courtesy
|
||||||
|
and improved service quality, include a contact email in each request via
|
||||||
|
``mailto``. You can set it directly in the engine configuration as shown above.
|
||||||
|
See: `OpenAlex API overview`_.
|
||||||
|
|
||||||
|
|
||||||
|
Troubleshooting
|
||||||
|
===============
|
||||||
|
|
||||||
|
- Few or no results in a non-English UI language:
|
||||||
|
Ensure the selected language has sufficient coverage at OpenAlex, or set the
|
||||||
|
UI language to English and retry.
|
||||||
|
- Preference changes fail while testing locally:
|
||||||
|
Make sure your ``server.secret_key`` and ``server.base_url`` are set in your
|
||||||
|
instance settings so signed cookies work; see :ref:`settings server`.
|
||||||
|
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
===============
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
import typing as t
|
import typing as t
|
||||||
|
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from urllib.parse import urlencode
|
from urllib.parse import urlencode
|
||||||
from searx.result_types import EngineResults
|
from searx.result_types import EngineResults
|
||||||
|
|
||||||
|
if t.TYPE_CHECKING:
|
||||||
from searx.extended_types import SXNG_Response
|
from searx.extended_types import SXNG_Response
|
||||||
|
from searx.search.processors import OnlineParams
|
||||||
|
|
||||||
# about
|
# about
|
||||||
about = {
|
about = {
|
||||||
@ -31,7 +120,7 @@ search_url = "https://api.openalex.org/works"
|
|||||||
mailto = ""
|
mailto = ""
|
||||||
|
|
||||||
|
|
||||||
def request(query: str, params: dict[str, t.Any]) -> None:
|
def request(query: str, params: "OnlineParams") -> None:
|
||||||
# Build OpenAlex query using search parameter and paging
|
# Build OpenAlex query using search parameter and paging
|
||||||
args = {
|
args = {
|
||||||
"search": query,
|
"search": query,
|
||||||
@ -60,7 +149,7 @@ def request(query: str, params: dict[str, t.Any]) -> None:
|
|||||||
params["url"] = f"{search_url}?{urlencode(args)}"
|
params["url"] = f"{search_url}?{urlencode(args)}"
|
||||||
|
|
||||||
|
|
||||||
def response(resp: SXNG_Response) -> EngineResults:
|
def response(resp: "SXNG_Response") -> EngineResults:
|
||||||
data = resp.json()
|
data = resp.json()
|
||||||
res = EngineResults()
|
res = EngineResults()
|
||||||
|
|
||||||
@ -71,12 +160,11 @@ def response(resp: SXNG_Response) -> EngineResults:
|
|||||||
authors = _extract_authors(item)
|
authors = _extract_authors(item)
|
||||||
journal, publisher, pages, volume, number, published_date = _extract_biblio(item)
|
journal, publisher, pages, volume, number, published_date = _extract_biblio(item)
|
||||||
doi = _doi_to_plain(item.get("doi"))
|
doi = _doi_to_plain(item.get("doi"))
|
||||||
tags = _extract_tags(item) or None
|
tags = _extract_tags(item)
|
||||||
comments = _extract_comments(item)
|
comments = _extract_comments(item)
|
||||||
|
|
||||||
res.add(
|
res.add(
|
||||||
res.types.LegacyResult(
|
res.types.Paper(
|
||||||
template="paper.html",
|
|
||||||
url=url,
|
url=url,
|
||||||
title=title,
|
title=title,
|
||||||
content=content,
|
content=content,
|
||||||
@ -99,7 +187,7 @@ def response(resp: SXNG_Response) -> EngineResults:
|
|||||||
return res
|
return res
|
||||||
|
|
||||||
|
|
||||||
def _stringify_pages(biblio: dict[str, t.Any]) -> str | None:
|
def _stringify_pages(biblio: dict[str, t.Any]) -> str:
|
||||||
first_page = biblio.get("first_page")
|
first_page = biblio.get("first_page")
|
||||||
last_page = biblio.get("last_page")
|
last_page = biblio.get("last_page")
|
||||||
if first_page and last_page:
|
if first_page and last_page:
|
||||||
@ -108,7 +196,7 @@ def _stringify_pages(biblio: dict[str, t.Any]) -> str | None:
|
|||||||
return str(first_page)
|
return str(first_page)
|
||||||
if last_page:
|
if last_page:
|
||||||
return str(last_page)
|
return str(last_page)
|
||||||
return None
|
return ""
|
||||||
|
|
||||||
|
|
||||||
def _parse_date(value: str | None) -> datetime | None:
|
def _parse_date(value: str | None) -> datetime | None:
|
||||||
@ -123,9 +211,9 @@ def _parse_date(value: str | None) -> datetime | None:
|
|||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
def _doi_to_plain(doi_value: str | None) -> str | None:
|
def _doi_to_plain(doi_value: str | None) -> str:
|
||||||
if not doi_value:
|
if not doi_value:
|
||||||
return None
|
return ""
|
||||||
# OpenAlex `doi` field is commonly a full URL like https://doi.org/10.1234/abcd
|
# OpenAlex `doi` field is commonly a full URL like https://doi.org/10.1234/abcd
|
||||||
return doi_value.removeprefix("https://doi.org/")
|
return doi_value.removeprefix("https://doi.org/")
|
||||||
|
|
||||||
@ -151,14 +239,17 @@ def _reconstruct_abstract(
|
|||||||
return text if text != "" else None
|
return text if text != "" else None
|
||||||
|
|
||||||
|
|
||||||
def _extract_links(item: dict[str, t.Any]) -> tuple[str, str | None, str | None]:
|
def _extract_links(item: dict[str, t.Any]) -> tuple[str, str, str]:
|
||||||
primary_location = item.get("primary_location", {})
|
primary_location: dict[str, str] = item.get("primary_location", {})
|
||||||
landing_page_url: str | None = primary_location.get("landing_page_url")
|
open_access: dict[str, str] = item.get("open_access", {})
|
||||||
|
|
||||||
|
landing_page_url: str = primary_location.get("landing_page_url") or ""
|
||||||
work_url: str = item.get("id", "")
|
work_url: str = item.get("id", "")
|
||||||
|
|
||||||
url: str = landing_page_url or work_url
|
url: str = landing_page_url or work_url
|
||||||
open_access = item.get("open_access", {})
|
html_url: str = landing_page_url
|
||||||
pdf_url: str | None = primary_location.get("pdf_url") or open_access.get("oa_url")
|
pdf_url: str = primary_location.get("pdf_url") or open_access.get("oa_url") or ""
|
||||||
html_url: str | None = landing_page_url
|
|
||||||
return url, html_url, pdf_url
|
return url, html_url, pdf_url
|
||||||
|
|
||||||
|
|
||||||
@ -185,20 +276,21 @@ def _extract_tags(item: dict[str, t.Any]) -> list[str]:
|
|||||||
|
|
||||||
def _extract_biblio(
|
def _extract_biblio(
|
||||||
item: dict[str, t.Any],
|
item: dict[str, t.Any],
|
||||||
) -> tuple[str | None, str | None, str | None, str | None, str | None, datetime | None]:
|
) -> tuple[str, str, str, str, str, datetime | None]:
|
||||||
host_venue = item.get("host_venue", {})
|
host_venue: dict[str, str] = item.get("host_venue", {})
|
||||||
biblio = item.get("biblio", {})
|
biblio: dict[str, str] = item.get("biblio", {})
|
||||||
journal: str | None = host_venue.get("display_name")
|
|
||||||
publisher: str | None = host_venue.get("publisher")
|
journal: str = host_venue.get("display_name", "")
|
||||||
pages = _stringify_pages(biblio)
|
publisher: str = host_venue.get("publisher", "")
|
||||||
volume = biblio.get("volume")
|
pages: str = _stringify_pages(biblio)
|
||||||
number = biblio.get("issue")
|
volume = biblio.get("volume", "")
|
||||||
|
number = biblio.get("issue", "")
|
||||||
published_date = _parse_date(item.get("publication_date"))
|
published_date = _parse_date(item.get("publication_date"))
|
||||||
return journal, publisher, pages, volume, number, published_date
|
return journal, publisher, pages, volume, number, published_date
|
||||||
|
|
||||||
|
|
||||||
def _extract_comments(item: dict[str, t.Any]) -> str | None:
|
def _extract_comments(item: dict[str, t.Any]) -> str:
|
||||||
cited_by_count = item.get("cited_by_count")
|
cited_by_count = item.get("cited_by_count")
|
||||||
if isinstance(cited_by_count, int):
|
if isinstance(cited_by_count, int):
|
||||||
return f"{cited_by_count} citations"
|
return f"{cited_by_count} citations"
|
||||||
return None
|
return ""
|
||||||
|
Loading…
x
Reference in New Issue
Block a user