paperless-ngx/src/documents/sanity_checker.py
Trenton H 8e67828bd7
Feature: Redesign the task system (#12584)
* feat(tasks): replace PaperlessTask model with structured redesign

Drop the old string-based PaperlessTask table and recreate it with
Status/TaskType/TriggerSource enums, JSONField result storage, and
duration tracking fields. Update all call sites to use the new API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): rewrite signal handlers to track all task types

Replace the old consume_file-only handler with a full rewrite that tracks
6 task types (consume_file, train_classifier, sanity_check, index_optimize,
llm_index, mail_fetch) with proper trigger source detection, input data
extraction, legacy result string parsing, duration/wait time recording,
and structured error capture on failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): add traceback and revoked state coverage to signal tests

* refactor(tasks): remove manual PaperlessTask creation and scheduled/auto params

All task records are now created exclusively via Celery signals (Task 2).
Removed PaperlessTask creation/update from train_classifier, sanity_check,
llmindex_index, and check_sanity. Removed scheduled= and auto= parameters
from all 7 call sites. Updated apply_async callers to use trigger_source
headers instead. Exceptions now propagate naturally from task functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): auto-inject trigger_source=scheduled header for all beat tasks

Inject `headers: {"trigger_source": "scheduled"}` into every Celery beat
schedule entry so signal handlers can identify scheduler-originated tasks
without per-task instrumentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): update serializer, filter, and viewset with v9 backwards compat

- Replace TasksViewSerializer/RunTaskViewSerializer with TaskSerializerV10
  (new field names), TaskSerializerV9 (v9 compat), TaskSummarySerializer,
  and RunTaskSerializer
- Add AcknowledgeTasksViewSerializer unchanged (kept existing validation)
- Expand PaperlessTaskFilterSet with MultipleChoiceFilter for task_type,
  trigger_source, status; add is_complete, date_created_after/before filters
- Replace TasksViewSet.get_serializer_class() to branch on request.version
- Add get_queryset() v9 compat for task_name/type query params
- Add acknowledge_all, summary, active actions to TasksViewSet
- Rewrite run action to use apply_async with trigger_source header
- Add timedelta import to views.py; add MultipleChoiceFilter/DateTimeFilter
  to filters.py imports

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tasks): add read_only_fields to TaskSerializerV9, enforce admin via permission_classes on run action

* test(tasks): rewrite API task tests for redesigned model and v9 compat

Replaces the old Django TestCase-based tests with pytest-style classes using
PaperlessTaskFactory. Covers v10 field names, v9 backwards-compat field
mapping, filtering, ordering, acknowledge, acknowledge_all, summary, active,
and run endpoints. Also adds PaperlessTaskFactory to factories.py and fixes
a redundant source= kwarg in TaskSerializerV10.related_document_ids.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): fix two spec gaps in task API test suite

Move test_list_is_owner_aware to TestGetTasksV10 (it tests GET /api/tasks/,
not acknowledge). Add test_related_document_ids_includes_duplicate_of to
cover the duplicate_of path in the related_document_ids property.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): address code quality review findings

Remove trivial field-existence tests per project conventions. Fix
potentially flaky ordering test to use explicit date_created values.
Add is_complete=false filter test, v9 type filter input direction test,
and tighten TestActive second test to target REVOKED specifically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): update TaskAdmin for redesigned model

Add date_created, duration_seconds to list_display; add trigger_source
to list_filter; add input_data, duration_seconds, wait_time_seconds to
readonly_fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): update Angular types and service for task redesign

Replace PaperlessTaskName/PaperlessTaskType/PaperlessTaskStatus enums
with new PaperlessTaskType, PaperlessTaskTriggerSource, PaperlessTaskStatus
enums. Update PaperlessTask interface to new field names (task_type,
trigger_source, input_data, result_message, related_document_ids).
Update TasksService to filter by task_type instead of task_name.
Update tasks component and system-status-dialog to use new field names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore(tasks): remove django-celery-results

PaperlessTask now tracks all task results via Celery signals. The
django-celery-results DB backend was write-only -- nothing reads
from it. Drop the package and add a migration to clean up the
orphaned tables.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: fix remaining tests broken by task system redesign

Update all tests that created PaperlessTask objects with old field names
to use PaperlessTaskFactory and new field names (task_type, trigger_source,
status, result_message). Use apply_async instead of delay where mocked.
Drop TestCheckSanityTaskRecording — tests PaperlessTask creation that was
intentionally removed from check_sanity().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): improve test_api_tasks.py structure and add api marker

- Move admin_client, v9_client, user_client fixtures to conftest.py so
  they can be reused by other API tests; all three now build on the
  rest_api_client fixture instead of creating APIClient() directly
- Move regular_user fixture to conftest.py (was already done, now also
  used by the new client fixtures)
- Add docstrings to every test method describing the behaviour under test
- Move timedelta/timezone imports to module level
- Register 'api' pytest marker in pyproject.toml and apply pytestmark to
  the entire file so all 40 tests are selectable via -m api

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(tasks): simplify task tracking code after redesign

- Extract COMPLETE_STATUSES as a class constant on PaperlessTask,
  eliminating the repeated status tuple across models.py, views.py (3×),
  and filters.py
- Extract _CELERY_STATE_TO_STATUS as a module-level constant instead of
  rebuilding the dict on every task_postrun
- Extract _V9_TYPE_TO_TRIGGER_SOURCE and _RUNNABLE_TASKS as class
  constants on TasksViewSet instead of rebuilding on every request
- Extract _TRIGGER_SOURCE_TO_V9_TYPE as a class constant on
  TaskSerializerV9 instead of rebuilding per serialized object
- Extract _get_consume_args helper to deduplicate identical arg
  extraction logic in _extract_input_data, _determine_trigger_source,
  and _extract_owner_id
- Move inline imports (re, traceback) and Avg to module level
- Fix _DOCUMENT_SOURCE_TO_TRIGGER type annotation key type to
  DocumentSource instead of Any
- Remove redundant truthiness checks in SystemStatusView branches
  already guarded by an is-None check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(tasks): add docstrings and rename _parse_legacy_result

- Add docstrings to _extract_input_data, _determine_trigger_source,
  _extract_owner_id explaining what each helper does and why
- Rename _parse_legacy_result -> _parse_consume_result: the function
  parses current consume_file string outputs (consumer.py returns
  "New document id N created" and "It is a duplicate of X (#N)"),
  not legacy data; the old name was misleading

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(tasks): extend and harden the task system redesign

- TaskType: add EMPTY_TRASH, CHECK_WORKFLOWS, CLEANUP_SHARE_LINKS;
  remove INDEX_REBUILD (no backing task — beat schedule uses index_optimize)
- TRACKED_TASKS: wire up all nine task types including the three new ones
  and llmindex_index / process_mail_accounts
- Add task_revoked_handler so cancelled/expired tasks are marked REVOKED
- Fix double-write: task_postrun_handler no longer overwrites result_data
  when status is already FAILURE (task_failure_handler owns that write)
- v9 serialiser: map EMAIL_CONSUME and FOLDER_CONSUME to AUTO_TASK
- views: scope task list to owner for regular users, admins see all;
  validate ?days= query param and return 400 on bad input
- tests: add test_list_admin_sees_all_tasks; rename/fix
  test_parses_duplicate_string (duplicates produce SUCCESS, not FAILURE);
  use PaperlessTaskFactory in modified tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tasks): fix MAIL_FETCH null input_data and postrun double-query

- _extract_input_data: return {} instead of {"account_ids": None} when
  process_mail_accounts is called without an explicit account list (the
  normal beat-scheduled path); add test to cover this path
- task_postrun_handler: replace filter().first() + filter().update() with
  get() + save(update_fields=[...]) — single fetch, single write,
  consistent with task_prerun_handler

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tasks): add queryset stub to satisfy drf-spectacular schema generation

TasksViewSet.get_queryset() accesses request.user, which drf-spectacular
cannot provide during static schema generation.  Adding a class-level
queryset = PaperlessTask.objects.none() gives spectacular a model to
introspect without invoking get_queryset(), eliminating both warnings
and the test_valid_schema failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): fill coverage gaps in task system

- test_task_signals: add TestTaskRevokedHandler (marks REVOKED, ignores
  None request, ignores unknown id); switch existing direct
  PaperlessTask.objects.create calls to PaperlessTaskFactory; import
  pytest_mock and use MockerFixture typing on mocker params
- test_api_tasks: add test_rejects_invalid_days_param to TestSummary
- tasks.service.spec: add dismissAllTasks test (POST acknowledge_all +
  reload)
- models: add pragma: no cover to __str__, is_complete, and
  related_document_ids (trivial delegates, covered indirectly)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Well, that was a bad push.

* Fixes v9 API compatability with testing coverage

* fix(tasks): restore INDEX_OPTIMIZE enum and remove no-op run button

INDEX_OPTIMIZE was dropped from the TaskType enum but still referenced
in _RUNNABLE_TASKS (views.py) and the frontend system-status-dialog,
causing an AttributeError at import time. Restore the enum value in the
model and migration so the serializer accepts it, but remove it from
_RUNNABLE_TASKS since index_optimize is a Tantivy no-op. Remove the
frontend "Run Task" button for index optimization accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(tasks): v9 type filter now matches all equivalent trigger sources

The v9 ?type= query param mapped each value to a single TriggerSource,
but the serializer maps multiple sources to the same v9 type value.
A task serialized as "auto_task" would not appear when filtering by
?type=auto_task if its trigger_source was email_consume or
folder_consume. Same issue for "manual_task" missing web_ui and
api_upload sources. Changed to trigger_source__in with the full set
of sources for each v9 type value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(tasks): give task_failure_handler full ownership of FAILURE path

task_postrun_handler now early-returns for FAILURE states instead of
redundantly writing status and date_done. task_failure_handler now
computes duration_seconds and wait_time_seconds so failed tasks get
complete timing data. This eliminates a wasted .get() + .save() round
trip on every failed task and gives each handler a clean, non-overlapping
responsibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(tasks): resolve trigger_source header via TriggerSource enum lookup

Replace two hardcoded string comparisons ("scheduled", "system") with a
single TriggerSource(header_source) lookup so the enum values are the
single source of truth. Any valid TriggerSource DB value passed in the
header is accepted; invalid values fall through to the document-source /
MANUAL logic. Update tests to pass enum values in headers rather than raw
strings, and add a test for the invalid-header fallback path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tasks): use TriggerSource enum values at all apply_async call sites

Replace raw strings ("system", "manual") with PaperlessTask.TriggerSource
enum values in the three callers that can import models. The settings
file remains a raw string (models cannot be imported at settings load
time) with a comment pointing to the enum value it must match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(tasks): parametrize repetitive test cases in task test files

test_api_tasks.py:
- Collapse six trigger_source->v9-type tests into one parametrized test,
  adding the previously untested API_UPLOAD case
- Collapse three task_name mapping tests (two remaps + pass-through)
  into one parametrized test
- Collapse two acknowledge_all status tests into one parametrized test
- Collapse two run-endpoint 400 tests into one parametrized test
- Update run/ assertions to use TriggerSource enum values

test_task_signals.py:
- Collapse three trigger_source header tests into one parametrized test
- Collapse two DocumentSource->TriggerSource mapping tests into one
  parametrized test
- Collapse two prerun ignore-invalid-id tests into one parametrized test

All parametrize cases use pytest.param with descriptive ids.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Handle JSON serialization for datetime and Path.  Further restrist the v9 permissions as Copilot suggests

* That should fix the generated schema/browser

* Use XSerializer for the schema

* A few more basic cases I see no value in covering

* Drops the migration related stuff too.  Just in case we want it again or it confuses people

* fix: annotate tasks_summary_retrieve as array of TaskSummarySerializer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: annotate tasks_active_retrieve as array of TaskSerializerV10

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Restore task running to superuser only

* Removes the acknowledge/dismiss all stuff

* Aligns v10 and v9 task permissions with each other

* Short blurb just to warn users about the tasks being cleared

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 09:28:41 -07:00

316 lines
10 KiB
Python

"""
Sanity checker for the Paperless-ngx document archive.
Verifies that all documents have valid files, correct checksums,
and consistent metadata. Reports orphaned files in the media directory.
Progress display is the caller's responsibility -- pass an ``iter_wrapper``
to wrap the document queryset (e.g., with a progress bar). The default
is an identity function that adds no overhead.
"""
import logging
from collections import defaultdict
from collections.abc import Iterator
from pathlib import Path
from typing import TYPE_CHECKING
from typing import Final
from typing import TypedDict
from django.conf import settings
from documents.models import Document
from documents.utils import IterWrapper
from documents.utils import compute_checksum
from documents.utils import identity
from paperless.config import GeneralConfig
logger = logging.getLogger("paperless.sanity_checker")
class MessageEntry(TypedDict):
"""A single sanity check message with its severity level."""
level: int
message: str
class SanityCheckMessages:
"""Collects sanity check messages grouped by document primary key.
Messages are categorized as error, warning, or info. ``None`` is used
as the key for messages not associated with a specific document
(e.g., orphaned files).
"""
def __init__(self) -> None:
self._messages: dict[int | None, list[MessageEntry]] = defaultdict(list)
self.has_error: bool = False
self.has_warning: bool = False
self.has_info: bool = False
self.document_count: int = 0
self.document_error_count: int = 0
self.document_warning_count: int = 0
self.document_info_count: int = 0
self.global_warning_count: int = 0
# -- Recording ----------------------------------------------------------
def error(self, doc_pk: int | None, message: str) -> None:
self._messages[doc_pk].append({"level": logging.ERROR, "message": message})
self.has_error = True
if doc_pk is not None:
self.document_count += 1
self.document_error_count += 1
def warning(self, doc_pk: int | None, message: str) -> None:
self._messages[doc_pk].append({"level": logging.WARNING, "message": message})
self.has_warning = True
if doc_pk is not None:
self.document_count += 1
self.document_warning_count += 1
else:
# This is the only type of global message we do right now
self.global_warning_count += 1
def info(self, doc_pk: int | None, message: str) -> None:
self._messages[doc_pk].append({"level": logging.INFO, "message": message})
self.has_info = True
if doc_pk is not None:
self.document_count += 1
self.document_info_count += 1
# -- Iteration / query --------------------------------------------------
def document_pks(self) -> list[int | None]:
"""Return all document PKs (including None for global messages)."""
return list(self._messages.keys())
def iter_messages(self) -> Iterator[tuple[int | None, list[MessageEntry]]]:
"""Iterate over (doc_pk, messages) pairs."""
yield from self._messages.items()
def __getitem__(self, item: int | None) -> list[MessageEntry]:
return self._messages[item]
# -- Summarize Helpers --------------------------------------------------
@property
def has_global_issues(self) -> bool:
return None in self._messages
@property
def total_issue_count(self) -> int:
"""Total number of error and warning messages across all documents and global."""
return (
self.document_error_count
+ self.document_warning_count
+ self.global_warning_count
)
# -- Logging output (used by Celery task path) --------------------------
def log_messages(self) -> None:
"""Write all messages to the ``paperless.sanity_checker`` logger.
This is the output path for headless / Celery execution.
Management commands use Rich rendering instead.
"""
if len(self._messages) == 0:
logger.info("Sanity checker detected no issues.")
return
doc_pks = [pk for pk in self._messages if pk is not None]
titles: dict[int, str] = {}
if doc_pks:
titles = dict(
Document.global_objects.filter(pk__in=doc_pks)
.only("pk", "title")
.values_list("pk", "title"),
)
for doc_pk, entries in self._messages.items():
if doc_pk is not None:
title = titles.get(doc_pk, "Unknown")
logger.info(
"Detected following issue(s) with document #%s, titled %s",
doc_pk,
title,
)
for msg in entries:
logger.log(msg["level"], msg["message"])
class SanityCheckFailedException(Exception):
pass
# ---------------------------------------------------------------------------
# Internal helpers
# ---------------------------------------------------------------------------
def _build_present_files() -> set[Path]:
"""Collect all files in MEDIA_ROOT, excluding directories and ignorable files."""
present_files = {
x.resolve()
for x in Path(settings.MEDIA_ROOT).glob("**/*")
if not x.is_dir() and x.name not in settings.IGNORABLE_FILES
}
lockfile = Path(settings.MEDIA_LOCK).resolve()
present_files.discard(lockfile)
general_config = GeneralConfig()
app_logo = general_config.app_logo or settings.APP_LOGO
if app_logo:
logo_file = Path(settings.MEDIA_ROOT / Path(app_logo.lstrip("/"))).resolve()
present_files.discard(logo_file)
return present_files
def _check_thumbnail(
doc: Document,
messages: SanityCheckMessages,
present_files: set[Path],
) -> None:
"""Verify the thumbnail exists and is readable."""
# doc.thumbnail_path already returns a resolved Path; no need to re-resolve.
thumbnail_path: Final[Path] = doc.thumbnail_path
if not thumbnail_path.is_file():
messages.error(doc.pk, "Thumbnail of document does not exist.")
return
present_files.discard(thumbnail_path)
try:
_ = thumbnail_path.read_bytes()
except OSError as e:
messages.error(doc.pk, f"Cannot read thumbnail file of document: {e}")
def _check_original(
doc: Document,
messages: SanityCheckMessages,
present_files: set[Path],
) -> None:
"""Verify the original file exists, is readable, and has matching checksum."""
# doc.source_path already returns a resolved Path; no need to re-resolve.
source_path: Final[Path] = doc.source_path
if not source_path.is_file():
messages.error(doc.pk, "Original of document does not exist.")
return
present_files.discard(source_path)
try:
checksum = compute_checksum(source_path)
except OSError as e:
messages.error(doc.pk, f"Cannot read original file of document: {e}")
else:
if checksum != doc.checksum:
messages.error(
doc.pk,
f"Checksum mismatch. Stored: {doc.checksum}, actual: {checksum}.",
)
def _check_archive(
doc: Document,
messages: SanityCheckMessages,
present_files: set[Path],
) -> None:
"""Verify archive file consistency: checksum/filename pairing and file integrity."""
if doc.archive_checksum is not None and doc.archive_filename is None:
messages.error(
doc.pk,
"Document has an archive file checksum, but no archive filename.",
)
elif doc.archive_checksum is None and doc.archive_filename is not None:
messages.error(
doc.pk,
"Document has an archive file, but its checksum is missing.",
)
elif doc.has_archive_version:
if TYPE_CHECKING:
assert isinstance(doc.archive_path, Path)
# doc.archive_path already returns a resolved Path; no need to re-resolve.
archive_path: Final[Path] = doc.archive_path # type: ignore[assignment]
if not archive_path.is_file():
messages.error(doc.pk, "Archived version of document does not exist.")
return
present_files.discard(archive_path)
try:
checksum = compute_checksum(archive_path)
except OSError as e:
messages.error(
doc.pk,
f"Cannot read archive file of document: {e}",
)
else:
if checksum != doc.archive_checksum:
messages.error(
doc.pk,
"Checksum mismatch of archived document. "
f"Stored: {doc.archive_checksum}, actual: {checksum}.",
)
def _check_content(doc: Document, messages: SanityCheckMessages) -> None:
"""Flag documents with no OCR content."""
if not doc.content:
messages.info(doc.pk, "Document contains no OCR data")
def _check_document(
doc: Document,
messages: SanityCheckMessages,
present_files: set[Path],
) -> None:
"""Run all checks for a single document."""
_check_thumbnail(doc, messages, present_files)
_check_original(doc, messages, present_files)
_check_archive(doc, messages, present_files)
_check_content(doc, messages)
# ---------------------------------------------------------------------------
# Public entry point
# ---------------------------------------------------------------------------
def check_sanity(
*,
iter_wrapper: IterWrapper[Document] = identity,
) -> SanityCheckMessages:
"""Run a full sanity check on the document archive.
Args:
iter_wrapper: A callable that wraps the document iterable, e.g.,
for progress bar display. Defaults to identity (no wrapping).
Returns:
A SanityCheckMessages instance containing all detected issues.
"""
messages = SanityCheckMessages()
present_files = _build_present_files()
documents = Document.global_objects.only(
"pk",
"filename",
"mime_type",
"checksum",
"archive_checksum",
"archive_filename",
"content",
).iterator(chunk_size=500)
for doc in iter_wrapper(documents):
_check_document(doc, messages, present_files)
for extra_file in present_files:
messages.warning(None, f"Orphaned file in media dir: {extra_file}")
return messages