6
v3 Ideas List
Trenton H edited this page 2025-07-08 09:23:32 -07:00

Table of Contents

Breaking Changes

Removing GPG / Encryption

  • Encrypting documents unsupported since 0.9, many years ago
  • Provides no benefit
  • Does still linger in the code base here and there

Settings Updates

  • Remove all but Django settings from the environment
  • Separate OCR vs other settings (call them site setting?)
  • Create multiple levels of OCR settings:
    • A default system configuration, controlled by staff/superusers
    • A user specific settings set
    • The final settings used for OCR are then the combined set, with user, then default system settings
    • Other parsers, like text, etc also define these levels. But it's separate entirely.
  • Allow workflows/matching to set certain settings:
    • Document filename matches regex, disable archive generation and disable de-skew
  • When a document starts consumption, settings go through the pipeline with it. ie set once, not read (from DB) again

Regex Everywhere

  • Remove usages of fnmatch in favor of regex. There was a PR that implemented some sort of multiple matching, where regex could have solved it

Re-design parsing/consumption chain

  • Use chains/pipelines to actually break the consumption into multiple tasks
  • Results from one task move on to the next
  • An initial task takes the file, waits for it to be unmodified, then determines the next task to start.
  • Or alternatively, the initial task builds a pipeline and starts that.
  • Handles deciding if the file can be consumed, rather than when a new file is seen (see plugin ideas)
  • Make each step along the well a well defined status update, sent over websocket, but also configure something like apprise/ntfy
  • TODO: If something fails along the chain, the DB shouldn't be updated. Maybe 1 task, multiple steps, wrapped in a transaction?

Actual Plugins

  • Design a system to allow plugins, while splitting apart the current code into plugins
  • I can see the following being plugins:
    • Parsers (obviously. Includes things like AI/cloud OCR to get the content or even could talk to a remote API)
    • Archive generation (example, use Gotenberg to convert a PDF to PDF/A instead of ocrmypdf)
    • Thumbnail generation (maybe you want to handle PDFs differently than JPEGs?)
    • Date parsing (handling non-latin dates, for example)
    • Machine learning (provides an interface which returns the proposed tags, type, etc)
  • Ideally, plugins should be registered when installed, declaring what mime types they support, with some sort of conflict resolution
  • With the settings updates above, a workflow could also be used to set the parser based on matching certain values
  • Provide "paperless", a core set of functionality, including models and common functionality (thumbnail generation for common types, current versions of ML, date parsing, etc)
  • Provide the existing parsers, re-configured to match the new format
  • Rework the other parts to conform to the plugin API spec

Simpler consumer

Transition to Alpine container

  • Smaller image size
  • Faster update cadence

Ditch celery for Huey

  • Celery is big and bulky, with support for memcached, sqs, etc, which we don't need
  • Huey also has nice Django integrations, like for database connections, which we kind of hacked into Celery
  • Would need to use its signals to implement task tracking, but the Django celery integrate is pretty "meh"

Improved Tasks

  • Show scheduled tasks with next execution
  • Simple task status
  • Include more task types
  • Include ability to trigger scheduled tasks "now"

External Services

External OCR

  • External OCR services, using an API, could provide more recent tesseract and ghostscript versions, potentially fixing issues faster than Debian updates (thinking Alpine based image)
  • This would be streamed the document, eventually return the content and an optional archive file
  • Is time consuming, so might need celery/huey/task queue there? And a database?
  • fastapi could easily set this up, if there is no need for a database.
  • Could use Redis/Valkey streams to manage state and show progress without a database

External Machine Learning

  • Again, define an API that the service provides so it could be swapped out
  • Provided the content, suggests the tags, correspondents, etc
  • External allows it to be hosted on a larger resourced machine
  • Needs a task queue for scheduled training?

Separate OCR from Archive

  • The getting of a image or PDF document content should be separated from the generation of an archive file
  • Just too many interactions between them, leading to odd combinations

Django Ninja

  • Really like the OpenAPI spec it generates
  • async support for databases
  • Strongly typed and validated with Pydantic

Blockers

  • Would need to implement Token based authentication
    • Could track, with some resolution, when a token was last used. Might be nice to display and allow removing old tokens which haven't been used
    • Could implement expiration too
  • Async pagination isn't working quite yet
  • No idea about allauth/oidc integration

Vector Embeddings

  • This would require either a new database or everyone to use the same database (ala Immich with pgvecto.rs)
  • Would enable semantic search, document similarity search
    • Maybe replacing Whoosh entirely?

New Sanity Checker

  • Sanity checker messages are attached to a document
  • Can be dismissed (but still viewed)
  • Visible in the UI somehow

Testing Updates

  • Fully transition to pytest-django, pytest style and fixtures, with appropriate scopes to limit the amount of work
  • Actually define and use factory-boy factories for our models (including relations)

Use File Storage API

  • See reference
  • Also can change to an in-memory storage for testing, so it's a little faster maybe?
  • Allows storing files elsewhere easier