paperless-ngx

Cutlery/paperless-ngx

Fork 0

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2025-07-09 03:04:12 -04:00

Table of Contents

Table of Contents
Breaking Changes

Removing GPG / Encryption
Settings Updates
Regex Everywhere
Re-design parsing/consumption chain
Actual Plugins
Simpler consumer
Transition to Alpine container
Ditch celery for Huey

Improved Tasks
External Services

External OCR
External Machine Learning

Separate OCR from Archive
Django Ninja

Blockers

Vector Embeddings
New Sanity Checker
Testing Updates
Use File Storage API

Breaking Changes
Improved Tasks
External Services
- External OCR
- External Machine Learning
Separate OCR from Archive
Django Ninja
- Blockers
Vector Embeddings
New Sanity Checker
Testing Updates

Breaking Changes

Removing GPG / Encryption

Encrypting documents unsupported since 0.9, many years ago
Provides no benefit
Does still linger in the code base here and there

Settings Updates

Remove all but Django settings from the environment
Separate OCR vs other settings (call them site setting?)
Create multiple levels of OCR settings:
- A default system configuration, controlled by staff/superusers
- A user specific settings set
- The final settings used for OCR are then the combined set, with user, then default system settings
- Other parsers, like text, etc also define these levels. But it's separate entirely.
Allow workflows/matching to set certain settings:
- Document filename matches regex, disable archive generation and disable de-skew
When a document starts consumption, settings go through the pipeline with it. ie set once, not read (from DB) again

Regex Everywhere

Remove usages of fnmatch in favor of regex. There was a PR that implemented some sort of multiple matching, where regex could have solved it

Re-design parsing/consumption chain

Use chains/pipelines to actually break the consumption into multiple tasks
Results from one task move on to the next
An initial task takes the file, waits for it to be unmodified, then determines the next task to start.
Or alternatively, the initial task builds a pipeline and starts that.
Handles deciding if the file can be consumed, rather than when a new file is seen (see plugin ideas)
Make each step along the well a well defined status update, sent over websocket, but also configure something like apprise/ntfy
TODO: If something fails along the chain, the DB shouldn't be updated. Maybe 1 task, multiple steps, wrapped in a transaction?

Actual Plugins

Design a system to allow plugins, while splitting apart the current code into plugins
I can see the following being plugins:
- Parsers (obviously. Includes things like AI/cloud OCR to get the content or even could talk to a remote API)
- Archive generation (example, use Gotenberg to convert a PDF to PDF/A instead of ocrmypdf)
- Thumbnail generation (maybe you want to handle PDFs differently than JPEGs?)
- Date parsing (handling non-latin dates, for example)
- Machine learning (provides an interface which returns the proposed tags, type, etc)
Ideally, plugins should be registered when installed, declaring what mime types they support, with some sort of conflict resolution
With the settings updates above, a workflow could also be used to set the parser based on matching certain values
Provide "paperless", a core set of functionality, including models and common functionality (thumbnail generation for common types, current versions of ML, date parsing, etc)
Provide the existing parsers, re-configured to match the new format
Rework the other parts to conform to the plugin API spec

Simpler consumer

Use something like watchfiles for a simpler loop with only itself as a dependency
See some ideas in https://github.com/paperless-ngx/paperless-ngx/tree/feature-simpler-consume-loop

Transition to Alpine container

Smaller image size
Faster update cadence

Ditch celery for Huey

Celery is big and bulky, with support for memcached, sqs, etc, which we don't need
Huey also has nice Django integrations, like for database connections, which we kind of hacked into Celery
Would need to use its signals to implement task tracking, but the Django celery integrate is pretty "meh"

Improved Tasks

Show scheduled tasks with next execution
Simple task status
Include more task types
Include ability to trigger scheduled tasks "now"

External Services

External OCR

External OCR services, using an API, could provide more recent tesseract and ghostscript versions, potentially fixing issues faster than Debian updates (thinking Alpine based image)
This would be streamed the document, eventually return the content and an optional archive file
Is time consuming, so might need celery/huey/task queue there? And a database?
fastapi could easily set this up, if there is no need for a database.
Could use Redis/Valkey streams to manage state and show progress without a database

External Machine Learning

Again, define an API that the service provides so it could be swapped out
Provided the content, suggests the tags, correspondents, etc
External allows it to be hosted on a larger resourced machine
Needs a task queue for scheduled training?

Separate OCR from Archive

The getting of a image or PDF document content should be separated from the generation of an archive file
Just too many interactions between them, leading to odd combinations

Django Ninja

Really like the OpenAPI spec it generates
async support for databases
Strongly typed and validated with Pydantic

Blockers

Would need to implement Token based authentication
- Could track, with some resolution, when a token was last used. Might be nice to display and allow removing old tokens which haven't been used
- Could implement expiration too
Async pagination isn't working quite yet
No idea about allauth/oidc integration

Vector Embeddings

This would require either a new database or everyone to use the same database (ala Immich with pgvecto.rs)
Would enable semantic search, document similarity search
- Maybe replacing Whoosh entirely?

New Sanity Checker

Sanity checker messages are attached to a document
Can be dismissed (but still viewed)
Visible in the UI somehow

Testing Updates

Fully transition to pytest-django, pytest style and fixtures, with appropriate scopes to limit the amount of work
Actually define and use factory-boy factories for our models (including relations)

Use File Storage API

See reference
Also can change to an in-memory storage for testing, so it's a little faster maybe?
Allows storing files elsewhere easier

Home
Lists
- Related Projects
- Hardware & Software Scanner
Setup Help
Examples
- Pre Consume Scripts
- Post Consume Scripts
Platform-Specific Troubleshooting
Ideas
- Backend

Feel free to contribute to the wiki pages - enhance and extend the content!

Also browse Discussions & connect in Matrix chat.