mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-07-09 03:04:12 -04:00
Page:
v3 Ideas List
Pages
Affiliated Projects
Email OAuth App Setup
Home
Platform‐Specific Troubleshooting
Post Consume Script Examples
Pre Consume Script Examples
Related Projects
Scanner & Software Recommendations
Using Security Tools with Paperless ngx
Using a Reverse Proxy with Paperless ngx
Using and Generating ASN Barcodes
v3 Ideas List
Clone
6
v3 Ideas List
Trenton H edited this page 2025-07-08 09:23:32 -07:00
Table of Contents
- Table of Contents
- Breaking Changes
- Removing GPG / Encryption
- Settings Updates
- Regex Everywhere
- Re-design parsing/consumption chain
- Actual Plugins
- Simpler consumer
- Transition to Alpine container
- Ditch celery for Huey
- Improved Tasks
- External Services
- Separate OCR from Archive
- Django Ninja
- Vector Embeddings
- New Sanity Checker
- Testing Updates
- Use File Storage API
Table of Contents
- Breaking Changes
- Improved Tasks
- External Services
- Separate OCR from Archive
- Django Ninja
- Vector Embeddings
- New Sanity Checker
- Testing Updates
Breaking Changes
Removing GPG / Encryption
- Encrypting documents unsupported since 0.9, many years ago
- Provides no benefit
- Does still linger in the code base here and there
Settings Updates
- Remove all but Django settings from the environment
- Separate OCR vs other settings (call them site setting?)
- Create multiple levels of OCR settings:
- A default system configuration, controlled by staff/superusers
- A user specific settings set
- The final settings used for OCR are then the combined set, with user, then default system settings
- Other parsers, like text, etc also define these levels. But it's separate entirely.
- Allow workflows/matching to set certain settings:
- Document filename matches regex, disable archive generation and disable de-skew
- When a document starts consumption, settings go through the pipeline with it. ie set once, not read (from DB) again
Regex Everywhere
- Remove usages of
fnmatch
in favor of regex. There was a PR that implemented some sort of multiple matching, where regex could have solved it
Re-design parsing/consumption chain
- Use chains/pipelines to actually break the consumption into multiple tasks
- Results from one task move on to the next
- An initial task takes the file, waits for it to be unmodified, then determines the next task to start.
- Or alternatively, the initial task builds a pipeline and starts that.
- Handles deciding if the file can be consumed, rather than when a new file is seen (see plugin ideas)
- Make each step along the well a well defined status update, sent over websocket, but also configure something like apprise/ntfy
- TODO: If something fails along the chain, the DB shouldn't be updated. Maybe 1 task, multiple steps, wrapped in a transaction?
Actual Plugins
- Design a system to allow plugins, while splitting apart the current code into plugins
- I can see the following being plugins:
- Parsers (obviously. Includes things like AI/cloud OCR to get the content or even could talk to a remote API)
- Archive generation (example, use Gotenberg to convert a PDF to PDF/A instead of ocrmypdf)
- Thumbnail generation (maybe you want to handle PDFs differently than JPEGs?)
- Date parsing (handling non-latin dates, for example)
- Machine learning (provides an interface which returns the proposed tags, type, etc)
- Ideally, plugins should be registered when installed, declaring what mime types they support, with some sort of conflict resolution
- With the settings updates above, a workflow could also be used to set the parser based on matching certain values
- Provide "paperless", a core set of functionality, including models and common functionality (thumbnail generation for common types, current versions of ML, date parsing, etc)
- Provide the existing parsers, re-configured to match the new format
- Rework the other parts to conform to the plugin API spec
Simpler consumer
- Use something like watchfiles for a simpler loop with only itself as a dependency
- See some ideas in https://github.com/paperless-ngx/paperless-ngx/tree/feature-simpler-consume-loop
Transition to Alpine container
- Smaller image size
- Faster update cadence
Ditch celery for Huey
- Celery is big and bulky, with support for memcached, sqs, etc, which we don't need
- Huey also has nice Django integrations, like for database connections, which we kind of hacked into Celery
- Would need to use its signals to implement task tracking, but the Django celery integrate is pretty "meh"
Improved Tasks
- Show scheduled tasks with next execution
- Simple task status
- Include more task types
- Include ability to trigger scheduled tasks "now"
External Services
External OCR
- External OCR services, using an API, could provide more recent tesseract and ghostscript versions, potentially fixing issues faster than Debian updates (thinking Alpine based image)
- This would be streamed the document, eventually return the content and an optional archive file
- Is time consuming, so might need celery/huey/task queue there? And a database?
- fastapi could easily set this up, if there is no need for a database.
- Could use Redis/Valkey streams to manage state and show progress without a database
External Machine Learning
- Again, define an API that the service provides so it could be swapped out
- Provided the content, suggests the tags, correspondents, etc
- External allows it to be hosted on a larger resourced machine
- Needs a task queue for scheduled training?
Separate OCR from Archive
- The getting of a image or PDF document content should be separated from the generation of an archive file
- Just too many interactions between them, leading to odd combinations
Django Ninja
- Really like the OpenAPI spec it generates
- async support for databases
- Strongly typed and validated with Pydantic
Blockers
- Would need to implement Token based authentication
- Could track, with some resolution, when a token was last used. Might be nice to display and allow removing old tokens which haven't been used
- Could implement expiration too
- Async pagination isn't working quite yet
- No idea about allauth/oidc integration
Vector Embeddings
- This would require either a new database or everyone to use the same database (ala Immich with pgvecto.rs)
- Would enable semantic search, document similarity search
- Maybe replacing Whoosh entirely?
New Sanity Checker
- Sanity checker messages are attached to a document
- Can be dismissed (but still viewed)
- Visible in the UI somehow
Testing Updates
- Fully transition to pytest-django, pytest style and fixtures, with appropriate scopes to limit the amount of work
- Actually define and use factory-boy factories for our models (including relations)
Use File Storage API
- See reference
- Also can change to an in-memory storage for testing, so it's a little faster maybe?
- Allows storing files elsewhere easier
Feel free to contribute to the wiki pages - enhance and extend the content!
Also browse Discussions & connect in Matrix chat.