mirror of
				https://github.com/paperless-ngx/paperless-ngx.git
				synced 2025-10-26 00:02:35 -04:00 
			
		
		
		
	
		
			
				
	
	
		
			325 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			325 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| ***************
 | |
| Advanced topics
 | |
| ***************
 | |
| 
 | |
| Paperless offers a couple features that automate certain tasks and make your life
 | |
| easier.
 | |
| 
 | |
| Guesswork
 | |
| #########
 | |
| 
 | |
| 
 | |
| Any document you put into the consumption directory will be consumed, but if
 | |
| you name the file right, it'll automatically set some values in the database
 | |
| for you.  This is is the logic the consumer follows:
 | |
| 
 | |
| 1. Try to find the correspondent, title, and tags in the file name following
 | |
|    the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that
 | |
|    the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
 | |
|    ``YYYYMMDDZ``.  The ``Z`` refers "Zulu time" AKA "UTC".
 | |
|    The tags are optional, so the format ``Date - Correspondent - Title.pdf``
 | |
|    works as well.
 | |
| 2. If that doesn't work, we skip the date and try this pattern:
 | |
|    ``Correspondent - Title - tag,tag,tag.pdf``.
 | |
| 3. If that doesn't work, we try to find the correspondent and title in the file
 | |
|    name following the pattern: ``Correspondent - Title.pdf``.
 | |
| 4. If that doesn't work, just assume that the name of the file is the title.
 | |
| 
 | |
| So given the above, the following examples would work as you'd expect:
 | |
| 
 | |
| * ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 | |
| * ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 | |
| * ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 | |
| * ``Another Company - Letter of Reference.jpg``
 | |
| * ``Dad's Recipe for Pancakes.png``
 | |
| 
 | |
| These however wouldn't work:
 | |
| 
 | |
| * ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 | |
| * ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 | |
| * ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 | |
| * ``Another Company- Letter of Reference.jpg``
 | |
| 
 | |
| Do I have to be so strict about naming?
 | |
| =======================================
 | |
| 
 | |
| Rather than using the strict document naming rules, one can also set the option
 | |
| ``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
 | |
| that is accepted by dateparser_. Doing so will cause ``paperless`` to default
 | |
| to any date format that is found in the title, instead of a date pulled from
 | |
| the document's text, without requiring the strict formatting of the document
 | |
| filename as described above.
 | |
| 
 | |
| .. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
 | |
| 
 | |
| Transforming filenames for parsing
 | |
| ==================================
 | |
| 
 | |
| Some devices can't produce filenames that can be parsed by the default
 | |
| parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
 | |
| ``paperless.conf`` one can add transformations that are applied to the filename
 | |
| before it's parsed.
 | |
| 
 | |
| The option contains a list of dictionaries of regular expressions (key:
 | |
| ``pattern``) and replacements (key: ``repl``) in JSON format, which are
 | |
| applied in order by passing them to ``re.subn``. Transformation stops
 | |
| after the first match, so at most one transformation is applied. The general
 | |
| syntax is
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|    [{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
 | |
| 
 | |
| The example below is for a Brother ADS-2400N, a scanner that allows
 | |
| different names to different hardware buttons (useful for handling
 | |
| multiple entities in one instance), but insists on adding ``_<count>``
 | |
| to the filename.
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|    # Brother profile configuration, support "Name_Date_Count" (the default
 | |
|    # setting) and "Name_Count" (use "Name" as tag and "Count" as title).
 | |
|    PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
 | |
| 
 | |
| 
 | |
| Matching tags, correspondents and document types
 | |
| ################################################
 | |
| 
 | |
| After the consumer has tried to figure out what it could from the file name,
 | |
| it starts looking at the content of the document itself.  It will compare the
 | |
| matching algorithms defined by every tag and correspondent already set in your
 | |
| database to see if they apply to the text in that document.  In other words,
 | |
| if you defined a tag called ``Home Utility`` that had a ``match`` property of
 | |
| ``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
 | |
| automatically tag your newly-consumed document with your ``Home Utility`` tag
 | |
| so long as the text ``bc hydro`` appears in the body of the document somewhere.
 | |
| 
 | |
| The matching logic is quite powerful, and supports searching the text of your
 | |
| document with different algorithms, and as such, some experimentation may be
 | |
| necessary to get things right.
 | |
| 
 | |
| In order to have a tag, correspondent or type assigned automatically to newly
 | |
| consumed documents, assign a match and matching algorithm using the web
 | |
| interface. These settings define when to assign correspondents, tags and types
 | |
| to documents.
 | |
| 
 | |
| The following algorithms are available:
 | |
| 
 | |
| * **Any:** Looks for any occurrence of any word provided in match in the PDF.
 | |
|   If you define the match as ``Bank1 Bank2``, it will match documents containing
 | |
|   either of these terms.
 | |
| * **All:** Requires that every word provided appears in the PDF, albeit not in the
 | |
|   order provided.
 | |
| * **Literal:** Matches only if the match appears exactly as provided in the PDF.
 | |
| * **Regular expression:** Parses the match as a regular expression and tries to
 | |
|   find a match within the document.
 | |
| * **Fuzzy match:** I dont know. Look at the source.
 | |
| * **Auto:** Tries to automatically match new documents. This does not require you
 | |
|   to set a match. See the notes below.
 | |
| 
 | |
| When using the "any" or "all" matching algorithms, you can search for terms
 | |
| that consist of multiple words by enclosing them in double quotes. For example,
 | |
| defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
 | |
| will match documents that contain either "Bank of America" or "BofA", but will
 | |
| not match documents containing "Bank of South America".
 | |
| 
 | |
| Then just save your tag/correspondent and run another document through the
 | |
| consumer.  Once complete, you should see the newly-created document,
 | |
| automatically tagged with the appropriate data.
 | |
| 
 | |
| 
 | |
| Automatic matching
 | |
| ==================
 | |
| 
 | |
| Paperless-ng comes with a new matching algorithm called *Auto*. This matching
 | |
| algorithm tries to assign tags, correspondents and document types to your
 | |
| documents based on how you have assigned these on existing documents. It
 | |
| uses a neural network under the hood.
 | |
| 
 | |
| If, for example, all your bank statements of your account 123 at the Bank of
 | |
| America are tagged with the tag "bofa_123" and the matching algorithm of this
 | |
| tag is set to *Auto*, this neural network will examine your documents and
 | |
| automatically learn when to assign this tag.
 | |
| 
 | |
| There are a couple caveats you need to keep in mind when using this feature:
 | |
| 
 | |
| * Changes to your documents are not immediately reflected by the matching
 | |
|   algorithm. The neural network needs to be *trained* on your documents after
 | |
|   changes. Paperless periodically (default: once each hour) checks for changes
 | |
|   and does this automatically for you.
 | |
| * The Auto matching algorithm only takes documents into account which are NOT
 | |
|   placed in your inbox (i.e., have inbox tags assigned to them). This ensures
 | |
|   that the neural network only learns from documents which you have correctly
 | |
|   tagged before.
 | |
| * The matching algorithm can only work if there is a correlation between the
 | |
|   tag, correspondent or document type and the document itself. Your bank
 | |
|   statements usually contain your bank account number and the name of the bank,
 | |
|   so this works reasonably well, However, tags such as "TODO" cannot be
 | |
|   automatically assigned.
 | |
| * The matching algorithm needs a reasonable number of documents to identify when
 | |
|   to assign tags, correspondents, and types. If one out of a thousand documents
 | |
|   has the correspondent "Very obscure web shop I bought something five years
 | |
|   ago", it will probably not assign this correspondent automatically if you buy
 | |
|   something from them again. The more documents, the better.
 | |
| 
 | |
| Hooking into the consumption process
 | |
| ####################################
 | |
| 
 | |
| Sometimes you may want to do something arbitrary whenever a document is
 | |
| consumed.  Rather than try to predict what you may want to do, Paperless lets
 | |
| you execute scripts of your own choosing just before or after a document is
 | |
| consumed using a couple simple hooks.
 | |
| 
 | |
| Just write a script, put it somewhere that Paperless can read & execute, and
 | |
| then put the path to that script in ``paperless.conf`` with the variable name
 | |
| of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
 | |
| ``PAPERLESS_POST_CONSUME_SCRIPT``.
 | |
| 
 | |
| .. TODO HYPEREF TO CONFIG
 | |
| 
 | |
| .. important::
 | |
| 
 | |
|     These scripts are executed in a **blocking** process, which means that if
 | |
|     a script takes a long time to run, it can significantly slow down your
 | |
|     document consumption flow.  If you want things to run asynchronously,
 | |
|     you'll have to fork the process in your script and exit.
 | |
| 
 | |
| 
 | |
| Pre-consumption script
 | |
| ======================
 | |
| 
 | |
| Executed after the consumer sees a new document in the consumption folder, but
 | |
| before any processing of the document is performed. This script receives exactly
 | |
| one argument:
 | |
| 
 | |
| * Document file name
 | |
| 
 | |
| A simple but common example for this would be creating a simple script like
 | |
| this:
 | |
| 
 | |
| ``/usr/local/bin/ocr-pdf``
 | |
| 
 | |
| .. code:: bash
 | |
| 
 | |
|     #!/usr/bin/env bash
 | |
|     pdf2pdfocr.py -i ${1}
 | |
| 
 | |
| ``/etc/paperless.conf``
 | |
| 
 | |
| .. code:: bash
 | |
| 
 | |
|     ...
 | |
|     PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
 | |
|     ...
 | |
| 
 | |
| This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
 | |
| which will in turn call `pdf2pdfocr.py`_ on your document, which will then
 | |
| overwrite the file with an OCR'd version of the file and exit.  At which point,
 | |
| the consumption process will begin with the newly modified file.
 | |
| 
 | |
| .. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
 | |
| 
 | |
| 
 | |
| Post-consumption script
 | |
| =======================
 | |
| 
 | |
| Executed after the consumer has successfully processed a document and has moved it
 | |
| into paperless. It receives the following arguments:
 | |
| 
 | |
| * Document id
 | |
| * Generated file name
 | |
| * Source path
 | |
| * Thumbnail path
 | |
| * Download URL
 | |
| * Thumbnail URL
 | |
| * Correspondent
 | |
| * Tags
 | |
| 
 | |
| The script can be in any language you like, but for a simple shell script
 | |
| example, you can take a look at ``post-consumption-example.sh`` in the
 | |
| ``scripts`` directory in this project.
 | |
| 
 | |
| The post consumption script cannot cancel the consumption process.
 | |
| 
 | |
| .. _advanced-file_name_handling:
 | |
| 
 | |
| File name handling
 | |
| ##################
 | |
| 
 | |
| By default, paperless stores your documents in the media directory and renames them
 | |
| using the identifier which it has assigned to each document. You will end up getting
 | |
| files like ``0000123.pdf`` in your media directory. This isn't necessarily a bad
 | |
| thing, because you normally don't have to access these files manually. However, if
 | |
| you wish to name your files differently, you can do that by adjustng the
 | |
| ``PAPERLESS_FILENAME_FORMAT`` settings variable.
 | |
| 
 | |
| This variable allows you to configure the filename (folders are allowed!) using
 | |
| placeholders. For example, setting
 | |
| 
 | |
| .. code:: bash
 | |
| 
 | |
|     PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title}
 | |
| 
 | |
| will create a directory structure as follows:
 | |
| 
 | |
| .. code::
 | |
| 
 | |
|     2019/
 | |
|       my_bank/
 | |
|         statement-january-0000001.pdf
 | |
|         statement-february-0000002.pdf
 | |
|     2020/
 | |
|       my_bank/
 | |
|         statement-january-0000003.pdf
 | |
|       shoe_store/
 | |
|         my_new_shoes-0000004.pdf
 | |
| 
 | |
| Paperless appends the unique identifier of each document to the filename. This
 | |
| avoides filename clashes.
 | |
| 
 | |
| .. danger::
 | |
| 
 | |
|     Do not manually move your files in the media folder. Paperless remembers the
 | |
|     last filename a document was stored as. If you do rename a file, paperless will
 | |
|     report your files as missing and won't be able to find them.
 | |
| 
 | |
| Paperless provides the following placeholders withing filenames:
 | |
| 
 | |
| * ``{correspondent}``: The name of the correspondent, or "none".
 | |
| * ``{title}``: The title of the document.
 | |
| * ``{created}``: The full date and time the document was created.
 | |
| * ``{created_year}``: Year created only.
 | |
| * ``{created_month}``: Month created only (number 1-12).
 | |
| * ``{created_day}``: Day created only (number 1-31).
 | |
| * ``{added}``: The full date and time the document was added to paperless.
 | |
| * ``{added_year}``: Year added only.
 | |
| * ``{added_month}``: Month added only (number 1-12).
 | |
| * ``{added_day}``: Day added only (number 1-31).
 | |
| * ``{tags}``: I don't know how this works. Look at the source.
 | |
| 
 | |
| Paperless will convert all values for the placeholders into values which are safe
 | |
| for use in filenames.
 | |
| 
 | |
| .. hint::
 | |
| 
 | |
|     Paperless checks the filename of a document whenever it is saved. Therefore,
 | |
|     you need to update the filenames of your documents and move them after altering
 | |
|     this setting by invoking the :ref:`document renamer <utilities-renamer>`.
 | |
| 
 | |
| .. warning::
 | |
| 
 | |
|     Make absolutely sure you get the spelling of the placeholders right, or else
 | |
|     paperless will use the default naming scheme instead.
 | |
| 
 | |
| .. caution::
 | |
| 
 | |
|     As of now, you could totally tell paperless to store your files anywhere outside
 | |
|     the media directory by setting
 | |
| 
 | |
|     .. code::
 | |
| 
 | |
|         PAPERLESS_FILENAME_FORMAT=../../my/custom/location/{title}
 | |
|     
 | |
|     However, keep in mind that inside docker, if files get stored outside of the
 | |
|     predefined volumes, they will be lost after a restart of paperless.
 |