mirror of
				https://github.com/paperless-ngx/paperless-ngx.git
				synced 2025-10-31 10:37:12 -04:00 
			
		
		
		
	
		
			
				
	
	
		
			256 lines
		
	
	
		
			9.0 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			256 lines
		
	
	
		
			9.0 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. _consumption:
 | |
| 
 | |
| Consumption
 | |
| ###########
 | |
| 
 | |
| Once you've got Paperless setup, you need to start feeding documents into it.
 | |
| Currently, there are three options: the consumption directory, IMAP (email), and
 | |
| HTTP POST.
 | |
| 
 | |
| 
 | |
| .. _consumption-directory:
 | |
| 
 | |
| The Consumption Directory
 | |
| =========================
 | |
| 
 | |
| The primary method of getting documents into your database is by putting them in
 | |
| the consumption directory.  The ``document_consumer`` script runs in an infinite
 | |
| loop looking for new additions to this directory and when it finds them, it goes
 | |
| about the process of parsing them with the OCR, indexing what it finds, and
 | |
| encrypting the PDF (if ``PAPERLESS_PASSPHRASE`` is set), storing it in the
 | |
| media directory.
 | |
| 
 | |
| Getting stuff into this directory is up to you.  If you're running Paperless
 | |
| on your local computer, you might just want to drag and drop files there, but if
 | |
| you're running this on a server and want your scanner to automatically push
 | |
| files to this directory, you'll need to setup some sort of service to accept the
 | |
| files from the scanner.  Typically, you're looking at an FTP server like
 | |
| `Proftpd`_ or `Samba`_.
 | |
| 
 | |
| .. _Proftpd: http://www.proftpd.org/
 | |
| .. _Samba: http://www.samba.org/
 | |
| 
 | |
| So where is this consumption directory?  It's wherever you define it.  Look for
 | |
| the ``CONSUMPTION_DIR`` value in ``settings.py``.  Set that to somewhere
 | |
| appropriate for your use and put some documents in there.  When you're ready,
 | |
| follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
 | |
| 
 | |
| 
 | |
| .. _consumption-directory-hook:
 | |
| 
 | |
| Hooking into the Consumption Process
 | |
| ------------------------------------
 | |
| 
 | |
| Sometimes you may want to do something arbitrary whenever a document is
 | |
| consumed.  Rather than try to predict what you may want to do, Paperless lets
 | |
| you execute scripts of your own choosing just before or after a document is
 | |
| consumed using a couple simple hooks.
 | |
| 
 | |
| Just write a script, put it somewhere that Paperless can read & execute, and
 | |
| then put the path to that script in ``paperless.conf`` with the variable name
 | |
| of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
 | |
| ``PAPERLESS_POST_CONSUME_SCRIPT``.  The script will be executed before or
 | |
| or after the document is consumed respectively.
 | |
| 
 | |
| .. important::
 | |
| 
 | |
|     These scripts are executed in a **blocking** process, which means that if
 | |
|     a script takes a long time to run, it can significantly slow down your
 | |
|     document consumption flow.  If you want things to run asynchronously,
 | |
|     you'll have to fork the process in your script and exit.
 | |
| 
 | |
| 
 | |
| .. _consumption-directory-hook-variables:
 | |
| 
 | |
| What Can These Scripts Do?
 | |
| ..........................
 | |
| 
 | |
| It's your script, so you're only limited by your imagination and the laws of
 | |
| physics.  However, the following values are passed to the scripts in order:
 | |
| 
 | |
| 
 | |
| .. _consumption-director-hook-variables-pre:
 | |
| 
 | |
| Pre-consumption script
 | |
| ::::::::::::::::::::::
 | |
| 
 | |
| * Document file name
 | |
| 
 | |
| A simple but common example for this would be creating a simple script like
 | |
| this:
 | |
| 
 | |
| ``/usr/local/bin/ocr-pdf``
 | |
| 
 | |
| .. code:: bash
 | |
| 
 | |
|     #!/usr/bin/env bash
 | |
|     pdf2pdfocr.py -i ${1}
 | |
| 
 | |
| ``/etc/paperless.conf``
 | |
| 
 | |
| .. code:: bash
 | |
| 
 | |
|     ...
 | |
|     PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
 | |
|     ...
 | |
| 
 | |
| This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
 | |
| which will in turn call `pdf2pdfocr.py`_ on your document, which will then
 | |
| overwrite the file with an OCR'd version of the file and exit.  At which point,
 | |
| the consumption process will begin with the newly modified file.
 | |
| 
 | |
| .. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
 | |
| 
 | |
| 
 | |
| .. _consumption-director-hook-variables-post:
 | |
| 
 | |
| Post-consumption script
 | |
| :::::::::::::::::::::::
 | |
| 
 | |
| * Document id
 | |
| * Generated file name
 | |
| * Source path
 | |
| * Thumbnail path
 | |
| * Download URL
 | |
| * Thumbnail URL
 | |
| * Correspondent
 | |
| * Tags
 | |
| 
 | |
| The script can be in any language you like, but for a simple shell script
 | |
| example, you can take a look at ``post-consumption-example.sh`` in the
 | |
| ``scripts`` directory in this project.
 | |
| 
 | |
| 
 | |
| .. _consumption-imap:
 | |
| 
 | |
| IMAP (Email)
 | |
| ============
 | |
| 
 | |
| Another handy way to get documents into your database is to email them to
 | |
| yourself.  The typical use-case would be to be out for lunch and want to send a
 | |
| copy of the receipt back to your system at home.  Paperless can be taught to
 | |
| pull emails down from an arbitrary account and dump them into the consumption
 | |
| directory where the process :ref:`above <consumption-directory>` will follow the
 | |
| usual pattern on consuming the document.
 | |
| 
 | |
| Some things you need to know about this feature:
 | |
| 
 | |
| * It's disabled by default.  By setting the values below it will be enabled.
 | |
| * It's been tested in a limited environment, so it may not work for you (please
 | |
|   submit a pull request if you can!)
 | |
| * It's designed to **delete mail from the server once consumed**.  So don't go
 | |
|   pointing this to your personal email account and wonder where all your stuff
 | |
|   went.
 | |
| * Currently, only one photo (attachment) per email will work.
 | |
| 
 | |
| So, with all that in mind, here's what you do to get it running:
 | |
| 
 | |
| 1. Setup a new email account somewhere, or if you're feeling daring, create a
 | |
|    folder in an existing email box and note the path to that folder.
 | |
| 2. In ``/etc/paperless.conf`` set all of the appropriate values in
 | |
|    ``PATHS AND FOLDERS`` and ``SECURITY``.
 | |
|    If you decided to use a subfolder of an existing account, then make sure you
 | |
|    set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here.  You also have to set
 | |
|    the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll
 | |
|    have to include that in every email you send.
 | |
| 3. Restart the :ref:`consumer <utilities-consumer>`.  The consumer will check
 | |
|    the configured email account at startup and from then on every 10 minutes
 | |
|    for something new and pulls down whatever it finds.
 | |
| 4. Send yourself an email!  Note that the subject is treated as the file name,
 | |
|    so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll
 | |
|    get what you expect.  Also, you must include the aforementioned secret
 | |
|    string in every email so the fetcher knows that it's safe to import.
 | |
|    Note that Paperless only allows the email title to consist of safe characters
 | |
|    to be imported. These consist of alpha-numeric characters and ``-_ ,.'``.
 | |
| 5. After a few minutes, the consumer will poll your mailbox, pull down the
 | |
|    message, and place the attachment in the consumption directory with the
 | |
|    appropriate name.  A few minutes later, the consumer will import it like any
 | |
|    other file.
 | |
| 
 | |
| 
 | |
| .. _consumption-http:
 | |
| 
 | |
| HTTP POST
 | |
| =========
 | |
| 
 | |
| You can also submit a document via HTTP POST, so long as you do so after
 | |
| authenticating.  To push your document to Paperless, send an HTTP POST to the
 | |
| server with the following name/value pairs:
 | |
| 
 | |
| * ``correspondent``: The name of the document's correspondent.  Note that there
 | |
|   are restrictions on what characters you can use here.  Specifically,
 | |
|   alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else is
 | |
|   out.  You also can't use the sequence ` - ` (space, dash, space).
 | |
| * ``title``: The title of the document.  The rules for characters is the same
 | |
|   here as the correspondent.
 | |
| * ``document``: The file you're uploading
 | |
| 
 | |
| Specify ``enctype="multipart/form-data"``, and then POST your file with::
 | |
| 
 | |
|     Content-Disposition: form-data; name="document"; filename="whatever.pdf"
 | |
| 
 | |
| An example of this in HTML is a typical form:
 | |
| 
 | |
| .. code:: html
 | |
| 
 | |
|     <form method="post" enctype="multipart/form-data">
 | |
|         <input type="text" name="correspondent" value="My Correspondent" />
 | |
|         <input type="text" name="title" value="My Title" />
 | |
|         <input type="file" name="document" />
 | |
|         <input type="submit" name="go" value="Do the thing" />
 | |
|     </form>
 | |
| 
 | |
| But a potentially more useful way to do this would be in Python.  Here we use
 | |
| the requests library to handle basic authentication and to send the POST data
 | |
| to the URL.
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|     import os
 | |
| 
 | |
|     from hashlib import sha256
 | |
| 
 | |
|     import requests
 | |
|     from requests.auth import HTTPBasicAuth
 | |
| 
 | |
|     # You authenticate via BasicAuth or with a session id.
 | |
|     # We use BasicAuth here
 | |
|     username = "my-username"
 | |
|     password = "my-super-secret-password"
 | |
| 
 | |
|     # Where you have Paperless installed and listening
 | |
|     url = "http://localhost:8000/push"
 | |
| 
 | |
|     # Document metadata
 | |
|     correspondent = "Test Correspondent"
 | |
|     title = "Test Title"
 | |
| 
 | |
|     # The local file you want to push
 | |
|     path = "/path/to/some/directory/my-document.pdf"
 | |
| 
 | |
| 
 | |
|     with open(path, "rb") as f:
 | |
| 
 | |
|         response = requests.post(
 | |
|             url=url,
 | |
|             data={"title": title,  "correspondent": correspondent},
 | |
|             files={"document": (os.path.basename(path), f, "application/pdf")},
 | |
|             auth=HTTPBasicAuth(username, password),
 | |
|             allow_redirects=False
 | |
|         )
 | |
| 
 | |
|         if response.status_code == 202:
 | |
| 
 | |
|             # Everything worked out ok
 | |
|             print("Upload successful")
 | |
| 
 | |
|         else:
 | |
| 
 | |
|             # If you don't get a 202, it's probably because your credentials
 | |
|             # are wrong or something.  This will give you a rough idea of what
 | |
|             # happened.
 | |
| 
 | |
|             print("We got HTTP status code: {}".format(response.status_code))
 | |
|             for k, v in response.headers.items():
 | |
|                 print("{}: {}".format(k, v))
 |