mirror of
				https://github.com/paperless-ngx/paperless-ngx.git
				synced 2025-10-26 16:22:35 -04:00 
			
		
		
		
	- rename PAPERLESS_TIKA to PAPERLESS_TIKA_ENABLED - all other env params now start with PAPERLESS_TIKA - convert_to_pdf as class instance method - smaller details Signed-off-by: Jo Vandeginste <Jo.Vandeginste@kuleuven.be>
		
			
				
	
	
		
			460 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			460 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. _configuration:
 | |
| 
 | |
| *************
 | |
| Configuration
 | |
| *************
 | |
| 
 | |
| Paperless provides a wide range of customizations.
 | |
| Depending on how you run paperless, these settings have to be defined in different
 | |
| places.
 | |
| 
 | |
| *   If you run paperless on docker, ``paperless.conf`` is not used. Rather, configure
 | |
|     paperless by copying necessary options to ``docker-compose.env``.
 | |
| *   If you are running paperless on anything else, paperless will search for the
 | |
|     configuration file in these locations and use the first one it finds:
 | |
| 
 | |
|     .. code::
 | |
| 
 | |
|         /path/to/paperless/paperless.conf
 | |
|         /etc/paperless.conf
 | |
|         /usr/local/etc/paperless.conf
 | |
| 
 | |
| 
 | |
| Required services
 | |
| #################
 | |
| 
 | |
| PAPERLESS_REDIS=<url>
 | |
|     This is required for processing scheduled tasks such as email fetching, index
 | |
|     optimization and for training the automatic document matcher.
 | |
| 
 | |
|     Defaults to redis://localhost:6379.
 | |
| 
 | |
| PAPERLESS_DBHOST=<hostname>
 | |
|     By default, sqlite is used as the database backend. This can be changed here.
 | |
|     Set PAPERLESS_DBHOST and PostgreSQL will be used instead of mysql.
 | |
| 
 | |
| PAPERLESS_DBPORT=<port>
 | |
|     Adjust port if necessary.
 | |
| 
 | |
|     Default is 5432.
 | |
| 
 | |
| PAPERLESS_DBNAME=<name>
 | |
|     Database name in PostgreSQL.
 | |
| 
 | |
|     Defaults to "paperless".
 | |
| 
 | |
| PAPERLESS_DBUSER=<name>
 | |
|     Database user in PostgreSQL.
 | |
| 
 | |
|     Defaults to "paperless".
 | |
| 
 | |
| PAPERLESS_DBPASS=<password>
 | |
|     Database password for PostgreSQL.
 | |
| 
 | |
|     Defaults to "paperless".
 | |
| 
 | |
| 
 | |
| Paths and folders
 | |
| #################
 | |
| 
 | |
| PAPERLESS_CONSUMPTION_DIR=<path>
 | |
|     This where your documents should go to be consumed.  Make sure that it exists
 | |
|     and that the user running the paperless service can read/write its contents
 | |
|     before you start Paperless.
 | |
| 
 | |
|     Don't change this when using docker, as it only changes the path within the
 | |
|     container. Change the local consumption directory in the docker-compose.yml
 | |
|     file instead.
 | |
| 
 | |
|     Defaults to "../consume", relative to the "src" directory.
 | |
| 
 | |
| PAPERLESS_DATA_DIR=<path>
 | |
|     This is where paperless stores all its data (search index, SQLite database,
 | |
|     classification model, etc).
 | |
| 
 | |
|     Defaults to "../data", relative to the "src" directory.
 | |
| 
 | |
| PAPERLESS_MEDIA_ROOT=<path>
 | |
|     This is where your documents and thumbnails are stored.
 | |
| 
 | |
|     You can set this and PAPERLESS_DATA_DIR to the same folder to have paperless
 | |
|     store all its data within the same volume.
 | |
| 
 | |
|     Defaults to "../media", relative to the "src" directory.
 | |
| 
 | |
| PAPERLESS_STATICDIR=<path>
 | |
|     Override the default STATIC_ROOT here.  This is where all static files
 | |
|     created using "collectstatic" manager command are stored.
 | |
| 
 | |
|     Unless you're doing something fancy, there is no need to override this.
 | |
| 
 | |
|     Defaults to "../static", relative to the "src" directory.
 | |
| 
 | |
| PAPERLESS_FILENAME_FORMAT=<format>
 | |
|     Changes the filenames paperless uses to store documents in the media directory.
 | |
|     See :ref:`advanced-file_name_handling` for details.
 | |
| 
 | |
|     Default is none, which disables this feature.
 | |
| 
 | |
| Hosting & Security
 | |
| ##################
 | |
| 
 | |
| PAPERLESS_SECRET_KEY=<key>
 | |
|     Paperless uses this to make session tokens. If you expose paperless on the
 | |
|     internet, you need to change this, since the default secret is well known.
 | |
| 
 | |
|     Use any sequence of characters. The more, the better. You don't need to
 | |
|     remember this. Just face-roll your keyboard.
 | |
| 
 | |
|     Default is listed in the file ``src/paperless/settings.py``.
 | |
| 
 | |
| PAPERLESS_ALLOWED_HOSTS<comma-separated-list>
 | |
|     If you're planning on putting Paperless on the open internet, then you
 | |
|     really should set this value to the domain name you're using.  Failing to do
 | |
|     so leaves you open to HTTP host header attacks:
 | |
|     https://docs.djangoproject.com/en/3.1/topics/security/#host-header-validation
 | |
| 
 | |
|     Just remember that this is a comma-separated list, so "example.com" is fine,
 | |
|     as is "example.com,www.example.com", but NOT " example.com" or "example.com,"
 | |
| 
 | |
|     Defaults to "*", which is all hosts.
 | |
| 
 | |
| PAPERLESS_CORS_ALLOWED_HOSTS<comma-separated-list>
 | |
|     You need to add your servers to the list of allowed hosts that can do CORS
 | |
|     calls. Set this to your public domain name.
 | |
| 
 | |
|     Defaults to "http://localhost:8000".
 | |
| 
 | |
| PAPERLESS_FORCE_SCRIPT_NAME=<path>
 | |
|     To host paperless under a subpath url like example.com/paperless you set
 | |
|     this value to /paperless. No trailing slash!
 | |
| 
 | |
|     .. note::
 | |
| 
 | |
|         I don't know if this works in paperless-ng. Probably not.
 | |
| 
 | |
|     Defaults to none, which hosts paperless at "/".
 | |
| 
 | |
| PAPERLESS_STATIC_URL=<path>
 | |
|     Override the STATIC_URL here.  Unless you're hosting Paperless off a
 | |
|     subdomain like /paperless/, you probably don't need to change this.
 | |
| 
 | |
|     Defaults to "/static/".
 | |
| 
 | |
| PAPERLESS_AUTO_LOGIN_USERNAME=<username>
 | |
|     Specify a username here so that paperless will automatically perform login
 | |
|     with the selected user.
 | |
| 
 | |
|     .. danger::
 | |
| 
 | |
|         Do not use this when exposing paperless on the internet. There are no
 | |
|         checks in place that would prevent you from doing this.
 | |
| 
 | |
|     Defaults to none, which disables this feature.
 | |
| 
 | |
| 
 | |
| PAPERLESS_COOKIE_PREFIX=<str>
 | |
|     Specify a prefix that is added to the cookies used by paperless to identify
 | |
|     the currently logged in user. This is useful for when you're running two
 | |
|     instances of paperless on the same host.
 | |
| 
 | |
|     After changing this, you will have to login again.
 | |
| 
 | |
|     Defaults to ``""``, which does not alter the cookie names.
 | |
| 
 | |
| .. _configuration-ocr:
 | |
| 
 | |
| OCR settings
 | |
| ############
 | |
| 
 | |
| Paperless uses `OCRmyPDF <https://ocrmypdf.readthedocs.io/en/latest/>`_ for
 | |
| performing OCR on documents and images. Paperless uses sensible defaults for
 | |
| most settings, but all of them can be configured to your needs.
 | |
| 
 | |
| 
 | |
| PAPERLESS_OCR_LANGUAGE=<lang>
 | |
|     Customize the language that paperless will attempt to use when
 | |
|     parsing documents.
 | |
| 
 | |
|     It should be a 3-letter language code consistent with ISO
 | |
|     639: https://www.loc.gov/standards/iso639-2/php/code_list.php
 | |
| 
 | |
|     Set this to the language most of your documents are written in.
 | |
| 
 | |
|     This can be a combination of multiple languages such as ``deu+eng``,
 | |
|     in which case tesseract will use whatever language matches best.
 | |
|     Keep in mind that tesseract uses much more cpu time with multiple
 | |
|     languages enabled.
 | |
| 
 | |
|     Defaults to "eng".
 | |
| 
 | |
| PAPERLESS_OCR_MODE=<mode>
 | |
|     Tell paperless when and how to perform ocr on your documents. Four modes
 | |
|     are available:
 | |
| 
 | |
|     *   ``skip``: Paperless skips all pages and will perform ocr only on pages
 | |
|         where no text is present. This is the safest option.
 | |
|     *   ``skip_noarchive``: In addition to skip, paperless won't create an
 | |
|         archived version of your documents when it finds any text in them.
 | |
|         This is useful if you don't want to have two almost-identical versions
 | |
|         of your digital documents in the media folder. This is the fastest option.
 | |
|     *   ``redo``: Paperless will OCR all pages of your documents and attempt to
 | |
|         replace any existing text layers with new text. This will be useful for
 | |
|         documents from scanners that already performed OCR with insufficient
 | |
|         results. It will also perform OCR on purely digital documents.
 | |
| 
 | |
|         This option may fail on some documents that have features that cannot
 | |
|         be removed, such as forms. In this case, the text from the document is
 | |
|         used instead.
 | |
|     *   ``force``: Paperless rasterizes your documents, converting any text
 | |
|         into images and puts the OCRed text on top. This works for all documents,
 | |
|         however, the resulting document may be significantly larger and text
 | |
|         won't appear as sharp when zoomed in.
 | |
|     
 | |
|     The default is ``skip``, which only performs OCR when necessary and always
 | |
|     creates archived documents.
 | |
| 
 | |
| PAPERLESS_OCR_OUTPUT_TYPE=<type>
 | |
|     Specify the the type of PDF documents that paperless should produce.
 | |
|     
 | |
|     *   ``pdf``: Modify the PDF document as little as possible.
 | |
|     *   ``pdfa``: Convert PDF documents into PDF/A-2b documents, which is a
 | |
|         subset of the entire PDF specification and meant for storing
 | |
|         documents long term.
 | |
|     *   ``pdfa-1``, ``pdfa-2``, ``pdfa-3`` to specify the exact version of
 | |
|         PDF/A you wish to use.
 | |
|     
 | |
|     If not specified, ``pdfa`` is used. Remember that paperless also keeps
 | |
|     the original input file as well as the archived version.
 | |
| 
 | |
| 
 | |
| PAPERLESS_OCR_PAGES=<num>
 | |
|     Tells paperless to use only the specified amount of pages for OCR. Documents
 | |
|     with less than the specified amount of pages get OCR'ed completely.
 | |
| 
 | |
|     Specifying 1 here will only use the first page.
 | |
| 
 | |
|     When combined with ``PAPERLESS_OCR_MODE=redo`` or ``PAPERLESS_OCR_MODE=force``,
 | |
|     paperless will not modify any text it finds on excluded pages and copy it
 | |
|     verbatim.
 | |
| 
 | |
|     Defaults to 0, which disables this feature and always uses all pages.
 | |
| 
 | |
| 
 | |
| PAPERLESS_OCR_IMAGE_DPI=<num>
 | |
|     Paperless will OCR any images you put into the system and convert them
 | |
|     into PDF documents. This is useful if your scanner produces images.
 | |
|     In order to do so, paperless needs to know the DPI of the image.
 | |
|     Most images from scanners will have this information embedded and
 | |
|     paperless will detect and use that information. In case this fails, it
 | |
|     uses this value as a fallback.
 | |
| 
 | |
|     Set this to the DPI your scanner produces images at.
 | |
| 
 | |
|     Default is none, which causes paperless to fail if no DPI information is
 | |
|     present in an image.
 | |
| 
 | |
| 
 | |
| PAPERLESS_OCR_USER_ARG=<json>
 | |
|     OCRmyPDF offers many more options. Use this parameter to specify any
 | |
|     additional arguments you wish to pass to OCRmyPDF. Since Paperless uses
 | |
|     the API of OCRmyPDF, you have to specify these in a format that can be
 | |
|     passed to the API. See `the API reference of OCRmyPDF <https://ocrmypdf.readthedocs.io/en/latest/api.html#reference>`_
 | |
|     for valid parameters. All command line options are supported, but they
 | |
|     use underscores instead of dashed.
 | |
| 
 | |
|     .. caution::
 | |
| 
 | |
|         Paperless has been tested to work with the OCR options provided
 | |
|         above. There are many options that are incompatible with each other,
 | |
|         so specifying invalid options may prevent paperless from consuming
 | |
|         any documents.
 | |
| 
 | |
|     Specify arguments as a JSON dictionary. Keep note of lower case booleans
 | |
|     and double quoted parameter names and strings. Examples:
 | |
| 
 | |
|     .. code:: json
 | |
| 
 | |
|         {"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}    
 | |
|     
 | |
| .. _configuration-tika:
 | |
| 
 | |
| Tika settings
 | |
| #############
 | |
| 
 | |
| Paperless can make use of `Tika <https://tika.apache.org/>`_ and 
 | |
| `Gotenberg <https://thecodingmachine.github.io/gotenberg/>`_ for parsing and
 | |
| converting "Office" documents (such as ".doc", ".xlsx" and ".odt"). If you
 | |
| wish to use this, you must provide a Tika server and a Gotenberg server,
 | |
| configure their endpoints, and enable the feature.
 | |
| 
 | |
| If you run paperless on docker, you can add those services to the docker-compose
 | |
| file (see the examples provided).
 | |
| 
 | |
| PAPERLESS_TIKA_ENABLED=<bool>
 | |
|     Enable (or disable) the Tika parser.
 | |
| 
 | |
|     Defaults to false.
 | |
| 
 | |
| PAPERLESS_TIKA_ENDPOINT=<url>
 | |
|     Set the endpoint URL were Paperless can reach your Tika server.
 | |
| 
 | |
|     Defaults to "http://localhost:9998".
 | |
| 
 | |
| PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>
 | |
|     Set the endpoint URL were Paperless can reach your Gotenberg server.
 | |
| 
 | |
|     Defaults to "http://localhost:3000".
 | |
| 
 | |
|     
 | |
| Software tweaks
 | |
| ###############
 | |
| 
 | |
| PAPERLESS_TASK_WORKERS=<num>
 | |
|     Paperless does multiple things in the background: Maintain the search index,
 | |
|     maintain the automatic matching algorithm, check emails, consume documents,
 | |
|     etc. This variable specifies how many things it will do in parallel.
 | |
| 
 | |
| 
 | |
| PAPERLESS_THREADS_PER_WORKER=<num>
 | |
|     Furthermore, paperless uses multiple threads when consuming documents to
 | |
|     speed up OCR. This variable specifies how many pages paperless will process
 | |
|     in parallel on a single document.
 | |
| 
 | |
|     .. caution::
 | |
| 
 | |
|         Ensure that the product
 | |
| 
 | |
|             PAPERLESS_TASK_WORKERS * PAPERLESS_THREADS_PER_WORKER
 | |
| 
 | |
|         does not exceed your CPU core count or else paperless will be extremely slow.
 | |
|         If you want paperless to process many documents in parallel, choose a high
 | |
|         worker count. If you want paperless to process very large documents faster,
 | |
|         use a higher thread per worker count.
 | |
| 
 | |
|     The default is a balance between the two, according to your CPU core count,
 | |
|     with a slight favor towards threads per worker, and using as much cores as
 | |
|     possible.
 | |
| 
 | |
|     If you only specify PAPERLESS_TASK_WORKERS, paperless will adjust
 | |
|     PAPERLESS_THREADS_PER_WORKER automatically.
 | |
| 
 | |
| 
 | |
| PAPERLESS_TIME_ZONE=<timezone>
 | |
|     Set the time zone here.
 | |
|     See https://docs.djangoproject.com/en/3.1/ref/settings/#std:setting-TIME_ZONE
 | |
|     for details on how to set it.
 | |
| 
 | |
|     Defaults to UTC.
 | |
| 
 | |
| 
 | |
| PAPERLESS_CONSUMER_POLLING=<num>
 | |
|     If paperless won't find documents added to your consume folder, it might
 | |
|     not be able to automatically detect filesystem changes. In that case,
 | |
|     specify a polling interval in seconds here, which will then cause paperless
 | |
|     to periodically check your consumption directory for changes.
 | |
| 
 | |
|     Defaults to 0, which disables polling and uses filesystem notifications.
 | |
| 
 | |
| 
 | |
| PAPERLESS_CONSUMER_DELETE_DUPLICATES=<bool>
 | |
|     When the consumer detects a duplicate document, it will not touch the
 | |
|     original document. This default behavior can be changed here.
 | |
| 
 | |
|     Defaults to false.
 | |
| 
 | |
| 
 | |
| PAPERLESS_CONSUMER_RECURSIVE=<bool>
 | |
|     Enable recursive watching of the consumption directory. Paperless will
 | |
|     then pickup files from files in subdirectories within your consumption
 | |
|     directory as well.
 | |
| 
 | |
|     Defaults to false.
 | |
| 
 | |
| 
 | |
| PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=<bool>
 | |
|     Set the names of subdirectories as tags for consumed files.
 | |
|     E.g. <CONSUMPTION_DIR>/foo/bar/file.pdf will add the tags "foo" and "bar" to
 | |
|     the consumed file. Paperless will create any tags that don't exist yet.
 | |
| 
 | |
|     PAPERLESS_CONSUMER_RECURSIVE must be enabled for this to work.
 | |
| 
 | |
|     Defaults to false.
 | |
| 
 | |
| 
 | |
| PAPERLESS_CONVERT_MEMORY_LIMIT=<num>
 | |
|     On smaller systems, or even in the case of Very Large Documents, the consumer
 | |
|     may explode, complaining about how it's "unable to extend pixel cache".  In
 | |
|     such cases, try setting this to a reasonably low value, like 32.  The
 | |
|     default is to use whatever is necessary to do everything without writing to
 | |
|     disk, and units are in megabytes.
 | |
| 
 | |
|     For more information on how to use this value, you should search
 | |
|     the web for "MAGICK_MEMORY_LIMIT".
 | |
| 
 | |
|     Defaults to 0, which disables the limit.
 | |
| 
 | |
| PAPERLESS_CONVERT_TMPDIR=<path>
 | |
|     Similar to the memory limit, if you've got a small system and your OS mounts
 | |
|     /tmp as tmpfs, you should set this to a path that's on a physical disk, like
 | |
|     /home/your_user/tmp or something.  ImageMagick will use this as scratch space
 | |
|     when crunching through very large documents.
 | |
| 
 | |
|     For more information on how to use this value, you should search
 | |
|     the web for "MAGICK_TMPDIR".
 | |
| 
 | |
|     Default is none, which disables the temporary directory.
 | |
| 
 | |
| PAPERLESS_OPTIMIZE_THUMBNAILS=<bool>
 | |
|     Use optipng to optimize thumbnails. This usually reduces the size of
 | |
|     thumbnails by about 20%, but uses considerable compute time during
 | |
|     consumption.
 | |
| 
 | |
|     Defaults to true.
 | |
| 
 | |
| PAPERLESS_POST_CONSUME_SCRIPT=<filename>
 | |
|     After a document is consumed, Paperless can trigger an arbitrary script if
 | |
|     you like.  This script will be passed a number of arguments for you to work
 | |
|     with. For more information, take a look at :ref:`advanced-post_consume_script`.
 | |
| 
 | |
|     The default is blank, which means nothing will be executed.
 | |
| 
 | |
| PAPERLESS_FILENAME_DATE_ORDER=<format>
 | |
|     Paperless will check the document text for document date information.
 | |
|     Use this setting to enable checking the document filename for date
 | |
|     information. The date order can be set to any option as specified in
 | |
|     https://dateparser.readthedocs.io/en/latest/settings.html#date-order.
 | |
|     The filename will be checked first, and if nothing is found, the document
 | |
|     text will be checked as normal.
 | |
| 
 | |
|     Defaults to none, which disables this feature.
 | |
| 
 | |
| PAPERLESS_THUMBNAIL_FONT_NAME=<filename>
 | |
|     Paperless creates thumbnails for plain text files by rendering the content
 | |
|     of the file on an image and uses a predefined font for that. This
 | |
|     font can be changed here.
 | |
| 
 | |
|     Note that this won't have any effect on already generated thumbnails.
 | |
| 
 | |
|     Defaults to ``/usr/share/fonts/liberation/LiberationSerif-Regular.ttf``.
 | |
| 
 | |
| 
 | |
| Binaries
 | |
| ########
 | |
| 
 | |
| There are a few external software packages that Paperless expects to find on
 | |
| your system when it starts up.  Unless you've done something creative with
 | |
| their installation, you probably won't need to edit any of these.  However,
 | |
| if you've installed these programs somewhere where simply typing the name of
 | |
| the program doesn't automatically execute it (ie. the program isn't in your
 | |
| $PATH), then you'll need to specify the literal path for that program.
 | |
| 
 | |
| PAPERLESS_CONVERT_BINARY=<path>
 | |
|     Defaults to "/usr/bin/convert".
 | |
| 
 | |
| PAPERLESS_GS_BINARY=<path>
 | |
|     Defaults to "/usr/bin/gs".
 | |
| 
 | |
| PAPERLESS_OPTIPNG_BINARY=<path>
 | |
|     Defaults to "/usr/bin/optipng".
 |