mirror of
				https://github.com/paperless-ngx/paperless-ngx.git
				synced 2025-11-02 18:47:10 -05:00 
			
		
		
		
	Documented all of the guesswork Paperless does
This commit is contained in:
		
							parent
							
								
									aea9ea50e5
								
							
						
					
					
						commit
						54443fa808
					
				@ -3,6 +3,8 @@ Changelog
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
* 0.2.0
 | 
					* 0.2.0
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					  * `#89`_ Ported the auto-tagging code to correspondents as well.  Thanks to
 | 
				
			||||||
 | 
					    `Justin Snyman`_ for the pointers in the issue queue.
 | 
				
			||||||
  * Added support for guessing the date from the file name along with the
 | 
					  * Added support for guessing the date from the file name along with the
 | 
				
			||||||
    correspondent, title, and tags.  Thanks to `Tikitu de Jager`_ for his pull
 | 
					    correspondent, title, and tags.  Thanks to `Tikitu de Jager`_ for his pull
 | 
				
			||||||
    request that I took forever to merge and to `Pit`_ for his efforts on the
 | 
					    request that I took forever to merge and to `Pit`_ for his efforts on the
 | 
				
			||||||
@ -97,6 +99,7 @@ Changelog
 | 
				
			|||||||
.. _zedster: https://github.com/zedster
 | 
					.. _zedster: https://github.com/zedster
 | 
				
			||||||
.. _Martin Honermeyer: https://github.com/djmaze
 | 
					.. _Martin Honermeyer: https://github.com/djmaze
 | 
				
			||||||
.. _Tim White: https://github.com/timwhite
 | 
					.. _Tim White: https://github.com/timwhite
 | 
				
			||||||
 | 
					.. _Justin Snyman: https://github.com/stringlytyped
 | 
				
			||||||
 | 
					
 | 
				
			||||||
.. _#20: https://github.com/danielquinn/paperless/issues/20
 | 
					.. _#20: https://github.com/danielquinn/paperless/issues/20
 | 
				
			||||||
.. _#44: https://github.com/danielquinn/paperless/issues/44
 | 
					.. _#44: https://github.com/danielquinn/paperless/issues/44
 | 
				
			||||||
@ -110,4 +113,5 @@ Changelog
 | 
				
			|||||||
.. _#67: https://github.com/danielquinn/paperless/issues/67
 | 
					.. _#67: https://github.com/danielquinn/paperless/issues/67
 | 
				
			||||||
.. _#68: https://github.com/danielquinn/paperless/issues/68
 | 
					.. _#68: https://github.com/danielquinn/paperless/issues/68
 | 
				
			||||||
.. _#71: https://github.com/danielquinn/paperless/issues/71
 | 
					.. _#71: https://github.com/danielquinn/paperless/issues/71
 | 
				
			||||||
 | 
					.. _#89: https://github.com/danielquinn/paperless/issues/89
 | 
				
			||||||
.. _#94: https://github.com/danielquinn/paperless/issues/71
 | 
					.. _#94: https://github.com/danielquinn/paperless/issues/71
 | 
				
			||||||
 | 
				
			|||||||
@ -3,7 +3,7 @@
 | 
				
			|||||||
Consumption
 | 
					Consumption
 | 
				
			||||||
###########
 | 
					###########
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Once you've got *Paperless* setup, you need to start feeding documents into it.
 | 
					Once you've got Paperless setup, you need to start feeding documents into it.
 | 
				
			||||||
Currently, there are three options: the consumption directory, IMAP (email), and
 | 
					Currently, there are three options: the consumption directory, IMAP (email), and
 | 
				
			||||||
HTTP POST.
 | 
					HTTP POST.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@ -35,41 +35,6 @@ appropriate for your use and put some documents in there.  When you're ready,
 | 
				
			|||||||
follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
 | 
					follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
.. _consumption-directory-naming:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
A Note on File Naming
 | 
					 | 
				
			||||||
---------------------
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Any document you put into the consumption directory will be consumed, but if
 | 
					 | 
				
			||||||
you name the file right, it'll automatically set some values in the database
 | 
					 | 
				
			||||||
for you.  This is is the logic the consumer follows:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
1. Try to find the correspondent, title, and tags in the file name following
 | 
					 | 
				
			||||||
   the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that
 | 
					 | 
				
			||||||
   the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
 | 
					 | 
				
			||||||
   ``YYYYMMDDZ``.  The ``Z`` is for "Zulu time" AKA "UTC".
 | 
					 | 
				
			||||||
2. If that doesn't work, we skip the date and try this pattern:
 | 
					 | 
				
			||||||
   the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
 | 
					 | 
				
			||||||
3. If that doesn't work, we try to find the correspondent and title in the file
 | 
					 | 
				
			||||||
   name following the pattern:  ``Correspondent - Title.pdf``.
 | 
					 | 
				
			||||||
4. If that doesn't work, just assume that the name of the file is the title.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
So given the above, the following examples would work as you'd expect:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 | 
					 | 
				
			||||||
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 | 
					 | 
				
			||||||
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 | 
					 | 
				
			||||||
* ``Another Company - Letter of Reference.jpg``
 | 
					 | 
				
			||||||
* ``Dad's Recipe for Pancakes.png``
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
These however wouldn't work:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 | 
					 | 
				
			||||||
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 | 
					 | 
				
			||||||
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 | 
					 | 
				
			||||||
* ``Another Company- Letter of Reference.jpg``
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
.. _consumption-imap:
 | 
					.. _consumption-imap:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
IMAP (Email)
 | 
					IMAP (Email)
 | 
				
			||||||
 | 
				
			|||||||
							
								
								
									
										85
									
								
								docs/guesswork.rst
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										85
									
								
								docs/guesswork.rst
									
									
									
									
									
										Normal file
									
								
							@ -0,0 +1,85 @@
 | 
				
			|||||||
 | 
					.. _guesswork:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Guesswork
 | 
				
			||||||
 | 
					#########
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					During the consumption process, Paperless tries to guess some of the attributes
 | 
				
			||||||
 | 
					of the document it's looking at.  To do this it uses two approaches:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					.. _guesswork-naming:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					File Naming
 | 
				
			||||||
 | 
					===========
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Any document you put into the consumption directory will be consumed, but if
 | 
				
			||||||
 | 
					you name the file right, it'll automatically set some values in the database
 | 
				
			||||||
 | 
					for you.  This is is the logic the consumer follows:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. Try to find the correspondent, title, and tags in the file name following
 | 
				
			||||||
 | 
					   the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that
 | 
				
			||||||
 | 
					   the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
 | 
				
			||||||
 | 
					   ``YYYYMMDDZ``.  The ``Z`` refers "Zulu time" AKA "UTC".
 | 
				
			||||||
 | 
					2. If that doesn't work, we skip the date and try this pattern:
 | 
				
			||||||
 | 
					   ``Correspondent - Title - tag,tag,tag.pdf``.
 | 
				
			||||||
 | 
					3. If that doesn't work, we try to find the correspondent and title in the file
 | 
				
			||||||
 | 
					   name following the pattern: ``Correspondent - Title.pdf``.
 | 
				
			||||||
 | 
					4. If that doesn't work, just assume that the name of the file is the title.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					So given the above, the following examples would work as you'd expect:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 | 
				
			||||||
 | 
					* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 | 
				
			||||||
 | 
					* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 | 
				
			||||||
 | 
					* ``Another Company - Letter of Reference.jpg``
 | 
				
			||||||
 | 
					* ``Dad's Recipe for Pancakes.png``
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					These however wouldn't work:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 | 
				
			||||||
 | 
					* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 | 
				
			||||||
 | 
					* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 | 
				
			||||||
 | 
					* ``Another Company- Letter of Reference.jpg``
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					.. _guesswork-content:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Reading the Document Contents
 | 
				
			||||||
 | 
					=============================
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					After the consumer has tried to figure out what it could from the file name,
 | 
				
			||||||
 | 
					it starts looking at the content of the document itself.  It will compare the
 | 
				
			||||||
 | 
					matching algorithms defined by every tag and correspondent already set in your
 | 
				
			||||||
 | 
					database to see if they apply to the text in that document.  In other words,
 | 
				
			||||||
 | 
					if you defined a tag called ``Home Utility`` that had a ``match`` property of
 | 
				
			||||||
 | 
					``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
 | 
				
			||||||
 | 
					automatically tag your newly-consumed document with your ``Home Utility`` tag
 | 
				
			||||||
 | 
					so long as the text ``bc hydro`` appears in the body of the document somewhere.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The matching logic is quite powerful, and supports searching the text of your
 | 
				
			||||||
 | 
					document with different algorithms, and as such, some experimentation may be
 | 
				
			||||||
 | 
					necessary to get things Just Right.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					.. _guesswork-content-howto:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					How Do I Set Up These Matching Algorithms?
 | 
				
			||||||
 | 
					------------------------------------------
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Setting up of the algorithms is easily done through the admin interface.  When
 | 
				
			||||||
 | 
					you create a new correspondent or tag, there are optional fields for matching
 | 
				
			||||||
 | 
					text and matching algorithm.  From the help info there:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					.. note::
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    Which algorithm you want to use when matching text to the OCR'd PDF.  Here,
 | 
				
			||||||
 | 
					    "any" looks for any occurrence of any word provided in the PDF, while "all"
 | 
				
			||||||
 | 
					    requires that every word provided appear in the PDF, albeit not in the
 | 
				
			||||||
 | 
					    order provided.  A "literal" match means that the text you enter must
 | 
				
			||||||
 | 
					    appear in the PDF exactly as you've entered it, and "regular expression"
 | 
				
			||||||
 | 
					    uses a regex to match the PDF.  If you don't know what a regex is, you
 | 
				
			||||||
 | 
					    probably don't want this option.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Then just save your tag/correspondent and run another document through the
 | 
				
			||||||
 | 
					consumer.  Once complete, you should see the newly-created document,
 | 
				
			||||||
 | 
					automatically tagged with the appropriate data.
 | 
				
			||||||
@ -32,6 +32,7 @@ Contents
 | 
				
			|||||||
   consumption
 | 
					   consumption
 | 
				
			||||||
   api
 | 
					   api
 | 
				
			||||||
   utilities
 | 
					   utilities
 | 
				
			||||||
 | 
					   guesswork
 | 
				
			||||||
   migrating
 | 
					   migrating
 | 
				
			||||||
   troubleshooting 
 | 
					   troubleshooting
 | 
				
			||||||
   changelog
 | 
					   changelog
 | 
				
			||||||
 | 
				
			|||||||
@ -52,9 +52,12 @@ for PDF files to parse and index.  The process is pretty straightforward:
 | 
				
			|||||||
   wait 10 seconds and try again.
 | 
					   wait 10 seconds and try again.
 | 
				
			||||||
2. Parse the PDF with Tesseract
 | 
					2. Parse the PDF with Tesseract
 | 
				
			||||||
3. Create a new record in the database with the OCR'd text
 | 
					3. Create a new record in the database with the OCR'd text
 | 
				
			||||||
4. Encrypt the PDF and store it in the ``media`` directory under
 | 
					4. Attempt to automatically assign document attributes by doing some guesswork.
 | 
				
			||||||
 | 
					   Read up on the :ref:`guesswork documentation<guesswork>` for more
 | 
				
			||||||
 | 
					   information about this process.
 | 
				
			||||||
 | 
					5. Encrypt the PDF and store it in the ``media`` directory under
 | 
				
			||||||
   ``documents/pdf``.
 | 
					   ``documents/pdf``.
 | 
				
			||||||
5. Go to #1.
 | 
					6. Go to #1.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
.. _utilities-consumer-howto:
 | 
					.. _utilities-consumer-howto:
 | 
				
			||||||
 | 
				
			|||||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user