AUR (en) - ocrmypdf

Search Criteria

Enter search criteria

Search by

Keywords

Out of Date

Sort by

Sort order

Per page

Package Details: ocrmypdf 16.10.1-1

Package Actions

Git Clone URL:	https://aur.archlinux.org/ocrmypdf.git (read-only, click to copy)
Package Base:	ocrmypdf
Description:	A tool to add an OCR text layer to scanned PDF files, allowing them to be searched
Upstream URL:	https://github.com/ocrmypdf/OCRmyPDF
Licenses:	MPL2
Submitter:	dreuter
Maintainer:	fbrennan (pigmonkey)
Last Packager:	pigmonkey
Votes:	132
Popularity:	3.08
First Submitted:	2014-01-27 11:36 (UTC)
Last Updated:	2025-04-25 03:37 (UTC)

Pinned Comments

fbrennan commented on 2023-05-12 22:54 (UTC)

The flag was invalid and has been removed with no action taken as no new version was released. There's nothing to do for this package; no new release has been made. Rebuild, as @eclairevoyant has said.

Latest Comments

« First ‹ Previous 1 .. 5 6 7 8 9 10 11 12 13 14 15 .. 22 Next › Last »

jorges commented on 2020-08-05 19:49 (UTC)

Thanks for the explanation! I just got rid of pyhton-pdfminer.six from AUR and downgraded python-pdfminer to 20200517-1. OCRMyPDF works and all is well!

pigmonkey commented on 2020-07-29 17:39 (UTC)

It's a little convoluted, but here is what I think is happening:

The confusingly-named python-pdfminer from community that we use is in fact python-pdfminer.six. You can verify that by looking at its PKGBUILD. The AUR python-pdfminer.six is basically the same package, except it pulls from PyPi instead of Github and is on an outdated version (20200124 instead of community's 20200720).

OCRMyPDF claims to support 20200720, but that version of python-pdfminer{,.six} dropped PDFTextExtractionNotAllowed. This apparently was unintentional and has been reversed in 20200726. But as of now 20200726 has not been officially tagged.

So, we need to wait for upstream python-pdfminer.six to make 20200726 official, and then wait for the community maintainer to update the python-pdfminer package to 20200726. And then we need to wait for upstream OCRMyPDF to release a new version that notes support for 20200726. Then I can update this package and everything will be copacetic.

In the meantime, you can downgrade the community python-pdfminer package to the previous version, or run the much older version provided by the AUR python-pdfminer.six package.

jorges commented on 2020-07-29 11:18 (UTC) (edited on 2020-07-29 11:22 (UTC) by jorges)

I was getting the traceback shown below with python-pdfminer. I was able to solve the problem by removing that package and installing python-pdfminer.six. I other people can confirm this maybe the package dependencies have to be changed?

$ ocrmypdf 
Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 33, in <module>
    sys.exit(load_entry_point('ocrmypdf==10.3.1', 'console_scripts', 'ocrmypdf')())
  File "/usr/bin/ocrmypdf", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.8/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/lib/python3.8/site-packages/ocrmypdf/__init__.py", line 21, in <module>
    from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/__init__.py", line 19, in <module>
    from ocrmypdf.pdfinfo.info import Colorspace, Encoding, PdfInfo
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/info.py", line 37, in <module>
    from ocrmypdf.pdfinfo.layout import get_page_analysis, get_text_boxes
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/layout.py", line 29, in <module>
    from pdfminer.pdfdocument import PDFTextExtractionNotAllowed
ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfdocument' (/usr/lib/python3.8/site-packages/pdfminer/pdfdocument.py)
(ins)[jscandal@lhasa .aur_bb]$ python
Python 3.8.5 (default, Jul 27 2020, 08:42:51) 
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(ins)>>> from pdfminer.pdfdocument import PDFTextExtractionNotAllowed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfdocument' (/usr/lib/python3.8/site-packages/pdfminer/pdfdocument.py)

bsdice commented on 2020-07-23 12:03 (UTC) (edited on 2020-07-23 12:03 (UTC) by bsdice)

Anybody else getting tracebacks when using --threshold?

An exception occurred while executing the pipeline
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(args, *kwds))
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 195, in exec_page_sync
    ocr_image_out = create_ocr_image(ocr_image, page_context)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_pipeline.py", line 544, in create_ocr_image
    dpi = tuple(round(coord) for coord in im.info['dpi'])
KeyError: 'dpi'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 356, in run_pipeline
    exec_concurrent(context)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 267, in exec_concurrent
    exec_progress_pool(
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_concurrent.py", line 108, in exec_progress_pool
    result = results.next()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
KeyError: 'dpi'

marlemion commented on 2020-07-16 06:47 (UTC) (edited on 2020-07-16 12:30 (UTC) by marlemion)

Never mind the below. For some reason, some files were missing from my system.

Fully updated arch and updated ocrmypdf to the latest via AUR:

ocrmypdf
Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 33, in <module>
    sys.exit(load_entry_point('ocrmypdf==10.2.0', 'console_scripts', 'ocrmypdf')())
  File "/usr/bin/ocrmypdf", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.8/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in   import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in     _call_with_frames_removed
  File "/usr/lib/python3.8/site-packages/ocrmypdf/__init__.py", line 21, in     <module>
    from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/__init__.py", line 19, in <module>
    from ocrmypdf.pdfinfo.info import Colorspace, Encoding, PdfInfo
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/info.py", line 37, in <module>
    from ocrmypdf.pdfinfo.layout import get_page_analysis, get_text_boxes
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/layout.py", line 24, in <module>
    import pdfminer.encodingdb
ModuleNotFoundError: No module named 'pdfminer.encodingdb'

Packages:

pakku -Ss pdfminer
community/python-pdfminer 20200517-1 [installed]
    Python PDF Parser
aur/pdfminer 20191125-1 [20 / 0.157511]
    python3 utils to extract, analyze text data of PDF files. Includes pdf2txt, dumppdf, and latin2ascii
aur/pdfminer-git r480.14fd0fd-1 [3 / 0.000000]
    python utils to extract& analyze text data of PDF files.
aur/pdfminer3k 1.3.1-1 [0 / 0.000000]
    A python3 port of pdfminer
aur/python-pdfminer.six 20200124-1 [6 / 0.013772]
    PDF parser and analyzer for Python

What is the problem?

xuanruiqi commented on 2020-07-03 02:28 (UTC)

Now that python-pillow in community has been updated to 7.2.0, the block on updating this should be no longer existent.

pigmonkey commented on 2020-06-14 18:58 (UTC)

I pinged the python-pillow packager. The package had simply fallen through the cracks and he will be updating it today, but 7.0 introduced some API breakage so the upgraded package will probably hang out in the testing repo for a bit.

fbrennan commented on 2020-06-13 01:03 (UTC)

It might make sense to put it an orphan request for python-pillow-git, then update that, then temporarily require it, @pigmonkey, given how long the community package has been out of date. Though, it's of course up to you, as it might be too much work.

jbarlow commented on 2020-06-13 00:41 (UTC)

Upstream here. I noticed python-pillow in AUR is quite old so this could be a blocker for some time.

ocrmypdf does work with pillow 6.2.1, with all tests passing. You could override the requirement and permit the earlier pillow. (I'd rather not change this upstream, so that upstream reflects the configuration that is being tested.)

On another note, I strongly doubt that pillow-simd would yield any measurable change in performance so it would not be worth the effort to integrate this.

pigmonkey commented on 2020-06-12 21:45 (UTC)

This package is stuck on 9.8.2 until the community python-pillow package is upgraded to >=7.0.0.