Package Details: ocrmypdf 13.5.0-1

Git Clone URL: https://aur.archlinux.org/ocrmypdf.git (read-only, click to copy)
Package Base: ocrmypdf
Description: A tool to add an OCR text layer to scanned PDF files, allowing them to be searched
Upstream URL: https://github.com/jbarlow83/OCRmyPDF
Licenses: MPL2
Submitter: dreuter
Maintainer: fbrennan (pigmonkey)
Last Packager: pigmonkey
Votes: 71
Popularity: 1.64
First Submitted: 2014-01-27 11:36 (UTC)
Last Updated: 2022-06-23 23:47 (UTC)

Latest Comments

zcc2xj commented on 2022-06-04 00:42 (UTC)

@marco.righi

by the way, make sure: add python-cryptography to IgnorePkg? [y/N] y

otherwise you'll pacman -Syu it to 37.0.0

marco.righi commented on 2022-05-29 21:31 (UTC)

@zcc2xj it works! Thx a lot!

zcc2xj commented on 2022-05-28 02:19 (UTC) (edited on 2022-05-28 02:25 (UTC) by zcc2xj)

@marco.righi

try this.

sudo pacman -S downgrade

sudo DOWNGRADE_FROM_ALA=1 downgrade python-cryptography

choose python-cryptography-36.0.0

type number, maybe 33

fbrennan commented on 2022-05-24 22:16 (UTC)

...Did you make sure the file exists? That package is now up to 37.0.0-1 in [extra] and has been flagged OOD again so is likely to be updated again soon.

marco.righi commented on 2022-05-24 09:12 (UTC) (edited on 2022-05-24 09:13 (UTC) by marco.righi)

Can you please resolve?

sudo pacman -U /var/cache/pacman/pkg/python-cryptography-36.0.0-1-x86_64.pkg.tar.zst

has result

error: '/var/cache/pacman/pkg/python-cryptography-36.0.0-1-x86_64.pkg.tar.zst': could not find or read package

mb720 commented on 2022-05-16 18:17 (UTC)

After upgrading ocrmypdf to version 13.4.4, I needed to downgrade the cryptography package with sudo pacman -U /var/cache/pacman/pkg/python-cryptography-36.0.0-1-x86_64.pkg.tar.zst to get rid of the error

pkg_resources.DistributionNotFound: The 'cryptography~=36.0.0' distribution was not found and is required by pdfminer.six.

hooregi commented on 2022-05-12 01:12 (UTC)

Run:

p -U /var/cache/pacman/pkg/python-pdfminer-20220319-1-any.pkg.tar.zst

To downgrade python-pdfminer to the 20220319 version.

android_aur commented on 2022-05-10 11:53 (UTC) (edited on 2022-05-10 11:56 (UTC) by android_aur)

@allexj

I also get this error: pkg_resources.DistributionNotFound: The 'pdfminer.six!=20200720,<=20220319,>=20191110' distribution was not found and is required by ocrmypdf

Downgrading python-pdfminer via sudo downgrade python-pdfminer to version 20220319-1 "fixed" the issue for me (until there is a real fix I guess)

allexj commented on 2022-05-08 07:28 (UTC) (edited on 2022-05-08 07:33 (UTC) by allexj)

pkg_resources.DistributionNotFound: The 'pdfminer.six!=20200720,<=20220319,>=20191110' distribution was not found and is required by ocrmypdf free(): invalid pointer Aborted (core dumped)

Even if I add the line "sed -i "s|20220319|20220506|g" setup.cfg" before setup.py

drik commented on 2022-05-08 00:32 (UTC)

The line is now: sed -i "s|20220319|20220506|g" setup.cfg

NickJolly commented on 2022-04-04 08:24 (UTC)

@frankspace: Yes, it did help. Thank you once again for sharing. It was too logical and clear not to work. I just successfully retried and compile it manually after implementing the fix in the PKGBUILD. There might have been a typo on the first try. Sorry for taking your time away, but at least it helped me starting learning about how to fix this kind of annoyances by myself. Have a nice day and stay safe.

Ps: still out of date though, thence a workaround by the end user is still needed, unfortunately. At least it has not yet crashed on me, under heavy usage.

frankspace commented on 2022-04-01 14:56 (UTC) (edited on 2022-04-01 14:59 (UTC) by frankspace)

@NickJolly: Sorry about that.

The purpose of the fix was to implement the upstream commit that fixed pdfminer compatibility: https://github.com/ocrmypdf/OCRmyPDF/commit/04996caac34a418cf233c0f3c8ac436b6f2b5920

I unfortunately don't have any idea how to do that with a python package by way of stuff like a git patch or whatever, but the only functional part of that commit is very simple: changing a version number in setup.cfg. Although sed's syntax occasionally ranges from opaque to outright insane, that's a pretty simple fix, because no special characters are involved and it's a unique number that occurs only once in a single file.

For context, here is my entire (as amended) package() section:

package () {
  cd "${srcdir}/${pkgname}-${pkgver}"
#until they push a new version, needed to work with current pdfminer
  sed -i "s|20211012|20220319|g" setup.cfg
  python setup.py install --root="$pkgdir/" --optimize=1
  install -Dm644 LICENSE $pkgdir/usr/share/licenses/$pkgname/LICENSE.rst
}

I just double-checked that it does compile for me, and work afterwards, in a clean chroot. I should point out that I only use AUR helpers to check for packages that need updating, I always compile stuff with makepkg. Also, I use Artix, but that really shouldn't make a difference.

Does that help?

EDIT: I see upstream is claiming their test suite fails here: https://github.com/ocrmypdf/OCRmyPDF/issues/937#issuecomment-1082721212 -- so it's possible this fix works for my (rather simple) use-cases but won't work for everyone. That, I wouldn't have a clue about.

NickJolly commented on 2022-04-01 14:20 (UTC)

Hi there @frankspace. The fix you kindly shared did not work for me. Would you mind elaborating on it? There must be something I am missing. Thank you

pigmonkey commented on 2022-03-30 03:07 (UTC)

This pkgbuild tracks the upstream package from PyPi, so it will not update to 13.4.2 until upstream pushes the new release there.

https://github.com/ocrmypdf/OCRmyPDF/issues/937

frankspace commented on 2022-03-24 06:39 (UTC) (edited on 2022-03-24 06:39 (UTC) by frankspace)

The latest update to python-pdfminer breaks ocrmypdf. Until upstream puts out a new version, the fix is pretty simple: just add a line with sed -i "s|20211012|20220319|g" setup.cfg to the package() section before the line with setup.py.

malacology commented on 2022-02-13 12:10 (UTC)

@allexj, you need to install python-setuptools to solve it, img2pdf already reply on this package, so I am a little worried about your dependcies

allexj commented on 2022-02-12 10:36 (UTC)

$ ocrmypdf /usr/lib/python3.10/site-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 2.0.5-build-libtorrent-rasterbar-src-libtorrent-rasterbar-2.0.5-bindings-python is an invalid version and will not be supported in a future release warnings.warn( Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 33, in <module> sys.exit(load_entry_point('ocrmypdf==13.3.0', 'console_scripts', 'ocrmypdf')()) File "/usr/lib/python3.10/site-packages/ocrmypdf/main.py", line 35, in run _parser, options, plugin_manager = get_parser_options_plugins(args=args) File "/usr/lib/python3.10/site-packages/ocrmypdf/_plugin_manager.py", line 116, in get_parser_options_plugins plugin_manager = get_plugin_manager(pre_options.plugins) File "/usr/lib/python3.10/site-packages/ocrmypdf/_plugin_manager.py", line 104, in get_plugin_manager pm = OcrmypdfPluginManager( File "/usr/lib/python3.10/site-packages/ocrmypdf/_plugin_manager.py", line 45, in init self.setup_plugins() File "/usr/lib/python3.10/site-packages/ocrmypdf/_plugin_manager.py", line 73, in setup_plugins module = importlib.import_module(name) File "/usr/lib/python3.10/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/usr/lib/python3.10/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 11, in <module> from ocrmypdf._exec import ghostscript File "/usr/lib/python3.10/site-packages/ocrmypdf/_exec/ghostscript.py", line 21, in <module> from PIL import Image, UnidentifiedImageError ImportError: cannot import name 'UnidentifiedImageError' from 'PIL' (/home/allexj/.local/lib/python3.10/site-packages/PIL/init.py)

hirunatan commented on 2022-01-26 15:57 (UTC)

Perhaps it will be good to notify the user, after installing, that they need to install the tesseract-data language packages, to use it.

https://ocrmypdf.readthedocs.io/en/latest/installation.html#arch-linux-aur

marco.righi commented on 2022-01-11 15:58 (UTC) (edited on 2022-01-11 16:00 (UTC) by marco.righi)

@bsdiceRobert, thanks a lot for your suggestion. I wrote the following code that should re-compile packages one by one. Perhaps the script rebuilds some packages more times but avoids errors that could stop the entire rebuild process.


#!/bin/bash
logfile=~/log/python3xRebuild.log
echo START $(date) |tee -a $logfile
PYDIRS=$(stat -c '%W %n' /usr/lib/python[3-9].* | sort -n | head -n -1 | awk '{ print $2 }')
    if [ -n "$PYDIRS" ]; then
      for d in $PYDIRS; do
        #echo "Found obsolete python directory $d, packages requiring rebuild:"
        for p in $(pacman -Qoq "$d"); do 
            command=$(echo yay -S $p --rebuildtree  --noconfirm)
            echo $command |tee -a $logfile
            eval $command
        done
      done
    fi


bsdice commented on 2022-01-11 15:37 (UTC) (edited on 2022-01-11 15:37 (UTC) by bsdice)

FYI the script snippet will not rebuild anything by itself but only check for directories older than the most current /usr/lib/python3.* directory. If you have python3.10 + python3.9 + python3.8 it will look at only 3.9 and 3.8 and then list all packages referencing these obsolete directories. If you reinstall these packages they should be installed for the most recent 3.10 in this example and while doing so, get removed from 3.9 or 3.8. So if you run the snippet again, the number of packages shown will shrink. In theory you could add

yay --noconfirm --answerdiff None --answerupgrade None "$d" || exit 1

after the "pacman" command before the "done", but better do it manually.

bsdice commented on 2022-01-11 11:31 (UTC)

@marco.righi You can try this within a script:

PYDIRS=$(stat -c '%W %n' /usr/lib/python[3-9].* | sort -n | head -n -1 | awk '{ print $2 }')
if [ -n "$PYDIRS" ]; then
  for d in $PYDIRS; do
    echo "Found obsolete python directory $d, packages requiring rebuild:"
    pacman -Qoq "$d"
  done
fi

Then use yay pikaur or whatever to rebuild anything found.

marco.righi commented on 2022-01-11 09:16 (UTC)

Do you know a script to rebuild all AUR Python dependencies?

nottoday commented on 2021-12-24 14:30 (UTC)

@jbarlow python-pikepdf is on version 4.2.0-1. I've tried updating it to 4.2.0-2 (from the arch repo). But that still gives the same error.

jbarlow commented on 2021-12-24 00:06 (UTC)

@nottoday Python-pikepdf is likely out of date.

nottoday commented on 2021-12-23 16:49 (UTC) (edited on 2021-12-24 13:35 (UTC) by nottoday)

I have a problem that has started since 13.0.0.

The following command

$ ocrmypdf ./test.scan.pdf ./test.pdf

gives the following error output.

    1 [tesseract] lots of diacritics - possibly poor OCR
An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 385, in run_pipeline
    exec_concurrent(context, executor)
  File "/usr/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 274, in exec_concurrent
    executor(
  File "/usr/lib/python3.9/site-packages/ocrmypdf/_concurrent.py", line 82, in __call__
    self._execute(
  File "/usr/lib/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 136, in _execute
    task_finished(result, pbar)
  File "/usr/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 264, in update_page
    ocrgraft.graft_page(
  File "/usr/lib/python3.9/site-packages/ocrmypdf/_graft.py", line 142, in graft_page
    self._graft_text_layer(
  File "/usr/lib/python3.9/site-packages/ocrmypdf/_graft.py", line 304, in _graft_text_layer
    base_page.contents_add(new_text_layer, prepend=True)
AttributeError: contents_add

I'm on Manjaro in case that makes a difference.

Thanks in advance.

malacology commented on 2021-12-14 00:59 (UTC)

okay, thanks it is solved

pigmonkey commented on 2021-12-14 00:50 (UTC)

Python AUR packages need to be rebuilt after Python upgrades.

The version bump I just pushed for 13.1.1 will cause this package to get rebuilt, however you will need to manually rebuild any AUR Python dependencies which have not incremented their pkgrel for the new Python (python-coloredlogs, python-humanfriendly). There's nothing we can do about those from this package.

malacology commented on 2021-12-13 22:44 (UTC)

ocrmypdf                                                                                                                ░▒▓ ✔  22:43:39  ▓▒░
Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 33, in <module>
    sys.exit(load_entry_point('ocrmypdf==13.1.0', 'console_scripts', 'ocrmypdf')())
  File "/usr/bin/ocrmypdf", line 22, in importlib_load_entry_point
    for entry_point in distribution(dist_name).entry_points
  File "/usr/lib/python3.10/importlib/metadata/__init__.py", line 919, in distribution
    return Distribution.from_name(distribution_name)
  File "/usr/lib/python3.10/importlib/metadata/__init__.py", line 518, in from_name
    raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: No package metadata was found for ocrmypdf

After upgrade to python 3.10

https://github.com/ocrmypdf/OCRmyPDF/issues/872#issuecomment-992025153

jvn01 commented on 2021-10-02 13:53 (UTC)

Give me error " python-distlib-0.3.2-1-any.pkg.tar.zst failed to download"

bot198042362134 commented on 2021-09-23 07:48 (UTC)

There are two missing dependencies: tesseract-data-eng and python-sortedcontainers

To solve this issue simply do:

pacman -S tesseract-data-eng python-sortedcontainers

lightsaber commented on 2021-08-24 18:31 (UTC)

Got this traceback:

Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 33, in <module>
    sys.exit(load_entry_point('ocrmypdf==12.3.2', 'console_scripts', 'ocrmypdf')())
  File "/usr/bin/ocrmypdf", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.9/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/usr/lib/python3.9/site-packages/ocrmypdf/__init__.py", line 10, in <module>
    from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "/usr/lib/python3.9/site-packages/ocrmypdf/helpers.py", line 22, in <module>
    import pikepdf
  File "/usr/lib/python3.9/site-packages/pikepdf/__init__.py", line 19, in <module>
    from ._version import __version__
  File "/usr/lib/python3.9/site-packages/pikepdf/_version.py", line 7, in <module>
    from pkg_resources import DistributionNotFound
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3243, in <module>
    def _initialize_master_working_set():
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3226, in _call_aside
    f(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3255, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 568, in _build_master
    ws.require(__requires__)
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 886, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 772, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'coloredlogs>=14.0' distribution was not found and is required by ocrmypdf

pigmonkey commented on 2021-08-08 22:37 (UTC)

This looks like a problem with the pdfminer package.

The latest version of the Arch package removed the dependency on python-sortedcontainers. Upstream does not actually need sortedcontainers and has removed the dependency, but that change has not been tagged in a release yet. So the Arch python-pdfminer needs to either incorporate that unreleased patch, or re-add the python-sortedcontainers dependency in their PKGBUILD.

In the meantime, downgrading to python-pdfminer version 20201018-2 will fix the problem.

alkaid commented on 2021-08-08 21:30 (UTC)

Missing dependencies python-sortedcontainers

The original traceback from python is

Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 33, in <module>
    sys.exit(load_entry_point('ocrmypdf==12.3.2', 'console_scripts', 'ocrmypdf')())
  File "/usr/bin/ocrmypdf", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.9/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/usr/lib/python3.9/site-packages/ocrmypdf/__init__.py", line 10, in <module>
    from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "/usr/lib/python3.9/site-packages/ocrmypdf/helpers.py", line 22, in <module>
    import pikepdf
  File "/usr/lib/python3.9/site-packages/pikepdf/__init__.py", line 19, in <module>
    from ._version import __version__
  File "/usr/lib/python3.9/site-packages/pikepdf/_version.py", line 7, in <module>
    from pkg_resources import DistributionNotFound
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3243, in <module>
    def _initialize_master_working_set():
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3226, in _call_aside
    f(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3255, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 568, in _build_master
    ws.require(__requires__)
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 886, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 772, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'sortedcontainers' distribution was not found and is required by pdfminer.six

fbrennan commented on 2021-07-26 05:09 (UTC)

Thanks @Lucki…I think most users would have pip installed, so we missed that one. Or, it wasn't required until recently. Either way, 12.3.0 pkgrel 3 has it, and it will be a make dependency going forwards.

Lucki commented on 2021-07-25 01:35 (UTC)

Python complains about pip not being available: /usr/bin/python: No module named pip.

==> Starting package()...
WARNING: The wheel package is not available.
/usr/bin/python: No module named pip
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/setuptools/installer.py", line 75, in fetch_build_egg
    subprocess.check_call(cmd)
  File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6umi8zik', '--quiet', 'setuptools_scm_git_archive']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/build/ocrmypdf/src/ocrmypdf-12.3.0/setup.py", line 11, in <module>
    setup(
  File "/usr/lib/python3.9/site-packages/setuptools/__init__.py", line 152, in setup
    _install_setup_requires(attrs)
  File "/usr/lib/python3.9/site-packages/setuptools/__init__.py", line 147, in _install_setup_requires
    dist.fetch_build_eggs(dist.setup_requires)
  File "/usr/lib/python3.9/site-packages/setuptools/dist.py", line 785, in fetch_build_eggs
    resolved_dists = pkg_resources.working_set.resolve(
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 766, in resolve
    dist = best[req.key] = env.best_match(
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1051, in best_match
    return self.obtain(req, installer)
  File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1063, in obtain
    return installer(requirement)
  File "/usr/lib/python3.9/site-packages/setuptools/dist.py", line 844, in fetch_build_egg
    return fetch_build_egg(self, req)
  File "/usr/lib/python3.9/site-packages/setuptools/installer.py", line 77, in fetch_build_egg
    raise DistutilsError(str(e)) from e
distutils.errors.DistutilsError: Command '['/usr/bin/python', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6umi8zik', '--quiet', 'setuptools_scm_git_archive']' returned non-zero exit status 1.
==> ERROR: A failure occurred in package().
    Aborting...
==> ERROR: Build failed, check /var/lib/aurbuild/x86_64/lucki/build

pigmonkey commented on 2021-05-29 18:02 (UTC)

I'm not getting that error. Perhaps you need to do a clean rebuild of python-coloredlogs for some reason.

Sproid commented on 2021-05-29 15:19 (UTC)

It is giving me this error: $ ocrmypdf Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 33, in <module> sys.exit(load_entry_point('ocrmypdf==12.0.3', 'console_scripts', 'ocrmypdf')()) File "/usr/bin/ocrmypdf", line 25, in importlib_load_entry_point return next(matches).load() File "/usr/lib/python3.9/importlib/metadata.py", line 77, in load module = import_module(match.group('module')) File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1030, in _gcd_import File "<frozen importlib._bootstrap>", line 1007, in _find_and_load File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed File "<frozen importlib._bootstrap>", line 1030, in _gcd_import File "<frozen importlib._bootstrap>", line 1007, in _find_and_load File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 680, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 855, in exec_module File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed File "/usr/lib/python3.9/site-packages/ocrmypdf/__init__.py", line 10, in <module> from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo File "/usr/lib/python3.9/site-packages/ocrmypdf/helpers.py", line 22, in <module> import pikepdf File "/usr/lib/python3.9/site-packages/pikepdf/__init__.py", line 19, in <module> from ._version import __version__ File "/usr/lib/python3.9/site-packages/pikepdf/_version.py", line 7, in <module> from pkg_resources import DistributionNotFound File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3243, in <module> def _initialize_master_working_set(): File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3226, in _call_aside f(*args, **kwargs) File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 3255, in _initialize_master_working_set working_set = WorkingSet._build_master() File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 568, in _build_master ws.require(__requires__) File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 886, in require needed = self.resolve(parse_requirements(requirements)) File "/usr/lib/python3.9/site-packages/pkg_resources/__init__.py", line 772, in resolve raise DistributionNotFound(req, requirers) pkg_resources.DistributionNotFound: The 'coloredlogs>=14.0' distribution was not found and is required by ocrmypdf I do have "python-coloredlogs 15.0-1" installed.

mmberlin commented on 2021-04-29 17:36 (UTC) (edited on 2021-04-29 17:39 (UTC) by mmberlin)

missing (make) dependency: python-setuptools-scm-git-archive

distutils.errors.DistutilsError: Command '['/usr/bin/python', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmponytfjx7', '--quiet', 'setuptools_scm_git_archive']' returned non-zero exit status 1.

aslakstubsgaard commented on 2021-02-08 15:35 (UTC) (edited on 2021-02-08 15:36 (UTC) by aslakstubsgaard)

using yay. tried doing a clean build of first python-humanfriendly, then subsequently python-coloredlogs, and finally ocrmypdf and it now works again. hope this can help others with the same issue.

pigmonkey commented on 2021-02-04 21:15 (UTC)

I can't recreate that error. This package has a dependency for python-coloredlogs, which in turn is dependent on python-humanfriendly. Whatever AUR helper you're using should pick all that up.

Perhaps you are running an old version of python-humanfriendly. It looks like that AUR package was updated to 1.9 on 2020-12-10.

aslakstubsgaard commented on 2021-02-04 16:51 (UTC) (edited on 2021-02-04 16:52 (UTC) by aslakstubsgaard)

did a fresh build but getting the error:

pkg_resources.DistributionNotFound: The 'humanfriendly>=9.1' distribution was not found and is required by coloredlogs

ginkel commented on 2020-10-26 10:56 (UTC)

ocrmypdf currently fails to work with the recently updated python-pdfminer package. Downgrading the package to python-pdfminer-20200726-1 works around the issue for now.

pkg_resources.DistributionNotFound: The 'pdfminer.six!=20200720,<=20200726,>=20191110' distribution was not found and is required by ocrmypdf

pigmonkey commented on 2020-10-19 12:42 (UTC)

I still use the package, so I'm happy to continue updating or to step back. No preference.

fbrennan commented on 2020-10-18 23:02 (UTC)

Hello all.

I'm back to using Arch if pigmonkey no longer wants to maintain this package. :-)

But I think they've done a good job so can also just give them the package. I can also just do nothing, but since I'm back in that situation it can be confusing who is responsible to push the update.

Which would you prefer?

pigmonkey commented on 2020-10-14 22:36 (UTC)

tesseract-data-osd is included with the standard tesseract Arch package.

Looking at the "Required By" section of the tesseract-data-eng package, it does not appear that it is common for other Arch packages to list it as a dependency.

If this is confusing for users, I think it would be acceptable to add it as an optional dependency, so that there is an indication at the end of the install that another package might be needed. But it may be weird for non-English speakers if the package has an optional dependency on the English language pack, but not whatever data pack is needed for the user's native language. I don't really want a 106 item optdepends array for every possible language pack.

jbarlow commented on 2020-10-14 07:07 (UTC)

OCRmyPDF assumes English unless a language is specified with -l fra for example. So strictly speaking it works, but you have to issue the option every time. The test suite also assumes English is installed. I believe most package managers have added an explicit dependency on tesseract-data-eng or whatever it's called in the system, but some have not.

I did poll users whether to default to the system language based on locale, but surprisingly non-English users didn't like the idea.

OCRmyPDF does assume tesseract-data-osd is installed so that should be a dependency if Arch breaks that out as a separate package.

pigmonkey commented on 2020-10-13 16:51 (UTC)

Tesseract does require a data package to be installed, but it does not have to be English. If a language is not specified, Tesseract does assume English, hence the error.

I don't think it's appropriate to include tesseract-data-eng as a dependency since that might not be the user's language.

ioan commented on 2020-10-13 13:45 (UTC)

crmypdf test.pdf test2.pdf Tesseract failed to report available languages. Output from Tesseract:


Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! List of available languages (1): osd

looks like it needs eng data by default

jorges commented on 2020-08-05 19:49 (UTC)

Thanks for the explanation! I just got rid of pyhton-pdfminer.six from AUR and downgraded python-pdfminer to 20200517-1. OCRMyPDF works and all is well!

pigmonkey commented on 2020-07-29 17:39 (UTC)

It's a little convoluted, but here is what I think is happening:

The confusingly-named python-pdfminer from community that we use is in fact python-pdfminer.six. You can verify that by looking at its PKGBUILD. The AUR python-pdfminer.six is basically the same package, except it pulls from PyPi instead of Github and is on an outdated version (20200124 instead of community's 20200720).

OCRMyPDF claims to support 20200720, but that version of python-pdfminer{,.six} dropped PDFTextExtractionNotAllowed. This apparently was unintentional and has been reversed in 20200726. But as of now 20200726 has not been officially tagged.

So, we need to wait for upstream python-pdfminer.six to make 20200726 official, and then wait for the community maintainer to update the python-pdfminer package to 20200726. And then we need to wait for upstream OCRMyPDF to release a new version that notes support for 20200726. Then I can update this package and everything will be copacetic.

In the meantime, you can downgrade the community python-pdfminer package to the previous version, or run the much older version provided by the AUR python-pdfminer.six package.

jorges commented on 2020-07-29 11:18 (UTC) (edited on 2020-07-29 11:22 (UTC) by jorges)

I was getting the traceback shown below with python-pdfminer. I was able to solve the problem by removing that package and installing python-pdfminer.six. I other people can confirm this maybe the package dependencies have to be changed?

$ ocrmypdf 
Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 33, in <module>
    sys.exit(load_entry_point('ocrmypdf==10.3.1', 'console_scripts', 'ocrmypdf')())
  File "/usr/bin/ocrmypdf", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.8/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/lib/python3.8/site-packages/ocrmypdf/__init__.py", line 21, in <module>
    from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/__init__.py", line 19, in <module>
    from ocrmypdf.pdfinfo.info import Colorspace, Encoding, PdfInfo
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/info.py", line 37, in <module>
    from ocrmypdf.pdfinfo.layout import get_page_analysis, get_text_boxes
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/layout.py", line 29, in <module>
    from pdfminer.pdfdocument import PDFTextExtractionNotAllowed
ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfdocument' (/usr/lib/python3.8/site-packages/pdfminer/pdfdocument.py)
(ins)[jscandal@lhasa .aur_bb]$ python
Python 3.8.5 (default, Jul 27 2020, 08:42:51) 
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(ins)>>> from pdfminer.pdfdocument import PDFTextExtractionNotAllowed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfdocument' (/usr/lib/python3.8/site-packages/pdfminer/pdfdocument.py)

bsdice commented on 2020-07-23 12:03 (UTC) (edited on 2020-07-23 12:03 (UTC) by bsdice)

Anybody else getting tracebacks when using --threshold?

An exception occurred while executing the pipeline
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(args, *kwds))
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 195, in exec_page_sync
    ocr_image_out = create_ocr_image(ocr_image, page_context)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_pipeline.py", line 544, in create_ocr_image
    dpi = tuple(round(coord) for coord in im.info['dpi'])
KeyError: 'dpi'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 356, in run_pipeline exec_concurrent(context) File "/usr/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 267, in exec_concurrent exec_progress_pool( File "/usr/lib/python3.8/site-packages/ocrmypdf/_concurrent.py", line 108, in exec_progress_pool result = results.next() File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next raise value KeyError: 'dpi'

marlemion commented on 2020-07-16 06:47 (UTC) (edited on 2020-07-16 12:30 (UTC) by marlemion)

Never mind the below. For some reason, some files were missing from my system.

Fully updated arch and updated ocrmypdf to the latest via AUR:

ocrmypdf
Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 33, in <module>
    sys.exit(load_entry_point('ocrmypdf==10.2.0', 'console_scripts', 'ocrmypdf')())
  File "/usr/bin/ocrmypdf", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.8/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in   import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in     _call_with_frames_removed
  File "/usr/lib/python3.8/site-packages/ocrmypdf/__init__.py", line 21, in     <module>
    from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/__init__.py", line 19, in <module>
    from ocrmypdf.pdfinfo.info import Colorspace, Encoding, PdfInfo
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/info.py", line 37, in <module>
    from ocrmypdf.pdfinfo.layout import get_page_analysis, get_text_boxes
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/layout.py", line 24, in <module>
    import pdfminer.encodingdb
ModuleNotFoundError: No module named 'pdfminer.encodingdb'

Packages:

pakku -Ss pdfminer
community/python-pdfminer 20200517-1 [installed]
    Python PDF Parser
aur/pdfminer 20191125-1 [20 / 0.157511]
    python3 utils to extract, analyze text data of PDF files. Includes pdf2txt, dumppdf, and latin2ascii
aur/pdfminer-git r480.14fd0fd-1 [3 / 0.000000]
    python utils to extract& analyze text data of PDF files.
aur/pdfminer3k 1.3.1-1 [0 / 0.000000]
    A python3 port of pdfminer
aur/python-pdfminer.six 20200124-1 [6 / 0.013772]
    PDF parser and analyzer for Python

What is the problem?

xuanruiqi commented on 2020-07-03 02:28 (UTC)

Now that python-pillow in community has been updated to 7.2.0, the block on updating this should be no longer existent.

pigmonkey commented on 2020-06-14 18:58 (UTC)

I pinged the python-pillow packager. The package had simply fallen through the cracks and he will be updating it today, but 7.0 introduced some API breakage so the upgraded package will probably hang out in the testing repo for a bit.

fbrennan commented on 2020-06-13 01:03 (UTC)

It might make sense to put it an orphan request for python-pillow-git, then update that, then temporarily require it, @pigmonkey, given how long the community package has been out of date. Though, it's of course up to you, as it might be too much work.

jbarlow commented on 2020-06-13 00:41 (UTC)

Upstream here. I noticed python-pillow in AUR is quite old so this could be a blocker for some time.

ocrmypdf does work with pillow 6.2.1, with all tests passing. You could override the requirement and permit the earlier pillow. (I'd rather not change this upstream, so that upstream reflects the configuration that is being tested.)

On another note, I strongly doubt that pillow-simd would yield any measurable change in performance so it would not be worth the effort to integrate this.

pigmonkey commented on 2020-06-12 21:45 (UTC)

This package is stuck on 9.8.2 until the community python-pillow package is upgraded to >=7.0.0.

pigmonkey commented on 2020-05-28 22:17 (UTC)

Thanks for identifying the issue. It looks like v9.8.1 fixes this and is in the process of being pushed to pypi.

brianmercer commented on 2020-05-28 21:18 (UTC)

Temporary workaround is to roll back python-pdfminer to the prior version:

pacman -U /var/cache/pacman/pkg/python-pdfminer-20200402-1-any.pkg.tar.zst

and optionally add

IgnorePkg = python-pdfminer

to the /etc/pacman.conf file to keep it from upgrading for now.

chrisberkhout commented on 2020-05-28 21:05 (UTC)

Last line of the error message is

pkg_resources.DistributionNotFound: The 'pdfminer.six<=20200402,>=20181108' distribution was not found and is required by ocrmypdf

That's from the python-pdfminer package, which is in the dependencies, it's just that the current version is python-pdfminer-20200517-1 and ocrmypdf apparently needs an earlier version.

It seems this has happened before: https://github.com/jbarlow83/OCRmyPDF/issues/457

I added a new issue: https://github.com/jbarlow83/OCRmyPDF/issues/566

oriba commented on 2020-05-28 20:26 (UTC)

ocrmypdf, built from this package, does not work anymore. Some days ago it worked. (Sidenote: I also had issues with matplotib, some ugly things may happen these days in the python field).

I got the a quite long message, and one of the things mentioned was "pdfminer.six" together with ContextualVersionConflict.

Looking at the package-dependencies, pdfminer.six is not in there. So it should be added. Also certain versions seem to be needed. Let me know if you want the complete error message, then I may paste it somewhere.

rharish commented on 2020-05-01 12:51 (UTC)

Does there exist a way to avoid using the egg files? Or somehow removing the dependency checks altogether? Installing from the AUR should ensure that the package has its dependencies met, so I don't think that the checks are needed.

I already tried installing it through pip in a virtualenv, along with Pillow-SIMD, and it ignores the checks and directly works with Pillow-SIMD. So those checks can be skipped IMHO.

pigmonkey commented on 2020-04-29 17:55 (UTC)

I'm not sure how to go about supporting Pillow-SMD in the package.

The PKGBUILD installs the package via setuptools, which results in an egg. The egg includes a requirement of Pillow>=6.2.0. You can see this at /usr/lib/python3.8/site-packages/ocrmypdf-9.7.2-py3.8.egg-info/requires.txt (or a similar path, depending on your version). That results in the error you're seeing.

I think this would be avoided if the package were installed via pip, but the wiki discourages that. And I think even then you'd end up just moving the problem to a different level: the python-reportlab package is also installed via setuptools so is going to have an egg that looks for that same Pillow package. You'd get the same error, but it would be thrown by reportlab rather than ocrmypdf.

rharish commented on 2020-04-29 05:41 (UTC) (edited on 2020-04-29 05:41 (UTC) by rharish)

This does not work with Pillow-SIMD. This is the issue that I created upstream. Here are the logs when I run ocrmypdf --help:

Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 6, in <module>
    from pkg_resources import load_entry_point
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3259, in <module>
    def _initialize_master_working_set():
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3242, in _call_aside
    f(*args, **kwargs)
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3271, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 584, in _build_master
    ws.require(__requires__)
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 901, in require
    needed = self.resolve(parse_requirements(requirements))
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 787, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'Pillow>=6.2.0' distribution was not found and is required by ocrmypdf

pigmonkey commented on 2020-04-22 17:58 (UTC)

Thanks. It looks like the confusingly named python-pdfminer package in community does indeed provide the needed python-pdfminder.six library rather than the abandoned python-pdfminer library.

That was the last AUR dependency, so maybe there's a chance of this getting adopted into community now.

petRUShka commented on 2020-04-22 10:04 (UTC) (edited on 2020-04-22 10:07 (UTC) by petRUShka)

Dependency aur/python-pdfminer.six possible should be replaced with community/python-pdfminer.

fbrennan commented on 2020-04-10 23:20 (UTC)

No, the computer was broken in transit. I still have it, just due to the pandemic the parts to fix it are arriving slowly. And it's made for 240V, and I now live in a 120V country.

bsdice commented on 2020-04-10 22:41 (UTC)

@fbrennan Does that mean some Philippine police jockey can now upload trojaned PKGBUILDs in your name? If true we should summon the help of an AUR admin to delete your key.

fbrennan commented on 2020-04-10 22:29 (UTC)

You've been added @pigmonkey, thank you.

I had to flee the country I was living in, and my desktop computer got broken. It was my Arch install and had my AUR SSH key. I don't know when I'm going to be able to get back to AUR maintenance.

https://www.vice.com/en_us/article/y3mqzb/the-philippines-wants-to-arrest-8chan-founder-fredrick-brennan-its-basically-a-death-sentence

pigmonkey commented on 2020-04-10 21:40 (UTC)

I'd be happy to help maintain this package, if needed.

rany commented on 2020-04-10 09:00 (UTC)

@AlexParkhomenko also done!

AlexParkhomenko commented on 2020-04-10 08:49 (UTC) (edited on 2020-04-10 08:49 (UTC) by AlexParkhomenko)

conflicts=("ocrmypdf" "python-pdfminer")

rany commented on 2020-04-10 08:40 (UTC)

@pescepalla Done!

pescepalla commented on 2020-04-10 06:31 (UTC)

Please add conflicts=("ocrmypdf") to the PKGBUILD

rien333 commented on 2020-03-31 12:23 (UTC)

Now two versions out of date, see https://github.com/jbarlow83/OCRmyPDF/releases.

pigmonkey commented on 2020-02-22 01:26 (UTC)

Done. https://github.com/jbarlow83/OCRmyPDF/pull/494

jbarlow commented on 2020-02-22 00:19 (UTC)

Could someone please submit a pull request updating the OCRmyPDF documentation (ocrmypdf.readthedocs.io) with directions for installing this package?

jbarlow commented on 2020-02-10 09:57 (UTC)

@brianmercer ocrmypdf 9.6.0 fixes the pdfminer.six version

brianmercer commented on 2020-02-04 15:24 (UTC) (edited on 2020-02-04 15:29 (UTC) by brianmercer)

ocrmypdf won't work if you updated the aur package of pdfminer.six to version 20200124.

ocrmypdf hasn't been updated for the new version of pdfminer.six.

https://github.com/jbarlow83/OCRmyPDF/blob/fd991a2380f1803924b1b8192e42e67a80998dde/setup.py

'pdfminer.six >= 20181108, <= 20200104',

So we're waiting on an update from ocrmypdf, or if necessary you can install an older version of pdfminer.six.

grunix commented on 2020-02-04 12:17 (UTC)

Please bump the version to 9.5.0.

commented on 2020-01-31 11:48 (UTC)

Please, use stable src url: <https://files.pythonhosted.org/packages/source/>${_name::1}/$_name/$_name-$pkgver.tar.gz

blabred commented on 2020-01-30 16:35 (UTC)

I get this whenever I'm trying to ocr a document:

Traceback (most recent call last): File "/home/adros/anaconda3/bin/ocrmypdf", line 6, in <module> from pkg_resources import load_entry_point File "/home/adros/anaconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 3252, in <module> @_call_aside File "/home/adros/anaconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 3236, in _call_aside f(args, *kwargs) File "/home/adros/anaconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 3265, in _initialize_master_working_set working_set = WorkingSet._build_master() File "/home/adros/anaconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 584, in _build_master ws.require(requires) File "/home/adros/anaconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 901, in require needed = self.resolve(parse_requirements(requirements)) File "/home/adros/anaconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 787, in resolve raise DistributionNotFound(req, requirers) pkg_resources.DistributionNotFound: The 'pikepdf<2,>=1.8.1' distribution was not found and is required by ocrmypdf

brianmercer commented on 2020-01-13 00:01 (UTC)

Version 9.2.0 release notes state that qpdf is no longer required as a dependency.

I built it without qpdf and it seemed to compile and run fine.

sagittarius commented on 2019-12-09 14:30 (UTC) (edited on 2019-12-11 13:47 (UTC) by sagittarius)

Please update to 9.1.1 Just change the source link and the SHA256 and it works as is.

$ ocrmypdf --version

9.1.1

Edit PKGBUILD

pkgver=9.1.1

source=("https://files.pythonhosted.org/packages/af/7f/234e357557233d618c5b40d066389de6203c48a1697653285af541fff582/ocrmypdf-9.1.1.tar.gz")

sha256sums=('656dd9cec46b2c3a8a1a4b98e7bb00dd95adb98229d63d397b422679cfbbb88e')

If necessary, rebuild python-pdfminer.six

brianmercer commented on 2019-12-02 14:58 (UTC)

Current version 9.0.5-1 is broken because of an update of python-pdfminer.six in the aur to version 20191110-1.

Error: "The 'pdfminer.six<=20191020,>=20181108' distribution was not found and is required by ocrmypdf"

Please update, thanks.

fbrennan commented on 2019-11-09 05:20 (UTC) (edited on 2019-11-09 05:20 (UTC) by fbrennan)

Github archives are unusable without hacks for Python AUR packages. That's because they don't include the .git directory, required by python-setuptools. The PyPI archive must be used.

jbarlow commented on 2019-11-09 05:12 (UTC)

@brianmercer CI is set up to build wheels and deploy to PyPI whenever a git tag is pushed, so tags should always be consistent with PyPI. But it's probably cleaner to install the .tar.gz from PyPI than Git - smaller download since it's not pulling the whole history.

brianmercer commented on 2019-11-08 18:23 (UTC)

I'm no expert on PKGBUILD.

Is there any downside to rewriting the PKGBUILD to use git tags instead of versions from pypi? It looks like @jbarlow is pretty diligent with the github tags. https://github.com/jbarlow83/OCRmyPDF/tags

And PKGBUILD supports git tags. https://wiki.archlinux.org/index.php/VCS_package_guidelines#The_pkgver.28.29_function

Does an ocrmypdf-git package need to be by commit or can it be by tag or release? Do the aur managers like yay (which I use) check devel versions by commit or can it check by tags?

pigmonkey commented on 2019-11-06 17:11 (UTC)

I'd be happy to help maintain the package if fbrennan is no longer interested.

rien333 commented on 2019-11-06 09:50 (UTC) (edited on 2019-11-06 09:50 (UTC) by rien333)

This is several versions out of date: https://github.com/jbarlow83/OCRmyPDF/releases. An update would be great, especially because there is a fix for a fatal ocrmypdf crash (see https://github.com/jbarlow83/OCRmyPDF/issues/448)

jbarlow commented on 2019-11-05 10:31 (UTC)

@brianmercer That is completely correct. ruffus, defusedxml, lxml should all be removed. -Upstream

brianmercer commented on 2019-11-05 02:47 (UTC)

It looks like version 9.0 removed the dependency on ruffus. And version 7.0 removed the dependency on defusedxml. And version 3.0 removed the dependency on lxml.

bsdice commented on 2019-10-25 13:23 (UTC)

@brianmercer et al. Package needs to be updated to 9.0.3 which fixes https://github.com/jbarlow83/OCRmyPDF/commit/17ac9d7a9a296ae3d50146fbefad5281e2851b0f

The backstory is that ghostscript tightened security after taviso took a stab at it security-wise back in summer of 2018. You can fix it yourself in the meantime, by:

1) Downloading raw PKGBUILD file into a temp directory

2) Edit these lines to say

pkgver=9.0.3

source=("https://files.pythonhosted.org/packages/6b/8c/d8a9132e050ac25ea5da63fabc1a1fc0246beee72701b372c35221a40237/ocrmypdf-9.0.3.tar.gz")

sha256sums=('3d9b92f6a01d0711e4156c6b36638d9d946d010e2925ec473ec7f666096cceeb')

3) makepkg -Ccfi

brianmercer commented on 2019-10-21 18:37 (UTC)

I started getting a set of errors with the 9.0.1 version. I edited the PKGBUILD to upgrade to version 9.0.3 of ocrmypdf and they went away.

These are the errors:

ERROR - GPL Ghostscript 9.50: Setting Overprint Mode to 1 not permitted in PDF/A-2, overprint mode not set

Error: /invalidfileaccess in --file-- Operand stack: --nostringval-- --nostringval-- (/usr/lib/python3.7/site-packages/ocrmypdf/data/sRGB.icc) (r) Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1990 1 3 %oparray_pop 1989 1 3 %oparray_pop 1977 1 3 %oparray_pop 1833 1 3 %oparray_pop --nostringval-- %errorexec_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- Dictionary stack: --dict:737/1123(ro)(G)-- --dict:1/20(G)-- --dict:76/200(L)-- Current allocation mode is local Last OS error: Permission denied Current file position is 580 GPL Ghostscript 9.50: Unrecoverable error, exit code 1 ERROR - SubprocessOutputError: Ghostscript PDF/A rendering failed

Fifis commented on 2019-07-31 13:03 (UTC)

For the latest ocrmypdf 8.3.2, I had to update pikepdf to 1.5.0.post0. Had a bit of trouble overwriting old package versions, e. g. sudo pacman -S --overwrite="*" python-ply python-pycparser img2pdf python-cffi python-defusedxml python-lxml python-reportlab to get ocrmypdf 8.3.2 to work.

john-soda commented on 2019-02-10 23:36 (UTC)

@fbrennan I really don't know what the problem is, that it can't reach setuptools_scm_git_archive. I downloaded the package manually and edited the PKGBUILD that it points to my local downloaded version. Now it works! Thanks for your help.

fbrennan commented on 2019-02-09 05:38 (UTC)

@john-soda I can install the latest version just fine for me, it seems to me you have a DNS resolution problem for the pypi domain.

john-soda commented on 2019-02-02 11:04 (UTC)

When I want to install ocrmypdf I get always the Error:

Could not find suitable distribution for Requirement.parse('setuptools_scm_git_archive')

Here the full log https://pastebin.com/xsqzeqr0

How can I install the newest version?

fbrennan commented on 2019-01-17 08:18 (UTC)

My apologies to all stakeholders waiting on me. I came down with a serious illness. Rest assured this is not forgotten or abandoned. I will update it in due time. Thanks

jbarlow commented on 2019-01-12 08:49 (UTC) (edited on 2019-01-12 08:52 (UTC) by jbarlow)

@fbrennan

v8 makes pdfminer.six "technically optional". setup.py still lists it as required, but downstream maintainers at their option may delete pdfminer.six from setup.py in their scripts, at the cost of the --redo-ocr feature. I will support this arrangement until the packaging situation for pdfminer.six improves. (I am doing it this way because "pip install ocrmypdf" works fine with pdfminer.six.)

v8 also drops python-xmp-toolkit because of the difficulties some downstream consumers had with it.

Thanks again for maintaining ocrmypdf for the ArchLinux community.

-Upstream

fbrennan commented on 2018-11-27 06:19 (UTC)

I thought of that @bsdice but it breaks the AUR Rules of Submission. https://wiki.archlinux.org/index.php/Arch_User_Repository#Rules_of_submission

The more and more AUR dependencies that get added the more difficult this gets to maintain and the more people we need to rely on. Fortunately I maintain python-pikepdf, python-ruffus and python-xmp-toolkit, the major AUR deps of this package before the recent update. I don't think it's kosher for me to make a metapackage which would install every dependency and have a replaces/conflicts/provides either.

For now I recommend users assure that python-sortedcontainers is installed before attempting to build ocrmypdf & deps. I'm sure the maintainer of python-pdfminer.six will add it to the manifest as soon as they can.

bsdice commented on 2018-11-27 05:40 (UTC)

@fbrennan: A workaround could have been to create a package called "python-pdfminer-six" and use the following statements:

  • replaces=('python-pdfminer.six')
  • conflicts=('python-pdfminer.six')
  • provides=('python-pdfminer.six')

Unfortunately python-pdfminer.six in AUR is missing a dependancy. Workaround is to still use http://termbin.com/k46k.

Harvey commented on 2018-11-26 15:44 (UTC)

python-pdfminer.six has been updated to version 20181108-1 ;)

fbrennan commented on 2018-11-26 11:50 (UTC) (edited on 2018-11-26 11:52 (UTC) by fbrennan)

Unfortunately my friends, we've hit a snag. Someone else is using the name python-pdfminer.six :-(

https://aur.archlinux.org/pkgbase/python-pdfminer.six/#news

I put a working PKGBUILD there. But unfortunately I cannot upload my new ocrmypdf, which works fine, until this user makes a decision. That is because ocrmypdf requires a higher version than theirs.

https://github.com/jbarlow83/OCRmyPDF/blob/0f5c484b626632aa68259eda16ff2c1b87a42104/requirements/main.txt#L7

I sincerely apologize for the long wait. If you are good with makepkg and pacman, you can use these two PKGBUILDS:

If not, you will just have to wait for python-pdfminer.six to be updated, by either ishitatsuyuki or me if he orphans.

fbrennan commented on 2018-11-23 23:08 (UTC)

I hope to release the upgrade to 7.3.1 today (GMT+8).

I apologize for the wait after the orphan notification.

Harvey commented on 2018-11-22 11:58 (UTC)

Version 7.3.1 looks very promising, depending on the release notes https://github.com/jbarlow83/OCRmyPDF/blob/master/docs/release_notes.rst Is there any chance for an update? I see there is a new dependency to pdfminer.six 20181108...

marlemion commented on 2018-11-06 09:18 (UTC) (edited on 2018-11-06 09:32 (UTC) by marlemion)

@bsdice: Thanks, but that did not help. Same error. On another machine, ocrmypdf is working. So it must be some issue on that machine...

Btw. ocrmypdf was working for ages on that machine, but I had to hold back leptonica for other reasons, so it was stuck to a certain version for some time....

Found the Problem: I had installed python2-jmespath-0.9.3-2. This package installs /usr/bin/jp2.py. For some reason, python looked at this jp2.py instead of /usr/lib/python3.x/site-packages/jp2.py. After removing python2-jmespath-0.9.3-2, it works. However, such a behaviour is irritating.

bsdice commented on 2018-11-06 09:16 (UTC)

@marlemion: Replace aur/img2pdf-git 0.2.1.r8.geedf73e-1 with normal img2pdf 0.3.1-1 and see what happens. pacman -Rd img2pdf-git ; pacman -S --asdeps img2pdf ; or something like that.

marlemion commented on 2018-11-06 08:57 (UTC)

I would like to update to the most recent version of ocrmypdf. Builds fine, but throws this error:

Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 11, in <module> load_entry_point('ocrmypdf==7.2.1', 'console_scripts', 'ocrmypdf')() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 484, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 2725, in load_entry_point return ep.load() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 2343, in load return self.resolve() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 2349, in resolve module = import(self.module_name, fromlist=['name'], level=0) File "/usr/lib/python3.7/site-packages/ocrmypdf/main.py", line 36, in <module> from ._pipeline import build_pipeline File "/usr/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 26, in <module> import img2pdf File "/usr/lib/python3.7/site-packages/img2pdf.py", line 28, in <module> from jp2 import parsejp2 ImportError: cannot import name 'parsejp2' from 'jp2' (/usr/bin/jp2.py)

img2pdf-git has been rebuilt. No effect.

fbrennan commented on 2018-10-02 03:52 (UTC)

I think lossy mode should still be selectable because it's only dangerous in certain situations and leads to really small files otherwise. It just shouldn't be default.

jbarlow commented on 2018-10-01 18:33 (UTC)

@bsdice: I'm aware of the JBIG2 6/8 issue. However, I never intended to enable lossy mode. I attribute the issue to the help text of jbig2enc misleading. I had to inspect the jbig2enc source to confirm it would indeed select lossy encoding.

In any case it is an easy fix to switch to lossless JBIG2 which still gets better results than CCITT G4 so I will do in the next release. I haven't decided if I will keep lossy mode.

Generally it is ideal to report upstream issues to upstream since users other than ArchLinux are affected. It so happens I subscribe to the AUR comments, but ocrmypdf is deployed in a lot of places I don't follow.

@fbrennan: I recommend just waiting till the next version.

fbrennan commented on 2018-10-01 10:42 (UTC)

Should the PKGBUILD be changed to reflect the possible danger of jbig2enc?

bsdice commented on 2018-09-29 22:03 (UTC)

Here is a cautionary note for people using this AUR for archival purposes:

The default of ocrmypdf is --optimize 1 ("do safe, lossless optimizations"). If you have jbig2enc installed, this means b/w documents will be re-encoded from CCITT G4 to JBIG2 in so-called "symbol mode", see https://github.com/jbarlow83/OCRmyPDF/blob/master/src/ocrmypdf/exec/jbig2enc.py#L42

Unfortunately it has been shown by D. Kriesel that JBIG2 is able to alter the contents of documents, e.g. by changing a "6" into an "8" due to their similarity at low resolution. In the aftermath German BSI (https://www.bsi.bund.de/DE/Publikationen/TechnischeRichtlinien/tr03138/index_htm.html), Swiss KOST (https://kost-ceco.ch/cms/index.php?id=312,569,0,0,1,0), and maybe others have issued statements forbidding JBIG2 altogether for archival purposes of legally relevant documents. Instead it is recommended to keep using lossless CCITT G4 compression.

Users of this package should therefore use this tool with "--optimize 0" (do not optimize) until further notice. Upstream should use jbig2 only at "--optimize 4" ("do dangerous aggressive lossy optimizations"), which does not exist at this point.

The drawback of G4 is of course larger file sizes, but I prefer that over having to doubt every document scanned, whether numbers or letters are really that what was printed in the original document.

sagittarius commented on 2018-09-06 08:52 (UTC) (edited on 2018-09-06 08:54 (UTC) by sagittarius)

@fbrennan No worries. We're sure you're doing your best and I'm very glad of it. And the least I can do is to report some issues as a user. My very little contribution. So thank you fbrennan. BTW, problem solved: v7.04 works great ;-)

fbrennan commented on 2018-09-04 14:33 (UTC)

@sagittarius Sorry, I am doing my best. I am new at this. I updated python-pikepdf -- updating that package should solve your problem. I'll make sure that this never happens again, I forgot how strict it is about package versions.

@jbehmel Your question has already been answered. Github archives are unusable without hacks for AUR packages. That's because they don't include the .git directory, required by python-setuptools.

jbehmel commented on 2018-09-03 13:55 (UTC)

Hey,I've just asked myself why You are not using this link: https://github.com/jbarlow83/OCRmyPDF/archive/v7.0.4.tar.gz

sagittarius commented on 2018-09-01 09:44 (UTC)

For v7.04,

$ ocrmypdf gives:

Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 578, in _build_master ws.require(requires) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 895, in require needed = self.resolve(parse_requirements(requirements)) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 786, in resolve raise VersionConflict(dist, req).with_context(dependent_req) pkg_resources.ContextualVersionConflict: (pikepdf 0.3.1 (/usr/lib/python3.7/site-packages), Requirement.parse('pikepdf<0.4,>=0.3.2'), {'ocrmypdf'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 6, in <module> from pkg_resources import load_entry_point File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 3105, in <module> @_call_aside File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 3089, in _call_aside f(args, *kwargs) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 3118, in _initialize_master_working_set working_set = WorkingSet._build_master() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 580, in _build_master return cls._build_from_requirements(requires) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 593, in _build_from_requirements dists = ws.resolve(reqs, Environment()) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 781, in resolve raise DistributionNotFound(req, requirers) pkg_resources.DistributionNotFound: The 'pikepdf<0.4,>=0.3.2' distribution was not found and is required by ocrmypdf</module></module>

Seems there is an issue with pikepdf<0.4,>=0.3.2 for only 0.3.1 is available

jbarlow commented on 2018-08-26 18:53 (UTC)

Most Linux distributions don't have jbig2enc packaged, so jbig2enc is technically optional to ease packaging. But ArchLinux has already jbig2enc, it should be required, because there's no reason to not take it when it's available.

Further details: https://ocrmypdf.readthedocs.io/en/latest/jbig2.html

bsdice commented on 2018-08-25 20:43 (UTC)

Popular package ;-) (also great software, using it alot)

As per comment by jbarlow on 2018-08-13 22:33 "jbig2enc should be added"

Should jbig2enc be added as a dependency or optional dependency or not at all?

fbrennan commented on 2018-08-24 07:15 (UTC)

python-pytz is a dependency of python-xmp-toolkit which is a dependency of ocrmypdf.

Using an AUR helper, such as yay, might help with packages that have many AUR dependencies. Updating one AUR package in isolation will not work.

connaisseur commented on 2018-08-24 05:52 (UTC)

Found just out, that an additional dependendcy has rised: python-pytz

Could you please verify / check - and possibly update the PKGBUILD?

bsdice commented on 2018-08-21 15:37 (UTC)

@sleeping Yes, as commented by me at 2018-08-13 04:55

Compare https://github.com/jbarlow83/OCRmyPDF/blob/1e23ea5364f1f39850d72e7a73d233067993dd4a/setup.py#L257

sleeping commented on 2018-08-21 12:38 (UTC)

Missing dependency: python-reportlab

sagittarius commented on 2018-08-18 14:42 (UTC) (edited on 2018-08-18 14:45 (UTC) by sagittarius)

Just for information, I had to rebuild few packages in this order to be able to launch ocrmypdf 7.0.3:

yaourt -S python-ruffus

yaourt -S python-pikepdf

yaourt -S python-xmp-toolkit

yaourt -S img2pdf-git

$ ocrmypdf --version

7.0.3

jbarlow commented on 2018-08-18 06:48 (UTC)

@fbrennan I was using setuptools_scm_git_archive a few years ago, but it was causing some problems (details of which I don't remember), so I removed it. Maybe I should try it again.

Either way, easiest thing to do is wait for PyPI. Normally it's only 20 minutes behind Github.

fbrennan commented on 2018-08-18 06:13 (UTC)

@jbarlow At least with the way we're building it right now, we can't build Github tarballs due to a known issue in setuptools (or in Github?). Apparently setuptools puts its metadata inside the Git repository, and because Github's tarballs don't include a git repository, the package won't build from Github.

Last night I found out that I could patch your setup.py (see https://pypi.org/project/setuptools_scm_git_archive/ for details) and force it to build anyway, but I thought by that point I must be doing something wrong and you never intended for downstream packagers to do something like that, so decided to wait for PyPI.

jbarlow commented on 2018-08-17 18:49 (UTC)

Aww, thanks everyone. :)

v7.0.3 is PyPI as of a few days. Normally Github and PyPI are nearly in lockstep, but Travis was having network problems last weekend and failed to deploy v7.0.3 to PyPI (which it does for me). PyPI releases are just distributions of the tagged releases on Github. It's a little better to use PyPI's sdist since it is smaller than a Github checkout.

bsdice commented on 2018-08-17 16:44 (UTC)

Check https://pypi.org/project/ocrmypdf/#files

fbrennan commented on 2018-08-17 16:36 (UTC)

Thank you everyone. Maintaining the package is the least I could do because I use ocrympdf a lot, and I found the developer extremely cordial and helpful when I had a problem with it while OCR'ing an Esperanto PDF.

Regarding version 7.0.3, someone flagged the package over this, that version is not yet on PyPI. As far as I know, what's on Github is development, while what's on PyPI is stable. So I'm assuming 7.0.3 is beta since it's not yet on PyPI. As soon as it is on PyPI I will update the PKGBUILD.

If my understanding of this is wrong, feel free to enlighten me.

sagittarius commented on 2018-08-17 10:24 (UTC) (edited on 2018-08-17 14:33 (UTC) by sagittarius)

Thanks to the maintainers and jbarlow for this utility is clearly ULTIMATE (necessary, indispensable, decisive for manipulating PDF files).

I've used the git version of img2pdf, rebuild some AUR packages (python-pikepdf, pybind11, pngquant...) and it works great :D

jbarlow commented on 2018-08-13 20:33 (UTC)

I'm the author of ocrmypdf and pikepdf - great to see the community here working away on the update. Several changes here are due deprecated features being removed in Python 3.7.

As of pikepdf 0.3.1, just released today, pybind11.patch will be unnecessary.

A few comments on dependencies compared to https://pastebin.com/84Tb6K6S: - jbig2enc should be added - leptonica should be added explicitly (>= 1.76.0, implied by tesseract) - qpdf should be added explicitly (>= 8.1.0, implied by pikepdf)

bsdice commented on 2018-08-13 02:55 (UTC) (edited on 2018-08-13 02:58 (UTC) by bsdice)

@fbrennan Thanks for adopting it! Glad I could help out the community.

Here are two fixes that escaped my attention:

(1) PKGBUILD of ocrmypdf is missing one depends=( ... 'python-reportlab>=3.3.0' ... )

(2) PKGBUILD of python-xmp-toolkit similarly is missing one depends=(... 'python-pytz')

Everything should be checked with namcap -i <pkgbuild|final .xz=""> anyhow.</pkgbuild|final>

May I also suggest to you to ask the pikepdf guy on Github why pybind11.patch is needed and also if that is the correct fix.

fbrennan commented on 2018-08-13 02:22 (UTC) (edited on 2018-08-13 07:20 (UTC) by fbrennan)

Thank you for the guide @bsdice ...

I adopted the package and will push a revised package for 7.0.2. (Unfortunately, have to wait for the python-ruffus package to either be disowned or updated. Will update as soon as that's done.)

mutantmonkey commented on 2018-08-12 19:34 (UTC)

Unfortunately, I haven't had much time to maintain this package as of late. I'm orphaning it so that someone with more time can take over.

bsdice commented on 2018-08-07 23:42 (UTC)

Finally, another new package called "python-xmp-toolkit" is needed, PKGBUILD: https://pastebin.com/xcngPwUq

I have based the PKGBUILD on the git-package: https://aur.archlinux.org/packages/python-xmp-toolkit-git/

In the end, the software will work again:

$ ocrmypdf --version

7.0.2

bsdice commented on 2018-08-07 23:36 (UTC) (edited on 2018-08-07 23:45 (UTC) by bsdice)

Next, update package "python-ruffus" https://aur.archlinux.org/pkgbase/python-ruffus/ and install the updated package python-ruffus-2.7.0-1-any.pkg.tar.xz (you can imho skip the python2 package).

PKGBUILD diff: https://pastebin.com/z9Zs1wZ7

bsdice commented on 2018-08-07 23:32 (UTC)

Next, you need a new package called "python-pikepdf". PKGBUILD: https://pastebin.com/hYRVaiqT

Patch file "pybind11.patch": https://pastebin.com/7BqByXUa

This package does not yet exist in Arch, feel free to create it.

The patch will throw out install_requires and introduce pybind11 as a setup_requires item. Otherwise, this library somehow can't find pybind11 and will quit with a backtrace.

bsdice commented on 2018-08-07 23:27 (UTC)

Then, rebuild this package https://aur.archlinux.org/packages/pybind11 for Python 3.7. The fix got in just a couple of hours ago.

pacaur -S pybind11 or what else tool for AUR you prefer. If nothing, download PKGBUILD and rebuild/install manually.

bsdice commented on 2018-08-07 23:24 (UTC)

Next, you will need the python package "img2pdf" as a dependency.

PKGBUILD: https://pastebin.com/QwptWeRZ

Create and install as usual with makepkg -Ccfi

bsdice commented on 2018-08-07 23:21 (UTC) (edited on 2018-08-07 23:22 (UTC) by bsdice)

Here is the diff for PKGBUILD: https://pastebin.com/84Tb6K6S

bsdice commented on 2018-08-07 23:18 (UTC)

This package is broken, as of today 2018/08/08. I am going to post here some information on how to build the latest 7.0.2. It will be somewhat difficult.

Here is the error message:

$ ocrmypdf Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 570, in _build_master ws.require(requires) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 888, in require needed = self.resolve(parse_requirements(requirements)) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 779, in resolve raise VersionConflict(dist, req).with_context(dependent_req) pkg_resources.ContextualVersionConflict: (ruffus 2.7.0 (/usr/lib/python3.7/site-packages), Requirement.parse('ruffus==2.6.3'), {'ocrmypdf'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 6, in <module> from pkg_resources import load_entry_point File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 3095, in <module> @_call_aside File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 3079, in _call_aside f(args, *kwargs) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 3108, in _initialize_master_working_set working_set = WorkingSet._build_master() File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 572, in _build_master return cls._build_from_requirements(requires) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 585, in _build_from_requirements dists = ws.resolve(reqs, Environment()) File "/usr/lib/python3.7/site-packages/pkg_resources/init.py", line 774, in resolve raise DistributionNotFound(req, requirers) pkg_resources.DistributionNotFound: The 'ruffus==2.6.3' distribution was not found and is required by ocrmypdf</module></module>

sagittarius commented on 2018-08-05 09:31 (UTC)

It would be great to have v7.0.2

vfrico commented on 2018-05-04 16:25 (UTC) (edited on 2018-05-15 17:54 (UTC) by vfrico)

This PKGBUILD compiles today:

https://0bin.net/paste/WhJ6LfPKH3N0YDpJ#cxC8Le-REOl/guW9WQIxb9/+daxT0a5ufp9o9Z4fXLJ

Also available as a gist:

https://gist.github.com/vfrico/9709e013f01c9a8bc384e3ea85f66c3f

john-soda commented on 2018-04-11 16:43 (UTC)

For everybody, if you want to install the newest ocrmypdf change pkgbuild file to this.

Maintainer: mutantmonkey aur@mutantmonkey.in
Contributor: Daniel Reuter daniel.robin.reuter@googlemail.com

pkgname=ocrmypdf pkgver=6.1.3 pkgrel=1 pkgdesc="A tool to add an OCR text layer to scanned PDF files, allowing them to be searched" url="https://github.com/jbarlow83/OCRmyPDF" arch=('any') license=('custom') depends=('python>=3.5' 'python-cffi>=1.9.1' 'python-pillow>=4.0.0' 'python-pypdf2>=1.26' 'python-reportlab>=3.3.0' 'python-ruffus>=2.6.3' 'ghostscript>=9.15' 'qpdf>=7.0.0' 'tesseract>=3.04' 'unpaper>=6.1' 'img2pdf>=0.2.3' 'python-setuptools_scm' 'python-defusedxml>=0.5.0' 'python-pytest-runner>=3.5.0') makedepends=('python-setuptools') source=("https://pypi.python.org/packages/8c/6b/fbd6d134ffa0acd14ba0323d8e4acd739c27f6b1296c5983dfbe86fe821c/ocrmypdf-${pkgver}.tar.gz") sha256sums=('9320a3913df54d94fce8db4b1ece32e557e313dc0f1a423ab4c533f49771e6c5')

package() { cd "${srcdir}/${pkgname}-${pkgver}" python setup.py install --root="$pkgdir/" --optimize=1 install -Dm644 LICENSE $pkgdir/usr/share/licenses/$pkgname/LICENSE }

john-soda commented on 2018-04-10 13:11 (UTC) (edited on 2018-04-10 13:19 (UTC) by john-soda)

I was able to install ocrmypdf 5.4.x. But when I try to update or uninstall the old version and install the new one I always get following error message:

File "/usr/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 667, in easy_install raise DistutilsError(msg) distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pytest-runner')

When I install the newest ocrmypdf via git and pip,

pip3 install git+https://github.com/jbarlow83/OCRmyPDF.git

it works.

john-soda commented on 2017-11-18 11:16 (UTC)

I needed an additonal dependency "python-setuptools-scm" without following error happend. Download error on https://pypi.python.org/simple/setuptools-scm/: [Errno -2] Name or service not known -- Some packages may not be found! Couldn't find index page for 'setuptools_scm' (maybe misspelled?) Download error on https://pypi.python.org/simple/: [Errno -2] Name or service not known -- Some packages may not be found! No local packages or working download links found for setuptools_scm Traceback (most recent call last): File "setup.py", line 244, in <module> zip_safe=False) File "/usr/lib/python3.6/site-packages/setuptools/__init__.py", line 128, in setup _install_setup_requires(attrs) File "/usr/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 666, in easy_install raise DistutilsError(msg) distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('setuptools_scm') ==> ERROR: A failure occurred in package(). Aborting... ==> ERROR: Makepkg was unable to build ocrmypdf.

jbarlow commented on 2017-11-09 08:15 (UTC)

@mutantmonkey. I'm the author of ocrmypdf. Thanks for maintaining this for Arch Linux and glad you find it useful. When you have a chance please consider updating the dependencies to match setup.py. For v5.2 the requirements for several packages are higher. At the minimum versions listed it won't work anymore (Tesseract 3.04 is now required, for one thing). I also suggest qpdf >= 7.0.0 because older versions have known security holes handling in malicious/malformed PDFs.

mutantmonkey commented on 2017-08-12 18:20 (UTC)

The version of img2pdf in AUR is now up-to-date, so using img2pdf-git is no longer necessary. Both python-ruffus and python-pypdf2 are already listed as dependencies and available in the AUR. If you are having trouble, you should try rebuilding them because you may already have older versions on your system built against an earlier version of Python, which will not work.

rabarrett commented on 2017-08-12 17:20 (UTC)

I got it to run, but only after installing python-pypdf2 -Should this be a dependency?

rabarrett commented on 2017-08-08 18:46 (UTC)

I also tried to clone the git repository for ocrmypdf and install it with makepkg (using the PKGBUILD from here). In doing so, I had to install 2 packages manually: [code] $ makepkg -i ==> WARNING: A package has already been built, installing existing package... ==> Installing package ocrmypdf with pacman -U... loading packages... resolving dependencies... warning: cannot resolve "python-ruffus>=2.6.3", a dependency of "ocrmypdf" warning: cannot resolve "img2pdf>=0.2.1", a dependency of "ocrmypdf" :: The following package cannot be upgraded due to unresolvable dependencies: ocrmypdf :: Do you want to skip the above package for this upgrade? [y/N] [/code] (I installed those two it warned about) But I still got the same error: [code] $ ocrmypdf Traceback (most recent call last): File "/usr/bin/ocrmypdf", line 6, in <module> from pkg_resources import load_entry_point File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3049, in <module> @_call_aside File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3033, in _call_aside f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3062, in _initialize_master_working_set working_set = WorkingSet._build_master() File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 658, in _build_master ws.require(__requires__) File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 972, in require needed = self.resolve(parse_requirements(requirements)) File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 858, in resolve raise DistributionNotFound(req, requirers) pkg_resources.DistributionNotFound: The 'PyPDF2>=1.26' distribution was not found and is required by ocrmypdf [code]

rabarrett commented on 2017-08-08 18:29 (UTC)

It appeared to install correctly for me, but then when I just tried to run it to see the help info, it said there were files "not found." I followed sagittarius's suggestion to install img2pdf-git and it failed to build, complaining: 250 passed 4 xpassed 4 failed (all in cherry/test dir) 3 xfailed 4 skipped

sagittarius commented on 2017-07-26 09:46 (UTC)

With this latest version, I had to replace img2pdf by img2pdf-git to make it work.

mutantmonkey commented on 2017-07-24 02:39 (UTC)

I've updated the dependency for python-img2pdf to img2pdf, as python-img2pdf is a duplicate and img2pdf seems like a more logical name. However, the current version of img2pdf is out-of-date and older than what python-img2pdf provided, so if you run into any weird issues, it may be due to that.

sagittarius commented on 2017-01-23 19:11 (UTC) (edited on 2017-01-23 19:12 (UTC) by sagittarius)

For info, I had to recompile python-ruffus (aur) to make it work.

hason commented on 2016-10-21 08:27 (UTC) (edited on 2016-10-21 08:28 (UTC) by hason)

Please update to the new version 4.2.5.

mutantmonkey commented on 2016-02-23 05:02 (UTC)

I'm aware that this package is out of date, however after building the new package it does not appear to be working properly on a test PDF file. I'll leave it at the working version for now until I get this figured out.

martimcfly commented on 2016-01-08 12:44 (UTC)

Hey mutantmonkey! I've got called up for missing this dependency on first start of ocrmypdf. I don't guess it's good to rely on transitive dependencies, since other packages and their dependencies could change. Marti

mutantmonkey commented on 2016-01-08 02:57 (UTC)

@martimcfly are you sure about that? I don't see it listed in requirements.txt and I don't see anything that would depend on it. python-reportlab does depend on it though, so it will be pulled in regardless.

martimcfly commented on 2016-01-04 00:15 (UTC)

python-pip is missing as dependency. could you please add it?

OlafLostViking commented on 2015-10-22 14:42 (UTC)

The new repository: https://github.com/jbarlow83/OCRmyPDF/releases

Falkenber9 commented on 2015-10-10 09:13 (UTC)

Hello, since latest Python update the function "tostring()" has been removed and can be replaced with "tobytes()". This affects the file /usr/lib/ocrmypdf/src/hocrTransform.py in line 89.

sagittarius commented on 2015-08-12 10:51 (UTC)

Hello allspark Could you please update to v3.0 ? https://github.com/fritz-hh/OCRmyPDF It would be fine.

Falkenber9 commented on 2015-07-27 13:45 (UTC)

Note: Due to this bug ( https://savannah.gnu.org/bugs/?45619 ) in parallel-20150622 the script always failes. This relates to an incorrect exit-code returned by parallel. For a quick workaround downgrade parallel to an older version.

allspark commented on 2015-02-08 13:20 (UTC)

thx danilo

dbrgn commented on 2014-10-01 05:36 (UTC)

Updated package: http://tmp.dbrgn.ch/ocrmypdf-2.1.0-1.src.tar.gz

dbrgn commented on 2014-09-23 12:59 (UTC)

Version 2.1-stable is out. An update would be great, as the current version doesn't work with the latest version of Tesseract anymore.

sagittarius commented on 2014-09-17 23:43 (UTC)

Since upgrade to python2-reportlab, several issues for making OCRmyPDF to work properly. I had to: - downgrade to python2-reportlab v2.7 (PKGBUILD here https://projects.archlinux.org/svntogit/community.git/plain/trunk/PKGBUILD?h=packages/python-reportlab&id=5c04a255e9c0f352dee3282f2a308d375926ed30). - replace in /usr/lib/ocrmypdf/src/ocrPage.sh: mv "$curHocr.html" "$curHocr" by mv "$curHocr.hocr" "$curHocr" - ovewrite /usr/lib/ocrmypdf/src/hocrTransform.py by original file from GitHub: https://github.com/fritz-hh/OCRmyPDF/tree/v2.x/src I made also a KDE service menu named OCRmyPDF.desktop in: /usr/share/kde4/services/ServiceMenus/ wich contents: [Desktop Entry] Type=Service ServiceTypes=KonqPopupMenu/Plugin MimeType=application/pdf; Icon=application-postscript TryExe=OCRmyPDF.sh Actions=OCRmyPDFclean;OCRmyPDFnoclean [Desktop Action OCRmyPDFclean] Name=OCR -> PDF clean Icon=application-postcript Exec=OCRmyPDF.sh -l eng -d -c -i %f "`echo %f | perl -pe 's/\.[^.]+$//'`-ocr.pdf";kdialog --passivepopup "Done" 3; echo [Desktop Action OCRmyPDFnoclean] Name=OCR -> PDF noclean Icon=application-postcript Exec=OCRmyPDF.sh -l eng -d -c %f "`echo %f | perl -pe 's/\.[^.]+$//'`-ocr.pdf";kdialog --passivepopup "Done" 3; echo

dreuter commented on 2014-09-16 17:31 (UTC)

@Chais: Could you provide a minimal (not) working example? So just some (sample) files to ocr and the commandline options you used.

Chais commented on 2014-09-16 12:56 (UTC)

When trying to ocr a pdf I'm getting these errors: http://sprunge.us/dKNQ No iea what to make of it.

dreuter commented on 2014-03-09 17:44 (UTC)

I just fixed it. Thanks. I also wrote an pull request to change it in the code (but maybe it does not affect the Ubuntu-Users and so it won't be upgraded soon).

p3t3r commented on 2014-03-05 17:39 (UTC)

Thank for this package, it's a great tool for quick and decent OCR. While it worked at first, it unfortunately stopped soon afterwards. But there's a fix, since it's just a module/function in Python that has since been renamed: In /usr/lib/ocrmypdf/src/hocrTransform.py replace _AsciiBase85Encode with asciiBase85Encode and everything is fine again.