1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
|
.\" Automatically generated by Pandoc 3.7.0.1
.\"
.TH "jbig2enc: Documentation" "" "" ""
.SH \f[CR]jbig2enc\f[R]: Documentation
Adam Langley \f[CR]<agl\(atimperialviolet.org>\f[R]
.SS What is JBIG2
JBIG2 is an image compression standard from the same people who brought
you the JPEG format.
It compresses 1bpp (black and white) images only.
These images can consist of \f[I]only\f[R] black and while, there are no
shades of gray \- that would be a grayscale image.
Any \(dqgray\(dq areas must, therefore be simulated using black dots in
a pattern called \c
.UR http://en.wikipedia.org/wiki/Halftone
halftoning
.UE \c
\&.
.PP
The JBIG2 standard has several major areas:
.IP \(bu 2
Generic region coding
.IP \(bu 2
Symbol encoding (and text regions)
.IP \(bu 2
Refinement
.IP \(bu 2
Halftoning
.PP
There are two major compression technologies which JBIG2 builds on: \c
.UR http://en.wikipedia.org/wiki/Arithmetic_coding
arithmetic encoding
.UE \c
\ and \c
.UR http://en.wikipedia.org/wiki/Huffman_coding
Huffman encoding
.UE \c
\&.
You can choose between them and use both in the same JBIG2 file, but
this is rare.
Arithmetic encoding is slower, but compresses better.
Huffman encoding was included in the standard because one of the
(intended) users of JBIG2 were fax machines and they might not have the
processing power for arithmetic coding.
.PP
\f[CR]jbig2enc\f[R] \f[I]only\f[R] supports arithmetic encoding
.SS Generic region coding
Generic region coding is used to compress bitmaps.
It is progressive and uses a context around the current pixel to be
decoded to estimate the probability that the pixel will be black.
If the probability is 50% it uses a single bit to encode that pixel.
If the probability is 99% then it takes less than a bit to encode a
black pixel, but more than a bit to encode a white one.
.PP
The context can only refer to pixels above and to the left of the
current pixel, because the decoder doesn\(aqt know the values of any of
the other pixels yet (pixels are decoded left\-to\-right,
top\-to\-bottom).
Based on the values of these pixels it estimates a probability and
updates it\(aqs estimation for that context based on the actual pixel
found.
All contexts start off with a 50% chance of being black.
.PP
You can encode whole pages with this and you will end up with a perfect
reconstruction of the page.
However, we can do better...
.SS Symbol encoding
Most input images to JBIG2 encoders are scanned text.
These have many repeating symbols (letters).
The idea of symbol encoding is to encode what a letter \(lqa\(rq looks
like and, for all the \(lqa\(rqs on the page, just give their locations.
(This is lossy encoding)
.PP
Unfortunately, all scanned images have noise in them: no two \(lqa\(rqs
will look quite the same so we have to group all the symbols on a page
into groups.
Hopefully each member of a given group will be the same letter,
otherwise we might place the wrong letter on the page!
These, very surprising, errors are called cootoots.
.PP
However, assuming that we group the symbols correctly, we can get great
compression this way.
Remember that the stricter the classifier, the more symbol groups
(classes) will be generated, leading to bigger files.
But, also, there is a lower risk of cootoots (misclassification).
.PP
This is great, but we can do better...
.SS Symbol retention
Symbol retention is the process of compressing multi\-page documents by
extracting the symbols from all the pages at once and classifying them
all together.
Thus we only have to encoding a single letter \(lqa\(rq for the whole
document (in an ideal world).
.PP
This is obviously slower, but generates smaller files (about half the
size on average, with a decent number of similar typeset pages).
.PP
One downside you should be aware of: If you are generating JBIG2 streams
for inclusion to a linearised PDF file, the PDF reader has to download
all the symbols before it can display the first page.
There is solution to this involing multiple dictionaries and symbol
importing, but that\(aqs not currently supported by \f[CR]jbig2enc\f[R].
.SS Refinement
Symbol encoding is lossy because of noise, which is classified away and
also because the symbol classifier is imperfect.
Refinement allows us, when placing a symbol on the page, to encode the
difference between the actual symbol at that location, and what the
classifier told us was \(lqclose enough\(rq.
We can choose to do this for each symbol on the page, so we don\(aqt
have to refine when we are only a couple of pixel off.
If we refine whenever we see a wrong pixel, we have lossless encoding
using symbols.
.SS Halftoning
\f[CR]jbig2enc\f[R] doesn\(aqt support this at all \- so I will only
mention this quickly.
The JBIG2 standard supports the efficient encoding of halftoning by
building a dictionary of halftone blocks (like the dictionaries of
symbols which we build for text pages).
The lack of support for halftones in G4 (the old fax standard) was a
major weakness.
.SS Some numbers
My sample is a set of 90 pages scanning pages from the middle of a
recent book.
The scanned images are 300dpi grayscale and they are being upsampled to
600dpi 1\-bpp for encoding.
.PP
Generic encoding each page: 3435177 bytes
.PP
Symbol encoding each page (default classifier settings): 1075185 bytes
.PP
Symbol encoding with refinement for more than 10 incorrect pixels:
3382605 bytes
.SS Command line options
\f[CR]jbig2enc\f[R] comes with a handy command line tool for encoding
images.
.IP \(bu 2
\f[CR]\-d | \-\-duplicate\-line\-removal\f[R]: When encoding generic
regions each scan line can be tagged to indicate that it\(aqs the same
as the last scanline \- and encoding that scanline is skipped.
This drastically reduces the encoding time (by a factor of about 2 on
some images) although it doesn\(aqt typically save any bytes.
This is an option because some versions of \f[CR]jbig2dec\f[R] (an open
source decoding library) cannot handle this.
.IP \(bu 2
\f[CR]\-p | \-\-pdf\f[R]: The PDF spec includes support for JBIG2
(Syntax→Filters→JBIG2Decode in the PDF references for versions 1.4 and
above).
However, PDF requires a slightly different format for JBIG2 streams: no
file/page headers or trailers and all pages are numbered 1.
In symbol mode the output is to a series of files:
\f[CR]symboltable\f[R] and \f[CR]page\-\f[R]\f[I]n\f[R] (numbered from
0)
.IP \(bu 2
\f[CR]\-s | \-\-symbol\-mode\f[R]: use symbol encoding.
Turn on for scanned text pages.
.IP \(bu 2
\f[CR]\-t <threshold>\f[R]: sets the fraction of pixels which have to
match in order for two symbols to be classed the same.
This isn\(aqt strictly true, as there are other tests as well, but
increasing this will generally increase the number of symbol classes.
.IP \(bu 2
\f[CR]\-w <weight>\f[R]: sets weightfactor (0.1\-0.9) that corrects
thresh for thick characters.
.IP \(bu 2
\f[CR]\-T <threshold>\f[R]: sets the black threshold (0\-255).
Any gray value darker than this is considered black.
Anything lighter is considered white.
.IP \(bu 2
\f[CR]\-r | \-\-refine <tolerance>\f[R]: (requires \f[CR]\-s\f[R]) turn
on refinement for symbols with more than \f[CR]tolerance\f[R] incorrect
pixels.
(10 is a good value for 300dpi, try 40 for 600dpi).
Note: this is known to crash Adobe products.
.IP \(bu 2
\f[CR]\-O <outfile>\f[R]: dump a PNG of the 1 bpp image before encoding.
Can be used to test loss.
.IP \(bu 2
\f[CR]\-2\f[R] or \f[CR]\-4\f[R]: upscale either two or four times
before converting to black and white.
.IP \(bu 2
\f[CR]\-S\f[R] Segment an image into text and non\-text regions.
This isn\(aqt perfect, but running text through the symbol compressor is
terrible so it\(aqs worth doing if your input has images in it (like a
magazine page).
You can also give the \f[CR]\-\-image\-output\f[R] option to set a
filename to which the parts which were removed are written (PNG format).
|