The key for JBIG2 use in compressing documents that need to be archived or that have strict retention requirements is that the JBIG2 encoding be reliable and not lose any informational content. One way to measure informational loss effectively in an automated environment is through an OCR (optical character recognition) program which can verify that recognition rates are as high after JBIG2 compression as before.
Although JBIG2 compression can be used in lossless mode, it is far more effective when used in a perceptually lossless mode. Most of the file size in a lossless encoding is used to encode sensor noise and other digitization artifacts not germane to the file's informational content. Though lossy JBIG2 can result in much more compact file sizes, it is very important to ensure that the lossy encoding is, in fact, perceptually lossless as described in the previous section of this Primer. Character mismatches and degraded OCR recognition are not acceptable in a perceptually lossless JBIG2 encoding. In other words, a cheaper JBIG2 solution could be much more expensive in the long run.
The file size advantage of JBIG2s over generic TIFFs is usually quite dramatic. For a typical scanned document at 300dpi, the TIFF is roughly 75 KB -125 KB bytes per image, while the JBIG2 is about 5x-10x smaller, in the range of 10 KB -15KB bytes per image.
The compressed JBIG2 file size can differ dramatically between different JBIG2 encoders. For example, consider the original file sigice9_172.tif. The jb2-pdf filecreated by one vendor (CVISION PdfCompressor) is less than half the size of the one created by a competing vendor. Other file size comparisons between these two JBIG2 converters across several datasets are given here below. As shown, the CVision encoded files are generally 35% - 40% smaller.
Figure 5. Across three very different sets of images, the CVision compressed files are significantly smaller than those of the competition.
Even within lossless compression there can be a big difference in file size. Here is a comparative sample of a typical patent file in which the size of a lossless CVision compressed file is less than a third of the size of a competing vendor's lossless file.
The Adobe PDF document format has, until the release of the PDF 1.4 specifications, supported the standard compression/decompression filters: LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode, and DCTDecode (JPEG-based). These decoding filters allow the PDF data streams to be compressed when the PDF file is written and then decoded by the Adobe PDF Reader. With the inclusion of the JBIG2Decode filter in the PDF specifications in 2001 (see the PDF Reference, 3rd Edition, Version 1.4), scanned documents can now be encoded using the new ITU-approved JBIG2 format and, at the same time, be fully PDF-compliant.
When a new format is introduced, like JBIG2, JPEG2000, or MPEG4, there is usually a considerable time delay until supporting viewers/players and other editing software is available to handle documents in this new format. One advantage to PDF wrapping a new format like JBIG2 is that readability is essentially guaranteed, assuming a more recent version of Adobe Reader (5.05 or higher) has been installed on the client machine. In fact, most companies currently using JBIG2 for their document compression are using PDF-wrapped JBIG2, not native JBIG2. For a slight increase in file size, a JBIG2 document can be wrapped in a PDF, thereby enabling the document to access a range of features supported in PDF that are not available in native JBIG2.
The JBIG2 format allows countless ways to encode any given image.
One of the first things a JBIG2 encoder must do is segment the image into its constituent symbols. The JBIG2 specs place no restrictions on how to do that. Therefore, every vendor creates their own proprietary algorithm to segment the image in the manner which they believe will lead to the greatest compression savings. Each of these symbols must then be encoded, as well as the information on how to position them.
Each of these decisions presents a wide range of options to the encoder. For example, let's say an image contains 5,000 characters and the symbol stream contains 5,000 symbols, one for each character. As the JBIG2 specs don't say in what order the symbols should be encoded, the encoder must choose the optimal ordering of the symbols. Of all the possible orderings, some of them will result in the smallest file size, while some of them may be so bad that the JBIG2 file could be even bigger than the original TIFF file!
Because the encoder can decide on the ordering, various JBIG2 vendors can each develop their own algorithms to minimize this part of the file size cost. The JBIG2 specifications provide a framework which enables great compression savings when used properly. The best manner to take advantage of it is an area of ongoing research and so the difference in file size between different JBIG2 implementations can vary drastically.
The ordering of the symbols within a JBIG2 stream does not in anyway affect image quality. While a naive ordering may not get the full compression savings available under the JBIG2 format, the resulting compressed file will still look fine. It does, however, provide another way for a good JBIG2 encoder to differentiate itself from the pack, by offering a significantly better compression ratio.
Typical OCR conversion of documents leave users stuck between two limited formats. First, there is the OCR'ed text file, which is easy to text search on and allows simple copying to a text editor such Microsoft Word. Second, there is the original document, which contains formatting, graphics and other information not usually found in a text file, and which an OCR program cannot reliably reproduce.
To view an OCR'ed text file in tandem with the original TIFF or JPEG can be quite tedious. Fortunately, PDF solves this problem by embedding a hidden text layer into an image PDF. This gives the user a fully searchable document which still contains all the visual information available in the original. Searching on a word can then take you to the exact location in the image where that word appears. An example of this image + text PDF is shown in the figure below, where the search is on the text string "determine".
Figure 6. A hidden text word is highlighted in Adobe Reader.
One possible drawback to this embedded OCR approach is that it increases the file size. In addition to storing the original document, you must also store an additional hidden text layer. JBIG2 solves this by greatly decreasing the size of the image layer, so that the compressed JBIG2 PDF including its text layer will generally be much smaller than the original file. You can have more information presented in a more useful manner, even as the file size has been greatly reduced. For example, the size of the original TIFF for US Patent 6122633 was 849.372 bytes, while the size of the JBIG2 wrapped PDF with OCR (using C Vision PdfCompressor) is less than 15% of the size of the original at 120.648 bytes.
When you consider that even the most accurate OCR engines frequently produce errors, it is always a good idea to have easy access to the original document in case the text file seems inaccurate.
As an example of PDF web-optimization, also referred to as linearization, consider the PDF file Patentsl20.pdf Although the PDF file shown is over 2,200 pages long, the HTML hyperlink:
can instruct the viewer (e.g., Adobe Reader) to open directly to a given page. In this case, Reader opens the PDF file directly to page 2000. This random access feature, which also includes file streaming (i.e., displaying the first page as soon as it downloads) is not available in pure JBIG2.
In order to fully appreciate the ways in which JBIG2 can enhance image quality, it is important to understand the distortions introduced by the scanning process. When used properly, JBIG2 can rectify many of these distortions and make the image appear closer to the original printed document.
The scanning process transforms a document from the continuous space of the printed page to the quantized space of a digital image. Strictly speaking, the level of precision of the characters on a printed page is limited by the molecules of the page. However, from the perspective of the human visual system, the page is perfectly smooth.
With a scanned document, the precision of an image is limited by the DPI. While at higher DPIs (such as 300 dpi and above) this will not cause any visible artifacts, it still presents a problem. Even a perfect sensor with a well-behaved monotonic response to the image will often divide a single font on the printed page into many different discrete shapes in the scanned image, i.e., fragmentation. The scanner can't identify each font and create the ideal digital representation of it. It is forced to digitize each character on its
own, and slight differences in how each character of a font is positioned in relation to the scanner grid will often result in different digital representations in the scanned image. Consider the variations in these original-to-scanned examples of the letter "a". Take an idealized "a", as on the left in the first two examples. Scanners will take that "a" and align the pixels according to the gridlines the scanner uses. The pixels of the scan are more "boxy" than smoothed. This is because a scanner represents each pixel with a grid box. Each box or pixel must be either white or black in a bitonal scan; there can be no "partial" decision. As a consequence, the scanned image loses precision in a process known as quantization. What is relevant for image compression is that the resulting bitmaps are highly sensitive to the precise placement of the scanning grid vis-a-vis the image. In the first two examples, you see how a slight change in the positioning of the scanner grid causes the same model to result in different bitmaps. The third example shows the resulting bitmaps side by side.
Figure 7. This shows a scanner's grid overlay on an idealized "a" and the resulting scan. Note the boxy appearance of the bitonal scan.
Figure 8. This is a second example of a scanner's grid overlay on the same letter "a", but with slightly different (smaller) font size, and its resulting bitonal scan.
Figure 9. Same letter, two scans. There are plenty of variations in the results, which makes accuracy and verifiability crucial elements in JBIG2 compression.
Furthermore, a scanner will not be precisely as sensitive to black throughout the image. This can cause the same font to appear slightly thicker in some parts of the page and slightly thinner in other parts of the page. To further compound the problem, many scanners tend to pick up visual noise that did not appear in the original document, some of which may become attached to characters.
This fragmentation of a font during digitization into many distinct characters has several drawbacks. The characters representing a given font and character code all appeared the same in the original document and there is no benefit in their looking different in the scanned image. Some of these characters may even appear awkward and ill-formed. From a compression perspective you have another problem. Every difference
between those characters needs to be encoded, resulting in much larger files. To help overcome these problems, JBIG2 allows pattern matching & substitution.
Pattern matching & substitution (PM&S) is perhaps the most powerful technique available within JBIG2. It enables the better JBIG2 implementations to achieve superior compression results even as they improve image quality. However, in the hands of a lesser JBIG2 implementation, PM&S can severely distort the image and lose information.
The premise behind PM&S is quite simple. If distinct characters on a scanned page are really different instances of the same font in the original document, you can improve image quality and drastically reduce file size by replacing each of those distinct characters with the same font in the compressed file.
Consider the figure below. The left box contains all the instances of a lowercase "h" in a standard document. The right box contains the single bitmap which replaces all of them in the compressed file. As you can see, the instance of the font used by the JBIG2 encoder looks much nicer than many of the characters in the scanned document.
Figure 10. All the instances of the letter "h" in the original document, even though they may vary slightly one from another, are replaced by the model on the right in the compressed document.
The cost of maintaining those 184 distinct instances of a lowercase "h" in the original document is very high. Each of them needs to be sent to the decoder, even though many of them are not particularly attractive and can even detract from the image quality. Merging them all into a single font allows a much smaller symbol dictionary, which can drastically reduce the file size.
Like all powerful tools, it is essential that PM&S be used correctly. Among the worst mistakes a JBIG2 encoder can make is a font substitution error, commonly known as a mismatch. If an encoder mistakenly includes a character in the wrong font, it will replace that character with the mistaken font in the compressed file. This creates a typo that will be seen in the compressed document. This misspelled word will confuse those who read the document and will cause an OCR engine that processes the compressed file to generate the wrong textual information. The only way to recover the lost information would be to recover it from the original document.
The ability to use PM&S presents many JBIG2 vendors with a dilemma. In order to stay competitive and get the best compression rates, they need to map as many characters as possible to the same font. A single mismatch, though, can potentially make the document worthless. Since the JBIG2 specs have nothing to say on which characters can be safely matched together and which can't, each JBIG2 vendor must develop their own proprietary algorithms. These algorithms involve sophisticated computer vision techniques. It is therefore not uncommon to find mismatches produced by many JBIG2 implementations, especially from the more recent entrants into the field.
These mismatches can severely degrade image quality. Here is a sample from a typical image file. The top half of the figure below shows what the original above looked like after lossy compression by a typical JBIG2 vendor. By way of contrast, the same document when compressed by a second vendor (CVision PdfCompressor), seen on the bottom half of the figure below, is accurate.
Figure 11. The file on top corrupts textual information while the one on bottom is faithful to the original. The difference between the two is striking.
The best way to verify the accuracy of a JBIG2 implementation is to run it on a set of files and visually inspect the results. Whenever the JBIG2 compressed files differ from the original images you would want them to have an improvement in image quality. At a minimum you should insist that they contain no degradation in image quality. While there is no substitute for looking at the compressed images and seeing if they appear acceptable, it can be a time consuming process. As a result, many people are interested in verification methods that can be automated.
For this reason, we recommend that an OCR engine be used to compare the quality of the image before compression and afterwards. The words in the text files produced by each image can be programmatically looked up in a dictionary to see if they are valid or not. This can produce an easily measurable score of how well each document did. A good JBIG2 implementation should produce a compressed file that does about as well, if not better, than the original image.
For example, when measured by an OCR Validation tool, the original document shown in the previous section had a score of 469 (i.e. 469 words in the OCRed text file had a match in a standard dictionary) while the CVision compressed file had a score of 472. On the other hand, the badly mismatched document produced by another vendor had an OCR score of 449.
Figure 12. The OCR engine found more valid words in the CVision compressed image than it did in the original image. The compressed image created by the competition did much worse than the original.
No OCR engine is 100% accurate. They all miss an occasional word that is clear to a human reader. Since there is an element of chance in even the best OCR engines, you can't be certain that the JBIG2 implementation degrades quality by testing it against the original on just a few files. However, over a large database it can be a very good measure of image quality. Within a small margin of error, you want the JBIG2 files to have OCR recognition rates about the same as or even better than those of the original files.
This is a good indication that the compression preserves image quality for this type of document, i.e., no OCR-based information loss.
Figure 13. Across the entire 186-page document, the OCR engine found more valid words in the CVision compressed files than it did in the original, thereby indicating that image quality was preserved.
The halftoning feature in JBIG2 provides very effective compression for bitonal files that contain picture images or greyscale regions. These regions are represented bitonally using halftoning patterns for documents such as newspapers or tax returns.
Conceptually, the idea is to take regions where pixels are not meant to stand alone, but rather convey the intensity of an image region, and have JBIG2 encode the region's intensity rather than specific pixel values. Although the compression rates for such files are often dramatic, there can be artifacts if done incorrectly. For halftoning to be a "safe" JBIG2 procedure, the compression system must have a reliable segmentor for text extraction. Text symbols, either in text zones or picture regions, cannot be halftoned for this process to be categorized as perceptually lossless.
Examples of both proper and improper use of JBIG2 halftoning can be seen in this full-page PDF file. Most of the text information within the picture regions of the image is protected, but the reverse video words at the bottom of the Chanel advertisement are degraded.
Halftoning over text regions, or pictures containing textual elements, will definitely degrade image quality. An example of the proper use of halftoning, applied to picture regions, is shown in the image below. The size reduction in this case is from 1.909.258 bytes as a TIFF G4 down to 241.576 bytes as a halftoned JBIG2.
As can be seen in this case, the text regions are sometimes hard to detect and can include reverse video segments in picture regions. When they are not detected, there is significant information loss, as seen in the figure below. The top row is the original image; the bottom is the compressed version. The halftone on the bottom left looks fine. In contrast, note the degradation of the reverse type on the bottom right:
Figure 14. Sections of the original image are on top with their halftoned counterparts beneath them. The halftoning on the right degraded image quality.
The JBIG2 specifications caution against using halftoning since this operation can seriously degrade the image. Similarly, the specs caution against using lossy JBIG2, which can introduce mismatches that degrade document quality, readability, and recognition rates.
The best way to understand such cautions with respect to JBIG2 is that the quality of the JBIG2 encoder is crucial. After all, image thresholding is potentially much more degrading than either font learning or halftoning since the part of the image that needs to be retained, e.g., signature, may disappear entirely. Yet most corporations capture their documents to black and white because they trust that the thresholding function is reliable, and that essential information in the image document will be preserved. They probably also test this assumption before putting their document imaging system into production.
JBIG2 compression can dramatically decrease file size, making documents much easier to transmit over the Web or via email. But a crucial step in maintaining document integrity during the conversion process to JBIG2 (or JBIG2-coded PDF) involves ensuring that the JBIG2 compressor actually enhances document image quality rather than degrades it. As has been shown, this cannot be assumed for all JBIG2 implementations, and results vary widely depending on the JBIG2 coder used.
It is important to understand that effective JBIG2 compression, with compression rates 5x-10x smaller than TIFF G4 (or standard G4-basedPDF), cannot be achieved with lossless JBIG2. Moreover, once a lossy JBIG2 encoder is utilized, it becomes essential for the IT manager or project leader responsible for document integrity to ensure that the JBIG2 converter supports perceptually lossless conversion.
The best way to test that a JBIG2 converter is perceptually lossless, i.e., non-degrading, is to visually inspect the image quality of the images before compression and afterwards. As this can be time consuming, it is recommended that instead of manual inspection you do an OCR verification test, which consists of measuring recognition rates both pre-compression and post-compression to verify that there is no loss in OCR recognition accuracy introduced during the JBIG2 compression process.