Book Scanner

Based on The Easy Book Scanner by David Landin, post

PVC tubing:
20″ x 5 = 100
3″ x 4 = 12
15″ x 2 = 30
8″ x 4 = 32
12″ x 2 = 24
total 198″ / 12″ = 16′ 6″

First book scan

I needed to scan “Mechanical Singing-bird Tabatiéres” by Geoffrey T. Mayson before the parts for the DIY scanner arrived, so I used a crude setup. I glued a 45° platform out of scrap wood, 10″ x 8″. I placed the page to be scanned on this side of the platform, and put a stack of books under the other side of the open book so it sat in a ‘V’, like it does in the book scanner. I placed a piece of glass over the page to be scanned.

I used a Canon Rebel T7i on a tripod positioned to point at the page and be roughly perpendicular and used a remote shutter release. For lighting, I pointed two gooseneck fluorescent grow lights at the page. I set the Canon on automatic.

Then I photographed the pages, all odd, then all even. Review of the photos showed that about a dozen were out of focus, so I photographed them while the setup was in place. Later on, I discovered I had skipped a page, so I photographed it manually, then had to do extra processing to brighten it to match the other photos.

The photos were transferred to the computer, then the ‘redo’ pages were renamed to replace the originals to keep the page order intact. I noticed that the brightness varied, the camera had changed the photo settings for pages that had lots of photos.

I thought brightness correct would work better on pages without the border of stuff around the page, so I trimmed the pages (and reoriented them at the same time for convenience). I determined the crop settings for one page, made a batch file for all the pages, then adjusted the settings by hand every ~10-20 pages as the thickness of the book changed its location of the page in the image. The even pages had a rotation of -90:

convert book1/IMG_0034.JPG -crop 3900x2925+1050+650 -rotate 90 book1b/IMG_0034.jpg

About half the pages have images, and brightness correction the fixed the text messed up the images, so I decided to identify image regions in the pictures, do brightness correction on the page, then paste the images from the original photo back over the page.
Here’s the approach using ImageMagick commands:

convert book1b/IMG_0046.jpg -fuzz 40%% -transparent white -channel RGB -evaluate set 100%% +channel -background Black -layers Flatten -alpha off book1c/IMG_0046.png

This gives an image with a black background and white text/images.

convert book1c/IMG_0046.png -morphology erode:3 square:3 -morphology dilate:10 square:3 -morphology erode:5 square:3  book1c/IMG_0046b.png

This removes small content patches (the text), and consolidates the images into blocks of white.

convert book1c/IMG_0046b.png -transparent black book1c/IMG_0046c.png

This makes the background transparent.

convert book1c/IMG_0046c.png -define connected-components:verbose=true -define connected-components:area-threshold=100000 -connected-components 4 /dev/null
Objects (id: bounding-box centroid area mean-color):
1: 2900x3700+0+0 1372.5,1921.6 6085713 graya(0,0)
11: 1530x1643+1093+1712 1896.7,2573.1 2303240 graya(255,1)
5: 1489x1342+221+352 971.0,1062.8 1807075 graya(255,1)
3: 978x1015+1922+0 2604.5,312.4 412883 graya(255,1)
0: 60x3571+0+0 18.4,1443.8 121089 graya(255,1)

This command gives bounding boxes for the blocks in the image. Delete any that contain ‘+0’, these are either the background or shiny corners of the page. In my photos, the corner blocks never included the corner specified without the ‘+0’.
-After skipping objects touching an edge:
11: 1530×1643+1093+1712 1896.7,2573.1 2303240 graya(255,1)
5: 1489×1342+221+352 971.0,1062.8 1807075 graya(255,1)

Then I wrote a perl script to convert the bounding box spec into rectangle specs used by the ‘-draw’ command.
11: 1530×1643+1093+1712 1896.7,2573.1 2303240 srgba(255,255,255,1)
-draw “rectangle 1093,1712 2623,3355”
block_prep perl script

convert -size 2900x3700 xc:none -stroke black -fill black -strokewidth 1 -draw "rectangle 1093,1712 2623,3355"  -draw "rectangle 221,352 1710,1694" book1c/IMG_0046m.png

This copies the image regions given by the bounding blocks over those spots in the brightened image.

This command from “Fred’s ImageMagick scripts” does the image brightening (and does a despeckle):

./textcleaner -f 20 -o 5 book1b/IMG_0046.jpg book1c/IMG_0046e.png

These steps can be combined into three ImageMagick commands:
1) textcleaner
2) Find blocks, write to ‘objects.txt’. Use the block_prep script to convert the blocks to ‘-draw’ commands, and write out a combined ImageMagick command.

echo "book1b/IMG_0046.jpg" >> objects.txt | convert book1b/IMG_0046.jpg -fuzz 40%% -transparent white -channel RGB -evaluate set 100%% +channel -background Black -layers Flatten -alpha off -morphology erode:3 square:3 -morphology dilate:10 square:3 -morphology erode:5 square:3 -transparent black -define connected-components:verbose=true -define connected-components:area-threshold=100000 -connected-components 4 /dev/null >> objects.txt

Then I add a step to test the regions. In some images, the ‘-fuzz 40%’ overbrightened the image, and all the images merged together with the page edge, so I had to try different ‘-fuzz’ settings and look at the resulting image. This was needed for 49 of 127 pages with images.

convert ../book1b/IMG_0046.jpg -fuzz 50%% -transparent white -channel RGB -evaluate set 100%% +channel -background Black -layers Flatten -alpha off IMG_0046.jpg

Then I checked the bounding boxes by drawing a red outline over the regions, and looking at the page. Some image boxes needed to be readjusted. In a few cases, the images were odd sized and two overlapping boxes or even a ‘-draw’ polygon was needed.

convert -size 2900x3700 xc:none xc:none -stroke red -fill none -strokewidth 5 -draw 'rectangle 1093,1712 2623,3355' -draw 'rectangle 221,352 1710,1694'  png:- | composite -geometry +0+0 png:- book1b/IMG_0046.jpg book1c/IMG_0046.png

3) Run the command to combine the brightened page with the image masking.

convert -size 2900x3700 xc:none -stroke black -fill black -strokewidth 5 -draw 'rectangle 1093,1712 2623,3355' -draw 'rectangle 221,352 1710,1694'  png:- | composite -compose CopyOpacity png:- book1b/IMG_0046.jpg png:- | composite -geometry +0+0 png:- book1c/IMG_0046.png book1d/IMG_0046.png

This gave the pages with images in an output dir, ‘book1d’. Then I ran a batch script to copy over the simple textcleaner images for the remaining text-only pages:

[[ ! -e book1d/IMG_0034.png ]] && cp book1c/IMG_0034.png book1d

This gave me all the odd and even pages in a separate series, so I renamed to images with their page numbers (image 0161 is where the switch to even pages happens), and did a check. This is how I discovered the missing page.

perl -ne 'BEGIN{$pg=1;$sw="161";}chomp;printf "cp book1d/$_ %03d.jpg\n",$pg;$pg+=2;if(/$sw/){$pg=2;}' < rename.bat 

Scan Tailor was used to do the final processing. It deskewed the pages, trimmed the margins, and converted the text portions of each page to B&W.

tesseract was used to OCR the book. It also combined the pages into a PDF. This PDF has two layers, the image layer, and an invisible layer with the text that can be used for searching and copying text out.

ls *.tif >output.txt && tesseract -l eng -c textonly_pdf=1 output.txt text pdf

The final book file was renamed to “Mechanical Singing-bird Tabatiéres by Geoffrey T. Mayson”, pdf, text.