{"id":1919,"date":"2021-08-01T18:40:17","date_gmt":"2021-08-01T23:40:17","guid":{"rendered":"http:\/\/jimlund.org\/blog\/?page_id=1919"},"modified":"2021-10-27T15:59:39","modified_gmt":"2021-10-27T20:59:39","slug":"book-scanner","status":"publish","type":"page","link":"http:\/\/jimlund.org\/blog\/?page_id=1919","title":{"rendered":"Book Scanner"},"content":{"rendered":"\n<p>Based on <a href=\"https:\/\/www.youtube.com\/watch?v=ufiWeIKkxmc\">The Easy Book Scanner<\/a> by David Landin, <a href=\"https:\/\/forum.diybookscanner.org\/viewtopic.php?f=14&amp;t=2914\">post<\/a><br><br>PVC tubing:<br> 20&#8243; x 5 = 100<br> 3&#8243;  x 4 =  12<br> 15&#8243; x 2 =  30<br> 8&#8243;  x 4 =  32<br> 12&#8243; x 2 =  24<br>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br>total 198&#8243; \/ 12&#8243; = 16&#8242; 6&#8243;<br><strong><br>First book scan<\/strong><br>I needed to scan &#8220;Mechanical Singing-bird Tabati\u00e9res&#8221; by Geoffrey T. Mayson before the parts for the DIY scanner arrived, so I used a crude setup.  I glued a 45\u00b0 platform out of scrap wood, 10&#8243; x 8&#8243;.  I placed the page to be scanned on this side of the platform, and put a stack of books under the other side of the open book so it sat in a &#8216;V&#8217;, like it does in the book scanner.  I placed a piece of glass over the page to be scanned.<br><br>I used a Canon Rebel T7i on a tripod positioned to point at the page and be roughly perpendicular and used a remote shutter release.  For lighting, I pointed two gooseneck fluorescent grow lights at the page.  I set the Canon on automatic.<br><br>Then I photographed the pages, all odd, then all even.  Review of the photos showed that about a dozen were out of focus, so I photographed them while the setup was in place.  Later on, I discovered I had skipped a page, so I photographed it manually, then had to do extra processing to brighten it to match the other photos.<br><br>The photos were transferred to the computer, then the &#8216;redo&#8217; pages were renamed to replace the originals to keep the page order intact.  I noticed that the brightness varied, the camera had changed the photo settings for pages that had lots of photos.<br><br>I thought brightness correct would work better on pages without the border of stuff around the page, so I trimmed the pages (and reoriented them at the same time for convenience).  I determined the crop settings for one page, made a batch file for all the pages, then adjusted the settings by hand every ~10-20 pages as the thickness of the book changed its location of the page in the image.  The even pages had a rotation of -90:<br><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">convert book1\/IMG_0034.JPG -crop 3900x2925+1050+650 -rotate 90 book1b\/IMG_0034.jpg<\/pre>\n\n\n\n<p>About half the pages have images, and brightness correction the fixed the text messed up the images, so I decided to identify image regions in the pictures, do brightness correction on the page, then paste the images from the original photo back over the page.<br>Here&#8217;s the approach using ImageMagick commands:<br><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">convert book1b\/IMG_0046.jpg -fuzz 40%% -transparent white -channel RGB -evaluate set 100%% +channel -background Black -layers Flatten -alpha off book1c\/IMG_0046.png<\/pre>\n\n\n\n<p>This gives an image with a black background and white text\/images.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">convert book1c\/IMG_0046.png -morphology erode:3 square:3 -morphology dilate:10 square:3 -morphology erode:5 square:3  book1c\/IMG_0046b.png<\/pre>\n\n\n\n<p>This removes small content patches (the text), and consolidates the images into blocks of white.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">convert book1c\/IMG_0046b.png -transparent black book1c\/IMG_0046c.png<br> book1c\/IMG_0046c.png<\/pre>\n\n\n\n<p>This makes the background  transparent.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">convert book1c\/IMG_0046c.png -define connected-components:verbose=true -define connected-components:area-threshold=100000 -connected-components 4 \/dev\/null<br> Objects (id: bounding-box centroid area mean-color):<br>   1: 2900x3700+0+0 1372.5,1921.6 6085713 graya(0,0)<br>   11: 1530x1643+1093+1712 1896.7,2573.1 2303240 graya(255,1)<br>   5: 1489x1342+221+352 971.0,1062.8 1807075 graya(255,1)<br>   3: 978x1015+1922+0 2604.5,312.4 412883 graya(255,1)<br>   0: 60x3571+0+0 18.4,1443.8 121089 graya(255,1)<\/pre>\n\n\n\n<p>This command gives bounding boxes for the blocks in the image.  Delete any that contain &#8216;+0&#8217;, these are either the background or shiny corners of the page.  In my photos, the corner blocks never included the corner specified without the &#8216;+0&#8217;.<br>-After skipping objects touching an edge:<br>   11: 1530&#215;1643+1093+1712 1896.7,2573.1 2303240 graya(255,1)<br>   5: 1489&#215;1342+221+352 971.0,1062.8 1807075 graya(255,1)<br><br>Then I wrote a perl script to convert the bounding box spec into rectangle specs used by the &#8216;-draw&#8217; command.<br>11: <strong>1530&#215;1643+1093+1712<\/strong> 1896.7,2573.1 2303240 srgba(255,255,255,1)<br>to<br>-draw &#8220;rectangle 1093,1712 2623,3355&#8221;<br><a href=\".\/pics\/perl\/block_prep\">block_prep perl script<\/a><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">convert -size 2900x3700 xc:none -stroke black -fill black -strokewidth 1 <strong>-draw \"rectangle 1093,1712 2623,3355\"<\/strong>  -draw \"rectangle 221,352 1710,1694\" book1c\/IMG_0046m.png<br><\/pre>\n\n\n\n<p>This copies the image regions given by the bounding blocks over those spots in the brightened image.<br><br>This command from &#8220;Fred&#8217;s ImageMagick scripts&#8221; does the image brightening (and does a despeckle):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">.\/textcleaner -f 20 -o 5 book1b\/IMG_0046.jpg book1c\/IMG_0046e.png<br><\/pre>\n\n\n\n<p>These steps can be combined into three ImageMagick commands:<br><strong>1)<\/strong> textcleaner<br><strong>2)<\/strong> Find blocks, write to &#8216;objects.txt&#8217;.  Use the block_prep script to convert the blocks to &#8216;-draw&#8217; commands, and write out a combined ImageMagick command.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">echo \"book1b\/IMG_0046.jpg\" &gt;&gt; objects.txt | convert book1b\/IMG_0046.jpg -fuzz 40%% -transparent white -channel RGB -evaluate set 100%% +channel -background Black -layers Flatten -alpha off -morphology erode:3 square:3 -morphology dilate:10 square:3 -morphology erode:5 square:3 -transparent black -define connected-components:verbose=true -define connected-components:area-threshold=100000 -connected-components 4 \/dev\/null &gt;&gt; objects.txt<\/pre>\n\n\n\n<p>Then I add a step to test the regions.  In some images, the &#8216;-fuzz 40%&#8217; overbrightened the image, and all the images merged together with the page edge, so I had to try different &#8216;-fuzz&#8217; settings and look at the resulting image.  This was needed for 49 of 127 pages with images.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">convert ..\/book1b\/IMG_0046.jpg <strong>-fuzz 50%%<\/strong> -transparent white -channel RGB -evaluate set 100%% +channel -background Black -layers Flatten -alpha off IMG_0046.jpg<\/pre>\n\n\n\n<p>Then I checked the bounding boxes by drawing a red outline over the regions, and looking at the page.  Some image boxes needed to be readjusted.  In a few cases, the images were odd sized and two overlapping boxes or even a &#8216;-draw&#8217; polygon was needed.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">convert -size 2900x3700 xc:none xc:none -stroke <strong>red<\/strong> -fill none -strokewidth 5 -draw 'rectangle 1093,1712 2623,3355' -draw 'rectangle 221,352 1710,1694'  png:- | composite -geometry +0+0 png:- book1b\/IMG_0046.jpg book1c\/IMG_0046.png<\/pre>\n\n\n\n<p><strong>3)<\/strong> Run the command to combine the brightened page with the image masking.<br><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">convert -size 2900x3700 xc:none -stroke black -fill black -strokewidth 5 -draw 'rectangle 1093,1712 2623,3355' -draw 'rectangle 221,352 1710,1694'  png:- | composite -compose CopyOpacity png:- book1b\/IMG_0046.jpg png:- | composite -geometry +0+0 png:- book1c\/IMG_0046.png book1d\/IMG_0046.png<\/pre>\n\n\n\n<p>This gave the pages with images in an output dir, &#8216;book1d&#8217;.  Then I ran a batch script to copy over the simple textcleaner images for the remaining text-only pages:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">[[ ! -e book1d\/IMG_0034.png ]] &amp;&amp; cp book1c\/IMG_0034.png book1d<\/pre>\n\n\n\n<p>This gave me all the odd and even pages in a separate series, so I renamed to images with their page numbers (image 0161 is where the switch to even pages happens), and did a check.  This is how I discovered the missing page.<br><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">perl -ne 'BEGIN{$pg=1;$sw=\"161\";}chomp;printf \"cp book1d\/$_ %03d.jpg\\n\",$pg;$pg+=2;if(\/$sw\/){$pg=2;}' &lt; rename.bat <\/pre>\n\n\n\n<p><a href=\"https:\/\/scantailor.org\/\">Scan Tailor<\/a> was used to do the final processing.  It deskewed the pages, trimmed the margins, and converted the text portions of each page to B&amp;W.<br><br>tesseract was used to OCR the book.  It also combined the pages into a PDF.  This PDF has two layers, the image layer, and an invisible layer with the text that can be used for searching and copying text out.  Final processing thread, <a href=\"https:\/\/diybookscanner.org\/forum\/viewtopic.php?t=3543\">link<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">ls *.tif &gt;output.txt &amp;&amp; tesseract -l eng -c textonly_pdf=1 output.txt text pdf<\/pre>\n\n\n\n<p>The final book file was renamed to &#8220;Mechanical Singing-bird Tabati\u00e9res by Geoffrey T. Mayson&#8221;, <a href=\".\/pics\/scans\/Mechanical Singing-bird Tabati\u00e9res by Geoffrey T. Mayson.pdf\">pdf,<\/a> <a href=\".\/pics\/scans\/Mechanical Singing-bird Tabati\u00e9res by Geoffrey T. Mayson.txt\">text<\/a>.<br><br><\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Second Book: A Beginners Guide to Kiln-Formed Glass by Brenda Griffith<\/strong><\/h4>\n\n\n\n<p>Roughly crop and rotate images, crop_rotate.bat.  Check and adjust every few pages.  Commands like:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">convert IMG_0078.JPG -crop 3950x3150+1350+400 -rotate -90 tmp1\/IMG_0078.JPG<\/pre>\n\n\n\n<p>Use Scantailor to process pages.  Use Deskew, Select Content functions.  Set margins to zero, match sizes, and set Output to Color\/Grayscale w\/ White margins, Equalize illumination.  Dewarping caused crashes, so turn it off.<br><br>Rename pages:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">perl -e 'BEGIN{@a=<code>ls tmp1\/out\/<\/code>;$p=2}foreach $f (@a){next if $f=~\/cache\/;chomp $f;($fn)=$f=~\/_0+(\\d+)\/;$pp=sprintf(\"%03d\",$p);<code>cp tmp1\/out\/$f tmp1\/out2\/$pp.tiff<\/code>;$p+=2;if ($p==130){$p=1} }'<br><\/pre>\n\n\n\n<p>Shrink images:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">opj_compress -r 200,100,75 -i .\/tmp1\/out2\/001.tiff -o tmp1\/out3\/001.jp<\/pre>\n\n\n\n<p>Combine images to an image-only pdf:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">img2pdf tmp1\/out3\/* &gt; images.pdf<\/pre>\n\n\n\n<p>Combine text and image layer pdfs:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">pdftk images.pdf multibackground text.pdf output A_Beginners_Guide_to_Kiln-Formed_Glass__by_Brenda_Griffith.pdf<\/pre>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Based on The Easy Book Scanner by David Landin, post PVC tubing: 20&#8243; x 5 = 100 3&#8243; x 4 = 12 15&#8243; x 2 = 30 8&#8243; x 4 = 32 12&#8243; x 2 = 24&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-total 198&#8243; \/ 12&#8243; = 16&#8242; 6&#8243;First book scanI needed to scan &#8220;Mechanical Singing-bird Tabati\u00e9res&#8221; by Geoffrey T. Mayson [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1919","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/jimlund.org\/blog\/index.php?rest_route=\/wp\/v2\/pages\/1919","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/jimlund.org\/blog\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/jimlund.org\/blog\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/jimlund.org\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/jimlund.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1919"}],"version-history":[{"count":7,"href":"http:\/\/jimlund.org\/blog\/index.php?rest_route=\/wp\/v2\/pages\/1919\/revisions"}],"predecessor-version":[{"id":1976,"href":"http:\/\/jimlund.org\/blog\/index.php?rest_route=\/wp\/v2\/pages\/1919\/revisions\/1976"}],"wp:attachment":[{"href":"http:\/\/jimlund.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1919"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}