/media/bill/HOWELL_BASE/System_maintenance/OCR/0_OCR notes.txt use tesseract (Hewlett Packard originally?) DON'T use gocr !!!! nothing works for handwriting One really has to de-bind all pages, photo each page on a very flat surface, right rotation (zero) 08********08 10Feb2021 https://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison 11 years ago I blogged about… Linux OCR Software Comparison abbyyocr cuneiform gocr ocrad tesseract abbyyocr asily best in his great tests, but tesseract is fe https://www.dedoimedo.com/computers/linux-ocr.html Optical Character Recognition (OCR) software for Linux Updated: March 12, 2011 tutorial for tesseract >> I had already installed tesseract!!?!? Sheesh - whole section below *************** ~/bin $ echo $PATH /home/bill/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/bill/bin:/home/bill/Qnial $ cd "/media/bill/Lexar" $ convert "Neil - bio grade 7-9 Balmoral.bmp" "Neil - bio grade 7-9 Balmoral.pnm" $ tesseract "Neil - bio grade 7-9 Balmoral.pnm" "Neil - bio grade 7-9 Balmoral" -l eng Imagemagick - Image manipulation programs -- binaries ImageMagick is a software suite to create, edit, and compose bitmap images. It can read, convert and write images in a variety of formats (over 100) including DPX, EXR, GIF, JPEG, JPEG-2000, PDF, PhotoCD, PNG, Postscript, SVG, and TIFF. Use ImageMagick to translate, flip, mirror, rotate, scale, shear and transform images, adjust image colors, apply various special effects, or draw text, lines, polygons, ellipses and Bézier curves. All manipulations can be achieved through shell commands as well as through an X11 graphical interface (display). This package include links to quantum depth specific binaries and manual pages. ## $ tesseract --tessdata-dir "/usr/share/tesseract-ocr/tessdata/" "./Neil - bio grade 7-9 Balmoral.bmp" "./outbase" -l eng -psm 0 $ tesseract temp.bmp temp_out6 -l eng -psm 6 $ tesseract temp.bmp temp_out3 -l eng -psm 0 -> doesn't work!!!??? $ tesseract temp.bmp temp_out2 -l eng $ tesseract temp.bmp temp_out https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage --tessdata-dir ./ ./testing/bilingual.jpg ./testing/bilingual-enghin -l eng+hin https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution https://www.linux.com/learn/how-scan-and-ocr-pro-open-source-tools $ bash "/home/bill/OOPM builds/0_environment.sh" cd {/usr} find -nowarn -iname "tesserac?*" | grep >"/home/bill/OOPM builds old/finds/find tesseract in usr 170424 16h23m.txt" 2>&1 --invert-match "Permission denied" >> Results (manually copy-paste pertinent parts) : ./share/man/man1/tesseract.1.gz ./share/doc/tesseract-ocr-osd ./share/doc/tesseract-ocr-equ ./share/doc/tesseract-ocr-eng ./share/doc/tesseract-ocr ./share/tesseract-ocr ./bin/tesseract $ cd "/media/bill/Lexar" $ gocr "Neil - bio grade 7-9 Balmoral.bmp" "out_gocr_frm_bmp" -l eng Gocr - Command line ocr gocr is a multi-platform OCR (Optical Character Recognition) program. It can read pnm, pbm, pgm, ppm, some pcx and tga image files. Currently the program should be able to handle well scans that have their text in one column and do not have tables. Font sizes of 20 to 60 pixels are supported. If you want to write your own OCR, libgocr is provided in a separate package. Documentation and graphical wrapper are provided in separated packages, too. ### NAME gocr - command line text recognition tool SYNOPSIS gocr [OPTION] [-i] pnm-file DESCRIPTION gocr is an optical character recognition program that can be used from the command line. It takes input in PNM, PGM, PBM, PPM, or PCX format, and writes recognized text to stdout. If the pnm file is a single dash, PNM data is read from stdin. If gzip, bzip2 and netpbm-progs are installed and your system supports popen(3) also pnm.gz, pnm.bz2, png, jpg, jpeg, tiff, gif, bmp, ps (only single pages) and eps are supported as input files (not as input stream), where pnm can be replaced by one of ppm, pgm and pbm. OPTIONS -h show usage information -i file read input from file (or stdin if file is a single dash) -l level set grey level to level (0<160<=255, default: 0 for autodetect), darker pixels belong to characters, brighter pixels are interpreted as background of the input image -d size set dust size in pixels (clusters smaller than this are removed), 0 means no clusters are removed, the default is -1 for auto detection -o file send output to file instead of stdout -e file send errors to file instead of stderr or to stdout if file is a dash -x file progress output to file (file can be a file name, a fifo name or a file descriptor 1...255), this is useful for GUI developpers to show the OCR progress, the file descriptor argument is only available, if compiled with __USE_POSIX defined EXAMPLES gocr -v 33 text1.pbm output verbose information, out30.png is created to see details of recognition process gocr -v 7 -c _YV text1.pbm verbose output for unknown chars and chars Y and V djpeg -pnm -gray text.jpg | gocr - convert a jpeg file to pnm format and input via pipe 24Apr2017 14:33 Tesseract-ocr - Command line ocr tool (LMDE2 install) The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google and is probably one of the most accurate open source OCR engines available. It can read a wide variety of image formats and convert them to text in over 40 languages. # NAME tesseract - command-line OCR engine SYNOPSIS tesseract imagename outbase|stdout [-l lang] [-psm N] [-c configvar=value] [configfile ...] DESCRIPTION tesseract(1) is a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed at Google since then. OPTIONS imagename The name of the input image. Most image file formats (anything readable by Leptonica) are supported. outbase The basename of the output file (to which the appropriate extension will be appended). By default the output will be named outbase.txt. When stdout is used as outbase, output will be sent to stdout. -l lang The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) -psm N Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are: 0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR. 3 = Fully automatic page segmentation, but no OSD. (Default) 4 = Assume a single column of text of variable sizes. 5 = Assume a single uniform block of vertically aligned text. 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character. -c configvar=value Sets a configuration variable. Multiple options can be set by using -c multiple times, once for each option. -v Returns the current version of the tesseract(1) executable. configfile The name of a config to use. A config is a plaintext file which contains a list of variables and their values, one per line, with a space separating variable from value. Interesting config files include: · hocr - Output in hOCR format instead of as a text file. ##