Discussion:
command line scanned pdf to text
Tom Fowle
2015-11-02 05:24:47 UTC
Permalink
Am I the last to find this?
command line ocr tesseract
won't directly support .pdf but
pdftocairo
produces .jpg among others which tesseract will read.

May not do well with collumns but not too bad.

Is there anything better?

Thanks
tom Fowle
John G Heim
2015-11-02 20:13:04 UTC
Permalink
I've been scanning in the D&D 5th Edition player's handbook. I tried
every open source OCR program I could find and tesseract was easily the
best. On pages that are just prose, it probably does about 99% accuracy.
Even on pages where that are 2 columns of prose, it does really well if
you tell it to look for that. Somebody sent me a pdf of the same book
done with a professional OCR program for Windows. The results are
approximately equal. Tesseract may lack the bells & whistles of
commercial products but for accuracy, it's pretty good.
Post by Tom Fowle
Am I the last to find this?
command line ocr tesseract
won't directly support .pdf but
pdftocairo
produces .jpg among others which tesseract will read.
May not do well with collumns but not too bad.
Is there anything better?
Thanks
tom Fowle
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
John Heim, ***@math.wisc.edu, 608-263-4189, skype:john.g.heim,
sip:***@sip.linphone.org
Cheryl Homiak
2015-11-02 20:53:45 UTC
Permalink
Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.

Thanks.
--
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
I've been scanning in the D&D 5th Edition player's handbook. I tried every open source OCR program I could find and tesseract was easily the best. On pages that are just prose, it probably does about 99% accuracy. Even on pages where that are 2 columns of prose, it does really well if you tell it to look for that. Somebody sent me a pdf of the same book done with a professional OCR program for Windows. The results are approximately equal. Tesseract may lack the bells & whistles of commercial products but for accuracy, it's pretty good.
Post by Tom Fowle
Am I the last to find this?
command line ocr tesseract
won't directly support .pdf but
pdftocairo
produces .jpg among others which tesseract will read.
May not do well with collumns but not too bad.
Is there anything better?
Thanks
tom Fowle
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
John G Heim
2015-11-02 22:06:09 UTC
Permalink
Huh, it strikes me as strange that tesseract didn't work for you. I used
tesseract last week to read a page in a pdf document that was stored as
an image. I used pdftohtml to extract the image and then tesseract to
convert it to text. I also pretty routinely use tesseract to read screen
capture images. It's not very accurate there but it's usually good
enough to make sense of.

Just "tesseract <infile> <outfile>" should work. The infile can be the
string "stdin" in which case it read from standard input. The outfile
can be "stdout" in which case it writes the text to stdout. Right off
hand, I do not have the command line I use to scan the D&D book. It's on
a computer at home that is turned off at the moment. But I can post the
whole thing tonight. Here are some lines from a backup version of the
script:

scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
tesseract /tmp/page.tiff stdout
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
Thanks.
--
John Heim, ***@math.wisc.edu, 608-263-4189, skype:john.g.heim,
sip:***@sip.linphone.org
Cheryl Homiak
2015-11-02 22:39:38 UTC
Permalink
Thanks much. No, the way to get into a turned-off computer far away hasn't been invented yet, unless you can turn it on by remote control somehow - :-)
I suspect the error was mine so I won't give up on it yet.

Thanks.
--
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Huh, it strikes me as strange that tesseract didn't work for you. I used tesseract last week to read a page in a pdf document that was stored as an image. I used pdftohtml to extract the image and then tesseract to convert it to text. I also pretty routinely use tesseract to read screen capture images. It's not very accurate there but it's usually good enough to make sense of.
scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
tesseract /tmp/page.tiff stdout
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
Thanks.
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
John G Heim
2015-11-03 14:30:31 UTC
Permalink
Here is the complete script. Sorry I forgot to post it last night. I
turned the machine on as I left this morning and sshed into it from
work. Theresome junk in here you may or may not be interested in. You
can pass the script 2 parameters. #1 is the page number.It uses this
number to make the output text file name. Page 99 would be named
p099.txt. If you don't pass it a page number, it looks for files
matching the same pattern and takes the next highest number. So if there
already is a p099.txt, it would create a p100.txt. The second parameter
is the tesseract psm flag. The tesseract man page explains these. The
default is 3.

After it's done with the scan and ocr, it concatenates all the pages
into one big file. It also beeps if the new page it just scanned is an
even numbered page. This is to remind me to turn the page. Otherwise I
sometimes forget if I've already done both sides.


#!/bin/bash

IDX=$1
if [ ! -z "$IDX" ]; then
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
else
for IDX in {1..999}; do
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -f "${TEXT}" && break
done
fi
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "

PSM="$2"
test -z "$PSM" && PSM=3

RESOLUTION=600
SCAN=/tmp/page.tif
scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN

PAGE=/tmp/page
tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
cat "${PAGE}.txt" >> "$TEXT"
/usr/bin/beep -r $((2 - IDX % 2))
test ! -z "$VERBOSE" && file "${TEXT}"
OUTFILE="/home/john/phb5/PHB5.txt"
echo "" > "$OUTFILE"
for IDX in {1..999}; do
TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
if [ -f "$TEXTFILE" ]; then
echo "Page $IDX" >> "$OUTFILE"
cat "$TEXTFILE">> "$OUTFILE"
echo -e "\f" >> "$OUTFILE"
fi
done
# EOF

IDX=$1
if [ ! -z "$IDX" ]; then
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
else
for IDX in {1..999}; do
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -f "${TEXT}" && break
done
fi
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "

PSM="$2"
test -z "$PSM" && PSM=3

RESOLUTION=600
SCAN=/tmp/page.tif
scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN

PAGE=/tmp/page
tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
cat "${PAGE}.txt" | cleantext >> "$TEXT"
/usr/bin/beep -r $((2 - IDX % 2))
test ! -z "$VERBOSE" && file "${TEXT}"
OUTFILE="/home/john/phb5/PHB5.txt"
echo "" > "$OUTFILE"
for IDX in {1..999}; do
TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
if [ -f "$TEXTFILE" ]; then
echo "Page $IDX" >> "$OUTFILE"
cat "$TEXTFILE">> "$OUTFILE"
echo -e "\f" >> "$OUTFILE"
fi
done
# EOF
Post by Cheryl Homiak
Thanks much. No, the way to get into a turned-off computer far away hasn't been invented yet, unless you can turn it on by remote control somehow - :-)
I suspect the error was mine so I won't give up on it yet.
Thanks.
--
John Heim, ***@math.wisc.edu, 608-263-4189, skype:john.g.heim,
sip:***@sip.linphone.org
Cheryl Homiak
2015-11-03 16:19:24 UTC
Permalink
Thanks. I did try another file and it worked in botyh cuneiform and tesseract so I think the two files I tried were an anomaly or it was a rotation issue. I haven't compared to see which package did the best job but it doesn't hurt to have both of them.
--
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Here is the complete script. Sorry I forgot to post it last night. I turned the machine on as I left this morning and sshed into it from work. Theresome junk in here you may or may not be interested in. You can pass the script 2 parameters. #1 is the page number.It uses this number to make the output text file name. Page 99 would be named p099.txt. If you don't pass it a page number, it looks for files matching the same pattern and takes the next highest number. So if there already is a p099.txt, it would create a p100.txt. The second parameter is the tesseract psm flag. The tesseract man page explains these. The default is 3.
After it's done with the scan and ocr, it concatenates all the pages into one big file. It also beeps if the new page it just scanned is an even numbered page. This is to remind me to turn the page. Otherwise I sometimes forget if I've already done both sides.
#!/bin/bash
IDX=$1
if [ ! -z "$IDX" ]; then
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
else
for IDX in {1..999}; do
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -f "${TEXT}" && break
done
fi
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "
PSM="$2"
test -z "$PSM" && PSM=3
RESOLUTION=600
SCAN=/tmp/page.tif
scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN
PAGE=/tmp/page
tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
cat "${PAGE}.txt" >> "$TEXT"
/usr/bin/beep -r $((2 - IDX % 2))
test ! -z "$VERBOSE" && file "${TEXT}"
OUTFILE="/home/john/phb5/PHB5.txt"
echo "" > "$OUTFILE"
for IDX in {1..999}; do
TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
if [ -f "$TEXTFILE" ]; then
echo "Page $IDX" >> "$OUTFILE"
cat "$TEXTFILE">> "$OUTFILE"
echo -e "\f" >> "$OUTFILE"
fi
done
# EOF
IDX=$1
if [ ! -z "$IDX" ]; then
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
else
for IDX in {1..999}; do
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -f "${TEXT}" && break
done
fi
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "
PSM="$2"
test -z "$PSM" && PSM=3
RESOLUTION=600
SCAN=/tmp/page.tif
scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN
PAGE=/tmp/page
tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
cat "${PAGE}.txt" | cleantext >> "$TEXT"
/usr/bin/beep -r $((2 - IDX % 2))
test ! -z "$VERBOSE" && file "${TEXT}"
OUTFILE="/home/john/phb5/PHB5.txt"
echo "" > "$OUTFILE"
for IDX in {1..999}; do
TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
if [ -f "$TEXTFILE" ]; then
echo "Page $IDX" >> "$OUTFILE"
cat "$TEXTFILE">> "$OUTFILE"
echo -e "\f" >> "$OUTFILE"
fi
done
# EOF
Post by Cheryl Homiak
Thanks much. No, the way to get into a turned-off computer far away hasn't been invented yet, unless you can turn it on by remote control somehow - :-)
I suspect the error was mine so I won't give up on it yet.
Thanks.
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
John G Heim
2015-11-03 16:23:50 UTC
Permalink
I'd be interested in hearing your opinion of cuneiform. I have gotten
much better results with tesseract but it could be that I didn't fiddle
with cuneiform enough. My experiments showed that tesseract gave better
results "out of the box" as it were. So then I spent a lot of time
tweaking my tesseract parameters. But maybe if I'd spent that amount of
time on cuneiform, I'd have gotten even better results.
Post by Cheryl Homiak
Thanks. I did try another file and it worked in botyh cuneiform and tesseract so I think the two files I tried were an anomaly or it was a rotation issue. I haven't compared to see which package did the best job but it doesn't hurt to have both of them.
--
John Heim, ***@math.wisc.edu, 608-263-4189, skype:john.g.heim,
sip:***@sip.linphone.org
Jude DaShiell
2015-11-04 15:01:11 UTC
Permalink
What data pack for tesseract has the english language in it? I'm being
prompted to download a data pack and I figure best get what language I
understand rather than the whole data set since both memory and disk
space over here are not unlimited.
Date: Mon, 2 Nov 2015 17:39:38
Reply-To: Speakup is a screen review system for Linux.
Subject: Re: command line scanned pdf to text
Thanks much. No, the way to get into a turned-off computer far away hasn't been invented yet, unless you can turn it on by remote control somehow - :-)
I suspect the error was mine so I won't give up on it yet.
Thanks.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Huh, it strikes me as strange that tesseract didn't work for you. I used tesseract last week to read a page in a pdf document that was stored as an image. I used pdftohtml to extract the image and then tesseract to convert it to text. I also pretty routinely use tesseract to read screen capture images. It's not very accurate there but it's usually good enough to make sense of.
scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
tesseract /tmp/page.tiff stdout
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
Thanks.
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
John G Heim
2015-11-04 15:11:22 UTC
Permalink
On ubuntu it's tesseract-ocr-en.
Post by Jude DaShiell
What data pack for tesseract has the english language in it? I'm being
prompted to download a data pack and I figure best get what language I
understand rather than the whole data set since both memory and disk
space over here are not unlimited.
Date: Mon, 2 Nov 2015 17:39:38
Reply-To: Speakup is a screen review system for Linux.
To: Speakup is a screen review system for Linux.
Subject: Re: command line scanned pdf to text
Thanks much. No, the way to get into a turned-off computer far away
hasn't been invented yet, unless you can turn it on by remote control
somehow - :-)
I suspect the error was mine so I won't give up on it yet.
Thanks.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by John G Heim
Huh, it strikes me as strange that tesseract didn't work for you. I
used tesseract last week to read a page in a pdf document that was
stored as an image. I used pdftohtml to extract the image and then
tesseract to convert it to text. I also pretty routinely use
tesseract to read screen capture images. It's not very accurate there
but it's usually good enough to make sense of.
Just "tesseract <infile> <outfile>" should work. The infile can be
the string "stdin" in which case it read from standard input. The
outfile can be "stdout" in which case it writes the text to stdout.
Right off hand, I do not have the command line I use to scan the D&D
book. It's on a computer at home that is turned off at the moment.
But I can post the whole thing tonight. Here are some lines from a
scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
tesseract /tmp/page.tiff stdout
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind
of file did you use and what did you put in your command-line? I am
asking this because I have tried to use tesseract a couple of times
with tiff files and have gotten mostly gibberish so obviously I am
doing something wrong. I am running debian testing if that makes a
difference.
Thanks.
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
John Heim, ***@math.wisc.edu, 608-263-4189, skype:john.g.heim,
sip:***@sip.linphone.org
Cheryl Homiak
2015-11-04 15:49:03 UTC
Permalink
On debian, it is tesseract-ocr-eng and it may or may not be installed with the main package; I don't remember having to do it separately but I have it.
--
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by John G Heim
On ubuntu it's tesseract-ocr-en.
Post by Jude DaShiell
What data pack for tesseract has the english language in it? I'm being
prompted to download a data pack and I figure best get what language I
understand rather than the whole data set since both memory and disk
space over here are not unlimited.
Date: Mon, 2 Nov 2015 17:39:38
Reply-To: Speakup is a screen review system for Linux.
To: Speakup is a screen review system for Linux.
Subject: Re: command line scanned pdf to text
Thanks much. No, the way to get into a turned-off computer far away
hasn't been invented yet, unless you can turn it on by remote control
somehow - :-)
I suspect the error was mine so I won't give up on it yet.
Thanks.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by John G Heim
Huh, it strikes me as strange that tesseract didn't work for you. I
used tesseract last week to read a page in a pdf document that was
stored as an image. I used pdftohtml to extract the image and then
tesseract to convert it to text. I also pretty routinely use
tesseract to read screen capture images. It's not very accurate there
but it's usually good enough to make sense of.
Just "tesseract <infile> <outfile>" should work. The infile can be
the string "stdin" in which case it read from standard input. The
outfile can be "stdout" in which case it writes the text to stdout.
Right off hand, I do not have the command line I use to scan the D&D
book. It's on a computer at home that is turned off at the moment.
But I can post the whole thing tonight. Here are some lines from a
scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
tesseract /tmp/page.tiff stdout
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind
of file did you use and what did you put in your command-line? I am
asking this because I have tried to use tesseract a couple of times
with tiff files and have gotten mostly gibberish so obviously I am
doing something wrong. I am running debian testing if that makes a
difference.
Thanks.
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
Jude DaShiell
2015-11-04 16:32:16 UTC
Permalink
Thanks, it's tesseract-data-eng on archlinux and it's a separate package
install. Got it too.
Date: Wed, 4 Nov 2015 10:49:03
Reply-To: Speakup is a screen review system for Linux.
Subject: Re: command line scanned pdf to text
On debian, it is tesseract-ocr-eng and it may or may not be installed with the main package; I don't remember having to do it separately but I have it.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by John G Heim
On ubuntu it's tesseract-ocr-en.
Post by Jude DaShiell
What data pack for tesseract has the english language in it? I'm being
prompted to download a data pack and I figure best get what language I
understand rather than the whole data set since both memory and disk
space over here are not unlimited.
Date: Mon, 2 Nov 2015 17:39:38
Reply-To: Speakup is a screen review system for Linux.
To: Speakup is a screen review system for Linux.
Subject: Re: command line scanned pdf to text
Thanks much. No, the way to get into a turned-off computer far away
hasn't been invented yet, unless you can turn it on by remote control
somehow - :-)
I suspect the error was mine so I won't give up on it yet.
Thanks.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by John G Heim
Huh, it strikes me as strange that tesseract didn't work for you. I
used tesseract last week to read a page in a pdf document that was
stored as an image. I used pdftohtml to extract the image and then
tesseract to convert it to text. I also pretty routinely use
tesseract to read screen capture images. It's not very accurate there
but it's usually good enough to make sense of.
Just "tesseract <infile> <outfile>" should work. The infile can be
the string "stdin" in which case it read from standard input. The
outfile can be "stdout" in which case it writes the text to stdout.
Right off hand, I do not have the command line I use to scan the D&D
book. It's on a computer at home that is turned off at the moment.
But I can post the whole thing tonight. Here are some lines from a
scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
tesseract /tmp/page.tiff stdout
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind
of file did you use and what did you put in your command-line? I am
asking this because I have tried to use tesseract a couple of times
with tiff files and have gotten mostly gibberish so obviously I am
doing something wrong. I am running debian testing if that makes a
difference.
Thanks.
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
Tom Fowle
2015-11-05 02:36:17 UTC
Permalink
Just installed tesseract as debian package and the en pack came with it
automatically.
Tom Fowle
Post by Cheryl Homiak
On debian, it is tesseract-ocr-eng and it may or may not be installed with the main package; I don't remember having to do it separately but I have it.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by John G Heim
On ubuntu it's tesseract-ocr-en.
Post by Jude DaShiell
What data pack for tesseract has the english language in it? I'm being
prompted to download a data pack and I figure best get what language I
understand rather than the whole data set since both memory and disk
space over here are not unlimited.
Date: Mon, 2 Nov 2015 17:39:38
Reply-To: Speakup is a screen review system for Linux.
To: Speakup is a screen review system for Linux.
Subject: Re: command line scanned pdf to text
Thanks much. No, the way to get into a turned-off computer far away
hasn't been invented yet, unless you can turn it on by remote control
somehow - :-)
I suspect the error was mine so I won't give up on it yet.
Thanks.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by John G Heim
Huh, it strikes me as strange that tesseract didn't work for you. I
used tesseract last week to read a page in a pdf document that was
stored as an image. I used pdftohtml to extract the image and then
tesseract to convert it to text. I also pretty routinely use
tesseract to read screen capture images. It's not very accurate there
but it's usually good enough to make sense of.
Just "tesseract <infile> <outfile>" should work. The infile can be
the string "stdin" in which case it read from standard input. The
outfile can be "stdout" in which case it writes the text to stdout.
Right off hand, I do not have the command line I use to scan the D&D
book. It's on a computer at home that is turned off at the moment.
But I can post the whole thing tonight. Here are some lines from a
scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
tesseract /tmp/page.tiff stdout
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind
of file did you use and what did you put in your command-line? I am
asking this because I have tried to use tesseract a couple of times
with tiff files and have gotten mostly gibberish so obviously I am
doing something wrong. I am running debian testing if that makes a
difference.
Thanks.
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
Jude DaShiell
2015-11-04 16:21:48 UTC
Permalink
Thanks, I'll search archlinux and see if that shows up. On Wed, 4 Nov
Date: Wed, 4 Nov 2015 10:11:22
Reply-To: Speakup is a screen review system for Linux.
Subject: Re: command line scanned pdf to text
On ubuntu it's tesseract-ocr-en.
Post by Jude DaShiell
What data pack for tesseract has the english language in it? I'm being
prompted to download a data pack and I figure best get what language I
understand rather than the whole data set since both memory and disk
space over here are not unlimited.
Date: Mon, 2 Nov 2015 17:39:38
Reply-To: Speakup is a screen review system for Linux.
To: Speakup is a screen review system for Linux.
Subject: Re: command line scanned pdf to text
Thanks much. No, the way to get into a turned-off computer far away
hasn't been invented yet, unless you can turn it on by remote control
somehow - :-)
I suspect the error was mine so I won't give up on it yet.
Thanks.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by John G Heim
Huh, it strikes me as strange that tesseract didn't work for you. I
used tesseract last week to read a page in a pdf document that was
stored as an image. I used pdftohtml to extract the image and then
tesseract to convert it to text. I also pretty routinely use
tesseract to read screen capture images. It's not very accurate there
but it's usually good enough to make sense of.
Just "tesseract <infile> <outfile>" should work. The infile can be
the string "stdin" in which case it read from standard input. The
outfile can be "stdout" in which case it writes the text to stdout.
Right off hand, I do not have the command line I use to scan the D&D
book. It's on a computer at home that is turned off at the moment.
But I can post the whole thing tonight. Here are some lines from a
scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
tesseract /tmp/page.tiff stdout
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind
of file did you use and what did you put in your command-line? I am
asking this because I have tried to use tesseract a couple of times
with tiff files and have gotten mostly gibberish so obviously I am
doing something wrong. I am running debian testing if that makes a
difference.
Thanks.
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
Tom Fowle
2015-11-03 04:15:32 UTC
Permalink
Sheryl,
I arbitrarilly chose to convert the pdf to jpeg as tesseract doesn't do
pdf.

Then I just did
tesseract filename.jpg outfile
produces
outfile.txt

sorry havn't tried .tif and I couldn't find a list of supported file types.

tom fowle
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
Thanks.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
I've been scanning in the D&D 5th Edition player's handbook. I tried every open source OCR program I could find and tesseract was easily the best. On pages that are just prose, it probably does about 99% accuracy. Even on pages where that are 2 columns of prose, it does really well if you tell it to look for that. Somebody sent me a pdf of the same book done with a professional OCR program for Windows. The results are approximately equal. Tesseract may lack the bells & whistles of commercial products but for accuracy, it's pretty good.
Post by Tom Fowle
Am I the last to find this?
command line ocr tesseract
won't directly support .pdf but
pdftocairo
produces .jpg among others which tesseract will read.
May not do well with collumns but not too bad.
Is there anything better?
Thanks
tom Fowle
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
Cheryl Homiak
2015-11-03 04:28:57 UTC
Permalink
I am sure tiff is supported. It is really strange. I get what look like words and what I get is the same every time I do a scan of the same image but they are nonsense. I even tried adding the designation for English thinking somehow it wasn't using English but got the same results. I know the image file is okay because it comes out fine using ABBY FineReader Express on my Mac.
--
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by Tom Fowle
Sheryl,
I arbitrarilly chose to convert the pdf to jpeg as tesseract doesn't do
pdf.
Then I just did
tesseract filename.jpg outfile
produces
outfile.txt
sorry havn't tried .tif and I couldn't find a list of supported file types.
tom fowle
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
Thanks.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
I've been scanning in the D&D 5th Edition player's handbook. I tried every open source OCR program I could find and tesseract was easily the best. On pages that are just prose, it probably does about 99% accuracy. Even on pages where that are 2 columns of prose, it does really well if you tell it to look for that. Somebody sent me a pdf of the same book done with a professional OCR program for Windows. The results are approximately equal. Tesseract may lack the bells & whistles of commercial products but for accuracy, it's pretty good.
Post by Tom Fowle
Am I the last to find this?
command line ocr tesseract
won't directly support .pdf but
pdftocairo
produces .jpg among others which tesseract will read.
May not do well with collumns but not too bad.
Is there anything better?
Thanks
tom Fowle
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
Willem van der Walt
2015-11-03 05:14:13 UTC
Permalink
cuneiform is IMHO a better OCR engine than tesseract.
It is available as a package under ubuntu.
Regards, Willem
Post by Cheryl Homiak
I am sure tiff is supported. It is really strange. I get what look like words and what I get is the same every time I do a scan of the same image but they are nonsense. I even tried adding the designation for English thinking somehow it wasn't using English but got the same results. I know the image file is okay because it comes out fine using ABBY FineReader Express on my Mac.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by Tom Fowle
Sheryl,
I arbitrarilly chose to convert the pdf to jpeg as tesseract doesn't do
pdf.
Then I just did
tesseract filename.jpg outfile
produces
outfile.txt
sorry havn't tried .tif and I couldn't find a list of supported file types.
tom fowle
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
Thanks.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
I've been scanning in the D&D 5th Edition player's handbook. I tried every open source OCR program I could find and tesseract was easily the best. On pages that are just prose, it probably does about 99% accuracy. Even on pages where that are 2 columns of prose, it does really well if you tell it to look for that. Somebody sent me a pdf of the same book done with a professional OCR program for Windows. The results are approximately equal. Tesseract may lack the bells & whistles of commercial products but for accuracy, it's pretty good.
Post by Tom Fowle
Am I the last to find this?
command line ocr tesseract
won't directly support .pdf but
pdftocairo
produces .jpg among others which tesseract will read.
May not do well with collumns but not too bad.
Is there anything better?
Thanks
tom Fowle
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard.
The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html.
This message has been scanned for viruses and dangerous content by MailScanner,
and is believed to be clean.
Please consider the environment before printing this email.
Cheryl Homiak
2015-11-03 07:25:31 UTC
Permalink
i've tried with both cuneiform and tesseract with the same results. I wonder if it's a rotation problem.
--
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
Post by Tom Fowle
Sheryl,
I arbitrarilly chose to convert the pdf to jpeg as tesseract doesn't do
pdf.
Then I just did
tesseract filename.jpg outfile
produces
outfile.txt
sorry havn't tried .tif and I couldn't find a list of supported file types.
tom fowle
Post by Cheryl Homiak
Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
Thanks.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
I've been scanning in the D&D 5th Edition player's handbook. I tried every open source OCR program I could find and tesseract was easily the best. On pages that are just prose, it probably does about 99% accuracy. Even on pages where that are 2 columns of prose, it does really well if you tell it to look for that. Somebody sent me a pdf of the same book done with a professional OCR program for Windows. The results are approximately equal. Tesseract may lack the bells & whistles of commercial products but for accuracy, it's pretty good.
Post by Tom Fowle
Am I the last to find this?
command line ocr tesseract
won't directly support .pdf but
pdftocairo
produces .jpg among others which tesseract will read.
May not do well with collumns but not too bad.
Is there anything better?
Thanks
tom Fowle
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
--
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
Loading...