thanks google hindi ocr guidelines

15
Installation guidelines for Hindi/Indic languages OCR (Windows) Thanks Google, UBUNTU & all open source resources for Internet based Gayatri and Yagya which helped us to develop Hindi/Indic languages OCR (Optical Character Recognition) for our mega Unicode conversion project of Vedic Literature. गल , ऊब णट  और  सभी   ोत  साधन  को हमार िदक िसाहय गा नकोड पा तर परयोजना लए िहद /भारतीय भाषाओ  क ओसीआर (ऑटकल टर परवत ) उपलध करान लए धयवाद. This will help us to propagate & implement OUR WILL/Our Solemn Pledge for everyone to have a life like our P.Gurusatta. यह सहायता हर िकसी का जीवन .सा की तरह जीन लए, हमार कप चार-सार करन  मदद करगा. दनीया  माताजी  - ‘‘टा! और जी को कभी अलग मत करना।’’ िफर बोली , ‘‘बटा, आन  वाल  समय द   नया अपनी समयाओ  का समाधान  गीत   और जी    वचन      ढगी।’’ सच तो , शव और को भला अलग िकया भी  जकत? -  -झा की . ७३ Overview: 1. Scan d ocume nt. (300 DPI for b etter ou tput) Image or PDF file. 2. For post -proc essin g for scanne d pages , save/e xport PDF as images int o one folder. 3. Use Scan T ailor f or pos t-proc essin g of sc anne d page s. 4. Make PDF file from images by c reatin g PDF. File s > Create PDF fr om mult iple file s > Add files. Check & correct serial of the pages/documen t. 5. Use gI mage Reader / VeitOCR for OCR. Save fil e in UTF-8 format. 6. Che ck spell ing s usin g spell che cker. 7. Conv ert fon t using f ont co nvertor. Print for man ual pro of read ing. 8. Chec k man ually logic al err ors o f the docu ment. NOTE: Tesseract hin.traineddata found working good for Chanakya like fonts. Required Installation instructions: (We should be connected to Internet throughout the i nstallation process.) 1. gs905w32.exe - GPL Ghotscript http://sourceforge.net/projects/ghostscript/  2.  jre-6u34-windows-i586.exe - Java Runtime 6.0+ http://www.oracle.com/techn etwork/java/javase/downlo ads/jre6-downloads-16 37595.html  3. vcredist_x86.exe - MS VC++ Redistributable Setup. http://www.microsoft.com/en-in/download/details.aspx?id=5555  

Upload: ankur-saxena

Post on 13-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 1/15

Installation guidelines for Hindi/Indic languages OCR (Windows)

Thanks Google, UBUNTU & all open source resources for Internet based Gayatri and Yagya whichhelped us to develop indi!Indic languages "#$ %"ptical #haracter $ecognition for our 'ega

Unicode conversion pro(ect of )edic *iterature. गगूल , ऊबणूटू और  सभी  खलु ेोत  ससंाधन  को हमारे

विैदक िसाह! के मगेा !"ूनको# $%ातंर& %'र!ो(ना के "ल) िह*द+ !भारती! भा,ा- ंके -सी.र

/012टकल कैरे3टर %'रवत4क5 6%ल7ध करान ेके "ल) ध*!वाद8This will help us to propagate & i'ple'ent "U$ +I**!"ur ole'n -ledge for everyone to have a

life like our -.Gurusatta. !ह सहा!ता हर िकसी का (ीवन %8ूग9ुस:ा की तरह (ीने के "ल); हमारे <=

सकं>% के ?@ारA?सार करने मB मदद करेगा8

वदंनी!ा   माता(ी   A CCबटेाD मEुे और गु9(ी को कभी Fलग मत करनाGHH िIर बोली;ं CCबेटा; .ने वाल ेसम!

मB द  "ुन!ा F%नी समJ!ा-ं का समाधान मरेे  गीत   मB  और ग9ु(ी   के  ?व@न   मB K  ू L=ेगीGHH स@ तो है; "Mव

और MNO को भला Fलग िक!ा भी कैस े(ा सकता हैP A QN, !Rुम की   Eलक A EाLकी   %S8 TU

Overview:

/. can docu'ent. %0112-I for better output I'age or -23 file.

4. 3or post5processing for scanned pages, save!e6port -23 as i'ages into one folder.

0. Use can Tailor for post5processing of scanned pages.

7. 8ake -23 file fro' i'ages by creating -23. 3iles 9 #reate -23 fro' 'ultiple files 9 :dd files.#heck & correct serial of the pages!docu'ent.

;. Use gI'age$eader ! )eit"#$ for "#$. ave file in UT35< for'at.

=. #heck spellings using spell checker.

>. #onvert font using font convertor. -rint for 'anual proof reading.

<. #heck 'anually logical errors of the docu'ent.

NOTE: Tesseract hin.traineddata found working good for #hanakya like fonts.

Required Installation instructions:%+e should be connected to Internet throughout the installation process.

1. gs?1;w04.e6e 5 G-* Ghotscript http@!!sourceforge.net!pro(ects!ghostscript! 2.  (re5=u075windows5i;<=.e6e 5 Aava $unti'e =.1

http@!!www.oracle.co'!technetwork!(ava!(avase!downloads!(re=5downloads5/=0>;?;.ht'l 3. vcredistC6<=.e6e 5 8 )# $edistributable etup.

http@!!www.'icrosoft.co'!en5in!download!details.asp6DidE;;;; 

Page 2: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 2/15

4. can Tailor 5 :n interactive post5processing tool for scanned pages.http@!!sourceforge.net!pro(ects!scantailor! 

5. tesseract5ocr5setup50.14.14.e6ehttp@!!tesseract5ocr.googlecode.co'!files!tesseract5ocr5setup50.14.14.e6e 

a. 8ake Internet connection "N.b. #hoose #o'ponents F

2ownload & Install indi *anguage 2ata 2ownload & Install 8ath ! Huation 2etect

c. Installation co'plete successfully.

! Restart is i"#ortant i""ediatel$!7. #opy5paste hin.traineddata file #@-rogra' 3ilesTesseract5"#$tessdata folder if it is not

downloaded there fro' http@!!tesseract5ocr.googlecode.co'!files!tesseract5ocr50.14.hin.tar.gJ 8. Try 'ore Indic language traineddata files & paste into above folder fro'

http@!!code.google.co'!p!parichit!downloads!list . Thanks to Indu and $K) $a'anhttp@!!code.google.co'!p!parichit! for their -arichit % िपरचत  pro(ect. :ccuracy is low.

Page 3: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 3/15

%can Tailor & 'n interactive #ost&#rocessing tool for scanned #ages!http@!!sourceforge.net!pro(ects!scantailor!

/. 2ownload and install.4. -ut all scanned i'ages ! e6ported i'ages fro' -23 file into one folder.

0. tart can Tailer. "pen new pro(ect. elect folder.

7. elect all files ! reHuired files. #lick L3i6 2-I even if M... #lick "K.

Page 4: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 4/15

;. 3i6 orientation here if needed. plit pages if needed %ones 'anual & then auto'atically !'anually.. 2eskew all pages auto'atically, click on arrow button. %8aking pages vertical forbetter scanning

Page 5: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 5/15

=. elect content. 3or selecting all pages contents auto'atically, click on arrow button.

>. #heck all pagesO selection 'anually and correct if needed.<. 8argins is optional. Try it if needed.?. "utput. "n clicking this tab, single page will be output in out folder  of your chosen folder.

I'portant@ #hange resolution 2-I E 011 & 4P5;P thicker %=11 & 4P5/1P thicker for allpages for better "#$.

Page 6: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 6/15

3or auto'atically output, click on arrow button.

Page 7: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 7/15

/1. "utput is ready into out folder of chosen folder for "#$.

I software Installation: * ' si"#le version!http@!!code.google.co'!p!tesseract5ocr!wiki!0rd-arty 

/. gi'agereaderC1.?5/Cwin04.e6e G I'age $eader.http@!!sourceforge.net!pro(ects!gi'agereader! 

Page 8: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 8/15

sing I"age Reader:

/. "pen it through progra' shortcut.4. #onfigure it for indi language data.

0. #hoose languages tab.

7. #usto' Tesseract language for indi. #lick :dd button.-refi6 F hin Na'e F indi or हनदी  I" #ode F hiCIN-refi6 F gu( Na'e F Gu(arati or ગજુરાતી  I" #ode F guCIN

#lick "K.

Page 9: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 9/15

;. #lick apply.

! Restart is i"#ortant i""ediatel$!>. n(oy using G I'age $eader for i'age and pdf files.<. very ti'e when we open G I'age $eader, we have to configure 9 apply.

?. +e have to set indi language option every ti'e.

Page 10: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 10/15

/1. "pen any i'age!pdf file.

//. elect area to be scanned.

/4. #lick $ecogniJe election.

Page 11: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 11/15

/0. +ait for "#$ to be done.

/7. ave file in Unicode at your desired destination.

I software Installation: * 'n advanced version! http@!!code.google.co'!p!tesseract5ocr!wiki!0rd-arty 

1. +ietOCR, )iet"#$ is a Aava GUI frontend for Tesseract "#$ engine, providingcharacter recognition support for co''on i'age for'ats, and 'ulti5page i'ages.http@!!vietocr.sourceforge.net!  http@!!sourceforge.net!pro(ects!vietocr! 

4. 2ownload and e6tract the folder.0. To download hin.traineddata, go to settings 9 2ownload *anguage data 9 elect *anguage and

download.7. tart with ocr.bat file.;. et "#$ language to indi.=. tart "#$Oing any i'age file.>. -23 files are supported but that reHuire 'ore technical support. -ls go through read'e.ht'l file.

+e were unable to open -23 files.

-ore lin.s for $our el#!

/. Use indi s#ell cec.er  http@!!www.awgp.in!spellchecker! or http@!!www.bhashagiri.co'!.

4. Use indi *ekhak. http@!!www.awgp.in!hindilekhak! or https@!!dl.dropbo6.co'!u!;0=>411=!indi5*ekhak5$elease541.1/.41/4.Jip for fontconversion/t$#ing and -ra'ukh Type -ad %#urrently it supports 41 Indian languages 5http@!!www.vishalon.net!-ra'ukhI8!-ra'ukhType-ad.asp6 for typing.

0. Bhasha IndiaQs TBI* #onverter 0.1 http@!!bhashaindia.co'!2ownloads!-ages!ho'e.asp6 ,TBI* #onverter 0.1 with 8icrosoft .NT 3ra'ework 0.; ervice pack /http@!!www.'icrosoft.co'!en5in!download!details.asp6DidE44 http@!!download.'icrosoft.co'!download!1!=!/!1=/311/#5<>;457=115:/?<5;04/7#=?B;/3!dotnetf60;setup.e6e 

3or "33*IN use %40/8B, 8icrosoft .NT 3ra'ework 0.; ervice pack / %3ull -ackagehttp@!!www.'icrosoft.co'!en5us!download!details.asp6DidE4;/;1 http@!!download.'icrosoft.co'!download!4!1!!41?17/05>/43570<#5?<<532::>?:<:#02!dotnetf60;.e6e 

7. #co"ing integrated "odule F :ll5in5"ne %"#$spell checkerfont conversion withinone pack F visit http@!!www.bhashagiri.co'! .

;. Irfan )iewer http@!!www.irfanview.co'! for i'age file conversion in batch.=. linu65intelligent5ocr5solution fro' Nalin & athyan.

http@!!code.google.co'!p!linu65intelligent5ocr5solution!downloads!list .

Page 12: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 12/15

E0a"#les of Hindi OCR

Page 13: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 13/15

आपकाल का अधयाम 

 दे य! " #ा$य! " "

 क%& 'मय (' ेह)त ेह*, +क) हम -ामानय 

 कह 'कत ेह*  'ामानय 'मय म/ 'ामानय 0कार क1 

 2त3य!4 चलत ेरह ेका 5चय 'म6 म/ आता 

 ह7  र)+ आदमी प7 दा ह)त ेह*, 89त ेह*, :े ती;8म 

 करत ेह*, <ाह;=ादी ह)त ेह*, 8ाल;8> ेह)ते ह*, 8?े

 ह)त ेह* 5र म@त क ेम%  A ह म/ चल े+ात ेह*  यह 'ामानय 

 Bम ह7 C2 ेचलता रहता ह7  $'म/ अचD# ेक1 क)E F 

 8ात हG ह7, ले Hक क#ी;क#ी &ाE (' े'मय #ी 

आत ेह*, +क) हम लIJ कह 'कत ेह*  +क) 

K आपकाल ह7 कहा +ा 'कता ह7  म% Lय के +ी 

 म/ आपकाल #ी कE 8ार आत ेह*  आपकाल म/

 -मानय 0कार क) 0HBयाMA ल?:?ा +ाती ह*  Nर 

 म/ आ2 ल2 +ाM &Oपर +ल रहा ह), C' 'मय 

 :Pता, हाा '8 क%& &)?कर +लत े&Oपर यर पाी 

आपकाल का अधयाम Q

Page 14: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 14/15

 दे य! #ा$य! "R 

 दे ताS के अ% Tह क1 8ात +ीप '8 े'U  ी 

 ह)2ी दे ता ाम ही $'लM र:ा 2या ह7 +,  ेHदया 

 करते ह*  0ाV कर ेक1 $W&X ' े+ते ही ल)2 Cक1 

 पU  +ा करते ह*, Cपा'ा करत ेह*, #+ करत ेह*, +ाM; $' पर चार कर/

 दे ता दे त ेत) ह*, $'म/ क)E =क हG ह7 अ2र े

 दे त े ह)त ेत) Cका ाम दे ता र:ा 2या ह)ता

 दे ा का अYZ ही ह)ता ह7;दे  ेपीता दे  ेाल े' ेअ2र 

 मर/2 ेाला क%& माA 2ता ह7 त) क)E 8े +ा ात हG ह7 

 पर चार करा प?े 2X +, अX:र दे ता दे त े[या ची+ 

 ह* \ दे ता 8ही 8ी+ दे ते ह* +) Cके पा' ह7  +'के

 पा' +) ची+ ह)2ी, ही त) द ेपाM2ा दे ता के पा' 

 ']Z Mक ची+ ह7 5र C'का ाम ह7;दे म दे  

 कहत ेह*;3, कमZ4 5र P#ा;ती! क1 अW&ाE क),

^े _तX क) $ता दे े के 8ाद म/ दे ता `नत ह) 

 +ात ेह*, a  ह) +ात ेह7 5र कहत ेह* +, +) ; हम 

आपक) द े'कत ेY ेहम ेह द ेHदया +द अXपका 

 काम ह7 Hक, +) ची+ हम ेदी ह7, C'क) +हD र/ #ी आप 

 म% ा'8 'म6/, हाA $Pते माल क1 5र C'ी HकPम क1 

 ']लता पाMA

Page 15: Thanks Google Hindi Ocr Guidelines

7/27/2019 Thanks Google Hindi Ocr Guidelines

http://slidepdf.com/reader/full/thanks-google-hindi-ocr-guidelines 15/15

Installation guidelines for Hindi OCR (1NT 23!45)

/. #onnect to Internet.4. "pen ynaptic 8anager.0. earch for tesseract5ocr. Install tesseract5ocr and tesseract5ocr5hin language file.7. Install.

;. "pen Google and search for gI'age$eader downloadhttp@!!sourceforge.net!pro(ects!gi'agereader! .

=. 2ownload .deb file fro' ourceforge.>. Install it.<. "pen gI'age$eader .?. #onfigure indi laguage./1. 3ile 9 #onfigue 9 *anguages 9 :dd 9 -refi6Ehin, Na'eEindi,

I" codeEhiCIN//. "K and :pply./4. elect reHuired language . indi 9 hi/0. "pen file.

/7. can selected area or full page or all pages./;. can Tailor 5 :n interactive post5processing tool for scanned pages.

http@!!sourceforge.net!pro(ects!scantailor!/=. Use synaptic 'anager in UBUNTU for installing cantailor software./>. Tan.s to Nalin & athyan Ai for their pro(ect linu65intelligent5ocr5solution .

*inu65intelligent5ocr5solution5/.=

*ios is a free and open source software for converting print in to te6t using either scanneror a ca'era, It can also produce te6t out of scanned i'ages fro' other sources such as - 23,

I'age or 3older containing I'ages. -rogra' is given total accessibility for visually i'paired. *iosis written in python, and we release it under G-*0 license. *ios will work with 2ebian basedoperating syste's. There are great 'any possibilities for this progra', 3eedback is the key to it,6pecting your feedback at Nalin.6.*inu6RG'ail.co' or at 1?77=1/44/;.

Wit su#er I and wor.ing great for Hindi! 8ore 'odifications are going on. Needed yourhelp to i'prove. They are working selflessly for )isually I'paired -ersons.