pdf data loading without mr using pig

Download PDF data loading without MR using PIG

If you can't read please download the document

Upload: rajesh-kumar-mandal

Post on 11-Feb-2017

49 views

Category:

Documents


2 download

TRANSCRIPT

Unstructured data (pdf) conversion and loading into HDFS

Objective : We received PDF data from client. We have to convert PDF data in txt format and load the data in HDFS so that we can generate reports based on client's data.

Data Sample :

Step 1 :

copy that pdf file into linux box in any folder.File name :InputData.pdfRun this below command in linux environment from specified location :

hadoop@hadoop:~/Testing$ pdftotext -layout -nopgbrk InputData.pdfhadoop@hadoop:~/Testing$ cat InputData.txt|tr -s " ">Input1.txt

Step 2 : copy Input1.txt into HDFS environment

hadoop@hadoop:~/Testing$ hadoop fs -copyFromLocal Input1.txt /rajesh

Step 3 : To view the file in HDFS

Step 4 :Run pig through HDFS mode

grunt> grunt> A= LOAD '/rajesh/Input1.txt' using PigStorage(' ') as (Sid:int,Sname:chararray,Ttrading:chararray,Sloc:chararray,OBal:int,CBal:int,Frate:int);2016-05-23 18:38:22,096 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS grunt> disHM= DISTINCT A; grunt> orHM = ORDER disHM by Sid; grunt> STORE orHM INTO '/rajesh/pigoutput' using PigStorage ',');

To view the output generated by pig :