chapter 4 analysis and designrepository.unika.ac.id/16186/5/15.k1.0067... · chapter 4 analysis and...
TRANSCRIPT
CHAPTER 4
ANALYSIS AND DESIGN
4.1 Analysis
4.1.1 Collecting Data
Collecting data can be achieved from web scrapping and text input from user. If
user choose web scrapping, there will be a limited source link that can be retrieved.
Web scrapping will get theme, source link, title and content of the news.
There are some links that can be accessed on this project such as tribunnews.com,
kompas.com,detik.com,saracennews.com,ishoax.blogspot.com,suaranasional.com.
Example :
Training Data
Table 4.1: Example of training Data
N Link Source Title Content Statue
o.
1 http://m.tribunnews.c Tribunnews Puluhan Peternak di Desa Real
om/regional/2017/11/ Sapi Ternak Hayup, Kecamatan
07/puluhan-sapi- Mati Haruai, Tabalong,
ternak-mati- Mendadak cemas. Sapi Bali
mendadak-di-desa- di Desa peliharaan mereka
hayup-kabupaten- Hayup mati mendadak.
tabalong Kabupaten
Tabalong
9
10
2. https://www.saracenn Saracen Seorang Media sosial di Hoax
ews.com/news/2017- Janda Yakin Kamboja saat ini
07-20-seorang-janda- Anak Sapi dihebohkan oleh
yakin-anak-sapi-ini- Ini Jelmaan berita seorang
jelmaan-mendiang- Mendiang janda dari sebuah
suaminya Suaminya desa yang
meyakini seekor
anak sapi berusia
lima bulan mirip
mendiang
suaminya.
Data Set
Table 4.2: Example of Data Set
N Link Source Title Content Sta
o tus
1. http://internas kompas Diyakini Sebuah rekaman video yang ?
ional.kompas. Titisan luar biasa memperlihatkan
com/read/201 Dewa, seekor anak sapi yang baru lahir
7/06/03/0924 Anak Sapi tetapi memiliki wajah mirip
4141/diyakini Berwajah wajah manusia.
.titisan.dewa. Manusia
anak.sapi.ber Hebohkan
wajah.manusi Warga
a.hebohkan.w
arga
Pre-processing data is useful to convert non structured textual data such
as string in news to structured textual data. Structured textual data will be
more easily classified. There are some process in pre processing data :
11
4.1.1.1 Case Folding
Case Folding is a process to convert all character into capital letter and
remove all punctuation.
Example
Before (Title) :Diyakini Titisan Dewa, Anak Sapi Berwajah Manusia
Hebohkan Warga
Before (Content): Sebuah rekaman video yang luar biasa memperlihatkan
seekor anak sapi yang baru lahir tetapi memiliki wajah
mirip wajah manusia.
After (Title):diyakini titisan dewa, anak sapi berwajah manusia hebohkan warga
After (Content): sebuah rekaman video yang luar biasa memperlihatkan
seekor anak sapi yang baru lahir tetapi memiliki wajah
mirip wajah manusia.
4.1.1.2 Tokening
Tokening is a process to retrieved word from sentence based on spacing.
Example
Table 4.3: Tokening Title
Word(Title)
diyakini hebohkan
titisan warga
dewa sapi
anak berwajah
sapi manusia
Table 4.4: Tokening Content
Word(Content)
sebuah memperlihatkan lahir wajah
rekaman seekor tetapi mirip
video anak memiliki wajah
12
yang sapi biasa manusia
luar yang baru
4.1.1.3 Stemming
Stemming is a process to get root of the word . In this project, Nazief &
Adriani is used to stemming process.
Table 4.5: Stemming Title
Word(Title)
yakin wajah
titisan manusia
dewa heboh
anak warga
sapi
Table 4.6: Stemming Content
Word (Content)
sebuah biasa yang wajah
rekaman lihat baru mirip
video seekor lahir wajah
yang anak tetapi manusia
luar sapi milik
4.1.1.4 Filtering Stop Word
Filtering Stop Word is a process to remove all stop words in order to make
classification more accurate.
Table 4.7: Stop word Removal Title
Word(Title)
yakin anak heboh
titisan sapi warga
dewa manusia wajah Table 4.8: Stop word Removal Content
Word(Content)
rekaman lahir
13
video milik
lihat wajah
anak manusia
sapi mirip
4.1.2 Document Frequency Thresholding
Document Frequency Thresholding is a process to count frequency of word
in each of document.
4.1.3 Term Frequency – Inverse Document Frequency ( TF-IDF)
TF-IDF is a term weighting in order to give value to each word. Frequency
of word will give effect this value[3].
TF-IDF Formula :
W dt =tf dt xidf t =tf dt x log(
N ) (1)
df t
Where :
W dt =weight term−t →d−document
tf d=frequency of term t on document d
N=amount of all document
df t =amount of all document that contain term t
Example
Table 4.9: Text Mining Calculation (Title)
Term TF DF D/DF log(D/DF) W
D1 D2 D1 D2
sapi 1 1 2 2/2 0 0 0
ternak 1 0 1 2/1 0.301 0.301 0
mati 1 0 1 2/1 0.301 0.301 0
dadak 1 0 1 2/1 0.301 0.301 0
hayup 1 0 1 2/1 0.301 0.301 0
14
tabalong 1 0 1 2/1 0.301 0.301 0
janda 0 1 1 2/1 0.301 0 0.301
yakin 0 1 1 2/1 0.301 0 0.301
anak 0 1 1 2/1 0.301 0 0.301
jelmaan 0 1 1 2/1 0.301 0 0.301
mendiang 0 1 1 2/1 0.301 0 0.301
suami 0 1 1 2/1 0.301 0 0.301
TF-IDF Content
Table 4.10: Text Mining Calculation (Content)
Term TF DF D/DF log(D/DF) W
D1 D2 D1 D2
peternak 1 0 1 2/1 0.3010 0.3010 0
hayup 1 0 1 2/1 0.3010 0.3010 0
haruai 1 0 1 2/1 0.3010 0.3010 0
tabalong 1 0 1 2/1 0.3010 0.3010 0
cemas 1 0 1 2/1 0.3010 0.3010 0
sapi 1 1 2 2/2 0 0 0
bali 1 0 1 2/1 0.3010 0.3010 0
peliharaan 1 0 1 2/1 0.3010 0.3010 0
mereka 1 0 1 2/1 0.3010 0.3010 0
mati 1 0 1 2/1 0.3010 0.3010 0
dadak 1 0 1 2/1 0.3010 0.3010 0
media 0 1 1 2/1 0.3010 0 0.3010
sosial 0 1 1 2/1 0.3010 0 0.3010
kamboja 0 1 1 2/1 0.3010 0 0.3010
heboh 0 1 1 2/1 0.3010 0 0.3010
berita 0 1 1 2/1 0.3010 0 0.3010
janda 0 1 1 2/1 0.3010 0 0.3010
yakin 0 1 1 2/1 0.3010 0 0.3010
anak 0 1 1 2/1 0.3010 0 0.3010
mirip 0 1 1 2/1 0.3010 0 0.3010
15
mendiang 0 1 1 2/1 0.3010 0 0.3010
suami 0 1 1 2/1 0.3010 0 0.3010
4.1.4 Processing Data using Multinomial Naive Bayes
Multinomial Naive Bayes will count frequency of word that appear in a
document. For example document d in class c.
P (C| term on document d ) = P(c) x P(t1|c) x P(t2|c) x ....x P(tn|c) (2)
Where :P(c) = prior probability in class c.
tn =each n word of document d .
P(c|term d-document )= probability document d on class c
P(tn|c) = probability word -n on class c
Prior Probability class c
P(c )= Nc
N
Where :
(3)
P(c) = Prior probability from class c. Nc = amount of class c in all of
document N= amount of all document.
P(tn|c)=
W ct +1
(4)
(∑W ' ∈VW ' ct )+B '
Where : Wct = Wdt in formula based on condition class c (1)
∑ W ' ∈VW ' ct = amount of W in all term on
class c B ' =amount of unique word.
16
Example :
Table 4.11: Like hood Calculation (Title)
Wct Wct Sigma Sigma W All W Like Like
Hoax Real W Hoax Real hood hood
Hoax Real
yakin 0.3010 0 7 6 12 0.068 0.055
titisan 0 0 7 6 12 0.052 0.055
dewa 0 0 7 6 12 0.052 0.055
anak 0.3010 0 7 6 12 0.068 0.055
sapi 0 0 7 6 12 0.052 0.055
wajah 0 0 7 6 12 0.052 0.055
manusia 0 0 7 6 12 0.052 0.055
heboh 0 0 7 6 12 0.052 0.055
warga 0 0 7 6 12 0.052 0.055 Table 4.12: Like hood Calculation (Content)
Wct Real Wct Sigma Sigma W Like Like
Hoax W W hood hood
Hoax Real Real Hoax
rekaman 0 0 11 10 22 0.0312 0.0303
video 0 0 11 10 22 0.0312 0.0303
lihat 0 0 11 10 22 0.0312 0.0303
anak 0 1 11 10 22 0.0312 0.0394
sapi 0 0 11 10 22 0.0312 0.0303
lahir 0 0 11 10 22 0.0312 0.0303
milik 0 0 11 10 22 0.0312 0.0303
wajah 0 0 11 10 22 0.0312 0.0303
manusia 0 0 11 10 22 0.0312 0.0303
mirip 0 1 11 10 22 0.0312 0.0394
Probability Prior
PPrior(Hoax) = 1/2
PPrior(Real) = 1/2
Probability Like hood Title Hoax: 4.75x10-12
Probability Like hood Title Real: 4.60x10-12
Probability Like hood Content Real:8.74x10 -16
17
Probability Like hood Content Hoax :1.10x10-15
P(Hoax)=PPrior*Plikehood=1/2x4.75x10-12x1.10x10-15=2.61x10-27
P(Real)= Pprior * Plikehood=1/2x4.60x10-12x8.74x10 -16=2.01x10-27
Result :
This news is categorized as a Hoax News because has Hoax
probability bigger than real probability.
4.1.5 Evaluation Multinomial Naive Bayes Algorithm
On this project, evaluation of Multinomial Naive Bayes Algorithm use confusion
matrix. Confusion matrix is a way to evaluate performance of Naive Bayes
Algorithm. On this testing, writer use confusion matrix and some formula such as
Table 4.13: Confusion Matrix
True_Real True_Hoax
Classify_as_Real TP(True Positive) FN(False
Negative)
Classify_as_Hoax FP(False Positive) TN(True
Negative)
Recall= TP
Precision= TP
TP+FN TP+FP
FN +FP
100 %
Error Rate=
x
TP+FN +FP+TN
Accuracy =( TP )( P )+( TN )( N )
P+N
P+N
P N
18
Note :
1. True Positive (TP) means amount of real news that correctly
classified as real news
2. False Positive (FP) means amount of real news but classified as
hoax news
3. False Negative (FN) means amount of hoax news but classified
as real news
4. True Negative (TN) means amount of hoax news that correctly
classified as hoax news.
5. Recall means ratio of real news that correctly predicted as real
news to the total of predicted real news observations.
6. Precision means ratio of news that correctly predicted as real
news to the total of predicted real news observations.
7. Error Rate means percentage when Naive Bayes wrongly
classified data in the data set.
8. Accuracy means percentage when Naive Bayes correctly
classified data in the data set.
9. P means amount of real news on testing data.
10. N means amount of hoax news on testing data.
For example :
There are 76 training data and 8 data set .
Table 4.14: Example Data to Calculate Confusion Matrix
No Reality Classified as
1 Hoax Real
2 Hoax Hoax
3 Hoax Hoax
19
4 Hoax Hoax
5 Real Hoax
6 Real Hoax
7 Real Real
8 Real Real
Then the confusion matrix is:
Table 4.15: Example of Confusion Matrix
True_Real True_Hoax
Class_Real 2 (TP) 1 (FN)
Class_Hoax 2 (FP) 3 (TN)
Recall= 2
= 2
Precision= 2
= 2
2+1 3 2+2 4
Error Rate=
2+1
x 100%=37.5 %
2+2+1 +3
Accuracy = 2
x 4
+ 3
x 4
=0.625=62.5%
4 4+4 4 4 +4
4.1.6 Add the result as a knowledge
The result of classification is a knowledge for the next classification.
20
4.2 Design
4.2.1. Use Case Diagram
Crawl Data
User input
Pre Process
Show Result
Text Mining
Training Data Set Data
Naive Bayes Classification
Save Result
Based on this picturization, the program will get input from user( link or text) .
After it data will get pre - processing and text mining to get pattern of word that
contain hoax and real. After getting text mining analysis, the data will be
classified using Naive Bayes classification by comparing probability of hoax
and real. Result of this project based on which one probability is bigger than
other. Lastly set data will be stored as training data.
21
4.2.2. Flow Chart
Illustration 4.1: Flow Chart Program (A)
22
Illustration 4.2: Flow Chart Program (B)
23
Illustration 4.3: Flow Chart Program (C)
Illustration 4.4: Flow Chart Program (D)
Illustration 4.5: Flow Chart Program (E)
24
4.2.3. UML Class Diagram
There is relation in class Java.
Illustration 4.6: Relation of Class Diagram
25
Description:
Illustration 4.7: Diagram Class of News Interface
This class is an interface class that contain default property of news.
26
Illustration 4.8: Diagram Class of News This class is a class to crawl news from internet and save it in database.
27
Illustration 4.9: Class Diagram of AddNews Trainer
This class is to add trainer data to database.
28
Illustration 4.10: Class Diagram of Stemming
This class is for processing data to stem word. In this project,
stemming process use Nazief Andriani Algorithm. Result of this process is
a stem word that will be used for text mining process.
29
Illustration 4.11: Class Diagram of Text Mining
30
TextMining class and TextMiningTrainer is use for calculate tfidf
value based on data set and training data.
Illustration 4.12: Class Diagram of TextMiningTrainer
31
Naive Bayes class is to implement Naive Bayes Algorithm to all theme
of news and give result whether news is categorized as a hoax or real news.
Illustration 4.13: Class Diagram of NaiveBayesNews
32
Naive BayesThemeNews class is to implement Naive Bayes Algorithm to
specific theme of news and give result whether news is categorized as a hoax or
real news.
Illustration 4.14: Class Diagram of NaiveBayesThemeNews
4.2.4 Database Schema
Illustration 4.15: Database Schema
Database used for storing data from web crawling process ,case folding,stemming and stop word removal process.
34
4.2.5 Graphical User Interface Design
Firstly user must choose menu on this program.
Illustration 4.16: GUI Hoax Detection (1)
Illustration 4.17: GUI Menu
Message : To Input Test Data (News without Link)
News : To input Test Data(News with link)
Trainer Message : To input training data(News without Link)as knowledge
Trainer News: To input training data(News with link) as knowledge
Addition word feature is aimed to help user find string that they believe
is
a
hoax word in message/news.
35
Message Detection GUI ( To detect news without link is hoax or real)
Illustration 4.18: GUI Message Detection
News Detection GUI: Detect News by input link, whether news is a hoax or not.
Illustration 4.19: GUI News Detection
36
Add Training Data (News) : to add knowledge on this project
Illustration 4.20: GUI Add Message to Training Data
Illustration 4.21: GUI Add News to Training Data