chapter 4 analysis and designrepository.unika.ac.id/16186/5/15.k1.0067... · chapter 4 analysis and...

CHAPTER 4

ANALYSIS AND DESIGN

4.1 Analysis

4.1.1 Collecting Data

Collecting data can be achieved from web scrapping and text input from user. If

user choose web scrapping, there will be a limited source link that can be retrieved.

Web scrapping will get theme, source link, title and content of the news.

There are some links that can be accessed on this project such as tribunnews.com,

kompas.com,detik.com,saracennews.com,ishoax.blogspot.com,suaranasional.com.

Example :

Training Data

Table 4.1: Example of training Data

N Link Source Title Content Statue

o.

1 http://m.tribunnews.c Tribunnews Puluhan Peternak di Desa Real

om/regional/2017/11/ Sapi Ternak Hayup, Kecamatan

07/puluhan-sapi- Mati Haruai, Tabalong,

ternak-mati- Mendadak cemas. Sapi Bali

mendadak-di-desa- di Desa peliharaan mereka

hayup-kabupaten- Hayup mati mendadak.

tabalong Kabupaten

Tabalong

9

10

2. https://www.saracenn Saracen Seorang Media sosial di Hoax

ews.com/news/2017- Janda Yakin Kamboja saat ini

07-20-seorang-janda- Anak Sapi dihebohkan oleh

yakin-anak-sapi-ini- Ini Jelmaan berita seorang

jelmaan-mendiang- Mendiang janda dari sebuah

suaminya Suaminya desa yang

meyakini seekor

anak sapi berusia

lima bulan mirip

mendiang

suaminya.

Data Set

Table 4.2: Example of Data Set

N Link Source Title Content Sta

o tus

1. http://internas kompas Diyakini Sebuah rekaman video yang ?

ional.kompas. Titisan luar biasa memperlihatkan

com/read/201 Dewa, seekor anak sapi yang baru lahir

7/06/03/0924 Anak Sapi tetapi memiliki wajah mirip

4141/diyakini Berwajah wajah manusia.

.titisan.dewa. Manusia

anak.sapi.ber Hebohkan

wajah.manusi Warga

a.hebohkan.w

arga

Pre-processing data is useful to convert non structured textual data such

as string in news to structured textual data. Structured textual data will be

more easily classified. There are some process in pre processing data :

11

4.1.1.1 Case Folding

Case Folding is a process to convert all character into capital letter and

remove all punctuation.

Example

Before (Title) :Diyakini Titisan Dewa, Anak Sapi Berwajah Manusia

Hebohkan Warga

Before (Content): Sebuah rekaman video yang luar biasa memperlihatkan

seekor anak sapi yang baru lahir tetapi memiliki wajah

mirip wajah manusia.

After (Title):diyakini titisan dewa, anak sapi berwajah manusia hebohkan warga

After (Content): sebuah rekaman video yang luar biasa memperlihatkan

seekor anak sapi yang baru lahir tetapi memiliki wajah

mirip wajah manusia.

4.1.1.2 Tokening

Tokening is a process to retrieved word from sentence based on spacing.

Example

Table 4.3: Tokening Title

Word(Title)

diyakini hebohkan

titisan warga

dewa sapi

anak berwajah

sapi manusia

Table 4.4: Tokening Content

Word(Content)

sebuah memperlihatkan lahir wajah

rekaman seekor tetapi mirip

video anak memiliki wajah

12

yang sapi biasa manusia

luar yang baru

4.1.1.3 Stemming

Stemming is a process to get root of the word . In this project, Nazief &

Adriani is used to stemming process.

Table 4.5: Stemming Title

Word(Title)

yakin wajah

titisan manusia

dewa heboh

anak warga

sapi

Table 4.6: Stemming Content

Word (Content)

sebuah biasa yang wajah

rekaman lihat baru mirip

video seekor lahir wajah

yang anak tetapi manusia

luar sapi milik

4.1.1.4 Filtering Stop Word

Filtering Stop Word is a process to remove all stop words in order to make

classification more accurate.

Table 4.7: Stop word Removal Title

Word(Title)

yakin anak heboh

titisan sapi warga

dewa manusia wajah Table 4.8: Stop word Removal Content

Word(Content)

rekaman lahir

13

video milik

lihat wajah

anak manusia

sapi mirip

4.1.2 Document Frequency Thresholding

Document Frequency Thresholding is a process to count frequency of word

in each of document.

4.1.3 Term Frequency – Inverse Document Frequency ( TF-IDF)

TF-IDF is a term weighting in order to give value to each word. Frequency

of word will give effect this value[3].

TF-IDF Formula :

W dt =tf dt xidf t =tf dt x log(

N ) (1)

df t

Where :

W dt =weight term−t →d−document

tf d=frequency of term t on document d

N=amount of all document

df t =amount of all document that contain term t

Example

Table 4.9: Text Mining Calculation (Title)

Term TF DF D/DF log(D/DF) W

D1 D2 D1 D2

sapi 1 1 2 2/2 0 0 0

ternak 1 0 1 2/1 0.301 0.301 0

mati 1 0 1 2/1 0.301 0.301 0

dadak 1 0 1 2/1 0.301 0.301 0

hayup 1 0 1 2/1 0.301 0.301 0

14

tabalong 1 0 1 2/1 0.301 0.301 0

janda 0 1 1 2/1 0.301 0 0.301

yakin 0 1 1 2/1 0.301 0 0.301

anak 0 1 1 2/1 0.301 0 0.301

jelmaan 0 1 1 2/1 0.301 0 0.301

mendiang 0 1 1 2/1 0.301 0 0.301

suami 0 1 1 2/1 0.301 0 0.301

TF-IDF Content

Table 4.10: Text Mining Calculation (Content)

Term TF DF D/DF log(D/DF) W

D1 D2 D1 D2

peternak 1 0 1 2/1 0.3010 0.3010 0

hayup 1 0 1 2/1 0.3010 0.3010 0

haruai 1 0 1 2/1 0.3010 0.3010 0

tabalong 1 0 1 2/1 0.3010 0.3010 0

cemas 1 0 1 2/1 0.3010 0.3010 0

sapi 1 1 2 2/2 0 0 0

bali 1 0 1 2/1 0.3010 0.3010 0

peliharaan 1 0 1 2/1 0.3010 0.3010 0

mereka 1 0 1 2/1 0.3010 0.3010 0

mati 1 0 1 2/1 0.3010 0.3010 0

dadak 1 0 1 2/1 0.3010 0.3010 0

media 0 1 1 2/1 0.3010 0 0.3010

sosial 0 1 1 2/1 0.3010 0 0.3010

kamboja 0 1 1 2/1 0.3010 0 0.3010

heboh 0 1 1 2/1 0.3010 0 0.3010

berita 0 1 1 2/1 0.3010 0 0.3010

janda 0 1 1 2/1 0.3010 0 0.3010

yakin 0 1 1 2/1 0.3010 0 0.3010

anak 0 1 1 2/1 0.3010 0 0.3010

mirip 0 1 1 2/1 0.3010 0 0.3010

15

mendiang 0 1 1 2/1 0.3010 0 0.3010

suami 0 1 1 2/1 0.3010 0 0.3010

4.1.4 Processing Data using Multinomial Naive Bayes

Multinomial Naive Bayes will count frequency of word that appear in a

document. For example document d in class c.

P (C| term on document d ) = P(c) x P(t1|c) x P(t2|c) x ....x P(tn|c) (2)

Where :P(c) = prior probability in class c.

tn =each n word of document d .

P(c|term d-document )= probability document d on class c

P(tn|c) = probability word -n on class c

Prior Probability class c

P(c )= Nc

N

Where :

(3)

P(c) = Prior probability from class c. Nc = amount of class c in all of

document N= amount of all document.

P(tn|c)=

W ct +1

(4)

(∑W ' ∈VW ' ct )+B '

Where : Wct = Wdt in formula based on condition class c (1)

∑ W ' ∈VW ' ct = amount of W in all term on

class c B ' =amount of unique word.

16

Example :

Table 4.11: Like hood Calculation (Title)

Wct Wct Sigma Sigma W All W Like Like

Hoax Real W Hoax Real hood hood

Hoax Real

yakin 0.3010 0 7 6 12 0.068 0.055

titisan 0 0 7 6 12 0.052 0.055

dewa 0 0 7 6 12 0.052 0.055

anak 0.3010 0 7 6 12 0.068 0.055

sapi 0 0 7 6 12 0.052 0.055

wajah 0 0 7 6 12 0.052 0.055

manusia 0 0 7 6 12 0.052 0.055

heboh 0 0 7 6 12 0.052 0.055

warga 0 0 7 6 12 0.052 0.055 Table 4.12: Like hood Calculation (Content)

Wct Real Wct Sigma Sigma W Like Like

Hoax W W hood hood

Hoax Real Real Hoax

rekaman 0 0 11 10 22 0.0312 0.0303

video 0 0 11 10 22 0.0312 0.0303

lihat 0 0 11 10 22 0.0312 0.0303

anak 0 1 11 10 22 0.0312 0.0394

sapi 0 0 11 10 22 0.0312 0.0303

lahir 0 0 11 10 22 0.0312 0.0303

milik 0 0 11 10 22 0.0312 0.0303

wajah 0 0 11 10 22 0.0312 0.0303

manusia 0 0 11 10 22 0.0312 0.0303

mirip 0 1 11 10 22 0.0312 0.0394

Probability Prior

PPrior(Hoax) = 1/2

PPrior(Real) = 1/2

Probability Like hood Title Hoax: 4.75x10-12

Probability Like hood Title Real: 4.60x10-12

Probability Like hood Content Real:8.74x10 -16

17

Probability Like hood Content Hoax :1.10x10-15

P(Hoax)=PPrior*Plikehood=1/2x4.75x10-12x1.10x10-15=2.61x10-27

P(Real)= Pprior * Plikehood=1/2x4.60x10-12x8.74x10 -16=2.01x10-27

Result :

This news is categorized as a Hoax News because has Hoax

probability bigger than real probability.

4.1.5 Evaluation Multinomial Naive Bayes Algorithm

On this project, evaluation of Multinomial Naive Bayes Algorithm use confusion

matrix. Confusion matrix is a way to evaluate performance of Naive Bayes

Algorithm. On this testing, writer use confusion matrix and some formula such as

Table 4.13: Confusion Matrix

True_Real True_Hoax

Classify_as_Real TP(True Positive) FN(False

Negative)

Classify_as_Hoax FP(False Positive) TN(True

Negative)

Recall= TP

Precision= TP

TP+FN TP+FP

FN +FP

100 %

Error Rate=

x

TP+FN +FP+TN

Accuracy =( TP )( P )+( TN )( N )

P+N

P+N

P N

18

Note :

1. True Positive (TP) means amount of real news that correctly

classified as real news

2. False Positive (FP) means amount of real news but classified as

hoax news

3. False Negative (FN) means amount of hoax news but classified

as real news

4. True Negative (TN) means amount of hoax news that correctly

classified as hoax news.

5. Recall means ratio of real news that correctly predicted as real

news to the total of predicted real news observations.

6. Precision means ratio of news that correctly predicted as real

news to the total of predicted real news observations.

7. Error Rate means percentage when Naive Bayes wrongly

classified data in the data set.

8. Accuracy means percentage when Naive Bayes correctly

classified data in the data set.

9. P means amount of real news on testing data.

10. N means amount of hoax news on testing data.

For example :

There are 76 training data and 8 data set .

Table 4.14: Example Data to Calculate Confusion Matrix

No Reality Classified as

1 Hoax Real

2 Hoax Hoax

3 Hoax Hoax

19

4 Hoax Hoax

5 Real Hoax

6 Real Hoax

7 Real Real

8 Real Real

Then the confusion matrix is:

Table 4.15: Example of Confusion Matrix

True_Real True_Hoax

Class_Real 2 (TP) 1 (FN)

Class_Hoax 2 (FP) 3 (TN)

Recall= 2

= 2

Precision= 2

= 2

2+1 3 2+2 4

Error Rate=

2+1

x 100%=37.5 %

2+2+1 +3

Accuracy = 2

x 4

+ 3

x 4

=0.625=62.5%

4 4+4 4 4 +4

4.1.6 Add the result as a knowledge

The result of classification is a knowledge for the next classification.

20

4.2 Design

4.2.1. Use Case Diagram

Crawl Data

User input

Pre Process

Show Result

Text Mining

Training Data Set Data

Naive Bayes Classification

Save Result

Based on this picturization, the program will get input from user( link or text) .

After it data will get pre - processing and text mining to get pattern of word that

contain hoax and real. After getting text mining analysis, the data will be

classified using Naive Bayes classification by comparing probability of hoax

and real. Result of this project based on which one probability is bigger than

other. Lastly set data will be stored as training data.

21

4.2.2. Flow Chart

Illustration 4.1: Flow Chart Program (A)

22

Illustration 4.2: Flow Chart Program (B)

23

Illustration 4.3: Flow Chart Program (C)

Illustration 4.4: Flow Chart Program (D)

Illustration 4.5: Flow Chart Program (E)

24

4.2.3. UML Class Diagram

There is relation in class Java.

Illustration 4.6: Relation of Class Diagram

25

Description:

Illustration 4.7: Diagram Class of News Interface

This class is an interface class that contain default property of news.

26

Illustration 4.8: Diagram Class of News This class is a class to crawl news from internet and save it in database.

27

Illustration 4.9: Class Diagram of AddNews Trainer

This class is to add trainer data to database.

28

Illustration 4.10: Class Diagram of Stemming

This class is for processing data to stem word. In this project,

stemming process use Nazief Andriani Algorithm. Result of this process is

a stem word that will be used for text mining process.

29

Illustration 4.11: Class Diagram of Text Mining

30

TextMining class and TextMiningTrainer is use for calculate tfidf

value based on data set and training data.

Illustration 4.12: Class Diagram of TextMiningTrainer

31

Naive Bayes class is to implement Naive Bayes Algorithm to all theme

of news and give result whether news is categorized as a hoax or real news.

Illustration 4.13: Class Diagram of NaiveBayesNews

32

Naive BayesThemeNews class is to implement Naive Bayes Algorithm to

specific theme of news and give result whether news is categorized as a hoax or

real news.

Illustration 4.14: Class Diagram of NaiveBayesThemeNews

4.2.4 Database Schema

Illustration 4.15: Database Schema

Database used for storing data from web crawling process ,case folding,stemming and stop word removal process.

34

4.2.5 Graphical User Interface Design

Firstly user must choose menu on this program.

Illustration 4.16: GUI Hoax Detection (1)

Illustration 4.17: GUI Menu

Message : To Input Test Data (News without Link)

News : To input Test Data(News with link)

Trainer Message : To input training data(News without Link)as knowledge

Trainer News: To input training data(News with link) as knowledge

Addition word feature is aimed to help user find string that they believe

is

a

hoax word in message/news.

35

Message Detection GUI ( To detect news without link is hoax or real)

Illustration 4.18: GUI Message Detection

News Detection GUI: Detect News by input link, whether news is a hoax or not.

Illustration 4.19: GUI News Detection

36

Add Training Data (News) : to add knowledge on this project

Illustration 4.20: GUI Add Message to Training Data

Illustration 4.21: GUI Add News to Training Data

chapter 4 analysis and designrepository.unika.ac.id/16186/5/15.k1.0067... · chapter 4 analysis and...

Documents