plagiarism detection for multithreaded software based on thread-aware software birthmarks zhenzhou...
TRANSCRIPT
Plagiarism Detection for Multithreaded Software
Based on Thread-Aware Software Birthmarks
Plagiarism Detection for Multithreaded Software
Based on Thread-Aware Software Birthmarks
Zhenzhou Tian
MOE Key Lab for Intelligent Networks and Network Security
Xi’an Jiaotong University, China
23/4/22
1
2
Outline
1. Introduction
2. Thread-Aware Birthmark Methods
3. Evaluation
4. Unsolved Problems & Future Work
3
Introduction Software plagiarism has been a serious threat to the healthy
development of software industry• Violate licenses for commercial interests or unwittingly
• Weak code protection awareness• Powerful automated code obfuscation tools• Distributed in binary form
4
Introduction A series of methods are proposed for plagiarism detection
Software Watermarking• Insert extra data• “a sufficiently determined attacker will eventually be able
to defeat any watermark” Static and Dynamic Software Birthmarks
• Dynamic birthmarks are more resilient to semantic-preserving code obfusctions
5
Introduction A series of methods are proposed for plagiarism detection
Software Watermarking Static and Dynamic Software Birthmarks Increasingly popular trend towards multithreaded
programming brings new challenge to existing dynamic birthmark methods
Existing dynamic birthmark remain optimized for sequential programs
Neglect the effect of thread scheduling Two executions of a single program under same input can be
very different, rendering the existing methods ineffective
!!
!
k
n
nkn
k
6
Introduction#include <stdio.h>#include <unistd.h>#include <pthread.h>#include<stdlib.h>#define N 8 pthread_t mThread[N];void *run(void *data){ int tid; tid =(int) data; printf("hello world from thread %d\n",tid); return NULL; }int main(int argc, char *argv[]){ int rc, i; int count; printf("input a number please: \n"); scanf("%d",&i); for(i;i<N; i++){ rc = pthread_create(&mThread[i], NULL, run, (void *) i); if (rc) printf("create thread failed. error code = %d\n", rc);} for(i=0;i<N; i++) pthread_join(mThread[i], NULL); printf("main thread finished\n"); return 0; }
DKISB SCSSB
Cosine 0.838 0.452
Jaccard 0.551 0.369
Dice 0.678 0.51
Containment 0.735 0.477
DKISB: dynamic key instruction sequence birthmark
SCSSB: system call short sequence birthmark
7
Introduction Contributions:
Two thread-aware dynamic birthmarks TW-DKISB and TW-SCSSB are proposed to detect software plagiarism• Operates directly on binary executables• Not limited to specific operating systems and languages• Resilient to various automated obfuscation techniques
29 different
obfuscation
techniques in
SandMark
Arr
ayFold
erArr
aySplit
ter
Blo
ckM
arke
rB
ludgeo
nSig
nat
ure
sB
oole
anSplit
ter
Bra
nch
Inve
rter
Duplic
ateR
egis
ters
Dyn
amic
Inlin
erFal
seR
efac
tor
Fie
ldAss
ignm
ent
Inlin
erIn
teger
Arr
aySplit
ter
Mer
geL
oca
lInte
ger
sM
ethodM
erger
Obje
ctify
Opaq
ueB
ranch
Inse
rtio
nO
verload
Nam
esPar
amAlia
sPro
mote
Prim
itiv
eReg
iste
rsPro
mote
Prim
itiv
eTyp
esPublic
izeF
ield
sR
andom
Dea
dC
ode
Ren
ameR
egis
ters
Reo
rder
Inst
ruct
ions
Reo
rder
Par
amet
ers
Sim
ple
Opqau
ePre
dic
ates
Split
Cla
sses
Sta
ticM
ethodsB
odie
sVar
iable
Rea
ssig
ner
0.0
0.2
0.4
0.6
0.8
1.0
Sim
ilarity
8
Introduction Contributions:
A prototype is implemented using the Pin instrumentation framework, and extensive experiments are conducted.
A suite of benchmarks is compiled for researchers to conduct experiments and present their findings
http://labs.xjtudlc.com/labs/benchmark.html
Pin
Libdft
Taint Analyzer
Key Instruction Recognizer&Recorder
DKISExtractor
Pin API
Libdft API
Program
Input
Key Instruction Sequence
9
Outline
1. Introduction
2. Thread-Aware Birthmark Methods
3. Evaluation
4. Unsolved Problems & Future Work
10
A set of characteristics extracted from a program that reflects intrinsic properties of the program, and which can be used to identify the program uniquely.
Two types: Static and Dynamic software birthmarks Dynamic birthmark defined by Myles
Software Birthmark
11
Thread-Aware Dynamic Software Birthmark
• Predetermining a thread schedule is very difficult• Try to shield their influence on executions instead of
enforcing thread schedule
12
Thread-Aware Dynamic Software Birthmarks Main Idea: Split then Aggregate Execution order in each thread is relatively stable. Projecting the trace on thread-ids to obtain sub-traces to extract
Slice birthmarks Aggregating all slice birthmarks.
Different traces of a program under the same input
open, read, write, read, read, write, close
open, read, read, write, write, read, close
t1 t1 t1 t1 t2 t2 t1
t1 t2 t1 t2 t1 t1 t1
open, read, write, read, closet1
read, writet2
Same slices
13
Slice Birthmark & Program Birthmark
t1
t2
<(open, read),1>, <(read,write),1>, <(write, read),1>, <(read,close),1>
<(read, write),1>
K-Gram
open, read, write, read, read, write, close
open, read, read, write, write, read, close
t1 t1 t1 t1 t2 t2 t1
t1 t2 t1 t2 t1 t1 t1
open, read, write, read, closet1
read, writet2
Slice Birthmarks
SCSSBSA
<(open, read),1>, <(read,write),2>, <(write, read),1>, <(read,close),1>
SAM
SCSSBSS
<t1, SliceBirth(t1)>, <t2, SliceBirth(t2)>SSM
14
Dynamic Analysis Module
Key Instruction Sequence
Pre-Processor
Birthmark Similarity Calculator
Plaintiff Binary
Defendant Binary
Input
k
Plagiarism Decider
Detection Result
Bir
thm
ark(
P)
Bir
thm
ark(
D)
S1,S2,…,Sn are similarity values under different inputs
Birthmark Generator
ε
System Call Sequence
Thread-Aware Birthmark based Plagiarism Detection
5 main modules:① DAM: monitoring and
recording
② PP: constitute valid traces
③ BG: extract thread-aware birthmarks
④ BSC: calculate similarity scores
⑤ PD: determine detection result
15
Dynamic Analysis Module
Key Instruction Sequence
Pre-Processor
Birthmark Similarity Calculator
Plaintiff Binary
Defendant Binary
Input
k
Plagiarism Decider
Detection Result
Bir
thm
ark(
P)
Bir
thm
ark(
D)
S1,S2,…,Sn are similarity values under different inputs
Birthmark Generator
ε
System Call Sequence
Thread-Aware Birthmark based Plagiarism Detection
5 main modules:① DAM: monitoring and
recording
② PP: constitute valid traces
③ BG: extract thread-aware birthmarks
④ BSC: calculate similarity scores
⑤ PD: determine detection result
16
Dynamic Analysis Module
Monitoring the execution of a program using Pin DKISExtractor: performs dynamic taint analysis to identify
and record key instructions SysTracer: record each execution of system calls
PinLibdft
Taint Analyzer
Key Instruction Recognizer&Recorder
DKISExtractor
Pin API
Libdft API
Program
Input
Key Instruction Sequence
SysTracer System Call Sequence
17
Dynamic Analysis Module
Key Instruction Sequence
Pre-Processor
Birthmark Similarity Calculator
Plaintiff Binary
Defendant Binary
Input
k
Plagiarism Decider
Detection Result
Bir
thm
ark(
P)
Bir
thm
ark(
D)
S1,S2,…,Sn are similarity values under different inputs
Birthmark Generator
ε
System Call Sequence
Thread-Aware Birthmark based Plagiarism Detection
5 main modules:① DAM: monitoring and
recording
② PP: constitute valid traces
③ BG: extract thread-aware birthmarks
④ BSC: calculate similarity scores
⑤ PD: determine detection result
18
Dynamic Analysis Module
Key Instruction Sequence
Pre-Processor
Birthmark Similarity Calculator
Plaintiff Binary
Defendant Binary
Input
k
Plagiarism Decider
Detection Result
Bir
thm
ark(
P)
Bir
thm
ark(
D)
S1,S2,…,Sn are similarity values under different inputs
Birthmark Generator
ε
System Call Sequence
Thread-Aware Birthmark based Plagiarism Detection
5 main modules:① DAM: monitoring and
recording
② PP: constitute valid traces
③ BG: extract thread-aware birthmarks
④ BSC: calculate similarity scores
⑤ PD: determine detection result
19
Pre-Processor & Birthmark Generator
Pre-Processor: filter out noises and extract valid traces
Birthmark Generator: generate TW-DKISBs and TW-SCSSBs utilizing SA model and SS model implemented
2#4#__NR_write#(0x6, 0xb2fcd0ef, 0x1, 0x90307d0, 0x1, 0xb2fcd064)#0x12#162#__NR_nanosleep#(0xb2fcd340, 0x0, 0x1bbc, 0xb1367010, 0x1bbc, 0xb2fcd248)#0x24#102#__NR_socketcall#(0x9, 0xb13570c0, 0xb41bfff4, 0x0, 0xb05004bc, 0xb13570a8)#0x144#3#__NR_read#(0x5, 0xb1357202, 0xa, 0x9030748, 0xa, 0xb1357164)#0x14#168#__NR_poll#(0xb0500498, 0x2, 0xffffffff, 0x0, 0xb5dfbff4, 0xb1357160)#0xffffffff
0#95BED5B7#add edi, dword ptr [ecx]#092688e0#/lib/i386-linux-gnu/libncursesw.so.52#93D5F263#add esi, dword ptr [edx+ecx*4]#fffeb53d#/usr/lib/i386-linux-gnu/libmad.so.00#95BED5BF#add ecx, dword ptr [edx]#09251af0#/lib/i386-linux-gnu/libncursesw.so.52#93D5EC60#sub eax, dword ptr [ebp+0x44]#fff9e0d8#/usr/lib/i386-linux-gnu/libmad.so.02#93D5EC63#imul dword ptr [ecx+ebx*1-0x4234]#bfa6c7d0#/usr/lib/i386-linux-gnu/libmad.so.0
Key Instruction Instances
System Call Instances
20
Dynamic Analysis Module
Key Instruction Sequence
Pre-Processor
Birthmark Similarity Calculator
Plaintiff Binary
Defendant Binary
Input
k
Plagiarism Decider
Detection Result
Bir
thm
ark(
P)
Bir
thm
ark(
D)
S1,S2,…,Sn are similarity values under different inputs
Birthmark Generator
ε
System Call Sequence
Thread-Aware Birthmark based Plagiarism Detection
5 main modules:① DAM: monitoring and
recording
② PP: constitute valid traces
③ BG: extract thread-aware birthmarks
④ BSC: calculate similarity scores
⑤ PD: determine detection result
21
Dynamic Analysis Module
Key Instruction Sequence
Pre-Processor
Birthmark Similarity Calculator
Plaintiff Binary
Defendant Binary
Input
k
Plagiarism Decider
Detection Result
Bir
thm
ark(
P)
Bir
thm
ark(
D)
S1,S2,…,Sn are similarity values under different inputs
Birthmark Generator
ε
System Call Sequence
Thread-Aware Birthmark based Plagiarism Detection
5 main modules:① DAM: monitoring and
recording
② PP: constitute valid traces
③ BG: extract thread-aware birthmarks
④ BSC: calculate similarity scores
⑤ PD: determine detection result
22
Similarity Calculator & Plagiarism Decider Similarity Calculator
' ' ' ' ' '1 1 2 2 1 1 2 2
For two SA model generated birthmarks:
, , , , , , and , , , , , ,n n m mA k v k v k v B k v k v k v
, ,
2, ,
Jaccard
Ex Dice Containment
A BA BEx Cosine A B Ex A B
A BA B
A B A BA B Ex A B
A B A
,
,
,
min A B
max A B
, ,cSim A B sim A B
, , ,c Ex Cosine Ex Jaccard Ex Dice Ex Containment
Four Similarity Metrics
23
Similarity Calculator & Plagiarism Decider Similarity Calculator
1 1 2 2
' ' ' ' ' '1 1 2 2
For two SS model generated birthmarks:
= , , , , , , and
= , , , , , ,
m m
n n
A t Birth t t Birth t t Birth t
B t Birth t t Birth t t Birth t
t1
t2
t3
.
.
.
.
.tm
t1'
t2'
t3'.....
tn'
A B
'
' '
, ,
'
1 1
,,
i jc i j i jt t MaxMatch A B
m n
i ji j
sim t t count t count tSim A B
count t count t
i icount t keySet Birth t
' 'j jcount t keySet Birth t
Bipartite matching
24
Similarity Calculator & Plagiarism Decider
Similarity Calculator Decision Maker
25
Outline
1. Introduction
2. Thread-Aware Birthmark Methods
3. Evaluation
4. Unsolved Problems & Future Work
26
Evaluation A high quality birthmark manifests in that the ratio of
false classifications should be rather low for a given ɛ Two properties to check
27
Evaluating Resilience Property Resilience to different compilers and optimization levels
Similairty scores between binaries of pigz
Statistical differences for 20 versions of pigz
28
Evaluating Resilience Property Resilience to special obfuscation tools
Cosine similarity between ConGzip and its 29 Sandmark obfuscated versions
Arr
ayFold
erArr
aySplit
ter
Blo
ckM
arke
rB
ludgeo
nSig
nat
ure
sB
oole
anSplit
ter
Bra
nch
Inve
rter
Duplic
ateR
egis
ters
Dyn
amic
Inlin
erFal
seR
efac
tor
Fie
ldAss
ignm
ent
Inlin
erIn
teger
Arr
aySplit
ter
Mer
geL
oca
lInte
ger
sM
ethodM
erger
Obje
ctify
Opaq
ueB
ranch
Inse
rtio
nO
verload
Nam
esPar
amAlia
sPro
mote
Prim
itiv
eReg
iste
rsPro
mote
Prim
itiv
eTyp
esPublic
izeF
ield
sR
andom
Dea
dC
ode
Ren
ameR
egis
ters
Reo
rder
Inst
ruct
ions
Reo
rder
Par
amet
ers
Sim
ple
Opqau
ePre
dic
ates
Split
Cla
sses
Sta
ticM
ethodsB
odie
sVar
iable
Rea
ssig
ner
0.0
0.2
0.4
0.6
0.8
1.0
Sim
ilarit
y
29
Evaluating Resilience Property Resilience to special obfuscation tools Allatori, DashO, Jshrink, ProGuard and RetroGround
Resilience to Allatori-Series obfuscation tools
30
Evaluating Credibility Property Similarity between independently implemented programs 6 compression software: Lbzip, lrzip, pbzip2, pigz, plzip and rar 5 audio players: Cmus, mocp, mp3blaster, mplayer and sox 10 web browsers: arora, chromium, dillo, dooble, epiphany, firefox,
konqueror, luakit, midori and seaMonkey
Credibility evaluation of TW-SCSSBs using 10 web browsers
31
Comparing with Traditional Birthmarks Performance Evaluation Metric
By varying ɛ from 0-0.5, an F-Measure curve can be drawn AUC: area under the F-Measure curve
EP JP EI JIPrecision
JP JI
EP JP EI JIRecall
EP EI
2 Precision RecallF Measure
Precision Recall
1 ,
, ,A B
A B A B
P P are classified as copies
Sim P P P P are classified as independent
otherwise inconclusive
Detection Criteria
32
Comparing with Traditional Birthmarks
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Threshold ()
F-M
easu
re
TW-SCSSBSA
TW-SCSSBSS
SCSSB
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Threshold ()
F-M
easu
re
TW-SCSSBSA
TW-SCSSBSS
SCSSB
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Threshold ()
F-M
easu
re
TW-SCSSBSA
TW-SCSSBSS
SCSSB
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Threshold ()
F-M
easu
re
TW-SCSSBSA
TW-SCSSBSS
SCSSB
Ex-Containment Metric
Ex-Dice Metric Ex-Jaccard Metric
Ex-Cosine Metric
F-Measure curves for TW-SCSSBSA, TW-SCSSBSS, and SCSSB
33
Outline
1. Introduction
2. Thread-Aware Birthmark Methods
3. Evaluation
4. Unsolved Problems & Future Work
34
Unsolved Problems & Future Work Problems Partial and library plagiarism problems Tool is preliminary Impact of K is not evaluated Future Works Conduct experiments using other kinds tools, such as the
shelling tools (Upx, ASProtect etc.); and on real plagiarism cases Improve our method to support for partial plagiarism detection Evaluate the effect of K to detection ability Form a relatively mature tool
35
Q&A
36
Some Definitions