radio fingerprinting using convolutional neural networks · neural networks to cognitive radio...

Radio Fingerprinting Using Convolutional Neural Networks

A Thesis Presented

by

Shamnaz Mohammed Riyaz

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Master of Science

in

Electrical and Computer Engineering

Northeastern University

Boston, Massachusetts

July 2018

To my family

ii

Contents

List of Figures v

List of Tables vi

Acknowledgments vii

Abstract of the Thesis viii

1 Introduction 1

2 Related work 32.0.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.0.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Causes of hardware impairments 93.1 Hardware impairments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 I/Q imbalance: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Phase noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.3 Carrier frequency and phase offset . . . . . . . . . . . . . . . . . . . . . . 123.1.4 Harmonic distortions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.5 Power amplifier distortions . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.1 Protocols of operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.2 Storage and processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Deep learning for RF fingerprinting 184.1 Initial studies on ML techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2.1 CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iii

5 Results and performance evaluation 295.1 Network setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2.1 CNN vs. conventional algorithms . . . . . . . . . . . . . . . . . . . . . . 325.2.2 Receiver operating characteristics for radio fingerprinting . . . . . . . . . . 335.2.3 Impact of distance on radio fingerprinting . . . . . . . . . . . . . . . . . . 35

6 Conclusion 386.1 Research challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Bibliography 40

iv

List of Figures

2.1 RF fingerprinting classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Typical transceiver chain with various sources of RF impairments. . . . . . . . . . 103.2 Amplitude imbalance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Phase imbalance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Phase noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 Phase offset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.6 AM/AM distortion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.7 AM/PM distortion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.8 Data collection using SDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.9 Experimental setup demonstrating data capture. . . . . . . . . . . . . . . . . . . . 153.10 Discovery cluster partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Device classification using Logistic Regression and Linear SVM for WiFi and LTE. 204.2 CNN architecture for RF fingerprinting. . . . . . . . . . . . . . . . . . . . . . . . 224.3 Convolution operation: filters strided over input sequences. . . . . . . . . . . . . . 234.4 Rectified Linear Unit (ReLU) operation performed on feature maps. . . . . . . . . 244.5 An illustration of max pooling operation. . . . . . . . . . . . . . . . . . . . . . . . 254.6 An illustration of sliding operation using a window of length 128. . . . . . . . . . 28

5.1 Software stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2 The accuracy comparison of SVM, logistic regression and CNN for 2− 5 devices. 325.3 ROC curve fold1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.4 ROC curve fold 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.5 ROC curve fold 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.6 ROC curve fold 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.7 ROC curve fold 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.8 Computational load. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.9 The plot of accuracy obtained using CNN for 4 devices over different distances

between transmitter and receiver. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

v

List of Tables

4.1 CNN architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

vi

Acknowledgments

Foremost, I would like to thank my advisor Prof. Kaushik Chowdhury for his constantguidance and encouragement in all my endeavors. His vision and ideas have always been a source ofinspiration for me. I thoroughly enjoyed my learning experience in his course ’Mobile and WirelessNetworking’ and also in the research associated with his Genesys lab. He has been extremelysupportive and patient throughout this research. I would also like to thank Prof. Stratis Ioannidis andProf. Jennifer Dy for their positive feedback and continuous association since the inception of theproject.I thank my husband Rameez for his support, inspiration and confidence in me. I am grateful for myparents, Mohammed and Sajida and in laws Rasheed and Rabia for being supportive and alwaysmotivating me to excel in everything I do. I would also like to thank my labmates in the Genesys labspecifically Kunal and Mauro for helping me with various experiments. Their company provided apositive energy in the work place.

vii

Abstract of the Thesis

Radio Fingerprinting Using Convolutional Neural Networks

by

Shamnaz Mohammed Riyaz

Master of Science in Electrical and Computer Engineering

Northeastern University, July 2018

Dr. Kaushik Chowdhury, Advisor

In this thesis, we describe a method for uniquely identifying a specific radio amongnominally similar devices using a combination of software defined radio (SDR) sensing capabilityand machine learning (ML) techniques. Our approach of radio fingerprinting applies ML over rawI/Q samples without specifically selecting features of interest. It distinguishes devices using only thetransmitter hardware-induced signal modifications that serve as a unique signature for a particulardevice. No higher level decoding, feature engineering, or protocol knowledge is needed, furthermitigating challenges of ID spoofing and coexistence of multiple protocols in a shared spectrum.Advances in SDR technology allows unprecedented control on the entire processing chain, allowingmodification of each functional block as well as sampling the changes in the input waveform.We first demonstrate RF impairments by modifying the operational blocks in a typical wirelesscommunications processing chain in a simulation study. We then generate over-the-air datasetcompiled from an experimental testbed of SDRs such as B210 and X310 and train the data usingan optimized deep convolutional neural network (CNN) architecture that gives good classificationaccuracy. We describe the parallel processing needs and choice of several hyper parameters to enableefficient training of the CNN model. We then compare the performance quantitatively with alternatetechniques such as support vector machines and logistic regression. Overall our results show that wecan achieve up to 90-99% experimental accuracy at transmitter-receiver distances varying between2-50 feet over a noisy, multi-path wireless channel.

viii

Chapter 1

Introduction

Emerging applications in the context of smart cities, autonomous vehicles, Internet of

Things (IoT), and complex military missions, among others, require reconfigurability both at the

systems and the protocol level within its communications architecture. These advances rely on a

critical enabling component, namely, software defined radio (SDR): this allows cross-layer pro-

grammability of the transceiver hardware using high level directives [1]. The promise of intelligent

or so called cognitive radios builds on the SDR concept, where the radio is capable of gathering

contextual information and adapting its own operation by changing the settings on the SDR based on

what it perceives in its surroundings.

In the last few decades, there has been an incredible growth in the application of internet

and connected devices. However, the privacy and security of such billions of devices is a paramount

concern in the IoT network. Any device that has network connectivity is vulnerable. Data gathered

by IoT devices are susceptible to attacks such as ID spoofing by an intruder. Most of the IoT

devices have limited computing power and memory capacity, which makes it difficult to use complex

cryptographic algorithms that require more resources than the devices can provide. Therefore,

there is insufficient authentication/authorization. Additionally, in many mission critical scenarios,

problems in authenticating devices, ID spoofing and unauthorized transmissions are major concerns.

Moreover, high bandwidth applications are causing a spectrum crunch, leading network providers

to explore innovative spectrum sharing regimes in the TV whitespace and the sub-6GHz bands. In

all of the above, identifying (i) the type of the protocol in use, and (ii) the specific radio transmitter

(among many other nominally similar radios) become important. Our work on SDR-enabled radio

fingerprinting tackles these two scenarios by learning characteristic features of the transmitters in

a pre-deployment training phase, which is then exploited during actual network operation. We

1

CHAPTER 1. INTRODUCTION

recognize that SDRs come in diverse form factors with varying on-board computational resources.

Thus, for general purpose use, any device fingerprinting approach must be computationally simple

once deployed in the field. For this reason, we propose machine learning (ML) techniques, specifically,

Deep Convolutional Neural Networks (CNNs), and experimentally demonstrate near-perfect radio

identification performance in many practical scenarios.

ML techniques have been remarkably successful in image and speech recognition, however,

their utility for device level fingerprinting by feature learning has yet to be conclusively demonstrated.

True autonomous behavior of SDRs, not only in terms of detecting spectrum usage, but also in terms

of self-tuning a multitude of parameters and reacting to environmental stimulus is now a distinct

possibility. We collect over 20 · 106 RF I/Q samples over multiple transmission rounds for eachtransmitter-receiver pair composed of off-the-shelf Universal Software Radio Peripheral (USRP)

SDRs. The approach of providing raw time series radio signal by treating the complex data as

dimension of 2 real valued I/Q inputs to the CNN, is motivated from modulation classification [2].

It has been found to be a promising technique for feature learning on large time series data. Our

technique of RF fingerprinting using the I/Q samples that carry embedded signatures characteristic

of different active transmitter hardware is a first in this field to the best of our knowledge. My

contributions in this project are:

• Generation of large real-time series data composed of 802.11ac signals using SDRs• Simulation study on the causes of hardware impairments of the transmitters• Developed a CNN architecture composed of multiple convolutional and max-pooling layers

optimized for the task of radio fingerprinting

• Partitioned the collected samples into separate instances for data pre-processing• Implemented CNN training in Keras running on top of TensorFlow on the Northeastern

discovery cluster environment

• Evaluated performance of CNN along with support vector machines and logistic regressionThe thesis is organized as follows. We briefly survey and classify existing approaches in Chapter 2.

In Chapter 3, we design a simulation model of a typical wireless communications processing chain

in MATLAB, and then modify the ideal operational blocks to demonstrate the RF impairments that

we wish to learn. This is followed with generation of real data and preprocessing data for training the

classifier. In Chapter 4, we architect and experimentally validate an optimized deep convolutional

neural network for radio fingerprinting. Experimental results and quantitative comparison of our

approach with support vector machines and logistic regression is provided in Chapter 5. Finally,

research challenges associated with our approach and conclusions are summarized in Chapter 6.

2

Chapter 2

Related work

There has been a significant amount of research going on in the application of deep

neural networks to cognitive radio tasks in the wireless communications field. While, the focus

is mainly on modulation classification which has shown impressive results[3]. Our interest is in

radio fingerprinting using deep learning architectures. The key idea behind radio fingerprinting is

to extract unique patterns (or features) and use them as signatures to identify devices. A variety of

features at the physical (PHY) layer, medium access control (MAC) layer, and upper layers have been

utilized for radio fingerprinting [4] in the literature. Simple unique identifiers such as IP addresses,

MAC addresses, mobile identification number (MIN), international mobile station equipment identity

(IMEI) numbers can easily be spoofed. Location-based features such as radio signal strength (RSS)

and channel state information (CSI) are susceptible to mobility and environmental changes. We

are interested in studying those features that are inherent to a device’s hardware, which are also

unchanging and not easily replicated by malicious agents. We classify existing approaches in Fig. 2.1.

2.0.1 Supervised learning

This type of learning requires a large collection of labeled samples prior to network

deployment for training the ML algorithm. It takes thousands of input samples from the devices with

labels correponding to each of the devices. The algorithm will then learn the relationship between

the samples and their associated numbers, and apply that learned relationship to classify completely

new samples (without labels) that the machine hasnt seen before. We study three types of learning

namely similarity based, classification based and deep learning based mechanisms.

3

CHAPTER 2. RELATED WORK

Figure 2.1: RF fingerprinting classification.

4


2.0.1.1 Similarity-based

Similarity measurements involve comparing the observed signature of the given device

with the references present in a master database. In [5], a passive fingerprinting technique is proposed

that identifies the wireless device driver running on an IEEE 802.11 compliant node by collecting

traces of probe request frames from the devices. They used binning approach on the time difference

between probes as features. These bins are iterated to compute similarity by summing the difference

of the percentages and mean differences scaled by percentage. They obtained an identification

accuracy varying from 77% to 97% depending on the bin size. [6] describes a passive blackbox-based

technique, that uses transmission control protocol (TCP) or user datagram protocol (UDP) packet

inter-arrival time (ITA) from access points (APs) as signatures to identify AP types. APs exhibit

different characteristics due to the manufacturing effects, because of which each AP will act upon

the packet ITA differently. In this case, an AP is considered as blackbox, since there is no apriori

information about the architecture of the AP. They collected multiple packet traces for each AP to

compute the ITAs. An unique pattern is then extracted using wavelet analysis on these ITAs. These

time intervals are sampled using bin sizes between 1-10µs. Optimal bin size is determined based on

the difference in the ITAs among different APs, that lead to a maximum value. Cross-correlation is

used to compute the similarity between the unknown signals and the signatures extracted from the

wavelet analysis for pattern matching.

2.0.1.2 Classification-based

There are several studies on supervised learning that exploit RF features such as I/Q

imbalance, phase imbalance, frequency error, and received signal strength, to name a few. These

imperfections are transmitter-specific and manifest themselves as artifacts of the emitted signals.

There are two types of algorithms

• ConventionalThis form of classification examines a match with pre-selected features using domain knowl-

edge of the system, i.e., the dominant feature(s) must be known a priori. This requires an

expertise in the RF domain for feature engineering. [7] proposes classification by extracting

the known preamble within a packet. The preamble signals are subjected to spectral analysis

by using fast fourier transform (FFT) to obtain the spectral components from the time domain

steady part of the signal. These log spectral energy features are fed as input to the k-nearest

neighbors (k-NN) discriminatory classifier, which uses Euclidean distance to compute the

5


distance. The training preambles are mapped into multidimensional feature space which is

divided into sections depending on the class labels. A given preamble is categorized based on

the highest frequency of occurrence of the label among all other k nearest training preambles.

This approach provides promising results with 97% accuracy to distinguish between eight

identical transmitters at 30dB signal-to-noise ratio (SNR). PARADIS [8] fingerprints 802.11

devices based on modulation-specific errors in the network interface card (NIC) of a wireless

frame. PARADIS demonstrated its effectiveness with an accuracy of 99% in distinguishing

between more than 130 similar 802.11 network interface cards (NICs). It is also shown to be

robust against alterations and noise in the wireless channel. In [9], a technique for physical

device and device-type classification called GTID is proposed. This method exploits variations

in clock skews as well as hardware compositions (such as processor, DMA controller, memory)

of the devices and applies artificial neural networks (ANNs) for classification. Unique device

specific signatures are created from the time-variant behavior of the traffic using statistical

techniques. GTID performed classification across various device classes such as iPhones,

Google phones that support variety of traffic types such as internet control message protocol

(ICMP), Skype etc and achieves high accuracy and recall on identification. In general, as

multiple different features are used, selecting the right set of features is a major challenge.

Additionally, RF domain knowledge plays significant role in extracting features, which by

itself is a time consuming task. This also causes scalability problems when large number of

devices are present, leading to increased computational complexity in training.

• Deep learningDeep learning offers a powerful framework for supervised learning approach. It can learn

functions of increasing complexity, leverages large datasets, and greatly increases the the

number of layers, in addition to neurons within a layer. [2] and [10] apply deep learning at

the physical layer, specifically focusing on modulation recognition using convolutional neural

networks. It involves identifying and differentiating broadcast radio, local and wide area

data and voice radios, radar users, and other sources of radio interference in the surroundings

which each have different behaviors and requirements. Modulation recognition is the task

of classifying the modulation type of a received radio signal with an aim to determine the

communication scheme. They classify 8 digital and 3 analog, totally 11 different modulation

schemes that are used in wireless systems. These consist of BPSK, QPSK, 8PSK, 16QAM,

64QAM, BFSK, CPFSK, and PAM4 for digital modulations, and WB-FM, AM-SSB, and

6


AM-DSB for analog modulations. Overall, 87.4% classification accuracy is obtained on the

test dataset. However, this approach does not identify a device, as we do here, but only the

modulation type used by the transmitter.

2.0.2 Unsupervised learning

Unsupervised learning is effective when there is no prior label information about devices.

In [11], an infinite Hidden Markov Random field (iHMRF)-based online classification algorithm is

proposed for wireless fingerprinting using unsupervised clustering techniques and batch updates.

This approach can model both time-dependent features such as received signal strength (RSS), Time-

of-Arrival (TOA) and angle-of-arrival (AOA) using Markov property and time-independent features

such as I/Q offset, carrier frequency offset (CFO), phase shift difference (PSD) using embedded

gaussian mixture model (GMM). A combination of these features is used to identify the number

of devices in a simulation testbed. However, this approach is yet to be demonstrated on real set of

devices. Transmitter characteristics are used in [12] where a non-parametric Bayesian approach

(namely, an infinite gaussian mixture model) classifies multiple devices in an unsupervised, passive

manner. A multi-variable Gaussian distribution with unknown parameters is used to model the feature

space of every single device. Similarly, an infinite gaussian mixture model used for multiple devices.

The features chosen by this approach are invariant to the channel, resistant to mobility and are not

affected by transmitter/receiver antenna gain and are independent of distance. It does not need to

create a database of legitimate devices unlike supervised approaches. This approach specifically

aims to detect identity spoofing by comparing the cluster labels with the device IDs. It identifies

masquerading attacks, when it encounters multiple devices which share the same device ID.

Our choice of algorithms is deep learning based which is made up of deep neural networks,

where several hidden layers are present between the input and output nodes. These hidden layers

extract the features from the input data and perform much more complicated classification tasks over

the learned features. This approach does not require feature engineering unlike other conventional

algorithms, thus reducing human intervention in identifying features. In the recent years, deep

learning has been found to be successful in object recognition, image classification and powering

vision in robots. Also most of the voice-activated personal assistants from Alexa, Cortana, Google

assistant, Siri and other high bandwidth applications such as Youtube, Netflix are all powered with

artificial-intelligence (AI) search engines, that provide us with information/recommendation as per

user’s interests. However, building such intelligent systems is not an easy task. Training these

7


deep learning algorithms require copious amount of data, nearly terabytes of data so that they can

perform better. This means, it involves careful selection of hyper parameters and efficient tuning of

these parameters required to solve complex functions. The number of such parameters can go up to

millions and hence careful consideration of training platform and resources is necessary. Generally

multi-core high performing GPUs are preferred to ensure efficient data processing. The complex

functions take weeks to train the large amount of data even with hundreds of GPU machines. It is

necessary to balance the trade off between training time and classification accuracy. Transmitter iden-

tification using deep learning architectures is still in a nascent stage. Our work focuses on generation

and processing of large number of RF I/Q samples to train the classifiers and eventually identify

the devices uniquely. The data collection procedure, data pre-processing, choice of parameters,

implementation details are explained in successive chapters.

8

Chapter 3

Causes of hardware impairments

Radio fingerprinting is a mechanism through which wireless devices can be identified based

on the unique characteristics in their analog components. Even though there has been an immense

growth in electronic design, RF transmitters are naturally imperfect devices due to the tolerances in

manufacturing of the analog electronics. These imperfections lead to differences in device specific

parameters such as channel doping and oxide thickness. One important thing to consider is that, these

imperfections are too small to compromise the specifications of communication standards [13]. Such

imperfections are specifically found in the transmitter front end such as frequency mixers, digital to

analog converters, band-pass filters and power amplifiers. RF fingerprint of a transmitter can not be

easily cloned and hence it provides an extra layer of security over other cryptographic mechanisms.

These fingerprints are unique to each device and cannot be replicated by any other device, since each

device adds its own impairments on the transmitted signal.

3.1 Hardware impairments

MATLAB Communications System Toolbox provides applications for the design and

analysis of communication systems. Using this, we design a simulation model of a typical wireless

communications processing chain, and then modify the ideal operational blocks to introduce RF

impairments, typically seen in actual hardware implementations. This allows us to individually study

the I/Q imbalance, phase noise, carrier frequency and phase offset, nonlinearity of power amplifier

and harmonic distortions in isolation of each other.

A block diagram of transceiver pair is shown in Fig. 3.1, with various sources of RF

impairments highlighted. We first study the effect of the hardware-induced causes of I/Q deviation

9

CHAPTER 3. CAUSES OF HARDWARE IMPAIRMENTS

Digital Baseband

(DSP) PA

I/Q Imbalance

Nonlinear Distor;on

LO

DAC

DAC

Phase Noise

An;-aliasing Filter

Digital Baseband

(DSP)

Carrier Frequency Offset

ADC

ADC

An;-aliasing Filter

LO

Sampling Frequency Offset

LNA

Harmonics Distor;on

(a) TransmiIer

(b) Receiver

π/2

π/2

Figure 3.1: Typical transceiver chain with various sources of RF impairments.

10


-1 -0.5 0 0.5 1

In-phase Amplitude

-1

-0.5

0

0.5

1

Quadra

ture

Am

plit

ude

Input Symbols

Reference Points

Figure 3.2: Amplitude imbalance.

-1 -0.5 0 0.5 1

In-phase Amplitude

-1

-0.5

0

0.5

1

Quadra

ture

Am

plit

ude

Input Symbols

Reference Points

Figure 3.3: Phase imbalance.

from the ideal values.

3.1.1 I/Q imbalance:

Quadrature mixers that convert baseband to RF and vice versa are often impaired by gain

and phase mismatches between the parallel sections of the RF chain dealing with the in-phase (I) and

quadrature (Q) signal paths. The analog gain is never the same for each signal path and the difference

between their amplitude causes amplitude imbalance, i.e., it occurs when one modulator produces a

larger signal than the other. In addition, the delay is never exactly 90◦, which causes phase imbalance,

which means that the cosine and the sine local oscillator (LO) signals are not perfectly orthogonal.

Figs. 3.2 and 3.3 illustrate the effect of amplitude imbalance and phase imbalance on a 16-QAM

constellation. In practice, I/Q amplitude imbalance is expressed in the range [-5, 5] dB, whereas

phase imbalance in the range [-30, 30] degrees.

3.1.2 Phase noise

The up-conversion of a baseband signal to a carrier frequency fc is performed at the

transmitter by mixing the baseband signal with the carrier signal. Instead of generating a pure tone at

frequency fc, i.e., ej2πfct, the generated tone is actually ej2πfct+φ(t), where φ(t) is a random phase

noise. The phase noise introduces a rotational jitter as shown in Fig. 3.4. Phase noise is expressed

in units of dBc/Hz, which represents the noise power relative to the carrier contained in a 1 Hz

11


-1 -0.5 0 0.5 1

In-phase Amplitude

-1

-0.5

0

0.5

1

Quadra

ture

Am

plit

ude

Input Symbols

Reference Points

Figure 3.4: Phase noise.

-1 -0.5 0 0.5 1

In-phase Amplitude

-1

-0.5

0

0.5

1

Quadra

ture

Am

plit

ude

Input Symbols

Reference Points

Figure 3.5: Phase offset.

bandwidth centered at a certain offset from the carrier. Typical values of phase noise level is in the

range [−100, −48] dBc/Hz, with frequency offset in the range [20, 200] Hz.

3.1.3 Carrier frequency and phase offset

The performance of crystal oscillators used to generate the carrier frequency is specified

with a certain accuracy in parts per million (ppm). The difference in transmitter and receiver carrier

frequencies is referred to as carrier frequency offset (CFO). Due to CFO, the received signal spectrum

shifts by a frequency offset:

y(t) = x(t)ej2π(fTx−fRx)t = x(t)ej2π∆CFOt (3.1)

where ∆CFO is the shift introduced by CFO between transceiver.

Phase shift difference is defined as the phase shift from one constellation to a neighboring

one. The uniqueness of CFO and phase offset in each transceiver pair make them excellent features

for classification of devices. Although orthogonal frequency divison multiplexing (OFDM) uses

different modulation techniques and each technique produces a specific constellation, most of the

constellations share some commonalities. For example, the phase shifts from one symbol to the next

one are created in the similar way in hardware and are transmitter dependent. Thus, for the sake

of simplicity, we use quadrature phase shift keying (QPSK) as an example and consider features

extracted from the constellation of QPSK as shown in 3.5. In QPSK, four symbols with different

12


-1 -0.5 0 0.5 1

In-phase Amplitude

-1

-0.5

0

0.5

1

Quadra

ture

Am

plit

ude

Input Symbols

Reference Points

Figure 3.6: AM/AM distortion.

-1 -0.5 0 0.5 1

In-phase Amplitude

-1

-0.5

0

0.5

1

Quadra

ture

Am

plit

ude

Input Symbols

Reference Points

Figure 3.7: AM/PM distortion.

phases are transmitted and each symbol is encoded with two bits. The phase difference between

two consecutive symbols is ideally 90. However, the transmitter amplifiers for I-phase and Q-phase

might be different. Consequently, the degree shift can have some variances. The constellation may

deviate from its original position due to hardware variability, and different devices may have different

constellations. Therefore, phase shift can be considered as a main feature.

3.1.4 Harmonic distortions

The harmonics in a transmitted signal are caused by nonlinearities in the transmitter-side

amplifiers. These harmonics are unique to the transmitting device. Harmonic distortion is measured

in terms of total harmonic distortion, which is a ratio of the sum of the powers of all harmonic

components to the power of the fundamental frequency of the signal. This distortion is usually

expressed in either percent or in dB relative to the fundamental component of the signal.

3.1.5 Power amplifier distortions

Power amplifier (PA) non-linearities mainly appear when the amplifier is operated in its

non-linear region, i.e., close to its maximum output power, where significant compression of the

output signal occurs. The distortions of the power amplifiers (PA) are generally modeled using

AM/AM (amplitude to amplitude) and AM/PM (amplitude to phase) curves. If we consider a complex

13


baseband signal x(t) = a(t)ejφ(t) , the output of the PA can be written

yPA(t) = AM(a(t))ej[φ(t)+PM(a(t))] (3.2)

where AM(a(t)) is the AM/AM function describing the PA output amplitude as a function of the

input signal amplitude, and PM(a(t)) is the AM/PM function describing the PA output phase as a

function of the input signal amplitude.

The AM/AM causes amplitude distortion whereas AM/PM introduces phase shift. As

shown in Fig. 3.6, the corner points of the constellation have moved toward the origin due to amplifier

gain compression. The constellation has rotated due to the AM/PM conversion in Fig. 3.7. The

nonlinearity of amplifier is modeled using Cubic Polynomial and Hyperbolic Tangent methods using

Third-order input intercept point (IIP3) parameter. IIP3 expressed in (dBm) represents a scalar

specifying the third order intercept.

3.2 Data Collection

Figure 3.8: Data collection using SDR.

Data collection is the first and foremost step in machine learning. The performance of

our predictive model is based on the quality and quantity of data that is gathered, and hence data

collection is the critical step. Our approach of deep learning applied for RF fingerprinting requires

ample data in order for the training to be effective. In our case, we generate raw I/Q samples and

transmit these samples over air and finally collect them at the receiver. We collect millions of samples

14


from each of the devices and generate corresponding class labels. For the purpose of data collection

at the receiver end, we use a fixed USRP B210. For the transmitter we use 4 different devices i.e.,

USRP B210s and X310s. Fig. 3.8 shows I/Q raw data collection using the SDR.

3.2.1 Protocols of operation

We transmit different physical layer frames defined by the IEEE 802.11ac and LTE

standards (as parameters defined in technical specification 36.141) on each transmitter SDR. These

frames are generated using the MATLAB WLAN Systems toolbox and LTE Systems toolboxes,

which provides standard-compliant functions for the generation of the waveforms. The data frames

generated are random since we intend to transmit any data streams. Once the waveforms are

generated, these protocol frames are streamed to the selected SDR for transmission, considering

separately the cases of over-the-air wireless propagation and through RF cable. The latter approach

eliminates wireless channel effects and captures the signals as they are modified by the transmitter.

The receiving SDR samples the incoming signals at 1.92 MS/s sampling rate at center frequency

of 2.45 GHz for WiFi and 900 MHz for LTE. Ultimately, we study the performance of different

learning algorithms, including linear support vector machine (SVM), logistic regression, and CNNs,

using I/Q samples collected from an experimental setup of USRP SDRs.

Figure 3.9: Experimental setup demonstrating data capture.

15


As shown in Fig.3.9, the host computer enabled with MATLAB WLAN and LTE toolboxes

generates waveforms and transmits these waveforms through an X310 USRP. These waveforms are

received through another USRP, a B210 which is connected to another computer through a high

speed link and that has all the required MATLAB packages to receive and store the raw IQ samples.

The workstations are equipped with typical configurations of Core-i7 processor, 8GB RAM, and

flash-based 512GB storage. Data is collected using different B210/X310 USRPs at the transmitter

end, but the receiver is kept fixed. Such experiments are repeated over distances starting from 2ft to

50ft with an interval of 4ft. Overall, we collect approximately 20 million samples for each of the five

SDRs at each distance.

3.2.2 Storage and processing

The samples are further analyzed offline over Northeastern’s Discovery cluster located

at Massachusetts Green High Performance Computing Center (MGHPCC). It provides high-end

research computing resources such as centralized high performance computing (HPC) clusters and

storage, visualization and software. There are 30352 compute cores shared across all users. The

method to connect to the Discovery cluster is via ssh (secure shell). Partitioning of discovery cluster

into dedicated CPU and GPU nodes is shown in Fig.3.10 The configuration details of the nodes

Figure 3.10: Discovery cluster partitioning.

which we use is given below. Each of these CPU nodes have : 2x Intel(R) Xeon(R) CPU E5-2680

v4 @ 2.40GHz with 28 physical / 56 logical cores and are equipped with 500GB RAM, whereas on

each node GPU has 4x NVIDIA Tesla K80 boards equipped with 4992 @ 560MHz CUDA cores per

board and 24 GB GDDR5 memory per board. These GPU servers are on a 10Gb/s TCP/IP backplane.

The collected complex I/Q samples are partitioned into subsequences in the cluster environment

16


before passing onto the classifiers. For our experimental study, we set a fixed subsequence length of

128, additional details of data preprocessing are provided in Chapter 4.

3.2.2.1 Signal metadata format

Signal Metadata Format (SigMF) is a standard way to store the signal data [14]. Deep

learning works best when large amounts of data are available. Since, deep learning in RF domain is

in nascent stage, sharing these datasets is important in order to reproduce the experimental results and

provide direct access to those users who do not have direct access to the tools/equipments required to

generate the datasets. SigMF is a method of sharing metadata descriptions of the captured signal

data written in JSON. It stores signal data using two files:

• JSON format text file which is made up of:• Core data namespaces: gives general file information• global : It includes information applicable to the whole recording such as description

of the SigMF recording, hardware used to make recording, sample rate and the data

file format

• capture : It provides parameters of the signal capture such as the center frequency ofthe signal and the sample index at which the segment takes effect

• annotations : It includes the signal data which is not part of the captures and globalsuch as the number of samples that each segment applies to and the frequency of the

lower/upper edge of the feature

• Extension namespace: used to define fields that are not in the core namespace i.e., capturedetails such as:

• signal reference number: the sequential label for signals in a data file• the type of RF transmitter• manufacturer of the transmitter• the source of the RF signal.

• A binary file, where IQ samples are stored as defined in in the ’datatype’ field in the metadatafile, for example ci16: complex data type of integer 16-bit data

It is encouraged to store all signal datasets in widely accepted formats such as SigMF as a standard

practice.

17

Chapter 4

Deep learning for RF fingerprinting

Assume a set of multiple wireless devices placed in a room and the task for one of the

devices is to identify rest of the devices uniquely. The identification is purely based on the inherent

hardware characteristics of the devices, which can be used as their unique signatures. To enable

the task of RF fingerprinting, we collected raw data from all the devices and build a model that can

effectively perform the classification. Different learning algorithms such as SVM, logistic regression

and CNNs are used to fit the data. Based on the preliminary results, we choose CNN to be the

best working model compared to other conventional ML algorithms. CNN really shines when it

comes to complex problems such as image classification, natural language processing, and speech

recognition and from our analysis we can say that it indeed is the most suitable choice for RF

fingerprinting as well. RF fingerprinting using CNN solves one of the major hurdles associated with

feature engineering. Deep learning offers algorithms that learns features and hence we do not have to

worry about specifically selecting features of interest. The major challenge that we faced with this

approach is finding the right model that nearly perfectly fits our data. Different components of the

CNN architecture and parameter selection, followed by hyper parameter tuning are presented in this

chapter.

4.1 Initial studies on ML techniques

As part of our preliminary experiments, we started with shallow (single layer) supervised

learning classifiers such as linear support vector machine (SVM) and logistic regression [15]. Several

features such as amplitude, phase and FFT values along with mean, standard deviation, normalized

phase, absolute normalized frequency components are extracted from the I/Q samples to build a rich

18

CHAPTER 4. DEEP LEARNING FOR RF FINGERPRINTING

set of features to train the classifiers. The frequency components of the samples are computed using

the FFT function in MATLAB.

4.1.1 Support vector machines

The SVM classifier is a supervised ML approach used for classification problem. It

is based on finding the hyperplane that segregates any two classes. The right hyperplane is the

one with the largest margin between the closest data point and the hyperplane. Selecting the best

hyperplane is necessary to ensure robustness in the classification. For a dataset of points xj ∈ Rd

and corresponding labels yj ∈ {−1, 1}, j = 1, . . . , N , the hyperplane is given by

f(x) = x′β + b = 0 (4.1)

The optimal hyperplane is obtained by finding β ∈ Rd and b ∈ R that minimizes

‖β‖22 + CN∑j=1

ζj (4.2)

subject to the constraints that

yj(f(xj) + ζj) ≥ 1 (4.3)

for all the data points (xj , yj), where ζj are the slack variables and in our experiments C is set to 1.

We use libraries/packages to implement support vector machines. In order to best fit the data with

hyperplane, we use Linear SVC (Support Vector Classifier). Python offers huge set of libraries and

for our training we use sklearn library which provides functions like LinearSVC (linear kernel) to

perform classification. The features which are already computed are fed into this classifier along

with label information to predict the performance of SVM. We chose LinearSVC since it offers

flexibility in choosing parameters. We also used squared hinge loss function and `2 regularizer to

prevent overfitting. These parameters helps in better scaling to larger dataset and faster convergence.

4.1.2 Logistic regression

Logistic regression is another supervised learning algorithm which transforms its output

using the logistic sigmoid function. This is the core of logistic regression which squashes any real

values into values between 0 and 1. Each of the returned probability value can then be mapped to

two or more discrete classes. Logistic regression can be thought of as a single-neuron dense neural

19


network. In logistic regression, again yj is binary in {−1,+1} and

P(yi = +1) = σ(β′xj + b) =1

1 + e−(β′xj+b)

(4.4)

We use scikit library to train the model in Python. This algorithm learns the regression variables β

and b that minimizes the squared error between each yj and ŷj . Cross entropy is the loss function

used for this learning. Overfitting is handled using `2 regularizer. New data points x are classified

based on σ(β′x + b). Classification is performed on three different datasets. First task is to classify

devices that operate on WiFi and the second task is to identify devices that use LTE. Last one being

the data combined from devices that use both WiFi and LTE. For each of these cases, data is divided

into three parts namely, training, validation and testing sets. Ultimately, the performance of the

aforementioned classifier is measured on the testing data which is not seen by the trained model.

Fig. 4.1 provides the accuracy obtained through cross validation for the validation data using both

SVM and logistic regression. Results were obtained for various combination of devices over air

for both WiFi and LTE respectively. In Fig. 4.1, we also report the accuracy in performance for

WiFi LTE WiFi-LTE40

50

60

70

80

90

100

% A

ccur

acy

Logistic Regression B210-B210Linear SVM B210-B210Logistic Regression B210-X310Linear SVM B210-X310

Figure 4.1: Device classification using Logistic Regression and Linear SVM for WiFi and LTE.

identifying different protocols. Being able to detect a protocol considerably reduces the number of

feasible constellations supported by the protocol, which in turn influences the constellation type and

20


structure. One important thing to note is that SVM and logistic regression are both able to achieve

high accuracies (≈ 90%) for the simpler task of protocol detection compared to device recognitionaccuracy of less than ≈ 60%.

4.2 Convolutional neural networks

Convolutional neural networks (ConvNets or CNNs) are a category of neural networks

that have found to be very effective in areas such as image recognition and classification [16] The

success of CNNs in recognizing faces, things, speech domain as well as empowering vision in robots

motivates our investigation in using these networks for radio fingerprinting. Our first challenge was

to understand, what these neural networks are made up of and how they can be used to achieve our

task. An Artificial Neural Network (ANN) is a model inspired by the neurons in the human brain.

The computation of ANN is similar to that of brain functioning, with neuron being the basic unit

of computation. Each neuron receives input from either an external source or from other neurons.

The input passed into the neuron is associated with weight, which is assigned based on the input’s

relative importance with other inputs. The neurons apply a non linear function namely activation

function to the entire weighted sum of inputs and ultimately computes an output. It is important to

use an activation function since most of the data is non linear and activation function introduces non

linearity to the neuron’s output [17]. The three most important activation functions are:

• Sigmoid : It takes the input and maps into a value between 0 and 1• Relu : Rectified Linear Unit takes input and replaces negative values with zero. This is done

by finding maximum value between the input and zero

• tanh : It takes the input and maps into a value in the range of [-1,1]A neural network is made up of input layer and multiple interconnected neurons in the middle layers

called as hidden layers followed by output layer. The CNNs are similar to ordinary neural networks

but are made up of multiple hidden layers and fully connected layers. Additionally, CNNs slides

a filter across the input dimensions, with the filter’s weights being shared across all the slides in

that particular layer. This results in lesser parameters as compared to far more parameters in regular

neural networks.

21


4.2.1 CNN architecture

The proposed method consists of two stages, i.e., a training stage and an identification

stage. In the former, the CNN is trained using raw IQ samples collected from each SDR transmitter

to solve a multi-class classification problem. In the identification stage, raw I/Q samples of the

unknown transmitter are fed to the trained neural network and the transmitter is identified based

on observed value at the output layer. In this section, we first describe the CNN architecture and

then present preprocessing of input data necessary to improve the performance. There exists several

CNN architectures namely LeNet, ResNets, AlexNet, GoogleNet, VGGNet, ZFNet, DenseNet. Our

CNN architecture is inspired in part by AlexNet [18], which shows remarkable performance in

image recognition. As shown in the Fig. 4.2, our network has four layers, which consists of two

convolutional layers and two fully connected or dense layers. Our goal is to first understand how the

Figure 4.2: CNN architecture for RF fingerprinting.

layers are stacked and the functional operation of the layer components. The most difficult challenge

in building the CNN network is to find how many layers to use, how many filters/kernels in each

layer, what the filter sizes, values for padding and stride should be. None of these are standard and

22


the complexity of the network depends on the type of data and its processing. A lot of effort was

spent on experimenting with different parameters and ultimately finding the right combination of

these hyperparameters that generalizes our data well.

We describe various CNN components and hyperparameters in detail in this chapter. The

input to the CNN is a windowed sequence of raw I/Q samples with length 128. Each complex value is

represented as two-dimensional real values, which results in the dimension of our input data growing

to 2× 128. This is then fed to the first convolution layer.

4.2.1.1 Convolution layer

The convolution layer is the core building block of the CNN, whose primary purpose is

to extract features from the input data. It consists of a set of spatial filters (also called kernels, or

simply filters) that perform a convolution operation over input data. The operation of the convolution

filter is shown with an example in Fig. 4.3 for intuitive understanding. A filter of size 2 × 2 is

Figure 4.3: Convolution operation: filters strided over input sequences.

convolved with input data of size 4 × 4 by sliding across its dimension. This convolution meanscomputing the element wise multiplication between input matrix and the filter matrix and then sum

all the multiplication outputs that produces a single value in the output matrix. Such convolution is

performed over the entire input to produce a two-dimensional feature map/activation map. The next

hyperparameter is called stride, which controls how the filter convolves around the input data. In the

Fig. 4.3, we set stride to 1, i.e., the filter convolves around the entire input matrix by shifting one

value at a time. In generic, stride is the sliding interval of the filter and determines the dimension

of the feature map. Our example produces a feature map of dimension 3 × 3 at the end of the

23


convolution. In our architecture, each convolution layer consists of a set of such filters, which in turn

operates independently to produce a set of two-dimensional feature maps.

4.2.1.2 ReLU activation

Convolution is a linear operation which involves basic element wise multiplication and

additions. Therefore to introduce non linearity to the system ReLU (Rectified Linear Units) layers

are used after each of the convolution layers. Their main function is to perform a pre-determined

non-linear transformation on each element of the feature map. There are many possible activation

functions, such as sigmoid and tanh; we use the ReLU function, as CNNs with ReLU train faster

compared to alternatives with greater computational efficiency. It also reduces the vanishing gradient

problem, where network training becomes slower because the gradients reduces exponentially to

minimal values close to zero. Mathematically it is expressed as:

f(x) = max(0, x) (4.5)

Figure 4.4: Rectified Linear Unit (ReLU) operation performed on feature maps.

As shown in Fig. 4.4, ReLU outputs max(x, 0) for an input x, replacing all negative

activations in the feature map by zero.

4.2.1.3 Pooling layers

The convolution layer is generally followed by a pooling layer. Its functionality is to (a)

introduce shift invariance (as well as (b) reduce the dimensionality of the rectified feature maps of the

preceding convolution layer, while retaining the most important information. We choose a pooling

24


layer with filters of size 2× 2 and stride 2, which downsamples the feature maps by 2 along boththe dimensions. Among different filter operations (such as average, sum), max pooling gives better

performance. As shown in Fig. 4.5, max pooling of size 2× 2 with stride 2 selects the maximumelement in the non-overlapping regions (shown with different colors). We apply pooling operation

separately onto each of the feature maps. Thus, it reduces the dimensionality of the feature maps,

Figure 4.5: An illustration of max pooling operation.

which in turn reduces the number parameters and computations in the network and control overfitting.

Additionally it makes network invariant to any sort of transformations in the input data.

4.2.1.4 Fully connected layers

A fully connected or dense layer is a traditional Multi Layer Perceptron (MLP), where

the neurons have full connections to all activation steps in the previous layer, similar to regular

neural networks. The output of the second pooling layer is provided as input to the fully connected

layer. Its primary purpose is to perform the classification task on high-level features extracted from

the preceding convolution layers. At the output layer, a softmax activation function is used. The

classifer with softmax activation function gives probabilities (e.g. [0.9, 0.09, 0.01] for three class

labels), i.e., it ensures the sum of the probabilities from the fully connected layer is 1. To sum it

up, the convolution, pooling layers function as feature extractors from the input data while the fully

connected layers (dense layers) perform the classification based on these features. The network

architecture for our RF fingerprinting is shown in Table 4.1.

Next, we discuss the selection of hyperparameters of CNN to optimize the performance,

followed by preprocessing of input data necessary for proper operation of CNN and finally shift-

25


Table 4.1: CNN architecture.

Layer Output dimensions

Input 2*128Conv1 50*128Conv2 50*128

FC/ReLU 256FC/ReLU 80

FC/Softmax 4

invariance property of our classifier.

4.2.1.5 Model selection

We start with a baseline architecture consisting of two convolution layers and two dense

layers, then progressively vary the hyperparameters to analyze their effect on the performance. The

first parameter is the number of filters in the convolutional layers. We observed that the number of

filters within a range of (30 − 256) provide reasonably similar performance. However, since thenumber of computations increases with an increase in the number of filters, we set 50 filters in both

convolution layers for balancing the performance and computational cost. Similarly, we set 1× 7 and2×7 as the filter size in the first and second convolution layer respectively, since larger filter size doesnot offer significant performance improvement. Furthermore, increasing the number of convolution

layers from 2 to 4 shows no improvement in the performance, which justifies continuation with two

convolution layers. We then try to analyze the effect of the number of neurons in the first dense

layer by varying it between 64 to 1024. Interestingly, we find that increasing the number of neurons

beyond 256 does not improve the performance. Therefore, we set 256 neurons in the first dense

layer. In all of these parameters selection, we observe that having a single fully connected layer or

increasing the number of neurons to as large as 1024, increases the model complexity and makes the

training slower. Overfitting is one of the major problems during network training, during which the

network weights gets tuned so well to the training examples while the network fails to perform well

when given the unseen data. Thus we need to take measures to alleviate the problem of overfitting.

We use dropout layer whose main function it to drop out a set of activations in that specific layer by

setting them to zero. By doing this, we can make our network robust and ensure that it does not get

too fitted to the training data. After finalizing the architecture and parameters of CNN, we carefully

26


select the regularization parameters as follows: We use a dropout rate of 50% at dense layers. In

addition, we use an `2 regularization parameter λ = 0.0001 to avoid overfitting.

4.2.1.6 Preprocessing data

Our experimental studies conducted on different representative classes of ML algorithms

demonstrate significant performance improvement by choosing deep CNN. However, to ensure

scalable performance over large number of devices, our CNN architecture needs to be modified. In

addition, our input I/Q sequences, which represent a time-trace of collected samples, need to be

suitably partitioned and augmented beyond a stream of raw I/Q samples. Our classifiers operate

on sequences of I/Q samples of a fixed length. In general, given sequences of length L, we can

create N = L/` subsequences of length ` by partitioning the input stream. We thus create L − `subsequences by sliding a window of length ` over the larger sequence (or stream) of I/Q samples.

Training classifiers over small subsequences leads to more training data points, which in turn yields a

low variance but potentially high bias in the classification result. Conversely, large sequences may

lead to high variance and low bias. We set 128 as sequence length. From a wireless communications

viewpoint, the channel remains invariant in smaller durations of time. Hence, the ability to operate

on smaller subsequences carved out of in-order received samples allows us to estimate the complex

coefficients representing the wireless channel. Thus we train our classifiers over the input I/Q

sequences by treating each real and imaginary part of a sample as two inputs, leading to a training

vector of 2× ` samples for a sequence of length `.

4.2.1.7 Shift invariance

Another prominent characteristic of our CNN classifier both with respect to our final goal

of identifying the transmitting device, but also in terms of feature extraction, is shift invariance. In

short, all events like I/Q imbalance, phase noise, carrier frequency and phase offset, and nonlinearity

of power amplifier and harmonic distortions can occur at an arbitrary position in a given I/Q sequence.

A classifier should be able to detect a device-specific impairment irrespectively of whether it occurs

at e.g., the 1-st or 15-th position of an I/Q sequence. Convolved weights in each layer detect signals

in arbitrary positions in the sequence, and a max-pool layer passes the presence of a signal to a higher

layer irrespectively of where it occurs.

To enhance the shift-invariance property of our classifier during training, we train it over sliding

windows of length ` as shown in Fig. 4.6, rather than partitioned windows: this further biases the

27


trained classifiers to shift-invariant configurations. In our initial experiments, we verified the efficacy

1 2 ⋯ 128 129 ⋯ ⋯ N

1 ⋯ 128 129 ⋯ 256 ⋯ ⋯ ⋯ M

Slidingoperation

IQsamples aftersliding

Figure 4.6: An illustration of sliding operation using a window of length 128.

of using sliding window by comparing the performance of our CNN with data preprocessed using

partitioned windows. We observed an improved performance with the usage of sliding window.

Finally, since deep learning performs well with large data, it was evident from our analysis that

sliding window is an efficient way for data augmentation.

28

Chapter 5

Results and performance evaluation

5.1 Network setup

The performance of the CNN architecture for RF fingerprinting is analyzed on the raw

IQ samples collected from USRPs. We use MATLAB as the host based software to interact with

the USRP radios. Once the data is collected at the receiver end, the samples are first partitioned

into subsequences in the Northeastern’s Discovery cluster. The details of the software packages and

structuring is shown in the figure 5.1. The core software implementation is in Python language. It

is easy to read and write than other programming languages. It offers a wide varieties of standard

libraries and built-in functions. In addition to it, there are many third-party open source libraries

offering high end modules for a wide range of applications. The compute nodes in the discovery

cluster are equipped with CUDA, a parallel computing platform and programming model by NVIDIA

for the computing purpose on the graphical processing units (GPUs). We implement our CNN

training and classifier in Keras which is a model level library that provides building blocks for

deep learning [19]. In the backend, we use TensorFlow library, a specialized well optimized tensor

manipulation library for high dimensional matrix operations. We install these packages in the open

source distribution of Python, namely Anaconda which is specifically used for machine learning

related applications. It eases package, environment management and deployment. All of these

packages are installed on a NVIDIA Cuda enabled Tesla K80m GPU, which is our platform for

training and evaluation.

29

CHAPTER 5. RESULTS AND PERFORMANCE EVALUATION

Figure 5.1: Software stack.

5.2 Evaluation

Our CNN implementation has a network depth of 5 layers, with 50 filters in layers 1

and 2, 256 neurons in layer 3, 80 neurons in layer 4 and the final classifier with 4 neurons. Each

convolution layer is followed by a max pooling layer with pool size 2. We calculate total error at

the output neurons and propagate these error back through the network using back propagation to

calculate gradients.Thus the important function during training is finding right set of weights that

fit our data well and classifies devices correctly by reducing the error at the output layer. This is

done using optimizers, whose basic purpose is to update weights using gradients. In our network

we use Adam, an optimization method which is well suited for problems that are large in terms of

data and parameters. Here, we should also consider another parameter called learning rate, which

decides by how much we update the network weights by their gradients. It is important to note that,

the learning rate should be chosen carefully. If it is too high, the network learns faster at the cost

of divergence and never reach a global minimum. On the other hand, if the learning rate is too low,

30


then the network learns too slow and may take days together to converge. Optimizer goes hand in

hand with learning rate and decides how it should use current weight gradients along with previous

weight gradients to decide the learning rate. Adam optimizer uses the gradients to find an adaptable

learning rate for each individual weight (parameter) unlike stochastic gradient descent where a single

learning rate is set for all weight updates i.e., the learning rate doesn’t change during training. The

next parameter is called Batch size, which defines the total number of examples taken from a dataset

at once to perform optimization. The choice of these parameters is done by progressively varying

their values to analyze the effect on the performance. The following steps summarizes the network

training process:

• The first step is the initialization of filters and weights. it is done using the Glorot uniforminitializer, also called Xavier uniform initializer. It draws samples from a uniform distribution

within [-limit, limit] where

limit =√

6/(fan in+ fan out) (5.1)

where fan in is the number of input units in the weight tensor and fan out is the number of

output units in the weight tensor.

• Training data is passed as input to the network and it goes through forward propagation step(convolutional, Relu and pooling operations along with fully connected layers) to find the

output probabilities for each of the classes.

• Total error is calculated at the output layer using categorical cross entropy loss function whichinternally uses softmax function.

• Gradients of the error is calculated w.r.t all the weights in the network and Adam optimizerupdates all filter weights and parameters to minimize the output error.

Finally, we evaluate the performance of our CNN algorithm using k-fold Cross Validation

technique. The value of k is set to 5 in our case. It is done by splitting the training dataset into 5

folds and takes turns training models on all the folds except one which is held out. This is followed

by evaluating model performance on the hold out dataset. The same process is repeated until each

of the subfolds becomes a part of the hold out dataset. Thus we can measure the trained model’s

performance on the unseen data and avoid overfitting. This leads us to obtain less biased estimate

of the performance of our model. We have used StratifiedKFold class from the scikit-learn Python

31


machine learning library to split up the training dataset into 5 folds. Our training set consists of

≈ 720K training examples and ≈ 80K examples for validation. We use another 200K examples fortesting the performance of our trained model. We also represent the class labels associated with the

devices as binary vectors since classification works better when the categorical variables are mapped

into binary values. This ensures equal importance is given to all the devices. It took ≈ 23min totrain our model. Performance evaluation on hold out dataset of 200K examples took only ≈ 2min.There exists several metrics to evaluate the model performance, of which accuracy which gives the

proportion of correct classifications among all the classifications is not a good measure. This is

because if the data is imbalanced the output predictions may show that every instance belongs to the

majority class (99%). Hence we do not solely rely on accuracy but use better metrics such as Area

Under the Curve (AUC), which is evaluated on the Receiver Operating Characteristic (ROC) curve

comprising true positive rate on the Y-axis and false positive rate on the X-axis.

5.2.1 CNN vs. conventional algorithms

We first measure the performance of our WiFi dataset using SVM and logistic regression

for the classification of nominally similar devices. We extract several features such as amplitude,

2 3 4 5

Number of transmitters

0

10

20

30

40

50

60

70

80

90

100

Accu

racy (

%)

SVM Logistic Regression CNN

Figure 5.2: The accuracy comparison of SVM, logistic regression and CNN for 2− 5 devices.

phase and FFT values along with mean, standard deviation, normalized phase, absolute normalized

frequency components from the raw I/Q samples and built a rich set of features to train the classifiers.

We obtain the classification accuracy for identification among 2, 3, 4 and 5 devices. As seen in

32


Fig. 5.2, accuracy measure with SVM and logistic regression algorithms for 2 devices is ≈ 55% andit decreases further as the number of devices increases. The performance deterioration can be clearly

seen in the Fig. 5.2.

We then train our CNN classifier using raw data to classify the same set of devices. With

our deep CNN network, we are able to achieve accuracy 98% for five devices, as opposed to less

than ≈ 33% for the shallow learning SVM and logistic regression algorithms.

5.2.2 Receiver operating characteristics for radio fingerprinting

We obtained false positive rate and true positive rate to measure AUC. Figs. 5.3, 5.4, 5.5, 5.6

and 5.7 shows the ROC curve for the classification of four similar WiFi devices, for each folds of

cross validation. We can see that the CNN model works extremely well, as AUC ranges between

0.93 and 1. The AUC attained for each device is 0.964, 0.936, 1, and 0.994, respectively as shown in

Fig. 5.3. This demonstrates that CNN is the effective model for radio fingerprinting. Additionally,

training our CNN network over a large dataset with Keras takes significantly lower time compared to

any other aforementioned algorithms. To demonstrate this, Fig. 5.8 shows computational load for

training, scaled as a function of the number of training examples and estimated time for every epoch

on average. Clearly performance with GPU is faster than the CPU.

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rate

Receiver operating characteristic example

B210 #1 (area = 0.96402)B210 #2 (area = 0.93601)B210 #3 (area = 1.00000)X310 #1 (area = 0.99461)

Figure 5.3: ROC curve fold1.

33



0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e



Figure 5.4: ROC curve fold 2.


0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e




34



0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e





0.0

0.2

0.4

0.6

0.8

1.0

True Positive Rate




5.2.3 Impact of distance on radio fingerprinting

We run experiments to collect data over a distance ranging between 2-50 ft over steps

of 4 ft, to evaluate the impact of distance (and possible multipath effect owing to reflections) on

35


Figure 5.8: Computational load.

0 2 6 10 14 18 22 26 30 34 38 42 46 50

Distance (ft)

60

65

70

75

80

85

90

95

100

Accu

racy (

%)

5

10

15

20

25

30

35

40

45

50

SN

R (

dB

)

Accuracy (%)

Observed SNR in dB

Analytical SNR in dB

Figure 5.9: The plot of accuracy obtained using CNN for 4 devices over different distances between

transmitter and receiver.

classification accuracy. Fig. 5.9 demonstrates the accuracy measure for the classification of 4 devices

using CNN. It achieves classification accuracy greater than 95% up to the distance of 34ft. In addition,

the observed SNR and analytical SNR (calculated using free-space path model) are shown in the

same plot to elucidate the effect of received SNR on the classification accuracy. It is evident that the

classification is robust against the fluctuations in SNR occurred due to path loss and multipath fading

36


up to the distance of 34ft.

37

Chapter 6

Conclusion

With the increase in the demand for high data rate applications and the advancement in the

IoT space enabling millions of devices to stay interconnected, wireless security has become one of

the crucial functionality. In addition to this, the available spectrum is limited to support enormous

amount of mobile devices. Therefore, the need for novel techniques for identifying devices and

hence detect malicious activity and gain spectrum awareness is of great importance. Existing device

fingerprinting approaches require feature engineering and are not efficient enough to train large

datasets. We propose a radio fingerprinting approach based on deep learning CNN architecture to

train using I/Q sequence examples. Our design enables learning features embedded in the signal

transformations of wireless transmitters, and identifies specific devices. Furthermore, we have shown

that our approach of device identification with CNN outperforms alternate ML techniques such

as SVM, logistic regression for the identification of four nominally similar devices. Finally, we

experimentally validate the performance of our design on a dataset collected over range of distances,

2 ft to 50 ft. We observe that detection accuracy decreases as the distance between transmitter and

receiver increases. We also show how computational resources such as Keras running with GPU

support speed up the training time. Our future work involves increasing the robustness of the CNN

architecture to allow scaling up to correct identification of 1000s of similar radios.

6.1 Research challenges

We now summarize the challenges associated with the implementation of CNNs for radio

fingerprinting. In our experiments, we set the partition length as 128 through a rectangular windowing

process. However, identifying the optimal length is a critical research objective and should be depen-

38

CHAPTER 6. CONCLUSION

dent on the channel coherence time. Varied CNN architectures may lead to significantly different

results. Finding an optimal architecture which enhances device classification is an open research

issue. A related challenge is obtaining the right balance between training time and the classification

accuracy. Increasing the depth of the CNN beyond a point may not help the classification; in fact

there are risks of overfitting the training set, as we found in some of our early experiments. Our

work focuses on training the model with actual experimental data while a large body of earlier

works attempt to solve a similar problem using synthetic data. There exists no standard dataset to

benchmark the performance of our classifier, and releasing all datasets in widely accepted formats

such as SigMF is essential for correct replication of experiments. Our classifier performs very well

on limited set of devices, however to identify large number of devices (1000s) and also at wide

distances of 100-200 ft, it may require us to effect major changes in the architecture and find new

optimum parameters. Additionally, the effects of wireless channel conditions on the classification

accuracy is yet to be studied. It is important to note that our technique relies on the fact that devices

can be identified uniquely based on their hardware imperfections. This leads to a wide scope to

determine the kind of features that can be learnt in the wireless domain.

39

Bibliography

[1] J. Mitola, “Software radio architecture: a mathematical perspective,” IEEE Journal on Selected

Areas in Communications, vol. 17, no. 4, pp. 514–538, Apr 1999.

[2] T. J. O’Shea and J. Corgan, “Convolutional radio modulation recognition networks,” CoRR, vol.

abs/1602.04105, 2016. [Online]. Available: http://arxiv.org/abs/1602.04105

[3] N. E. West and T. O’Shea, “Deep architectures for modulation recognition,” in 2017 IEEE

International Symposium on Dynamic Spectrum Access Networks (DySPAN), March 2017, pp.

1–6.

[4] Q. Xu, R. Zheng, W. Saad, and Z. Han, “Device fingerprinting in wireless networks: Challenges

and opportunities,” IEEE Communications Surveys Tutorials, vol. 18, no. 1, pp. 94–104,

Firstquarter 2016.

[5] J. Franklin, D. McCoy, P. Tabriz, V. Neagoe, J. Van Randwyk, and D. Sicker, “Passive data link

layer 802.11 wireless device driver fingerprinting,” in Proceedings of the 15th Conference on

USENIX Security Symposium - Volume 15, ser. USENIX-SS’06. Berkeley, CA, USA: USENIX

Association, 2006. [Online]. Available: http://dl.acm.org/citation.cfm?id=1267336.1267348

[6] K. Gao, C. Corbett, and R. Beyah, “A passive approach to wireless device fingerprinting,” in

2010 IEEE/IFIP International Conference on Dependable Systems Networks (DSN), June 2010,

pp. 383–392.

[7] I. O. Kennedy, P. Scanlon, F. J. Mullany, M. M. Buddhikot, K. E. Nolan, and T. W. Rondeau,

“Radio transmitter fingerprinting: A steady state frequency domain approach,” in 2008 IEEE

68th Vehicular Technology Conference, Sept 2008, pp. 1–5.

[8] V. Brik, S. Banerjee, M. Gruteser, and S. Oh, “Wireless device identification with radiometric

signatures,” in Proceedings of the 14th ACM International Conference on Mobile Computing

40

http://arxiv.org/abs/1602.04105http://dl.acm.org/citation.cfm?id=1267336.1267348

BIBLIOGRAPHY

and Networking, ser. MobiCom ’08. New York, NY, USA: ACM, 2008, pp. 116–127.

[Online]. Available: http://doi.acm.org/10.1145/1409944.1409959

[9] S. V. Radhakrishnan, A. S. Uluagac, and R. Beyah, “Gtid: A technique for physical device and

device type fingerprinting,” IEEE Transactions on Dependable and Secure Computing, vol. 12,

no. 5, pp. 519–532, Sept 2015.

[10] T. J. O’Shea and J. Hoydis, “An introduction to machine learning communications systems,”

CoRR, vol. abs/1702.00832, 2017. [Online]. Available: http://arxiv.org/abs/1702.00832

[11] F. Chen, Q. Yan, C. Shahriar, C. Lu, W. Lou, and T. C. Clancy, “On passive wireless device

fingerprinting using infinite hidden markov random field,” submitted for publication.

[12] N. T. Nguyen, G. Zheng, Z. Han, and R. Zheng, “Device fingerprinting to enhance wireless

security using nonparametric bayesian method,” in 2011 Proceedings IEEE INFOCOM, April

2011, pp. 1404–1412.

[13] S. U. Rehman, K. Sowerby, and C. Coghill, “Analysis of receiver front end on the performance

of rf fingerprinting,” in 2012 IEEE 23rd International Symposium on Personal, Indoor and

Mobile Radio Communications - (PIMRC), Sept 2012, pp. 2494–2499.

[14] The signal metadata format specification. [Online]. Available: https://github.com/gnuradio/

SigMF

[15] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics).

Springer, 2006.

[16] Cs231n convolutional neural networks for visual recognition. [Online]. Available:

http://cs231n.github.io/convolutional-networks/

[17] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, ser. Springer

Series in Statistics. New York, NY, USA: Springer New York Inc., 2001.

[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional

neural networks,” in Proceedings of the 25th International Conference on Neural Information

Processing Systems - Volume 1, ser. NIPS’12. USA: Curran Associates Inc., 2012, pp.

1097–1105. [Online]. Available: http://dl.acm.org/citation.cfm?id=2999134.2999257

[19] Keras: The python deep learning library. [Online]. Available: https://keras.io/

41

http://doi.acm.org/10.1145/1409944.1409959http://arxiv.org/abs/1702.00832https://github.com/gnuradio/SigMFhttps://github.com/gnuradio/SigMFhttp://cs231n.github.io/convolutional-networks/http://dl.acm.org/citation.cfm?id=2999134.2999257https://keras.io/

CoverTable of ContentsList of FiguresList of TablesAcknowledgmentsAbstract of the Thesis1 Introduction2 Related work2.0.1 Supervised learning2.0.2 Unsupervised learning

3 Causes of hardware impairments3.1 Hardware impairments3.1.1 I/Q imbalance:3.1.2 Phase noise3.1.3 Carrier frequency and phase offset3.1.4 Harmonic distortions3.1.5 Power amplifier distortions

3.2 Data Collection3.2.1 Protocols of operation3.2.2 Storage and processing

4 Deep learning for RF fingerprinting4.1 Initial studies on ML techniques4.1.1 Support vector machines4.1.2 Logistic regression

4.2 Convolutional neural networks4.2.1 CNN architecture

5 Results and performance evaluation5.1 Network setup5.2 Evaluation5.2.1 CNN vs. conventional algorithms5.2.2 Receiver operating characteristics for radio fingerprinting5.2.3 Impact of distance on radio fingerprinting

6 Conclusion6.1 Research challenges

Bibliography

radio fingerprinting using convolutional neural networks · neural networks to cognitive radio...

Documents