study of internet traffic to analyze and predict traffic
Post on 13-Feb-2017
397 Views
Preview:
TRANSCRIPT
SITAPTSTUDY OF INTERNET TRAFFIC TO ANALYZE AND PREDICT TRAFFIC
Amit Aroraaa1603@georgetown.eduhttps://www.linkedin.com/in/amit-arora-539120a
WHAT DOES INTERNET TRAFFIC LOOK LIKE?Word cloud of mean percentage packets contributed by various applications from 2008 to 2015
BACKGROUND Pervasive growth of the
Internet. Internet access becomes faster
and applications move to the cloud the profile of Internet traffic continues to change.
Peer to Peer traffic, video sharing and OTT services coupled with almost ubiquitous access to high speed internet poses new challenges Service providers: how to
better utilize bandwidth ? OEMs: how to increase bits
per second and packets per second through the equipment ?
A key to understanding and solving these challenges is to
understand what constitutes Internet traffic
how the internet traffic will look like in the coming years
optimize networks and infrastructure to better utilize available resources.
This is what this project aims to address i.e. understanding internet traffic from various perspectives (application, protocol, packet size and others)
This understanding can then feed into network and infrastructure design.
A data product named SITAPT (Study of Internet Traffic to Analyze and Predict Traffic) is built which addresses the above requirements.
TARGET AUDIENCE
Network OEM Service Provider
SCOPE OF SITAPTVisualization of Traffic
Data
• Hundreds of applications (web, secure web, file transfer etc)
• Tens of protocols (TCP, UDP, ESP, GRE etc).
• Various packet sizes
Time Series Analysis for Traffic Prediction
• Multivariate timeseries analysis to predict traffic for various applications and protocols in the next 12 months.
• Identify trends in key and upcoming applications/protocols
Clustering to Explore Similarity
• Use machine learning algorithms to identify possible clusters of similarity between traffic patterns across multiple years.
Relationship betwen traffic types
• Identify and model relationship between key protcols
DATA SCIENCE PIPELINE
SITAPT itself is implemented completely in Python (version 2.7.11), although it relies heavily on other python packages such as numpy etc. that might be written in other programming languages for speed. The entire code for SITAPT is available on Github SITAPT repo.
Obtain anonymized internet traces from CAIDA. The traces are available in pcap format and contain IP and Transport layer headers only.
Read the packet trace and convert information from each packet into a JSON which is then stored in MongoDB. Remove all data that is not IP.
Find out the traffic mix by protocols, application, packet sizes and other criteria. Convert the results in a pandas dataframe which is again stored into Mongo.
· Model traffic as a multivariate time series. Use time series analysis techniques to forecast traffic for various applications and protocols.
· Use unsupervised machine learning (clustering) to identify similarity in traffic across the dataset.
· Identify and model relationship between key protocols
Create visualizations for understanding the data. Representing time series and describing trends.Explore clustering and regression
ARCHITECTURE
ApplicationLayer
ApplicationLayer
Anon
ymize
d In
tern
et tr
aces
Internet trace from CAIDA
Internet trace from a Service Provider
Ingestion module(BeautifulSoup4, wget,
gzip, pycapfile)Ingestion
WORM
ETL
Offline (non-realtime),Authenticated ingestion
NOSQL - MongoDB
Immutable Data store
ETL = Extract only IP packets, Transform to JSON, Load in data store
Computation & Modeling
Models - AR(I)MA- PCA, KMeans- Linear Regression.
Packages- Statsmodels - SciKit Learn
Train model on available data sets
Wal
l of i
nter
pret
ation
Data from models made available in various formats
A WORD ABOUT DATA USED IN SITAPTCAIDA (Center for Applied Internet Data Analysis, http://www.caida.org/) maintains a lot of data that can be used as analyzing and understanding Internet traffic. Anonymized internet traces from the year 2008 to 2015 are available upon request from CAIDA (see http://www.caida.org/data/passive/passive_2015_dataset.xml ), these traces form the dataset used by SITAPT.
All traces in this dataset are anonymized (by CAIDA itself) with the same key. In addition, the payload has been removed from all packets (again by CAIDA itself).
A WORD ABOUT DATA USED IN SITAPT(CONTD..)How much data does SITAPT use?CAIDA provides around 13,000 files (in compressed format) for the period from 2008 to 2015. The combined size of the uncompressed version of the files stored into a database would run into several tens of terabytes. The current version of SITAPT is not a *big data* product.
Clearly, analyzing this amount of data requires horizontal scaling which is outside the scope of the current project. To reduce the problem to a more manageable level, SITAPT works with one file for every month of every year from 2008 to 2015 (for 2014 and 2015 CAIDA provides one file per quarter).
In total, SITAPT analyses 73 packet trace files from 2008 to 2015. Each trace contains millions of packets. The size of the database that stores the JSON representation of the trace files is more than 1TB.
DATA TRANSFORMATION FOR COMPUTATION Ingested data is stored in Mongo collections (one for each year). This data needs to be transformed into matrix form to make it
amenable for computation.
Once created, the three collections (applications, protocols and packet size distribution) are also stored as CSV files such that the modeling phase do not have to interact with the database at all and can read data on which they need to work on directly from the CSV file.
DATA TRANSFORMATION FOR COMPUTATION (CONTD..)These files represent time series data such that each row is a parameterized representation of traffic expressed either as combination of applications or protocols or packet size distribution.
Date sun-sr-iiop
transmit-port
ieee-mms-ssl
passgo
joaJewelSuite ovhpas sdo-tls interwise lm-instmgr
3/19/2008
4.36E-06 3.49E-05 0 8.72E-06
0 0.000798
0.000959
0.001051
0.000606
4/30/2008
9.00E-06 9.00E-06 9.00E-06
9.00E-06
9.00E-06 9.00E-06 9.00E-06 9.00E-06 9.00E-06
5/15/2008
0 0 0 0 0 3.96E-05 0.001187
0.004413
0
6/19/2008
0 0 0 0 0 0.000368
0.000403
0.000266
0
7/17/2008
0.000744
6.20E-05 0 0 0 0.003299
0.002828
1.24E-05 2.48E-05
Time
Applications
Read row wise: composition of Internet traffic by packets percentage contributed by each application in a trace captured at a particular day. Read column wise: packets percentage of each application as a time series.
VISUALIZATIONSThree different types of visualizations are explored Word cloud (for applications and protocols) Stacked chart (for applications, protocols and
packet size distribution) Heat map (for packet size distribution) Parallel Coordinates (protocols)
VISUALIZATIONS: STACKED CHARTS
Observations: Almost exponential increase in HTTPS traffic Decrease in unclassified (unknown) traffic Logarithmic decay in applications that contribute less than 0.5% of packets individually.
VISUALIZATIONS: HEAT MAP
Observations:Packet size distribution is almost entirely bimodal, with only the 1 to 100 bytes range and the 1400 to 1500 bytes range showing packet percentages of any significance. Only two rows show any dark colors (which represents a significant packet percentage) and these are the 1 to 100 packet size row and the 1400 to 1500 packet size row..
VISUALIZATIONS: PARALLEL COORDINATES
The protocols parallel coordinates for protocols contributing more than 0.01% to overall traffic. This chart clearly shows a negative correlation between TCP and UDP protocol traffic.
Each time series is studied and analyzed individually. The following operations are done on each time series.
Plot the first difference series to identify trends. Evaluate ACF and PCF to identify dependencies of the
series upon previous time samples. Seasonal decomposition to identify trend and seasonality
and residuals. ARMA and ARIMA modeling of the time series Modeling is done using “statsmodels” package The output of each of the above steps is available as part
of SITAPT analysis.
TIME SERIES ANALYSIS
TIME SERIES MODELING FOR TCP PROTOCOL
TIME SERIES MODELING FOR TCP PROTOCOL (CONTD..)
TIME SERIES MODELING FOR TCP PROTOCOL (CONTD..)
TIME SERIES MODELING FOR MULTIPLE PROTOCOLS AND APPLICATIONS
Forecasted traffic mix
TRENDS IN SOME IMPORTANT APPLICATIONS
Almost exponential increase in HTTPS trafficHTTP traffic is decreasing but still contributes a significant percentage
CLUSTERING To explore if there are any patterns hidden in the internet
traffic data a clustering technique is employed. Each protocol or application or packet size interval is treated
as a feature and each trace is treated as an instance. Clustering is done in two steps: Dimensionality reduction via PCA (Principal Component
Analysis) For applications, PCA reduces 5000+ dimensions to 10.
Clustering via KMeans K = 4
PCA and KMeans are both done using the scikit-learn API.
CLUSTERING (CONTD..)
Some clustering present in applications and protocols data, not so much in packet size distribution data (needs higher K maybe).
CLUSTERING
Date Year Half Quarter Fortnight
DayOfTheWeek
cluster TCP ESP UDP
5/17/2012 2012 1 2 2 Thursday 2 91.01197292 0.190882127 8.2196799517/19/2012 2012 2 3 2 Thursday 2 91.75760765 0.084797929 7.5882906579/20/2012 2012 2 3 2 Thursday 2 93.10399212 0.024899575 6.12521031210/18/2012 2012 2 4 2 Thursday 2 90.43387514 0.492682745 8.56456854211/15/2012 2012 2 4 1 Thursday 2 95.2341703 0.262422627 3.84373103112/20/2012 2012 2 4 2 Thursday 2 91.12739062 0.035976644 8.604101833/21/2013 2013 1 1 2 Thursday 2 94.65616355 0.019147528 5.2066727696/20/2013 2013 1 2 2 Thursday 2 90.5349465 0.061648601 9.1950060539/18/2014 2014 2 3 2 Thursday 2 90.60646827 0.020378309 8.885648934
It is a matter of further analysis to figure out what event or phenomenon was happening which caused the Internet traffic at during different times between 2008 to 2015 to be similar.
If this study was being done on traffic from a closed network (such as from a single ISP) then it would be much easier to attribute this clustering to real world events (such as the OS update for mobile phones for example).
The following is an excerpt from the generated CSV file for protocols showing the additional fields added, including the label field provided by the clustering algorithm.
The table is filtered on cluster (label) type 2 and it is seen that traces which has higher than usual TCP traffic % (90 to 95%) are clustered together.
LINEAR REGRESSION
Parallel coordinates showed negative correlation between the percentage of TCP traffic and the percentage of UDP traffic. Creating a scatter plot of TCP Vs UDP and then creating a linear regression model to fit a straight line through it.The coefficients vector is [ -1.00805723] and the variance score is 0.96.
WHAT WORKED? The fact that all the packet traces are now available in
a document database means that the data is now available in a consumable format and this really opens up avenues for further analysis, asking different types of questions off the data.
The time series analysis revealed interesting trends about the data, such as an almost exponential increase in secure HTTP traffic which was expected but at the same time there is not a huge decrease in non-secure HTTP traffic which was somewhat unexpected.
Various types of visualization techniques (like parallel coordinates) and tools like Bokeh provide a really good insight into the data.
WHAT DID NOT WORK? With the amount of data involved, this is clearly a Big Data project,
since that was not something that could be done completed in a short time so the alternative was use to trace file for each month and that reduced the number of data points available for analysis (only 73 data points). This limited the prediction ability of the time series models, not all applications and protocols could be modeled within the 95% confidence interval and a MAPE of < 5%.
This data would provide much more insights if it corresponded to traffic from a closed network rather than the Internet. For example, such as an ISP’s network limited to certain geographical areas because then the data would have less variability and would be easier to explain the clustering.
For the time series model, only the MAPE was considered while choosing between the AR(I)MA models. There are other criteria as well such as Durbin-Watson statistic, the BIC and HQIC etc. which should have been explored but were not.
CONCLUSIONSITAPT provides valuable insights into network traffic composition and trends. In terms of applications there is an exponential
growth trend in HTTPS traffic, a trend that is visible even at a macro level (generic internet packet trace).
The time series analysis is able to provide predictions for applications and protocols.
In terms of packet sizes there is a bi-modal distribution.
Clustering reveals patterns in terms of both application and protocols
top related