c s 5 7 6 4 : i n fo r m a ti o n v i s u a l i za ti o n...
TRANSCRIPT
CS5764: Information Visualization Project Part-3
Team Members: Payel Bandyopadhyay, Anika Tabassum, Min Oh The 5 W’s (who, what, when, where, why, How) for malicious activity: Out of 59 employees who have anomalous email activity patterns, two suspects were identified by checking unusual login and website activities. The two suspects are a statistician (user id: AXC0137, name: Axel Xerxes Chapman) and a production line worker (user id: CSF0929, name: Chaney Sean Fuentes). Both Axel and Chaney have regular login patterns through the entire time of employment but unusual activities were found. It was able to identify detailed evidence for probable malicious activities and their possible intentions by integrating and analyzing the given data (Table 1). The user AXC0137 is trying to create his/her own business. This is clarified by the fact that the user is checking out websites like 1and1.com, domaintools.com. Both of these websites are for hosting own websites. So the user is probably trying to host some website which might be website containing his/her business idea. The user also accessed websites like klout.com, logmein.com which are websites for sharing business ideas online. All these evidences go against the user’s normal activity. Hence, we can claim that the user AXC0137 is trying to create his/her own business. So, probably the user is trying to create his own business using the knowledge from the organization. May be that’s why the user was logging in odd hours even being a statistician. Also, may be this was the reason why he/she was sending out emails to unauthorized sources outside the organization. Also, the fact that his/her occupation is statistician it is very likely that he/she might be involved in start-up. Now creating or thinking of a startup is not an offence but if he/she is doing that using or stealing the ideas of the current organization that he/she in working with, then it is a criminal offense and he/she should be sent to jail. The user CSF0929 is trying to leak information. This evident by the fact that the user even being a production line worker accessed website like wikileaks.com and aweber.com. Two of his website activity were outliers, viz. Wikileaks.com (website to leak/host sensitive information) and aweber.com (email marketing website). A production line worker accessing these two websites was a bit different compared to his/her regular website access activity. wikileaks website is famous for leaking sensitive information. So, it might be the case that the user is trying to leak sensitive company information.
Table 1: Showing the classification of 5Ws
Ws User 1 User 2
Who AXC0137 CSF0929
What The user is transfering important analysis records of company
The user tried to leak information to outsiders which is evident by the fact that he
1
accessed Wikileaks.com, a website to host sensitive information.
When Usual login pattern at night where users general office hour at daytime (Table 6)
1.Unusual login activities (Table 5) 2.Unusual device activities (Table 5)
Where 1. No device access records 2. Access of remote hosting sites, (using PC9532) we suspect the user might access that to transfer files 3. Accessed websites like klout.com, logmein.com
1. Email 2. Used USB device (accessed PC4442) 3. Accessed websites like wikileaks.com, aweber.com
Why 1. The user has both anomalous email and login activity patterns found in our analysis 2. The user is probably trying to create his/her own business (startup) using the ideas of his/her current organization which is evident by the fact that he/she accessed websites like 1and1.com, domaintools.com. The user is trying to find domains to host his/her business website.
1. The user has both anomalous website access and login and device activity patterns found in our analysis 2. Might have financial constraint
How Found records of accessing remote hosting sites, so we assume that transfers been done through these sites
Analysis data shows records of connecting USB device at the corresponding time of unusual login time, so we assume he transfers files and leaked company information through his sites
2
Initial Data Analysis: As an initial approach, we at first tried to play out with data and visualizing them in spreadsheet by filtering, sorting, groupBy, plotting to find out what approaches we could take. However, after trying to read each of the individual data file, we came up with conclusion that just exploring different files individually can not give us any meaningful and useful representation. Thus, we planned to create a hypothesis and based on that we would analyze them and collaborate them. Initial hypothesis: Since, our main goal is to find out the attackers and five W’s (who, what, when, where, why), we came up with a hypothesis that none of the malicious activities would be possible without the help of some insiders. So, we assumed that there must be some employee's in the company who are either the malicious actors or involved in those activities indirectly by feeding important information to the attackers. Based, on this assumption we planned to proceed further to find out those anomalous employee's. Based on the datasets, we have divided the employee activities in three parts which we assumed to be useful to search for the malicious ones: i) employee's email activities (assuming that the infiltrators might try to contact with attackers through email) ii) employee's login/logoff activities, i.e., accessing PCs and using removable storage medias, i.e., USB devices in PCs (assuming the infiltrators might try to store the important informations in the the USB devices to help the attackers) iii) employee's web site access (assuming the infiltrator might access the phishing or unknown websites to feed information to others or try to hack himself). Our plan to connect all data files together: Each of our group members work on analyzing one of the activities mentioned above. One of us tried to find out the unusual employee's depending on their login and device activities having this thought that if it is possible to find out some employees who are accessing differently rather than their normal activities and other users then we will look for the outside persons they emailed, what date they emailed and if they sent out any attachments to them or not. One tried to look for infiltrators depending on the unusual websites if any employee's tried to access and the number of times they accesses them. Thus, to sum up, if we could end up finding out the anomalous employee's, we can find out to whom they contacted, sent informations, what time they did, what medium they used to do that and how they did that.
3
i) Analyzing anomalous e-mail activities: It is assumed that an anomalous e-mail activity of employees might be a basis for indicating suspects. To find anomalous e-mail activities, network-based clustering analysis was conducted. From the email data, it was able to secure e-mail activities in which receivers are linked to senders who are corporate employees. A network where a node indicates each user and a link between two nodes depicts e-mail transmission from the sender to receiver was derived from the e-mail data by removing duplicated links. Figure 1 shows randomly sampled subnetwork of the entire email network using force-directed layout. In total, the derived network consists of 4,936 nodes and 112,240 links. With the network, various network properties were calculated for each user, including average shortest path length, clustering coefficient, closeness centrality, stress, edge count, in-degree, out-degree, betweenness centrality, and neighborhood connectivity (Table 2). Each feature was scaled by z-scoring. Finally, Gaussian Mixture Model was fitted into network property data to cluster e-mail activities and figure out anomalies.
Figure 1. Randomly sampled email activity network with force-directed layout
Out of 4,936 nodes, 30 nodes were randomly sampled which spans 23 employees and 7 outsiders. The colors correspond to role of employees or indicate outsiders. The directed edges represent e-mail transmission between sender and receiver.
4
Table 2. Network properties calculated based on the e-mail activity network (for 20 users).
A network property table (Table 1) representing network properties for each user contains 10 dimensions. The role dimension was eliminated when Gaussian Mixture Model was fitted into the data. Consequently, 8 clusters of employees showing the similar patterns in an email activity network were derived from the fitted Gaussian Mixture Model (Figure 2). To visualize the clusters into a two-dimensional space, dimensionality reduction technique was used. Figure 3 represents result of principal component analysis (PCA) with these 8 clusters.
5
Figure 2. Result of fitting Gaussian Mixture Model into e-mail activity network
Figure 3. Dimensionality reduction for network properties with 8 clusters derived from
Gaussian Mixture Model
By utilizing heatmap visualization (Firgure 4), for each role, how many employees were assigned to specific clusters was observed. For example, out of 36 administrative assistants (the second row in figure 4), 63.9% of employees (23 people) were assigned to Cluster 4, whereas only 2.8% of employee (1 people) was labeled as Cluster 6. Likewise, it was able to identify major and minor email activity clusters for each role. Employees who were assigned to minor clusters (less than 13% out of whole employees in each role) were distinguished as an anomalous email activity group. Finally, identities of employees in anomalous email activity group were listed (Table 3). It was able to specify 59 individuals (about 6% of total employees) showing anomalous email activities. These 59 individuals were considered as people who are required further investigations because their anomalous email activities have a possibility to describe conspiring with outsiders to conduct malicious activity. To narrow down the suspicious with more evidence, the result of login activity analysis was integrated to that of email activity analysis. We identified two suspicious, a statistician (user id: AXC0137, name: Axel Xerxes Chapman) and a production line worker (user id: CSF0929, name: Chaney Sean Fuentes) showing anomalous activities in both email and login records (highlighted in Table 3). Tools for Data preprocessing: own source code (C++) Network visualization and analysis: Cytoscape, Microsoft Excel Clustering visualization and analysis: R, Microsoft Excel, own source code (C++)
6
Figure 4. heatmap for identifying anomalous email activities for each role
7
Table 3. The 59 suspicious identified by their anomalous email activities
8
ii) Analyzing employee's login activities: With this analysis we try to identify employee's who are accessing their PCs differently rather than their normal pattern. While skimming through the data we observe, most employee's have access to only one PC except a few. We found out the employee's who accessed more than on PCs are all IT admins, which had already been said in data. We also observe that most employee's access PCs from around 6AM-6PM. So, we categorize the 24hr time stamp into 4 categories (from 0-3) based on employee's who are working within that time and employee's who are not. After categorizing each employee accessing PCs on each date, we visualize that using Andromeda to find out the clusters and outliers to identify the employee's who are following a different pattern. We also compute the user similarities based on these categories and try to visualize a cluster based on their similarity scores by plotting them in one dimensional structure using both Andromeda and Python Pandas. However, for 980 employee's we failed to identify anything from that.
Figure 5: Analyzing similarities among employee's login activities
We observe, employee BLW0787 who is out of the normal pattern is a computer programmer, employed throughout whole six months and also used storage medias. Next, we try to identify the employee’s who use any storage media connected to their respective devices they use within their respective times. Our observation from the device and login data at first was each employee is using only one storage media and using the same PC they access what we found out from login activities. We also find out that timestamp the employee's use their medias also match with the time
9
they were accessing their PCs. Thus, nothing unusual found from that. One important thing we found was only 215 employee's are using storage medias out of total 980 employee's. We compute the similarities among the employee’s accessing the storage medias on each date within 6 months. We compare the two similarity scores between the employee's login activities and use of storage medias.
Figure 6: Similarities of employee’s use of storage media vs their login activities Here we observe, users having very similar login activities vary widely from each other on their use of storage medias. Since we did not find anything useful only from login and device pattern activities, next we planned to find out a pattern of users login and device connecting depending on their roles. We made an assumption that ‘user of similar roles have similar login and device connect activities’. So, we used a k-means clustering algorithm over the user login activities and device activities to cluster each user. Since we divided users login and device pattern into three categories (0-2 by users morning activities, night activities and both morning and night activities), we used k =3 in our k-means cluster algorithm. After setting each user in three different clusters, we categorize the users by roles and identified which the dominating cluster for each role. According to data, the users have been divided into 42 different roles. From Figure 9, we observe that cluster 2 is the dominating cluster for most of the roles which is basically day time activities. Some of the users also fall to cluster 1 which implies day activities on weekends and cluster 3 are the users who have both regular day and night time activities. Since there are at least 15 technicians having night login patterns, so to avoid so many users as anomalous we set up a threshold=5, to identify the most unusual users who are behaving very differently from their roles, i.e., if the number of users of same role is less than 5 in some cluster, we hypothesize these users in one suspect list. Besides, there are some roles, president, vice president, nurse, nurse practitioner and security guard which have only one user acting that role or their roles define that having unusual activities are very possible. So, we keep them out of our suspicion. So, clustering and ruling out the users by the dominating cluster gives us 42 users in suspicion list. Next, we manually found out from our user login activities about which users have daytime/nighttime activities in general but have very unusual night activities in
10
some days (red marked users in Table 6). These kept only five users in total and among these five users, among them user CSF0929 and AXC0137 intersects with having very unusual email activities in their email network analysis. Table 4: Login activity of users categorized by day and night time activities over 6 months of data
Table 4 and Table 5 shows the analysis and preprocessing of all users login and device activities over six months and categorized them from (0-3) corresponding to no activity, daytime, nighttime and both.
11
Table 5: Device activity of users categorized by day and night time activities over 6 months of data
Table 6: 48 users who have been identified as behaving differently from their cluster AMS0762 Mathematician
WKC0202 SecurityGuard
CDT0311 Salesman Night Activity
OHE0350 Physicist
AYD0147 Accountant
CPS0014 Director
UKW0099 ElectricalEngineer
KAK0992 Mathematician
RSP0404 TestEngineer
NKB0411 ElectricalEngineer
AXC0137 Statistician Night Activity
IDM0326 Technician
SUV0051 Physicist Night Activity
AXB0237 MechanicalEngineer
12
RHB0200 SecurityGuard Night Activity
HML0159 Manager
CBM0387 ElectricalEngineer
WMD0345 Physicist
ACC0950 ComputerTrainer Night Activity
CHP0446 Salesman
CCN0067 AdministrativeAssistant
GWC0187 PurchasingClerk
CBN0398 TestEngineer
DRW0195 Attorney
CZB0191 AdministrativeAssistant
AVJ0078 AdministrativeAssistant
BLW0787 ComputerProgrammer
KAG0412 MechanicalEngineer
CCM0786 ComputerProgrammer
RNR0344 Physicist
LGW0987 FinancialAnalyst
BLM0712 Salesman
DCB0109 Manager
PRM0153 Director
Figure 7: login activity of user AXC0137 over 6 months
13
From Figure 7 and Figure 8, we observe that user AXC0137 and user CSF0929 have login activities of daytime usually, except for few days when they have unusual night time activity along with their usual office hours.
Table 7: Unusual login and device activity time of user CSF0929
LogIn Activity (PC4442) Date Time
Device Activity Date Time
07/01/2017 2:23 AM -3:53AM 4:09AM- 5:15AM
7/01/2017 2:23AM- 3:53AM 4:09AM- 5:50AM
07/02/2017 9:57PM-10:40PM 07/02/2017 9:57PM- 10:40PM
07/09/2017 5:12AM-5:15AM 07/09/2017 1:07AM- 2:51AM 5:12AM- 05:15AM
07/14/2017 1:42AM-6:24AM 07/14/2017 2:05AM- 4:26AM 5:44AM- 5:50AM
07/16/27 3:52AM- 6:52AM 07/16/27 4:11AM- 4:22AM 5:28AM- 5:43AM
Table 8: Unusual login and device activity time of user AXC0137 (No record of connecting devices)
Login Activity (PC9532) Date
Time
05/1/2017 6:52AM
05/7/2017 10:03PM -3:41AM
05/16/2017 8:48PM- 1:02AM
05/19/2017 2:58AM- 05:38
06/05/2017 5:39AM- 5:49AM
10/26/2017 3:49AM- 5:57AM
14
Figure 8: login activity of user CSF0929 over 6 months
After analyzing the users login and device connecting activity of two users, we find very interesting and fishy pattern of them. First, user AXC0137 does not have any device connecting records but he has six unusual login activities at night or at early morning before office hours over six months, along with his usual office hour. This user always has accessed PC9532, which is his own office PC (from emplyer_info.csv). We assume that only he has access to that PC. User CSF0929 have both fishy login and device activities at the corresponding time. The unusual thing is this user has been employed only for three months from 5/7/17- 7/29/17. His general work hr 8.30AM-5.40PM. Also, he has his device connected to his PC4442 only at those hours of his unusual login time.
15
Figure 9: Generating a heat map to identify dominating cluster for each role
. Visualization Tools: Andromeda, Python Matplotlib Clustering computation: JAVA Eclipse(own source code K-Means)
16
Analysis tools: Python Pandas, numpy, Microsoft Excel, Java Eclipse (to compute user similarity scores and login and device activity pattern) [using own source code] iii) Analysing employee's website activities: With this analysis we try to find which users are accessing malicious website. Our initial hypothesis was to find outliers among the websites. Then our task was to compare the least visited websites with a set of malicious website available online.
Fig. 10: Showing the outliers among the websites
Once we got the list of least visited websites, we tried to compare it with the websites that are malicious. While doing this, we understood that this was a bit difficult as getting the list of malicious websites was difficult. Hence, we changed our hypothesis. We tried to find user website visit pattern. So, for each user we tried find the number of times each website have been visited and plot them individually to find out malicious activity.
17
Fig. 11: Showing website activity for each individual user The next step is to find the user similarity matrix for all the users and find out outliers among them. Though we are not sure if the results will be fruitful but that could be a way to analyze the further data. As we decided we tried to find outliers among the websites. For that, we clustered the websites according to the categories shown in the table below:
18
Table 9: Clustering the websites based on various categories
Website Category Examples
Shopping websites Amazon.com, target.com, bestbuy.com, …..
Search engines Google.com, ask.com, yahoo.com, …..
News websites Foxnews.com, bbcnews.com, ….
Banking websites Bankofamerica.com, chase.com, discovercard.com, …..
Business sharing websites Aweber.com, addthis.com, …...
Weather report Cnn.com, bbc.com, cbssports.com, …...
As, we categorised the websites we found out 1 website which redirects to different website. Rr.com redirects to a website named “mail.twc.com”, which asks the user to input their email address and password. Hence, we tried to filter out all the users who accessed this website. Below table shows a part of userlist. Initial Hypothesis based on website visit activity of all users: rr.com uses URL forwarding, which can be for hostile purposes such as phishing attacks or malware distribution. Users who have accessed have accessed this website might be doing something illegal and can be counted as suspicious.
19
Table 10: Table showing which users accessed rr.com
Now, it’s very clear from the table that our hypothesis went wrong. This was because there are way too many users who have accessed this website and it’s quite certain that not all of them are doing suspicious activity. Since our previous analysis failed, we tried a different approach. We tried to find users which seemed suspicious based on both email activity and logon activity. Hence, we found user AXC0137 (who is a
20
Statistician by profession) as common. Further, we analysed the website activity of this user. Below table shows a screenshot of a part of the websites visited by user AXC0137:
Fig. 12: Showing the website activity of user AXC0137
Since, this user visited many websites we couldn’t display all the websites. Our previous analyses proved this user tried to leak information to outsiders (from the organization). The website analysis shows that the user visited websites like klout.com (website to share content online), logmein.com (website to access computers from any device), 1and1.com (claiming domain), domaintools.com (security analyst turn threat data into threat intelligence) and various other websites. From the analysis of email activity and device activity, user CSF0929 was also identified as fishy. Hence, we further analysed his website activity. Since he is a production line worker by profession, he didn’t access too many websites. Even if he did, those were mostly common websites like google.com facebook.com and all. Two of his website activity were outliers, viz. Wikileaks.com (website to leak/host sensitive information) and aweber.com (email marketing website). A production line worker accessing these two websites was a bit different compared to his/her regular website access activity. Visualization Tools: Tableau Analysis tools: Python Pandas, Microsoft Excel, Shell script (to analyze data)
21
Contribution:
Table 11: Work distribution among members
Planned Work Member
Planning 5W’s All members
Getting an overview of whole data and making hypothesis
All members
Analyzing anomalous e-mail activities Min
Analyzing logon activity and device accessing activities of users
Anika
Analyzing user visiting websites Payel
Incorporating all the analysis with users/employee's and trying to find out the intruders
All members
22