splunk live university of alberta 2015
TRANSCRIPT
Greg DostatniTeam Lead, Application Hosting
Splunk at the University of Alberta
Copyright © 2015 Splunk Inc.
2
• At U of A since 2007• Responsible for 10-person
team managing applications and databases university-wide
• Splunk user since 2013• I’ve eaten BBQ chicken
intestines on a stick. Yummy.• splunk> take the sh out of IT
3
The University of Alberta
• Public research university based in Edmonton and founded in 1908
• 39,000+ students and 18,000 employees
• 5 campuses and 18 faculties• One of the top 100 universities
worldwide
4
IT at the University of Alberta
Central IT group for authentication, wireless and core services
Independent IT groups for most faculties and departments
University-wide initiative to consolidate more of IT
Need to standardize IT operations and tame diverse technology stacks
4
5
Application Hosting Objectives
• Centralize more of IT• Build and manage shared
environments• Develop custom services as
needed• Roll out/upgrade applications• Investigate performance
problems
IT
Libraries
LMS
Public website + CMS
Ticketing
Billing systems
Research group serversOther applications
and databases
6
Challenges after Restructuring IT
• More interdependencies among teams
• Massive volume of data, housed in silos
• “Running blind” – no understanding of the data
• Time-consuming to gather data for incidents
7
Splunk Timeline
• Funding to rebuild Splunk environment
• New hardware, clustering with dedicated storage
• 400 data sources• 133 sourcetypes
April 2015
• Management notification of syslog data loss
• Incidents escalated
• Splunk in production?
Sept. 2014
• Data loss concerns from restarting Splunk
• Management relying on Splunk reports
• Splunk not in production
March 2014
• Pilot deployed• Splunk as syslog
target• Log aggregation
test; no need for backup
Sept. 2013
8
Splunk at the University of Alberta
Infrastructure Applications
(mail, authentication)
Networking and Security
(switches, IPS)
Application Hosting
(apps, databases)
9
Example: Troubleshooting Authentication Systems
Before
• 12GB/day, 20 machines• No aggregation• Reactive issue response
based on user feedback• Manual investigations• Delay in getting data
After
• Centralized data• ½ hour to troubleshoot• Proactive alerts for issues• Easy access to
infrastructure data• Real-time reporting
10
Example: Performance MonitoringTrack and correlate request response times to gauge user satisfaction
11
Example: First Responders AppDashboards for initial incident review
12
Example: Proactive AlertsTrigger alerts on both the count and percentage of messages
13
Example: Executive Dashboards
14
Splunk Deployment Takeaways
Successes
• Visibility cutting through team boundaries
• More advanced initial incident investigation
• Openness - signed standard IT agreement for access to Splunk data
• Management loves reports• Defusing situations with rapid
access to facts
Challenges
• Accepting syslog data directly• Log standardization• Figuring out what to look at in the
logs to understand “good” system behavior
15
Aha! MomentsTransactions
• End-to-end monitoring of 4M+ email messages per day (greylisting spam filtering Google)
• Used transactions to combine logs across systems into single, message-centric log
• Ability to easily search for anomalies
Generic Alerts
• Created alert to catch errors across systems in real time
• Used existing alert and removed host specification to create the generic alert
• Catches errors that were not in Splunk at the moment the alert was created
10-second Query
• 10-second window = ~35,000 events
• Statistics to rank likely events triggering issues
• New Splunk window to analyze unusual messages
• Ability to examine small slice of time in detail while running statistics over longer period of time
16
“Splunk allows us to erase these lines and any analyst can see all the data from
anywhere and investigate a problem from end to end.”
Thank you