how we analyzed 1000 dumps in one day - dina goldshtein, brightsource - devopsdays tel aviv 2015

18
How We Analyzed 1000 Dumps in One Day DINA GOLDSHTEIN EMBEDDED TEAM LEADER, BRIGHTSOURCE ENERGY BLOGS.MICROSOFT.CO.IL/DINAZIL/ @DINAGOZIL

Upload: devopsdays-tel-aviv

Post on 14-Feb-2017

253 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

How We Analyzed

1000 Dumps in One Day

DINA GOLDSHTEINEMBEDDED TEAM LEADER, BRIGHTSOURCE

ENERGYBLOGS.MICROSOFT.CO.IL/DINAZIL/

@DINAGOZIL

Page 2: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Agenda What we do and why we need dumps

Manual analysis process

The holy grail: automatic dump analysis

Our automatic triage workflow

Page 3: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

About Us BrightSource Energy builds solar power plants

Power plants have control software

Control software crashes

Page 4: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Our Production Environment

The office (development) network is connected to the Internet

The production (power plant) network is isolated

There is a (very slow) one-way link from production to development

Page 5: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

In the Beginning… Mask all crashes by a nice error dialog and an “orderly” shut-down

Analyze errors using very extensive log files from all components

Alas, last error in log doesn’t always correspond to the fiend

Need to know exact exception, when it occurred and where!

Page 6: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Crash Dumps A dump is a snapshot of a process’s memory: threads, heap, exceptions, locks, etc.

Various tools can open dump files and see what’s inside

Page 7: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

How??? An executable can be compiled with debug information - the symbols

Symbols files (.PDB) contain information which allows debuggers to match addresses and other information in the file to names of DLLs, functions, variables, lines of code, etc.

Page 8: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

How??? An executable can be compiled with debug information - the symbols

Symbols files (.PDB) contain information which allows debuggers to match addresses and other information in the file to names of DLLs, functions, variables, lines of code, etc.

Page 9: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Symbol Server Symbols can be provided to the debugger explicitly

But they can also reside in a Symbol Server (stored by name and hash)

The debugger can download debugging symbols automatically for the right product version

Page 10: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Production Crashes We can’t attach a debugger, or do remote analysis of production errors

Windows can be configured to automatically save a dump when a process crashes

When crashes occur, dump files are generated and transmitted to a central location and then the office network

Page 11: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Manual Dump Analysis With high failure rates, we’re talking dozens of dumps per day from a single facility

Many errors are exact duplicates

Manual analysis means:◦ Copy dump to my machine (it’s not uncommon for a dump to be 2-3GB)◦ Copy debugger support files and symbols (if no symbol server is present)◦ Open dump in debugger (Visual Studio/WinDbg)◦ Locate the exception and call stack◦ Triage and open a bug for the relevant developer

◦ Probably around 10 minutes per dump…

Page 12: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Automatic Dump Analysis

ClrMD is a NuGet package which provides a debugger API for dumps and live processes

◦ Works with both native and managed code

The core of our automatic solution uses ClrMD for automatic dump analysis and triage:

◦ Exception information◦ Call stack◦ Likely faulting component

Recently became open source on GitHub

Page 13: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Some Code… target = DataTarget.LoadCrashDump(dumpPath);if (target.ClrVersions.Count > 0) { ClrInfo dacVersion = target.ClrVersions[0]; string dacLocation = dacVersion.TryDownloadDac(); runtime = target.CreateRuntime(dacLocation);}var dc = (IDebugControl)target.DebuggerInterface;dc.GetLastEventInformation(out eventType, out processId, out threadIndex, extraInformation, extraInformationSize, out extraInformationUsed, description, descriptionSize, out descriptionUsed);var dso = (IDebugSystemObjects)target.DebuggerInterface;var sysIds = new uint[count];dso.GetThreadIdsByIndex(threadIndex, count, null, sysIds);if (IsThreadManaged(sysIds[0])) { var td = runtime.Threads.First(t => t.OSThreadId == sysIds[0]); clrException = td.CurrentException; }

Page 14: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Our Dump Analysis Workflow

At the end of a shift, operators copy dumps to a network share in the office network

A script goes over the dumps one by one and uses ClrMD to find the root cause of the error

According to a configuration file, the faulting module’s owner is alerted and a ticket is opened in Redmine

Page 15: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

From Hours to Seconds Manual, tedious, error-prone dump analysis by red-eyed developers…

…Automatic, happy, untiring ninja script

Page 16: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

DEMOANALYZE 74 DUMPS IN A FEW MINUTES

Page 17: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Summary What we do and why we need dumps

Manual analysis process

The holy grail: automatic dump analysis

Our automatic triage workflow

Resources:◦ The slides: http://tinyurl.com/dumpstlv ◦ ClrMD on GitHub◦ DumpAnalyzer on GitHub◦ msos on GitHub

Page 18: How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015

Questions?Thank You!

DINA GOLDSHTEINEMBEDDED TEAM LEADER, BRIGHTSOURCE

ENERGYBLOGS.MICROSOFT.CO.IL/DINAZIL/

@DINAGOZIL

"Retouched Kitty" by Ozan Kilic is licensed under Creative Commons Attribution 2.0