ftrace how-to - version 0.4 - latinoware · ftrace how-to version 0.4 - latinoware red hat daniel...
TRANSCRIPT
Ftrace How-toversion 0.4 - Latinoware
Red Hat
Daniel ’bristot’ de Oliveira
October 21, 2016
Who am I?
I am Daniel :-)
I am from BRAZIL!, But I’m Italian too..
It means 9 FIFA World CUP Championship o/
My father did not allow me to became a Truck Driver... So Istarted to study:
Bs. Computer Science 2009Ms. Automation Engineering 2014Ph.D Automation Engineering 2019
Before Red Hat: 5 years with embedded Linux
at Red Hat: SEG on SBR-Kernel: Real-time and performance
What is trace?
Run-time information
Hey... function foo, called bar...
Hey... function bar returned in 2 us
Hey... the code crossed here, and the var X is 10
Can be enabled/disabled in runtime
Low overhead... mainly when disabled (it is really important)
Generate a *lot* of data!
Dozen lines of trace per microsecond, per cpu!
Trace techniques
Static trace - Compiled in the code
Trace of functions - In the function calls
Dynamic trace - Added dynamically
Kernel tracing
Trace techniques
Static trace - tracepointsTrace of functions - ftraceDynamic trace - kprobes
Ftrace provides interface for these three techniques
Go!
Please, boot your RHEL7/Fedora VMs
Or run on your machine! it is safe :-)
Ftrace’s interface
Ftrace is embedded on kernel
Accessible via debugfs
echo to setcat to get
On Fedora and on RHEL7 it is mounted by default at:
/sys/kernel/debug/
On RHEL6:
mount -t debugfs debugfs /sys/kernel/debug/
Ftrace’s interface is at /sys/kernel/debug/tracing/
Ftrace’s interface
[root@btt-rhel7 ~]# cd /sys/kernel/debug/tracing/
[root@btt-rhel7 tracing]# ls
available_events max_graph_depth stack_trace_filter
available_filter_functions options trace
available_tracers per_cpu trace_clock
buffer_size_kb printk_formats trace_marker
buffer_total_size_kb README trace_options
current_tracer saved_cmdlines trace_pipe
dyn_ftrace_total_info set_event trace_stat
enabled_functions set_ftrace_filter tracing_cpumask
events set_ftrace_notrace tracing_max_latency
free_buffer set_ftrace_pid tracing_on
function_profile_enabled set_graph_function tracing_thresh
instances snapshot uprobe_events
kprobe_events stack_max_size uprobe_profile
kprobe_profile stack_trace
Starting from function tracer
Trace of kernel functions
Only kernel and only functions
Only kernel functions - no user-spaceNo macros and no inline functions
Basically: how does it work?
gcc -pg adds a call to mcount on begin of each functionmcount receives the address of the caller and the caller of callercalls* function tracer’s functionthat will save the information on the trace’s buffer
Default question: WOW so it means a lot overhead?
No: only a small when enabled, and ”nop” when disabled:
When disabled, all mcount calls are turned on nop.
This Steven’s lecture explains how it works:video.linux.com/videos/removing-stop-machine-from-the-tracing-infrastructure
Basic ftrace’s interface
available tracers
cat: show available tracers
current tracer
cat: show current tracerecho: set the current tracer
trace
cat: print the trace bufferecho: clean the trace buffer
tracing on
echo 1: turn the trace onecho 0: turn the trace off
Basic ftrace’s interface
[root@btt-rhel7 tracing]# cat available_tracers
blk function_graph wakeup_rt wakeup function nop
[root@btt-rhel7 tracing]# cat current_tracer
nop
[root@btt-rhel7 tracing]# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
Using function tracer
[root@btt-rhel7 tracing]# echo function > current_tracer
[root@btt-rhel7 tracing]# echo 1 > tracing_on
[root@btt-rhel7 tracing]# head -15 trace
# tracer: function
#
# entries-in-buffer/entries-written: 71715/71715 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
bash-2274 [002] .... 2553.416814: mutex_unlock <-rb_simple_write
bash-2274 [002] .... 2553.416816: __fsnotify_parent <-vfs_write
bash-2274 [002] .... 2553.416817: fsnotify <-vfs_write
bash-2274 [002] .... 2553.416817: __srcu_read_lock <-fsnotify
Stopping the trace
[root@btt-rhel7 tracing]# echo 0 > tracing_on
[root@btt-rhel7 tracing]# echo nop > current_tracer
[root@btt-rhel7 tracing]# echo > trace
Graph tracer
It traces the call of functions
But also the return of functions
So, can I get the execution time of a function? YES!
But it have a cost: it is more expensive than function tracer
But not that much
Function graph tracer
[root@btt-rhel7 tracing]# echo function_graph > current_tracer
[root@btt-rhel7 tracing]# echo 1 > tracing_on
[root@btt-rhel7 tracing]# head -20 trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
3) | tick_do_update_jiffies64() {
3) 0.045 us | _raw_spin_lock();
3) | do_timer() {
3) | update_wall_time() {
3) 0.046 us | _raw_spin_lock_irqsave();
3) 0.047 us | _raw_spin_unlock_irqrestore();
3) 0.617 us | }
3) 0.040 us | calc_global_load();
3) 1.138 us | }
3) 0.042 us | _raw_spin_unlock();
3) 1.938 us | }
Jumping to Tracepoints
Points of trace on kernel’s code
Low overhead, mainly when disabled
Runs a callback to write on ftrace’s buffer
It is also known as trace events (e.g. on perf)
Organized by subsystems
subsystem:tracepoint name
Basic tracepoint’s interface
available events
cat: show available events
set event
cat: show enabled eventsecho: enable/clean events
Basic tracepoint’s interface
[root@btt-rhel7 tracing]# cat available_events | grep irq_handler
irq:irq_handler_exit
irq:irq_handler_entry
[root@btt-rhel7 tracing]# cat available_events | wc -l
1200
[root@btt-rhel7 tracing]# echo irq:irq_handler_exit > set_event
[root@btt-rhel7 tracing]# cat set_event
irq:irq_handler_exit
[root@btt-rhel7 tracing]# echo irq:irq_handler_entry >> set_event
[root@btt-rhel7 tracing]# cat available_events | grep sched_wakeup >> set_event
[root@btt-rhel7 tracing]# cat set_event
irq:irq_handler_exit
irq:irq_handler_entry
sched:sched_wakeup_new
sched:sched_wakeup
[root@btt-rhel7 tracing]# echo > set_event
[root@btt-rhel7 tracing]# cat set_event
Tracepoints output
[root@btt-rhel7 tracing]# cat available_events | grep irq_handler > set_event
[root@btt-rhel7 tracing]# head -20 trace
# tracer: nop
#
# entries-in-buffer/entries-written: 150/150 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [001] d.h. 3623.817286: irq_handler_entry: irq=42 name=virtio0-input.0
<idle>-0 [001] d.h. 3623.817290: irq_handler_exit: irq=42 ret=handled
<idle>-0 [003] d.h. 3624.175584: irq_handler_entry: irq=14 name=ata_piix
<idle>-0 [003] d.h. 3624.175681: irq_handler_exit: irq=14 ret=handled
<idle>-0 [003] d.h. 3624.175689: irq_handler_entry: irq=14 name=ata_piix
<idle>-0 [003] d.h. 3624.175706: irq_handler_exit: irq=14 ret=handled
<idle>-0 [001] d.h. 3624.186418: irq_handler_entry: irq=42 name=virtio0-input.0
<idle>-0 [001] d.h. 3624.186421: irq_handler_exit: irq=42 ret=handled
<idle>-0 [001] d.h. 3625.264161: irq_handler_entry: irq=42 name=virtio0-input.0
Ftrace and tracepoints - together is better
<idle>-0 [002] .N.. 173.728450: schedule_preempt_disabled <-cpu_startup_entry
<idle>-0 [002] .N.. 173.728450: __schedule <-schedule_preempt_disabled
<idle>-0 [002] .N.. 173.728450: rcu_note_context_switch <-__schedule
<idle>-0 [002] .N.. 173.728450: _raw_spin_lock_irq <-__schedule
<idle>-0 [002] dN.. 173.728451: pre_schedule_idle <-__schedule
<idle>-0 [002] dN.. 173.728451: idle_exit_fair <-pre_schedule_idle
<idle>-0 [002] dN.. 173.728451: put_prev_task_idle <-__schedule
<idle>-0 [002] dN.. 173.728451: pick_next_task_fair <-__schedule
<idle>-0 [002] dN.. 173.728451: clear_buddies <-pick_next_task_fair
<idle>-0 [002] dN.. 173.728452: __dequeue_entity <-pick_next_task_fair
<idle>-0 [002] d... 173.728452: sched_switch: prev_comm=swapper/2 prev_pid=0
prev_prio=120 prev_state=R ==> next_comm=virt-what next_pid=2325 next_prio=120
grep-2325 [002] d... 173.728454: finish_task_switch <-__schedule
grep-2325 [002] .... 173.728455: __mmdrop <-finish_task_switch
grep-2325 [002] .... 173.728455: pgd_free <-__mmdrop
grep-2325 [002] .... 173.728455: _raw_spin_lock <-pgd_free
grep-2325 [002] .... 173.728455: _raw_spin_unlock <-pgd_free
grep-2325 [002] .... 173.728455: free_pages <-pgd_free
grep-2325 [002] .... 173.728456: free_pages.part.63 <-free_pages
grep-2325 [002] .... 173.728456: __free_pages <-free_pages.part.63
But it is too much information!
All the functions are too much!
It is possible to filter the trace of functions
And it is also possible to filter tracepoints based on its data.
let’s try it, starting by ftrace.
Ftrace’s filter interface
available filter functions
cat: show the functions that can be filtered
set ftrace filter
cat: show functions that will be tracedecho: enable/clean functions that will be traced
set ftrace notrace
cat: show functions that will NOT be tracedecho: enable/clean functions that will NOT be traced
set ftrace pid
cat: show the pid that will be tracedecho: set/clean the pid that will be traced
Filtering the trace of functions
[root@btt-rhel7 tracing]# cat available_filter_functions | wc -l
29428
[root@btt-rhel7 tracing]# echo mutex_lock > set_ftrace_filter
[root@btt-rhel7 tracing]# echo mutex_unlock >> set_ftrace_filter
[root@btt-rhel7 tracing]# cat set_ftrace_filter
mutex_unlock
mutex_lock
[root@btt-rhel7 tracing]# echo > set_ftrace_filter
[root@btt-rhel7 tracing]# cat set_ftrace_filter
#### all functions enabled ####
Filtering the trace of functions
[root@btt-rhel7 tracing]# echo mutex_lock mutex_unlock > set_ftrace_filter
[root@btt-rhel7 tracing]# echo function > current_tracer
[root@btt-rhel7 tracing]# echo 2294 > set_ftrace_pid
[root@btt-rhel7 tracing]# echo 1 > tracing_on
[root@btt-rhel7 tracing]# head -20 trace
# tracer: function
#
# entries-in-buffer/entries-written: 490/490 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
bash-2294 [001] .... 111801.975119: mutex_unlock <-rb_simple_write
bash-2294 [001] .... 111801.975134: mutex_lock <-trace_array_put
bash-2294 [001] .... 111801.975135: mutex_unlock <-trace_array_put
bash-2294 [001] .... 111801.975360: mutex_lock <-n_tty_write
bash-2294 [001] .... 111801.975368: mutex_unlock <-n_tty_write
bash-2294 [001] .... 111801.975369: mutex_unlock <-tty_write_unlock
bash-2294 [001] .... 111801.975462: mutex_lock <-tty_ioctl
Filter and function graph tracer
[root@btt-rhel7 tracing]# echo function_graph > current_tracer
[root@btt-rhel7 tracing]# head -20 trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) 0.127 us | mutex_lock();
2) 0.112 us | mutex_unlock();
2) 0.055 us | mutex_unlock();
2) 0.127 us | mutex_lock();
2) 0.122 us | mutex_lock();
2) 0.129 us | mutex_lock();
2) 0.049 us | mutex_lock();
2) 0.050 us | mutex_unlock();
2) 0.206 us | mutex_lock();
2) 0.063 us | mutex_lock();
2) 0.081 us | mutex_unlock();
2) 0.063 us | mutex_unlock();
2) 0.054 us | mutex_lock();
2) 0.062 us | mutex_unlock();
2) 0.087 us | mutex_unlock();
2) 0.066 us | mutex_unlock();
Function filtering: wildcards and modules
[root@btt-rhel7 tracing]# echo mutex_* > set_ftrace_filter
[root@btt-rhel7 tracing]# cat set_ftrace_filter
mutex_spin_on_owner
mutex_unlock
mutex_lock
mutex_trylock
mutex_lock_interruptible
mutex_lock_killable
[root@btt-rhel7 tracing]# echo :mod:dm_mirror:* > set_ftrace_filter
[root@btt-rhel7 tracing]# head -10 set_ftrace_filter
mirror_iterate_devices [dm_mirror]
mirror_postsuspend [dm_mirror]
mirror_status [dm_mirror]
mirror_resume [dm_mirror]
fail_mirror [dm_mirror]
wakeup_mirrord [dm_mirror]
delayed_wake_fn [dm_mirror]
free_context [dm_mirror]
mirror_dtr [dm_mirror]
trigger_event [dm_mirror]
...
Function filtering: graph function
function graph: turn trace on in the call, and off on return
[root@btt-rhel7 tracing]# echo ttwu_do_wakeup > set_graph_function
[root@btt-rhel7 tracing]# echo function_graph > current_tracer
[root@btt-rhel7 tracing]# echo 1 > tracing_on
[root@btt-rhel7 tracing]# head -20 trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
3) | ttwu_do_wakeup() {
3) | check_preempt_curr() {
3) 0.077 us | resched_task();
3) 0.619 us | }
3) 1.066 us | }
1) | ttwu_do_wakeup() {
1) | check_preempt_curr() {
1) | check_preempt_wakeup() {
1) 0.078 us | update_curr();
1) 0.076 us | wakeup_gran.isra.54();
1) 1.175 us | }
1) 1.679 us | }
1) 2.159 us | }
Filtering tracepoints
There’s no need to filter which tracepoint - you alreadyfiltered it by choosing :-)
But you can filter at which conditions you want to print atracepoint, based on its fields.
Tracepoints are more than just *printks*
They are structured information
Basic tracepoint’s filtering interface
do you recall that events are classified by subsystems?
events options are on dir:
events/$SUBSYSTEM/$EVENT NAME
e.g.: events/irq/irq handler entry
inside each there are these files:
id: the ID of the eventenable: echo 1 to enable, 0 to disablefilter: get/set filter optionsformat: information about the data gathered by this tracepoint
Filtering tracepoints: without filter
[root@btt-rhel7 tracing]# cat available_events | grep irq:irq_
irq:irq_handler_exit
irq:irq_handler_entry
[root@btt-rhel7 tracing]# cat available_events | grep irq:irq_ > set_event
[root@btt-rhel7 tracing]# tail -10 trace
<idle>-0 [001] d.h. 1543.014323: irq_handler_entry: irq=43 name=virtio0-input.0
<idle>-0 [001] d.h. 1543.014328: irq_handler_exit: irq=43 ret=handled
<idle>-0 [001] d.h. 1543.015088: irq_handler_entry: irq=43 name=virtio0-input.0
<idle>-0 [001] d.h. 1543.015090: irq_handler_exit: irq=43 ret=handled
kworker/3:0-2299 [003] d.h. 1543.232015: irq_handler_entry: irq=14 name=ata_piix
kworker/3:0-2299 [003] d.h. 1543.232147: irq_handler_exit: irq=14 ret=handled
kworker/3:0-2299 [003] d.h. 1543.232158: irq_handler_entry: irq=14 name=ata_piix
kworker/3:0-2299 [003] d.h. 1543.232196: irq_handler_exit: irq=14 ret=handled
<idle>-0 [001] d.h. 1543.534487: irq_handler_entry: irq=43 name=virtio0-input.0
<idle>-0 [001] d.h. 1543.534492: irq_handler_exit: irq=43 ret=handled
Filtering tracepoints!
[root@btt-rhel7 tracing]# cd events/irq/irq_handler_entry/
[root@btt-rhel7 irq_handler_entry]# ls
enable filter format id
[root@btt-rhel7 irq_handler_entry]# cat format
name: irq_handler_entry
ID: 114
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int irq; offset:8; size:4; signed:1;
field:__data_loc char[] name; offset:12; size:4; signed:1;
print fmt: "irq=%d name=%s", REC->irq, __get_str(name)
[root@btt-rhel7 irq_handler_entry]# echo 'irq == 14' > filter
[root@btt-rhel7 irq_handler_entry]# cd ../irq_handler_exit/
[root@btt-rhel7 irq_handler_exit]# echo 'irq == 14' > filter
Filtered tracepoints!
[root@btt-rhel7 irq_handler_exit]# cd ../../../
[root@btt-rhel7 tracing]# tail -10 trace
kworker/3:0-2299 [003] d.h. 2305.087986: irq_handler_entry: irq=14 name=ata_piix
kworker/3:0-2299 [003] d.h. 2305.088010: irq_handler_exit: irq=14 ret=handled
kworker/3:0-2299 [003] d.h. 2307.135803: irq_handler_entry: irq=14 name=ata_piix
kworker/3:0-2299 [003] d.h. 2307.135852: irq_handler_exit: irq=14 ret=handled
kworker/3:0-2299 [003] d.h. 2307.135858: irq_handler_entry: irq=14 name=ata_piix
kworker/3:0-2299 [003] d.h. 2307.135873: irq_handler_exit: irq=14 ret=handled
kworker/3:0-2299 [003] d.h. 2309.183882: irq_handler_entry: irq=14 name=ata_piix
kworker/3:0-2299 [003] d.h. 2309.183966: irq_handler_exit: irq=14 ret=handled
kworker/3:0-2299 [003] d.h. 2309.183973: irq_handler_entry: irq=14 name=ata_piix
kworker/3:0-2299 [003] d.h. 2309.183998: irq_handler_exit: irq=14 ret=handled
A more complex filter!
[root@btt-rhel7 tracing]# cd events/sched/sched_wakeup
[root@btt-rhel7 sched_wakeup]# cat format
name: sched_wakeup
ID: 311
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char comm[32]; offset:8; size:16; signed:1;
field:pid_t pid; offset:24; size:4; signed:1;
field:int prio; offset:28; size:4; signed:1;
field:int success; offset:32; size:4; signed:1;
field:int target_cpu; offset:36; size:4; signed:1;
print fmt: "comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
REC->comm, REC->pid, REC->prio, REC->success, REC->target_cpu
[root@btt-rhel7 sched_wakeup]# echo "prio < 100" > filter
[root@btt-rhel7 sched_wakeup]# echo 1 > enable
Let’s put more fun on it!
[root@btt-rhel7 sched_wakeup]# cd ../sched_switch/
[root@btt-rhel7 sched_switch]# cat format
name: sched_switch
ID: 309
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char prev_comm[32]; offset:8; size:16; signed:1;
field:pid_t prev_pid; offset:24; size:4; signed:1;
field:int prev_prio; offset:28; size:4; signed:1;
field:long prev_state; offset:32; size:8; signed:1;
field:char next_comm[32]; offset:40; size:16; signed:1;
field:pid_t next_pid; offset:56; size:4; signed:1;
field:int next_prio; offset:60; size:4; signed:1;
print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==>
next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state & (1024-1) ? __print_flags(REC->prev_state &
(1024-1), "|", { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" },
{ 128, "K" }, { 256, "W" }, { 512, "P" }) : "R", REC->prev_state & 1024 ? "+" : "", REC->next_comm,
REC->next_pid, REC->next_prio
[root@btt-rhel7 sched_switch]# echo "(prev_state == 1 && prev_prio < 100) || next_prio < 100 " > filter
[root@btt-rhel7 sched_switch]# echo 1 > enable
Lets put fun on it!
[root@btt-rhel7 sched_switch]# cd ../../../
[root@btt-rhel7 tracing]# tail -9 trace
<idle>-0 [001] dNh. 6155.077138: sched_wakeup: comm=watchdog/1 pid=19 prio=0 success=1
target_cpu=001
<idle>-0 [001] d... 6155.077165: sched_switch: prev_comm=swapper/1 prev_pid=0
prev_prio=120 prev_state=R ==> next_comm=watchdog/1
next_pid=19 next_prio=0
watchdog/1-19 [001] d... 6155.077181: sched_switch: prev_comm=watchdog/1 prev_pid=19
prev_prio=0 prev_state=S ==>
next_comm=swapper/1 next_pid=0 next_prio=120
<idle>-0 [002] dNh. 6155.089144: sched_wakeup: comm=watchdog/2 pid=24 prio=0 success=1
target_cpu=002
<idle>-0 [002] d... 6155.089166: sched_switch: prev_comm=swapper/2 prev_pid=0
prev_prio=120 prev_state=R ==> next_comm=watchdog/2
next_pid=24 next_prio=0
watchdog/2-24 [002] d... 6155.089181: sched_switch: prev_comm=watchdog/2 prev_pid=24
prev_prio=0 prev_state=S ==> next_comm=swapper/2
next_pid=0 next_prio=120
<idle>-0 [003] dNh. 6155.101158: sched_wakeup: comm=watchdog/3 pid=29 prio=0 success=1
target_cpu=003
<idle>-0 [003] d... 6155.101176: sched_switch: prev_comm=swapper/3 prev_pid=0
prev_prio=120 prev_state=R ==> next_comm=watchdog/3
next_pid=29 next_prio=0
watchdog/3-29 [003] d... 6155.101189: sched_switch: prev_comm=watchdog/3 prev_pid=29
prev_prio=0 prev_state=S ==> next_comm=swapper/3
next_pid=0 next_prio=120
Ah! percpu trace! and trace pipe! and buffersize!
That is simple! and useful!
Each CPU have a dir in the per cpu/ dir
For example, for CPU 2: per cpu/cpu2/
Each CPU has its own trace at: per cpu/cpuX/trace
Trace pipe: run a cat per cpu/cpuX/trace pipe
It is also available for all CPUs
The size of the trace is defined per cpu on file buffer size kb
Triggering
Ok, it is nice to filter, but sometimes we need more!
I want to start the trace after the occurrence of an event
and I want to stop the trace after another event happens!
or I want to enable an event after the call of a function
or yet I want to get the stacktrace in the occurrence of atracepoint
ok, let’s try it!
Triggering on function trace
The interface for triggering is the filter file: set ftrace filter
echo ’function:action:times’ > set ftrace filter to set
echo ’ !function:action:times’ > set ftrace filter to clear
Let’s start by turning the tracing on and off
Triggering trace on and off - from a function
[root@btt-rhel7 tracing]# echo 0 > tracing_on
[root@btt-rhel7 tracing]# echo irq_exit:traceoff:5 irq_enter:traceon:5 > set_ftrace_filter
[root@btt-rhel7 tracing]# echo function > current_tracer
[root@btt-rhel7 tracing]# cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 70/70 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
bash-2372 [001] d... 2344.896123: irq_enter <-smp_apic_timer_interrupt
bash-2372 [001] d... 2344.896124: rcu_irq_enter <-irq_enter
bash-2372 [001] d.h. 2344.896124: exit_idle <-smp_apic_timer_interrupt
bash-2372 [001] d.h. 2344.896124: local_apic_timer_interrupt <-smp_apic_timer_interrupt
bash-2372 [001] d.h. 2344.896125: hrtimer_interrupt <-local_apic_timer_interrupt
bash-2372 [001] d.h. 2344.896125: _raw_spin_lock <-hrtimer_interrupt
[...]
bash-2372 [001] d.h. 2344.896132: _raw_spin_unlock <-hrtimer_interrupt
bash-2372 [001] d.h. 2344.896132: tick_program_event <-hrtimer_interrupt
bash-2372 [001] d.h. 2344.896132: clockevents_program_event <-tick_program_event
bash-2372 [001] d.h. 2344.896132: ktime_get <-clockevents_program_event
bash-2372 [001] d.h. 2344.896132: lapic_next_deadline <-clockevents_program_event
Triggering events on and off - from a function
[root@btt-rhel7 tracing]# echo 'irq_exit:disable_event:sched:sched_wakeup' > set_ftrace_filter
[root@btt-rhel7 tracing]# echo 'irq_enter:enable_event:sched:sched_wakeup' > set_ftrace_filter
[root@btt-rhel7 tracing]# head -20 trace
# tracer: nop
#
# entries-in-buffer/entries-written: 467/15199 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [003] dNh. 5605.176671: sched_wakeup:
comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003
<idle>-0 [003] dNh. 5605.996414: sched_wakeup:
comm=rcu_sched pid=13 prio=120 success=1 target_cpu=003
<idle>-0 [003] dNh. 5609.176670: sched_wakeup:
comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003
<idle>-0 [003] dNh. 5613.176671: sched_wakeup:
comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003
<idle>-0 [003] dNh. 5613.890256: sched_wakeup:
comm=rcu_sched pid=13 prio=120 success=1 target_cpu=003
<idle>-0 [003] dNh. 5615.996420: sched_wakeup:
comm=rcu_sched pid=13 prio=120 success=1 target_cpu=003
<idle>-0 [003] dNh. 5617.176668: sched_wakeup:
comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003
Triggering on events
Interface on ”filter” file of the event dir
Format:
# echo '[!]command[:count] [if filter]' > trigger
Commands:
enable event/disable eventtraceon/traceoffsnapshotstacktrace
not available on RHEL7 :-( (yet?)
Triggering events on and off - from a function
[root@kiron bristot]# cd /sys/kernel/debug/tracing/events/sched/sched_wakeup
[root@kiron sched_wakeup]# ls
enable filter format id trigger
[root@kiron sched_wakeup]# echo 'stacktrace:10 if prio < 100' > trigger
[root@kiron sched_wakeup]# cat ../../../trace
<idle>-0 [003] dNh. 7762.589836: <stack trace>
=> ftrace_raw_event_sched_wakeup_template
=> ttwu_do_wakeup
=> ttwu_do_activate.constprop.90
=> try_to_wake_up
=> wake_up_process
=> hrtimer_wakeup
=> __run_hrtimer
=> hrtimer_interrupt
=> local_apic_timer_interrupt
=> smp_apic_timer_interrupt
=> apic_timer_interrupt
=> cpuidle_enter
=> cpu_startup_entry
=> start_secondary
pulseaudio-2972 [003] dN.. 7762.590148: <stack trace>
trace-cmd
A command line tool for ftrace
It is useful to collect data on customers
If you know how-to use ftrace, you know how to use thetrace-cmd
Tip: ftrace and vmcore
It is possible to extract the trace from a vmcore!It helps to understand what happened before the crashcrash> extend /usr/lib64/crash/extensions/trace.so
crash> trace dump -t data.dat
crash> pwd
/cores/retrace/tasks/968181176/misc
crash> ls
bt-a bt-filter data.dat dwysocha-automated-analysis.txt
dwysocha-rhst-search-rip-string.txt retrace-log run_crash
sys sys-c
More info
LWN -> Kernel -> Kernel Index -> Kernel Tracing
Kernel Documentation: Documentation/trace/
ping bristot@sbr-kernel
Part IUnderstanding the Linux kernelexecution model
Operating system: What books say it is:
IMHO: Netherlands’ Flag!
Before starting...
Let’s redefine hardware
Another point of view of Hardware
Another point of view of Hardware
And we fit the kernel here:
And protect it
And the kernel runs...
How does the kernel run?
There are two ways to run kernel’s code
Or ‘calling the kernel’
Or by running a kernel thread
Calling the kernel
We can think on kernel as a library of functions that areactivated to serve an event
These events are either generate:
by the Hardware, orby the Software.
How does the kernel receives these events?
Via interruptions
What is an interruption?
Interrupts are events that indicate that a condition existssomewhere in the system, the processor, or within the currently
executing program or task that requires the attention of aprocessor. They typically result in a forced transfer of executionfrom the currently running program or task to a special softwareroutine or task called an interrupt handler - Intel 64 and IA-32
Architectures Software Developer’s Manual.
Type of interruptions
Hardware Activated:
Asynchronous
Software Activated:
SynchronousExceptions:
Faults: Correctable; offending instruction is retriedTraps: Often for debugging; instruction is not retriedAborts: Not Correctable; Severe errors!
Software Interrupts:
System Calls!
A Hardware Interruption
Hardware Interrupt: Another point of view
How about process?
What is a process?
A process is a virtual memory context
Running on a protected ring
Where the threads run
Process and Threads
A process is a ‘virtual’ environment where threads have its:
codedatastackresources: e.g. sockets, file descriptors, and so on.
And they run: Threads are scheduled to run on a processor
There’s no ‘Software layer between the thread and processor’Ok, that flag, I mean, diagram fits to java :P
But sometimes a thread need more resources...
These resources are managed by the kernel
So: threads run with Operating System Support!Not ON the Operating System.
Hey kernel! I need a resource!
How does a thread ask a resource to the kernel?
Hey kernel! I need a resource!
It runs the kernel :-)...
Threads running on kernel space
A thread can run kernel code on kernel-space
And we say that the kernel runs on behalf thread
Each thread has a stack in kernel context
How does a thread jumps to ‘kernel context on ring 0’?
Generating a software interrupt o/
Thread running on kernel: system call
Thread running on kernel: or via exceptions
Another point of view:
Kernel threads
Are threads that run on kernel address space.
They are like regular threads - But only run on kernel space.
Finally, the kernel threads:
So, we have the following ways to run kernel’scode
IRQ - Hardware activated
Soft IRQ
Process threads:
Via system callVia exceptions
Kernel threads:
Runs only on kernel-space
It explains how! but not when!
How does the system decides to run a IRQ or a thread?
hardware IRQs
They are asynchronousKernel can’t control when they will run
They start running: They are not scheduled to run!
But it can control if they can be activated
Only maskable interrupts
They run until finish: kernel can’t put it to sleep:
But they can suffer interference of another IRQAnd they can block on spinlocks
IRQ running
Threads
They are activated in the kernel context
sched wakeup
Because they go to sleep in the kernel context
Mainly via system callMost common states
S - InterruptibleD - Uninterruptible
R is the Runable state
But it does not mean that they are running!
Another thread can be running at the timeAnd the thread is waiting to be scheduled...
States of a thread
Thread sleeping/waking up
Schedulers
Real-Time Dynamic priority: DEADLINEEach task has a:
PeriodExecution Time - or budgetDeadline
Closer deadline - higher priority
Real-Time fixed priority: FIFO/RREach task has a fixed priority of 99 possible:
User-space: 1 < 99Kernel-space: 0 > 98
Higher priority thread runsTasks with same prio:
FIFO: Each task will run until finishRR: Tasks will share CPU time on a Round-Robin Fashion
Schedulers
Fair Scheduler: OTHER
Will provide the same amount of CPU time to each runnabletask in a period.Less nice task will receive more time in a period.
This nice is internally mapped to a priorityKernel-Space: 99 > 139.
IDLE: Waits on kernel
Scheduling
Is there a Deadline task ready?
Get the one with the closer deadline
Is there a RT task ready?
Get the one with higher priority
Is there a Fair task ready?
Get the next to run in the fair fashion
Enter on idle state.
Scheduling
Conclusion
Applications do not run on the OS - It run on the hardware
OS is responsible to provide the environment and resources
The kernel is activated by interrupts.
From hardware, andFrom software.
Threads run on kernel-space
Threads sleep on kernel space
and the kernel schedules the threads
The end.Thanks for listening.