eBPF Trace from Kernel to Userspace

74
eBPF Trace from Kernel to Userspace Gary Lin SUSE Labs Software Engineer Technology Sharing Day 2016

Transcript of eBPF Trace from Kernel to Userspace

Page 1: eBPF Trace from Kernel to Userspace

eBPFTrace from Kernel to Userspace

Gary LinSUSE LabsSoftware Engineer

Technology Sharing Day

2016

Page 2: eBPF Trace from Kernel to Userspace

Tracer

Page 3: eBPF Trace from Kernel to Userspace

tick_nohz_idle_enterset_cpu_sd_state_idleup_write__tick_nohz_idle_enterktime_getuprobe_mmapread_hpetvma_set_page_protvma_wants_writenotify

rcu_needs_cpufputget_next_timer_interrupt_raw_spin_lockhrtimer_get_next_event_raw_spin_lock_irqsave_raw_spin_unlock_irqrestoresyscall_trace_leave _raw_write_unlock_irqrestore__audit_syscall_exitpath_putdputmntput

up_write

rax: 0x0000000000000000rbx: 0xffff88012b5a5a28rcx: 0xffff8800987c18e0rdx: 0x0000000000000000rsi: 0xffff88012b439f20rdi: 0xffff88012b464628rbp: 0xffff8800959e3d98

Page 4: eBPF Trace from Kernel to Userspace

kprobe

Kernel

Userspace

uprobe

/sys/kernel/debug/tracing/kprobe_events

/sys/kernel/debug/tracing/uprobe_events

Page 5: eBPF Trace from Kernel to Userspace

eBPF

Page 6: eBPF Trace from Kernel to Userspace

BPF?

Page 7: eBPF Trace from Kernel to Userspace

Berkeley Packet Filter

Page 8: eBPF Trace from Kernel to Userspace

BPF

No RedBPF Program

Page 9: eBPF Trace from Kernel to Userspace

The BSD Packet Filter:A New Architecture for User-level

Packet CaptureDecember 19, 1992

Page 10: eBPF Trace from Kernel to Userspace

SCO lawsuit, August 2003

Page 11: eBPF Trace from Kernel to Userspace

Old

Page 12: eBPF Trace from Kernel to Userspace

Stable

Page 13: eBPF Trace from Kernel to Userspace

BPF ASM

ldh [12]jne #0x800, dropldb [23]jneq #1, drop# get a random uint32 numberld randmod #4jneq #1, dropret #-1drop: ret #0

Page 14: eBPF Trace from Kernel to Userspace

BPF Bytecode

struct sock_filter code[] = { { 0x28, 0, 0, 0x0000000c }, { 0x15, 0, 8, 0x000086dd }, { 0x30, 0, 0, 0x00000014 }, { 0x15, 2, 0, 0x00000084 }, { 0x15, 1, 0, 0x00000006 }, { 0x15, 0, 17, 0x00000011 }, { 0x28, 0, 0, 0x00000036 }, { 0x15, 14, 0, 0x00000016 }, { 0x28, 0, 0, 0x00000038 }, { 0x15, 12, 13, 0x00000016 }, ...};

Page 15: eBPF Trace from Kernel to Userspace

Virtual Machinekind of

Page 16: eBPF Trace from Kernel to Userspace

BPF JIT

Page 17: eBPF Trace from Kernel to Userspace

BPFBytecode

NativeMachine

CodeBPF JIT

Page 18: eBPF Trace from Kernel to Userspace

$ find arch/ -name bpf_jit*arch/sparc/net/bpf_jit_comp.carch/sparc/net/bpf_jit_asm.Sarch/sparc/net/bpf_jit.harch/arm/net/bpf_jit_32.carch/arm/net/bpf_jit_32.harch/arm64/net/bpf_jit_comp.carch/arm64/net/bpf_jit.harch/powerpc/net/bpf_jit_comp.carch/powerpc/net/bpf_jit_asm.Sarch/powerpc/net/bpf_jit.harch/s390/net/bpf_jit_comp.carch/s390/net/bpf_jit.Sarch/s390/net/bpf_jit.harch/mips/net/bpf_jit.carch/mips/net/bpf_jit_asm.Sarch/mips/net/bpf_jit.harch/x86/net/bpf_jit_comp.carch/x86/net/bpf_jit.S

Page 19: eBPF Trace from Kernel to Userspace

Stable and Efficient

Page 20: eBPF Trace from Kernel to Userspace

eBPF

Page 21: eBPF Trace from Kernel to Userspace

Extended BPF

Page 22: eBPF Trace from Kernel to Userspace

eBPF

userspacekernel

eBPFProgramBPF_PROG_LOAD

At most4096instructions

Page 23: eBPF Trace from Kernel to Userspace

Extended RegisterseBPF VerifiereBPF MapProbe Event

Page 24: eBPF Trace from Kernel to Userspace

Extended RegisterseBPF VerifiereBPF MapProbe Event

Page 25: eBPF Trace from Kernel to Userspace

Classic BPF: 32 bitExtended BPF: 64 bit

Page 26: eBPF Trace from Kernel to Userspace

Classic BPF: A, X (2)Extended BPF: R0 – R9 (10)

R10 (read-only)

Page 27: eBPF Trace from Kernel to Userspace

For x86_64 JIT

R0 → raxR1 → rdiR2 → rsiR3 → rdxR4 → rcxR5 → r8R6 → rbxR7 → r13R8 → r14R9 → r15R10 → rbp

Page 28: eBPF Trace from Kernel to Userspace

BPF Calling Convention

● R0

Return value from in-kernel function, and exit value for eBPF program

● R1 – R5

Arguments from eBPF program to in-kernel function

● R6 – R9

Callee saved registers that in-kernel function will preserve

● R10

Read-only frame pointer to access stack

Page 29: eBPF Trace from Kernel to Userspace

Extended RegisterseBPF VerifiereBPF MapProbe Event

Page 30: eBPF Trace from Kernel to Userspace

Two-Step Verification

Page 31: eBPF Trace from Kernel to Userspace

Step 1

Directed Acyclic GraphCheck

Page 32: eBPF Trace from Kernel to Userspace

Loops

Unreachable Instructions

Page 33: eBPF Trace from Kernel to Userspace

Loops

Unreachable Instructions

Page 34: eBPF Trace from Kernel to Userspace

Step 2

Simulate the Execution

Page 35: eBPF Trace from Kernel to Userspace

Read a never-written register

Do arithmetic of two valid pointer

Load/store registers of invalid types

Read stack before writing data into stack

Page 36: eBPF Trace from Kernel to Userspace

Read a never-written register

Do arithmetic of two valid pointer

Load/store registers of invalid types

Read stack before writing data into stack

Page 37: eBPF Trace from Kernel to Userspace

Extended RegisterseBPF VerifiereBPF MapProbe Event

Page 38: eBPF Trace from Kernel to Userspace

eBPF

userspacekernel

UserProgram

Map BPF_MAP_*

Page 39: eBPF Trace from Kernel to Userspace

eBPF Map Types

● BPF_MAP_TYPE_HASH● BPF_MAP_TYPE_ARRAY● BPF_MAP_TYPE_PROG_ARRAY● BPF_MAP_TYPE_PERF_EVENT_ARRAY

Page 40: eBPF Trace from Kernel to Userspace

eBPF Map Syscalls

● BPF_MAP_CREATE● BPF_MAP_LOOKUP_ELEM● BPF_MAP_UPDATE_ELEM● BPF_MAP_DELETE_ELEM● BPF_MAP_GET_NEXT_KEY

Page 41: eBPF Trace from Kernel to Userspace

Extended RegisterseBPF VerifiereBPF MapProbe Event

Page 42: eBPF Trace from Kernel to Userspace

New ioctl request

PERF_EVENT_IOC_SET_BPF

Page 43: eBPF Trace from Kernel to Userspace

Kprobe

Page 44: eBPF Trace from Kernel to Userspace

BPF_PROG_LOAD

User Program

eBPF

userspace

kernel

KernelProgram

kprobe

Eventfd

fd

PERF_EVENT_IOC_SET_BPF

fd

Attach

Page 45: eBPF Trace from Kernel to Userspace

Registration

perf_tp_event_init() kernel/events/core.cperf_trace_init() kernel/trace/trace_event_perf.cperf_trace_event_init() kernel/trace/trace_event_perf.cperf_trace_event_reg() kernel/trace/trace_event_perf.c

ret = tp_event->class->reg(tp_event, TRACE_REG_PERF_REGISTER, NULL);

kprobe_register() kernel/trace/trace_kprobe.cenable_trace_kprobe() kernel/trace/trace_kprobe.cenable_kprobe() kernel/kprobes.c

Page 46: eBPF Trace from Kernel to Userspace

Attach

perf_ioctl() kernel/events/core.c_perf_ioctl() kernel/events/core.c

case PERF_EVENT_IOC_SET_BPF: return perf_event_set_bpf_prog(event, arg);

perf_event_set_bpf_prog() kernel/events/core.c

prog = bpf_prog_get(prog_fd); event->tp_event->prog = prog;

Page 47: eBPF Trace from Kernel to Userspace

Dispatch Event

kprobe_dispatcher() kernel/trace/trace_kprobe.ckprobe_perf_func() kernel/trace/trace_kprobe.c

if (prog && !trace_call_bpf(prog, regs)) Return;

trace_call_bpf() kernel/trace/bpf_trace.cBPF_PROG_RUN() include/linux/filter.h__bpf_prog_run() kernel/bpf/core.c

Page 48: eBPF Trace from Kernel to Userspace

kfree_skb(struct sk_buff *skb){ if (unlikely(!skb)) return; ….}

kprobe

eBPF

BPF bytecode Read Map

BPF bytecode Map

BPF_PROG_LOAD BPF_MAP_*

userspace

kernel

bpf_tracer.c

Page 49: eBPF Trace from Kernel to Userspace

Uprobe

Page 50: eBPF Trace from Kernel to Userspace

BPF_PROG_LOAD

User Program

eBPF

userspace

kernel

KernelProgram

uprobe

Eventfd

fd

PERF_EVENT_IOC_SET_BPF

fd

Attach

Page 51: eBPF Trace from Kernel to Userspace

__libc_malloc(size_t *bytes){ arena_lookup(ar_ptr);

arena_lock(ar_ptr, bytes); ….}

uprobe

eBPF

BPF bytecode

BPF bytecode

userspace

kernel

bpf_tracer.c

glibc

Page 52: eBPF Trace from Kernel to Userspace

How to use eBPF?

Page 53: eBPF Trace from Kernel to Userspace

Linux Kernel >= 4.1

Page 54: eBPF Trace from Kernel to Userspace

Kernel Config

● CONFIG_BPF=y● CONFIG_BPF_SYSCALL=y● CONFIG_BPF_JIT=y● CONFIG_HAVE_BPF_JIT=y● CONFIG_BPF_EVENTS=y

Page 55: eBPF Trace from Kernel to Userspace

BPF ASM

Page 56: eBPF Trace from Kernel to Userspace

BPF ASMRestricted C

Page 57: eBPF Trace from Kernel to Userspace

LLVM >= 3.7

Page 58: eBPF Trace from Kernel to Userspace

clang:llc:

--emit-llvm--march=bpf

Page 59: eBPF Trace from Kernel to Userspace

C codeLLVM

IR BitcodeBPF Bytecodeclang llc

Page 60: eBPF Trace from Kernel to Userspace

User Program

eBPF

userspace

kernel

eBPF MAP

KernelProgram

As simpleas possible

Whatever you want

Page 61: eBPF Trace from Kernel to Userspace

BPF Compiler Collection

Page 62: eBPF Trace from Kernel to Userspace

obs://Base:System/bcc

Page 63: eBPF Trace from Kernel to Userspace

C & Python Library

Built-in BPF compiler

Page 64: eBPF Trace from Kernel to Userspace

Hello World

from bcc import BPF

bpf_prog="""void kprobe__sys_clone(void *ctx) { bpf_trace_printk(“Hello, World\\n”);}"""

BPF(text=bpf_prog).trace_print()

Page 65: eBPF Trace from Kernel to Userspace

Access Map

In bitehist.c:

BPF_HISTOGRAM(dist);dist.increment(bpf_log2l(req->__data_len / 1024));

In bitehist.py:

b = BPF(src_file = "bitehist.c")b["dist"].print_log2_hist("kbytes")

Page 66: eBPF Trace from Kernel to Userspace

Access Map (Cont’)

# ./bitehist.pyTracing... Hit Ctrl-C to end.^C kbytes : count distribution 0 -> 1 : 8 |****** | 2 -> 3 : 0 | | 4 -> 7 : 51 |****************************************| 8 -> 15 : 8 |****** | 16 -> 31 : 1 | | 32 -> 63 : 3 |** | 64 -> 127 : 2 |* |

Page 67: eBPF Trace from Kernel to Userspace

memleak.py

if not kernel_trace: print("Attaching to malloc and free in pid %d," "Ctrl+C to quit." % pid) bpf_program.attach_uprobe(name="c", sym="malloc", fn_name="alloc_enter", pid=pid) bpf_program.attach_uretprobe(name="c", sym="malloc", fn_name="alloc_exit", pid=pid) bpf_program.attach_uprobe(name="c", sym="free", fn_name="free_enter", pid=pid)else: print("Attaching to kmalloc and kfree, Ctrl+C to quit.") bpf_program.attach_kprobe(event="__kmalloc", fn_name="alloc_enter") bpf_program.attach_kretprobe(event="__kmalloc", fn_name="alloc_exit") bpf_program.attach_kprobe(event="kfree", fn_name="free_enter")

Page 68: eBPF Trace from Kernel to Userspace

memleak.py (alloc_enter)

BPF_HASH(sizes, u64);BPF_HASH(allocs, u64, struct alloc_info_t);

int alloc_enter(struct pt_regs *ctx, size_t size) { ... u64 pid = bpf_get_current_pid_tgid(); u64 size64 = size; sizes.update(&pid, &size64); ...}

Page 69: eBPF Trace from Kernel to Userspace

memleak.py (alloc_exit)

BPF_HASH(sizes, u64);BPF_HASH(allocs, u64, struct alloc_info_t);

int alloc_exit(struct pt_regs *ctx) { u64 address = ctx->ax; u64 pid = bpf_get_current_pid_tgid(); u64* size64 = sizes.lookup(&pid); struct alloc_info_t info = {0};

if (size64 == 0) return 0; // missed alloc entry

info.size = *size64; sizes.delete(&pid);

info.timestamp_ns = bpf_ktime_get_ns(); info.num_frames = grab_stack(ctx, &info) - 2; allocs.update(&address, &info); ...}

Page 70: eBPF Trace from Kernel to Userspace

memleak.py (free)

BPF_HASH(sizes, u64);BPF_HASH(allocs, u64, struct alloc_info_t);

int free_enter(struct pt_regs *ctx, void *address){ u64 addr = (u64)address; struct alloc_info_t *info = allocs.lookup(&addr); if (info == 0) return 0;

allocs.delete(&addr); ...}

Page 71: eBPF Trace from Kernel to Userspace

Demo

Page 72: eBPF Trace from Kernel to Userspace

Question?

Page 73: eBPF Trace from Kernel to Userspace

ThankYou

Page 74: eBPF Trace from Kernel to Userspace

References

● Documentation/networking/filter.txt

● http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html

● https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/

● https://suchakra.wordpress.com/2015/08/12/bpf-internals-ii/

● https://lkml.org/lkml/2013/9/30/627

● https://lwn.net/Articles/612878/

● https://lwn.net/Articles/650953/

● https://github.com/iovisor/bcc