eBPF Memory Leak Detection — Q&A Preparation
eBPF Memory Leak Detection — Q&A Preparation
Based on slides 27–33 of the CSITEP presentation. Organized by topic.
1. eBPF vs Traditional Tools (Slide 28)
Q1: What's the overhead of eBPF memleak in production?
A: Typically less than 5% CPU overhead. The uprobe/kprobe mechanism adds a few microseconds per hit. For functions called less than 10K times per second, the impact is negligible. For extremely hot paths (>100K calls/sec), the overhead can become noticeable.
中文: 通常 CPU 开销低于 5%。uprobe/kprobe 每次命中增加几微秒。每秒调用少于 1 万次的函数影响可忽略。极高频热路径(>10 万次/秒)开销可能变得明显。
Q2: Why not just use ASan in production? It's only 2-3x overhead.
A: 2-3x means you need 2-3x the hardware to maintain
the same throughput. Also, ASan requires recompilation with
-fsanitize=address, which changes the binary. You can't
attach it to an already-running process. eBPF can attach and detach at
any time without restarting.
中文: 2-3 倍意味着需要 2-3
倍硬件来维持同样的吞吐量。且 ASan 需要用 -fsanitize=address
重新编译。你无法挂到已在运行的进程上。eBPF
可以随时挂载和卸载,无需重启。
Q3: Can eBPF detect use-after-free at all?
A: Not directly. eBPF memleak tracks alloc/free pairing — it knows if memory was freed but not if it was accessed after being freed. For use-after-free, you still need Valgrind or ASan, which maintain shadow memory tracking the state of every byte.
中文: 不能直接检测。eBPF memleak 追踪 alloc/free 配对——它知道内存是否被释放,但不知道释放后是否被访问。要检测 use-after-free,仍需 Valgrind 或 ASan。
Q4: What about memory leak detection in Go or Java programs?
A: Go and Java use garbage collectors, so traditional malloc/free leaks don't apply. eBPF memleak is primarily for C/C++/Rust programs that use manual memory management or system allocators. For Go, use pprof; for Java, use JVM heap dumps.
中文: Go 和 Java 使用垃圾回收器,传统 malloc/free 泄漏不适用。eBPF memleak 主要针对使用手动内存管理的 C/C++/Rust 程序。Go 用 pprof;Java 用 JVM 堆转储。
Q5: Does it work on containers/Kubernetes?
A: Yes. eBPF runs at the kernel level, so it can trace any process on the host regardless of whether it's in a container. You just need to specify the target PID or binary path.
中文: 可以。eBPF 在内核层面运行,可以追踪主机上的任何进程,无论是否在容器中。只需指定目标 PID 或二进制路径。
Q6: What Linux kernel version do we actually need?
A: The minimum is 4.9 for basic BPF functionality. For stable memleak with stack traces, we recommend 4.14+. CAP_BPF (non-root usage) requires 5.8+.
中文: 基本 BPF 功能最低需要 4.9。稳定的 memleak 含调用栈建议 4.14+。CAP_BPF(非 root)需要 5.8+。
Q7: Valgrind vs eBPF memleak — when to use which?
A: Valgrind during development for comprehensive checking (use-after-free, overflow, double-free, leaks with byte-level precision); eBPF in production for leak hunting (zero code change, near-zero overhead, attach to running process). They complement each other.
中文: 开发阶段用 Valgrind 做全面检查(use-after-free、溢出、double-free、字节级泄漏定位);生产环境用 eBPF 猎取泄漏(零代码修改、近零开销、挂载到运行中进程)。它们互补。
2. What We Capture (Slide 29)
Q8: Does memleak slow down malloc/free calls?
A: Each uprobe hit adds approximately 2-4 microseconds. For a typical application calling malloc a few thousand times per second, this is imperceptible. For millions of calls per second, you'd want to sample rather than trace every call.
中文: 每次 uprobe 命中增加约 2-4 微秒。每秒几千次 malloc 的典型应用不可感知。每秒百万次调用需要采样而不是全量追踪。
Q9: How do you get the stack trace? Does it require debug symbols?
A: We use the kernel's BPF stack trace helper,
walking frame pointers. The binary needs
-fno-omit-frame-pointer for reliable user-space stacks.
Debug symbols (DWARF) are not required at collection time but are needed
afterward to translate addresses to function names and line numbers.
中文: 使用内核的 BPF
调用栈辅助函数,通过帧指针回溯。二进制文件需要
-fno-omit-frame-pointer
才能获得可靠的用户态栈。收集时不需要调试符号(DWARF),但之后解析地址为函数名/行号时需要。
Q10: What if the program uses a custom allocator instead of malloc?
A: If the custom allocator ultimately calls malloc (jemalloc, tcmalloc, mimalloc all do), memleak will still catch it. If it uses mmap directly or a fully custom pool allocator that never calls malloc, you'd need to add uprobes to that allocator's specific functions.
中文: 如果自定义分配器最终调用 malloc(jemalloc、tcmalloc、mimalloc 都是),memleak 仍能捕获。如果直接使用 mmap 或完全自定义的池分配器,需要对其特定函数添加 uprobe。
Q11: Does it work with C++ new/delete?
A: Yes. C++ new calls
operator new which calls malloc.
delete calls operator delete which calls
free. Hooking malloc/free covers C++ allocations.
中文: 可以。C++ 的 new 调用
operator new 再调用
malloc。delete 调用
operator delete 再调用 free。hook malloc/free
就覆盖了 C++ 分配。
Q12: Can you filter by specific process or thread?
A: Yes. You can target a specific PID and the tool records TID per allocation, so you can filter results by thread in post-processing.
中文: 可以。可以指定特定 PID,工具记录每次分配的 TID,后处理中可按线程过滤。
Q13: How much memory does the eBPF memleak tool itself consume?
A: The BPF maps grow proportionally to outstanding allocations being tracked. For a typical application with thousands of active allocations, this is a few megabytes. Each unique stack trace is stored once and referenced by ID.
中文: BPF maps 与被追踪的未释放分配数量成正比。数千个活跃分配的典型应用约几 MB。每个唯一调用栈只存一次,通过 ID 引用。
Q14: How do you distinguish intentional long-lived allocations from real leaks?
A: Key indicators of a real leak: (1) continuously growing allocations over time, not one-time initialization; (2) multiple allocations from the same call stack accumulating; (3) memory growing proportionally to load/time. Our HTML report's "continuous leak detection" flags pattern #1 automatically.
中文: 真正泄漏的关键指标:(1) 随时间持续增长,非一次性初始化;(2) 同一调用栈的多次分配不断累积;(3) 内存与负载/时间成正比增长。HTML 报告的"持续泄漏检测"自动标记模式 #1。
3. Kernel-Space Tracing (Slide 30)
Q15: Is it safe to trace kernel allocations in production?
A: Yes. kprobes are a well-established kernel tracing mechanism. The BPF verifier ensures the probe program cannot crash the kernel or access invalid memory. Overhead is similar to user-space — a few microseconds per probe hit.
中文: 安全。kprobe 是成熟的内核追踪机制。BPF 验证器确保探针程序不会崩溃内核或访问无效内存。开销与用户态类似。
Q16: What about SLUB/SLAB allocator tracking?
A: kmem_cache_alloc is the SLAB/SLUB interface and we hook it directly. This covers most kernel object allocations (inodes, dentries, sk_buffs, etc.). kmalloc is also backed by SLAB/SLUB internally.
中文: kmem_cache_alloc 是 SLAB/SLUB 接口,我们直接 hook。这覆盖了大多数内核对象分配(inodes、dentries、sk_buffs 等)。kmalloc 内部也基于 SLAB/SLUB。
Q17: Can you trace specific kernel modules?
A: The tool captures all kernel allocations. You can filter by kernel function name patterns in post-processing — look for stacks containing the specific module's functions.
中文: 工具捕获所有内核分配。可在后处理中按内核函数名模式过滤——查找包含特定模块函数的调用栈。
Q18: What if memleak itself crashes? Does it affect the traced process?
A: No. eBPF programs are verified by the kernel before loading — they cannot crash the kernel or the traced process. If the user-space component crashes, BPF probes are automatically cleaned up. The traced process continues unaffected.
中文: 不会。eBPF 程序在加载前经过内核验证。如果用户态组件崩溃,BPF 探针自动清理。被追踪进程继续不受影响。
4. HTML Visual Report (Slides 31–33)
Q19: How does the "continuous leak detection" algorithm work?
A: It compares each allocation source across all snapshots. If a source's total bytes increase in every consecutive snapshot (monotonically increasing), it's flagged as a suspected continuous leak. This eliminates one-time allocations and focuses on genuinely growing memory.
中文: 比较每个分配源在所有快照中的数据。如果某源的总字节数在每个连续快照中都增加(单调递增),则标记为疑似持续泄漏。排除一次性分配,聚焦真正增长的内存。
Q20: Can the report handle very large logs (hours of tracing)?
A: Yes. The report aggregates data by snapshot intervals. Even hours of tracing produces a manageable number of data points. We've tested with thousands of snapshots without issues.
中文: 可以。报告按快照间隔聚合数据。数小时追踪也产生可管理数量的数据点。测试过数千个快照没有问题。
Q21: Can we integrate this into CI/CD?
A: Yes. Run the service under memleak for N seconds, parse the output, fail the build if outstanding allocations exceed a threshold or continuous growth is detected. The HTML report generation can also be automated as a post-processing step.
中文: 可以。在 memleak 下运行服务 N 秒,解析输出,如果未释放分配超过阈值或检测到持续增长则构建失败。HTML 报告生成也可作为后处理步骤自动化。
Q22: What's do_anonymous_page? Why does it dominate kernel allocations?
A: do_anonymous_page handles page
faults for anonymous memory (heap, stack, mmap without file backing).
It's normal for it to dominate because every user-space malloc that
triggers a new page goes through this path. It typically represents the
program's working set, not a kernel bug.
中文: do_anonymous_page
处理匿名内存的缺页中断。它占主导是正常的,因为每个触发新页分配的用户态
malloc 都经过这条路径。通常代表程序的工作集,不是内核 bug。
Q23: How do we handle false positives in reports?
A: Common false positives: (1) startup allocations — filter by growth pattern, should plateau; (2) cache/pool allocations — grow then stabilize, continuous-growth detector ignores them; (3) intentional leaks (daemon patterns) — create a baseline and compare. You can maintain a whitelist to suppress known non-issues.
中文: 常见误报:(1) 启动分配——通过增长模式过滤,应趋于平稳;(2) 缓存/池分配——增长后稳定,检测器忽略;(3) 故意泄漏(守护进程模式)——创建基线比较。可维护白名单抑制已知非问题。
5. General / Architecture
Q24: Does this work on ARM/aarch64?
A: Yes. eBPF is architecture-independent — it runs in the kernel's BPF virtual machine. Both x86_64 and aarch64 are fully supported. The only concern is stack unwinding may require different compilation flags on ARM.
中文: 可以。eBPF 架构无关,在内核 BPF 虚拟机中运行。x86_64 和 aarch64 都完全支持。唯一关注点是 ARM 上栈回溯可能需要不同编译标志。
Q25: Can we run memleak on multiple processes simultaneously?
A: Yes. Run separate instances targeting different PIDs, or use system-wide mode to trace all processes. System-wide mode captures every malloc/free on the system.
中文: 可以。运行多个实例分别指向不同 PID,或以系统级模式追踪所有进程。
Q26: What privileges do we need? Can we avoid root?
A: Options: (1) Root — always works; (2) CAP_BPF + CAP_PERFMON (Linux 5.8+) — minimum capabilities without full root; (3) CAP_SYS_ADMIN — works but overly broad. In production, recommend a dedicated service account with CAP_BPF + CAP_PERFMON only.
中文: 选项:(1) Root——始终有效;(2) CAP_BPF + CAP_PERFMON(Linux 5.8+)——不需要完整 root 的最小权限;(3) CAP_SYS_ADMIN——有效但过宽。生产中建议只有 CAP_BPF + CAP_PERFMON 的专用账号。
Q27: What's the recommended workflow for investigating a suspected leak?
A:
- Confirm the symptom — check RSS growth in monitoring
- Attach memleak — run for 2-5 minutes with periodic snapshots
- Generate HTML report — look for continuously growing sources
- Identify top offenders — expand stack traces to find the code path
- Analyze the code — determine if it's a real leak or intentional retention
- Fix and verify — patch the code, run memleak again to confirm
中文:
- 确认症状——监控中检查 RSS 增长
- 挂载 memleak——运行 2-5 分钟,周期性快照
- 生成 HTML 报告——查找持续增长的来源
- 识别主要问题——展开调用栈找到代码路径
- 分析代码——判断是真正泄漏还是故意保留
- 修复并验证——修补代码,再次确认泄漏消除
Q28: Can we save raw data and generate the report later?
A: Yes. memleak outputs text logs to stdout/file. Collect raw output, transfer to another machine, run the HTML report generator offline.
中文: 可以。memleak 输出文本日志到 stdout/文件。收集原始输出,传到另一台机器,离线生成 HTML 报告。
Q29: How does Rust leak memory if it has ownership?
A: Several ways: (1) Rc/Arc reference cycles —
reference count never reaches zero; (2) mem::forget() — skips
destructor; (3) Box::leak() — intentional leak for 'static
reference; (4) FFI boundaries — memory allocated in C code; (5)
infinite-growing collections. Rust guarantees memory safety (no
UB), not preventing leaks. Since Rust's allocator calls malloc/free
underneath, our uprobe approach works out of the box.
中文: 几种方式:(1) Rc/Arc 引用循环;(2) mem::forget();(3) Box::leak();(4) FFI 边界;(5) 无限增长的集合。Rust 保证内存安全(无 UB),不保证防泄漏。由于 Rust 分配器底层调用 malloc/free,我们的 uprobe 方法开箱即用。
Q30: What's the maximum duration you can run memleak?
A: No hard limit. The BPF hash map has a configurable max size (default ~10240 entries). If you hit the limit, new allocations are silently dropped from tracking (the allocation itself still succeeds). We've run 24+ hours in production without issues. Map size can be tuned at startup.
中文: 没有硬性限制。BPF 哈希表有可配置最大大小(默认约 10240 条目)。达到限制时新分配被静默丢弃追踪(分配本身仍成功)。生产中运行 24 小时以上没有问题。Map 大小可在启动时调整。