hackslash dot org

Disabling kernel functions in your process (2009)

Posted: 2025-05-21 02:22:25

The blog post describes a method to disable specific kernel functions within a user-space process by intercepting system calls. It leverages the ptrace system call to attach to a process, modify its system call table entries to point to a custom function, and then detach. The custom function can then choose to emulate the original kernel function, return an error, or perform other actions, effectively blocking or altering the behavior of targeted system calls for the specified process. This technique allows for granular control over kernel interactions within a user-space process, potentially useful for security sandboxing or debugging.

Chad Austin's 2009 blog post, "Disabling kernel functions in your process," explores a technique for selectively restricting a process's access to specific kernel system calls on Linux. This method provides a more granular approach to security than traditional, all-or-nothing methods like chroot jails, offering finer control over a process's capabilities and potentially mitigating the impact of vulnerabilities.

The core idea revolves around manipulating the system call table, a data structure within the kernel that maps system call numbers to their corresponding functions. By overwriting specific entries in this table with a pointer to a custom function, usually one that simply returns an error code (like -ENOSYS for "Function not implemented"), the process can effectively disable those system calls. Attempting to invoke a disabled system call will then result in the custom function being executed, preventing the actual kernel function from running.

Austin details the process of achieving this, which involves obtaining the address of the system call table, changing the memory protection of that region to allow writing, and then overwriting the desired entries. He emphasizes the platform-specific nature of this technique, as the location and structure of the system call table can vary across different kernel versions and distributions. The provided code examples are tailored for a specific system configuration and require adaptation for other environments.

The post highlights the potential benefits of this approach. By selectively disabling system calls that a process doesn't legitimately require, the attack surface can be significantly reduced. Even if a vulnerability exists within the process, exploiting it to perform unintended actions might be prevented if the necessary system calls for those actions are disabled. This provides a layer of defense that can complement other security measures.

However, Austin also acknowledges the limitations and potential drawbacks. Modifying kernel structures directly is inherently risky and can lead to system instability if done incorrectly. The technique is also dependent on the specifics of the kernel, making it brittle and prone to breaking across updates or different systems. Additionally, sophisticated attackers might find ways to circumvent these restrictions, for example, by restoring the original system call table entries.

The post concludes by suggesting this method as a potentially useful tool for enhancing security in specific scenarios where the risks are acceptable and the benefits outweigh the drawbacks. It encourages careful consideration and thorough testing before implementing this technique in a production environment.

Summary of Comments ( 6 )
https://news.ycombinator.com/item?id=44047741

HN commenters discuss the blog post's method of disabling kernel functions by overwriting the system call table entries with int3 instructions. Several express concerns about the fragility and unsafety of this approach, particularly in multi-threaded environments and due to potential conflicts with security mitigations like SELinux. Some suggest alternatives like using LD_PRELOAD to intercept and redirect function calls or employing seccomp-bpf for finer-grained control. Others question the practical use cases for this technique, acknowledging its potential for debugging or specialized security applications but cautioning against its general use. A few commenters share anecdotal experiences or related techniques, like disabling ptrace to hinder debuggers. The overall sentiment is one of cautious curiosity mixed with skepticism regarding the robustness and practicality of the described method.

The Hacker News post discussing Chad Austin's article on disabling kernel functions has a moderate number of comments, mostly focusing on the practicality and security implications of the technique described.

Several commenters express skepticism about the usefulness of this approach in real-world scenarios. One commenter highlights the limited scope of the technique, pointing out that it only affects the calling process and not the entire system. They argue that if a serious security vulnerability exists that requires disabling a kernel function, a system-wide solution would be necessary. Another commenter questions the practicality of preemptively disabling functions, suggesting it's difficult to predict which functions might be exploited in the future. They propose that a more reactive approach, focusing on patching vulnerabilities as they are discovered, is likely more effective.

Some comments discuss the potential security risks associated with disabling kernel functions. One commenter notes that disabling certain critical functions could destabilize the system, leading to crashes or unexpected behavior. Another expresses concern that attackers could potentially exploit this mechanism itself, disabling essential security functions to gain further access to the system.

A few commenters delve into the technical details of the implementation. One discusses the challenges of determining which functions are safe to disable without causing system instability. Another mentions the possibility of using this technique for performance optimization, by disabling unused or unnecessary kernel functions. However, they acknowledge that the potential performance gains are likely to be minimal.

One commenter provides an alternative perspective, suggesting that the technique could be valuable in highly specialized environments, such as embedded systems or security-critical applications. They argue that in these contexts, the limited scope and potential risks might be acceptable trade-offs for the added security benefits.

There's a thread discussing the difference between disabling a function and simply not calling it. Commenters clarify that disabling prevents the function from being called by any process, including libraries or other system components, while simply not calling it in your own code only affects your process's behavior.

Finally, some commenters express appreciation for the ingenuity of the approach, even if they acknowledge its limited practical application. They see it as an interesting exploration of the Linux kernel's capabilities and a potential starting point for further research in system security.

Why Does My eBPF Program Work on One Kernel but Fail on Another?

permalink

Posted: 2025-04-23 07:17:16

eBPF program portability can be tricky due to differences in kernel versions and configurations. The blog post highlights how seemingly minor variations, such as a missing helper function or a change in struct layout, can cause a program that works perfectly on one kernel to fail on another. It emphasizes the importance of using the bpftool utility for introspection, allowing developers to compare kernel features and identify discrepancies that might be causing compatibility issues. Additionally, building eBPF programs against the oldest supported kernel and strategically employing the LINUX_VERSION_CODE macro can enhance portability and minimize unexpected behavior across different kernel versions.

The blog post "Why Does My eBPF Program Work on One Kernel but Fail on Another?" explores the common frustration of eBPF programs behaving inconsistently across different Linux kernel versions. It delves into the reasons behind this incompatibility, focusing on the volatile nature of the eBPF verifier and its dependencies on kernel internals.

The author begins by acknowledging the seemingly random nature of these failures, where a functioning eBPF program on one kernel version might inexplicably break on another, even with seemingly minor version differences. This fragility stems from the eBPF verifier, a crucial component responsible for ensuring the safety and stability of eBPF programs before they are loaded into the kernel. The verifier analyzes the program's bytecode, meticulously checking for potential issues like infinite loops, out-of-bounds memory accesses, and other unsafe operations that could compromise the kernel's integrity.

A key factor contributing to the verifier's volatility is its reliance on internal kernel data structures and functions. These internals can change between kernel versions, sometimes subtly and without explicit documentation. As a result, a verifier that accepts a program on one kernel might reject it on another due to altered offsets, data structure layouts, or function signatures. Even seemingly minor changes in the kernel's internal workings can have cascading effects on the verifier's logic and lead to program rejection.

The blog post emphasizes that relying on undocumented kernel internals is a primary culprit in these cross-kernel incompatibilities. eBPF programs often interact with kernel functions and data structures that are not part of the official kernel API. While accessing these internals might offer powerful capabilities, it creates a tight coupling between the eBPF program and the specific kernel version it was developed on. Any changes to these undocumented elements in a newer kernel can render the eBPF program unusable.

The author then highlights several specific examples of internal kernel changes impacting eBPF program compatibility, including modifications to context structures and helper functions. These examples illustrate how even seemingly innocuous changes can break existing eBPF programs.

Finally, the post offers strategies for mitigating these compatibility challenges. One approach involves using the bpftool utility to inspect the verifier's log and understand the reasons for program rejection. This can provide valuable insights into the specific kernel changes causing the incompatibility. Another strategy is to avoid relying on undocumented kernel internals whenever possible. Sticking to the stable kernel API can minimize the risk of breakage across kernel versions. The post concludes by encouraging developers to embrace the dynamic nature of the eBPF ecosystem and proactively address potential compatibility issues. Using tools and best practices can help ensure that eBPF programs remain functional and portable across different kernel versions.

Summary of Comments ( 12 )
https://news.ycombinator.com/item?id=43769461

The Hacker News comments discuss potential reasons for eBPF program incompatibility across different kernels, focusing primarily on kernel version discrepancies and configuration variations. Some commenters highlight the rapid evolution of the eBPF ecosystem, leading to frequent breaking changes between kernel releases. Others point to the importance of checking for specific kernel features and configurations (like CONFIG_BPF_JIT) that might be enabled on one system but not another, especially when using newer eBPF functionalities. The use of CO-RE (Compile Once – Run Everywhere) and its limitations are also brought up, with users encountering problems despite its intent to improve portability. Finally, some suggest practical debugging strategies, such as using bpftool to inspect program behavior and verify kernel support for required features. A few commenters mention the challenge of staying up-to-date with eBPF's rapid development, emphasizing the need for careful testing across target kernel versions.

The Hacker News post "Why Does My eBPF Program Work on One Kernel but Fail on Another?" with the ID 43769461 has several comments discussing the intricacies and challenges of working with eBPF across different kernel versions.

Several commenters highlight the rapid pace of eBPF development and the resulting instability across kernel versions. One commenter points out that the constant evolution, while beneficial in the long run, makes it difficult for developers to maintain compatibility. They mention the frequent changes in verifier rules and helper functions as primary culprits. Another echoes this sentiment, stating that keeping up with these changes can be a full-time job, particularly when dealing with complex eBPF programs. This rapid evolution necessitates careful attention to kernel version compatibility during development and deployment.

The discussion also delves into the specifics of eBPF program loading and verification. One commenter explains how the behavior of the eBPF verifier can change between kernel versions, leading to programs that work on one kernel but fail on another. They mention that seemingly minor kernel upgrades can sometimes introduce breaking changes in the verifier's logic, causing previously valid programs to be rejected. This emphasizes the need for thorough testing across different target kernels.

Another thread focuses on the challenges of debugging eBPF programs. A user shares their experience of encountering cryptic error messages from the verifier, making it difficult to pinpoint the root cause of the issue. They suggest that improved tooling and more descriptive error messages would significantly ease the debugging process. Another commenter suggests using dynamic tracing tools like bpftrace to gain insights into the program's execution and identify potential problems.

The complexities of eBPF helper functions are also addressed. One commenter points out that the availability and behavior of helper functions can vary across kernels. They recommend consulting the kernel documentation and checking for changes in helper function signatures between kernel versions. Another user advises against relying on undocumented helper functions, as their behavior might change unexpectedly.

Finally, several commenters emphasize the importance of staying updated with the latest eBPF developments. They recommend subscribing to mailing lists, following relevant communities, and keeping track of kernel release notes to anticipate potential compatibility issues. They also advocate for better documentation and tooling to simplify eBPF development and improve cross-kernel compatibility.

Linux Kernel Defence Map – Security Hardening Concepts

permalink

Posted: 2025-04-05 22:16:54

The Linux Kernel Defence Map provides a comprehensive overview of security hardening mechanisms available within the Linux kernel. It categorizes these techniques into areas like memory management, access control, and exploit mitigation, visually mapping them to specific kernel subsystems and features. The map serves as a resource for understanding how various kernel configurations and security modules contribute to a robust and secure system, aiding in both defensive hardening and vulnerability research by illustrating the relationships between different protection layers. It aims to offer a practical guide for navigating the complex landscape of Linux kernel security.

The Linux Kernel Defence Map, presented on GitHub by user a13xp0p0v, offers a comprehensive, visually-oriented guide to various security hardening techniques applicable to the Linux kernel. It serves as a roadmap for system administrators and security professionals seeking to enhance the security posture of their Linux systems by leveraging kernel-level defenses.

The map categorizes these defenses into several key domains, reflecting different layers and aspects of kernel security. These include:

Kernel Self-Protection: This area focuses on mechanisms that protect the kernel itself from exploitation. Techniques listed encompass Kernel Address Space Layout Randomization (KASLR), which randomizes the location of kernel code in memory, and Kernel Page Table Isolation (KPTI/KAISER), which isolates user-space and kernel-space page tables to mitigate Meltdown-type vulnerabilities. It also covers Supervisor Mode Access Prevention (SMAP) and Supervisor Mode Execution Protection (SMEP), which restrict access and execution from supervisor mode to user-space memory, preventing certain types of privilege escalation attacks.
Memory Management Hardening: This domain deals with securing the kernel's memory management subsystem. It includes strategies like restricting memory allocations with SLAB_FREELIST_HARDENED, enabling memory tagging extensions like ARM Memory Tagging Extension (MTE), and implementing hardened usercopy functions to prevent vulnerabilities arising from copying data between user and kernel space.
Capability-Based Security: This section outlines the use of Linux capabilities, which provide a finer-grained alternative to traditional root privileges, allowing processes to have specific privileges without granting full administrative access. This helps limit the potential damage from compromised processes.
Namespaces and Seccomp: These features isolate processes from each other and the system, limiting their access to resources and system calls. Namespaces create isolated environments for processes, while Seccomp allows restricting the system calls a process can make. This restricts the attack surface available to a malicious process.
Security Modules: The map covers various security modules like SELinux, AppArmor, and TOMOYO Linux, which provide mandatory access control (MAC) frameworks. These modules enforce predefined security policies, restricting access to resources based on labels and rules, even for privileged processes. This adds an additional layer of security beyond traditional discretionary access control.
Cryptographic API Hardening: This area addresses securing cryptographic operations within the kernel. It highlights the use of cryptographic agility, enabling constant-time cryptographic algorithms to prevent timing attacks, and using a hardware security module (HSM) to offload sensitive cryptographic operations to a dedicated secure device.
Auditing and Intrusion Detection: This category covers mechanisms to monitor kernel activity and detect suspicious events. It includes the use of the audit subsystem for logging security-relevant events, and integrating kernel instrumentation with intrusion detection systems.
Exploit Mitigation Techniques: The map lists various exploit mitigation methods, like stack canaries, which detect stack overflows, and Shadow Stacks, which protect return addresses from modification. These techniques make it more difficult for attackers to exploit vulnerabilities.

The Linux Kernel Defence Map provides a valuable overview, presenting these security hardening concepts in a structured and accessible format. It serves as a starting point for those looking to understand and implement kernel-level security measures, offering a broad perspective on the landscape of available techniques and guiding further research into specific areas of interest. However, it's crucial to note that security is a continuous process, and this map represents a snapshot of current best practices, not a complete or static solution. Continuous learning and adaptation are essential for maintaining a robust security posture.

Summary of Comments ( 10 )
https://news.ycombinator.com/item?id=43597264

Hacker News users generally praised the Linux Kernel Defence Map for its comprehensiveness and visual clarity. Several commenters pointed out its value for both learning and as a quick reference for experienced kernel developers. Some suggested improvements, including adding more details on specific mitigations, expanding coverage to areas like user namespaces and eBPF, and potentially creating an interactive version. A few users discussed the project's scope, questioning the inclusion of certain features and debating the effectiveness of some mitigations. There was also a short discussion comparing the map to other security resources.

The Hacker News post titled "Linux Kernel Defence Map – Security Hardening Concepts" generated several comments discussing the linked resource, a mind map visualizing various Linux kernel security hardening mechanisms.

Several commenters praised the map for its comprehensive overview and visual appeal. One user described it as "extremely helpful" and appreciated the clear organization of complex information. Another lauded the project's "great work" and found it beneficial for both learning and review. The visual nature of the map was highlighted as a key strength, allowing users to quickly grasp the relationships between different security concepts.

Some commenters focused on the map's practicality and usefulness. One suggested using it for security audits or as a reference during incident response. Another highlighted its potential as a learning tool, allowing users to delve deeper into specific areas based on their interests. The ability to see the interconnectedness of various security mechanisms was also mentioned as valuable for developing a holistic understanding of kernel security.

Several comments discussed specific aspects of kernel security and their representation in the map. Discussion arose around kernel self-protection mechanisms and their limitations. One commenter pointed out the trade-off between security and performance, emphasizing that implementing every hardening technique could have performance implications. Another mentioned the importance of keeping the map updated as new security features are introduced in the kernel. The inclusion of specific kernel modules and their functionalities was also discussed.

A few commenters suggested improvements or additions to the map. One recommended including links to relevant documentation or resources for each security mechanism. Another proposed adding a section on eBPF-based security tools. The possibility of creating an interactive version of the map was also mentioned.

Overall, the comments reflected a positive reception of the Linux Kernel Defence Map. Commenters appreciated its comprehensive nature, visual clarity, and practical value for both learning and professional use. While some suggestions for improvements were made, the overall consensus was that the map provides a valuable resource for anyone interested in understanding and enhancing Linux kernel security.

Landrun: Sandbox any Linux process using Landlock, no root or containers

permalink

Posted: 2025-03-22 13:56:59

Landrun is a tool that utilizes the Landlock Linux Security Module (LSM) to sandbox processes without requiring root privileges or containers. It allows users to define fine-grained access control rules for a target process, restricting its access to the filesystem, network, and other resources. By leveraging Landlock's unprivileged mode and a clever bootstrapping process involving temporary filesystems, Landrun simplifies sandbox setup and makes robust sandboxing accessible to regular users. This enables easier and more secure execution of potentially untrusted code, contributing to a more secure desktop environment.

The GitHub project "Landrun" introduces a novel approach to sandboxing Linux processes, leveraging the Landlock Linux Security Module (LSM) to restrict access to files, directories, and other system resources. Unlike traditional sandboxing methods like containers or user namespaces, Landrun operates without requiring root privileges, making it more accessible and potentially less resource-intensive.

The core functionality of Landrun revolves around creating a restricted execution environment for a target command. This environment is defined by a configuration file that specifies allowed and denied access patterns for various resource types. These access patterns utilize Landlock's rules, which can be highly granular, enabling fine-tuned control over what a sandboxed process can interact with. For instance, a rule could permit read access to a specific file, write access to a particular directory, or completely deny any interaction with a network socket.

Landrun streamlines the process of using Landlock, abstracting away its complexities with a more user-friendly interface. Instead of directly interacting with the Landlock API, users can define their desired sandbox constraints in a declarative configuration format. Landrun then handles the translation of these constraints into the corresponding Landlock rules and applies them to the target process.

The project emphasizes ease of use and integration. It provides tools to easily generate default sandbox configurations and adapt them to specific needs. This simplifies the initial setup and allows users to quickly establish a baseline level of security. Furthermore, Landrun is designed to be easily incorporated into existing workflows, enabling developers to seamlessly integrate sandboxing into their build and deployment processes.

Landrun's reliance on the Landlock LSM offers several advantages. Landlock operates at the kernel level, providing a robust security boundary that is difficult for a compromised process to bypass. Its fine-grained access control capabilities allow for the creation of highly restrictive sandboxes, minimizing the potential impact of a security vulnerability. Finally, Landlock's efficient design ensures that the performance overhead of sandboxing is minimal.

The project's documentation highlights example use cases, including running untrusted code, isolating sensitive operations, and restricting access to specific resources. It also provides a comprehensive overview of the configuration options and demonstrates how to customize the sandbox behavior for different scenarios. The project's goal is to democratize access to advanced sandboxing techniques, empowering developers to enhance the security of their applications without requiring specialized expertise or elevated privileges.

Summary of Comments ( 122 )
https://news.ycombinator.com/item?id=43445662

HN commenters generally praise Landrun for its innovative approach to sandboxing, making it easier than traditional methods like containers or VMs. Several highlight the significance of using Landlock LSM for security, noting its kernel-level enforcement as a robust mechanism. Some discuss potential use cases, including sandboxing web browsers and other potentially risky applications. A few express concerns about complexity and debugging challenges, while others point out the project's early stage and potential for improvement. The user-friendliness compared to other sandboxing techniques is a recurring theme, with commenters appreciating the streamlined process. Some also discuss potential integrations and extensions, such as combining Landrun with Firejail.

The Hacker News post titled "Landrun: Sandbox any Linux process using Landlock, no root or containers" generated a fair amount of discussion, with several commenters expressing interest and raising relevant points.

Several users praised the project for its innovative approach to sandboxing, specifically highlighting the use of Landlock as a more granular and efficient alternative to traditional containerization or other sandboxing methods. They appreciated the potential for improved security and resource management. One commenter specifically lauded the project's ability to restrict access to specific files and directories, offering finer control than container-based solutions. This resonated with others who were looking for lightweight security options for specific applications.

A significant thread discussed the practical applications of Landrun. Suggestions ranged from securing web browsers and media players to isolating potentially vulnerable services. The ability to sandbox without root privileges was seen as a significant advantage, making the tool more accessible and usable in various environments.

Some users delved into the technical aspects of Landlock and its implementation within Landrun. They inquired about the performance overhead, the level of security provided against various attack vectors, and the project's compatibility with different Linux distributions. There was a specific question about the handling of shared libraries and the potential for vulnerabilities arising from those dependencies.

Concerns were also raised about the complexity of configuring Landlock rules, with users acknowledging the steep learning curve associated with understanding and effectively utilizing the technology. One commenter suggested that a more user-friendly interface or simplified rule management would be beneficial for wider adoption.

The conversation also touched upon the broader security implications of sandboxing and the importance of multiple layers of defense. While Landrun was recognized as a valuable tool, users emphasized that it shouldn't be considered a silver bullet and should be used in conjunction with other security practices.

Finally, a few commenters mentioned alternative sandboxing technologies like Bubblewrap and Firejail, drawing comparisons to Landrun and discussing the relative merits of each approach. This provided a broader context for understanding the landscape of Linux sandboxing tools.

A more robust raw OpenBSD syscall demo

permalink

Posted: 2025-03-12 06:11:41

This blog post presents a revised and more robust method for invoking raw OpenBSD system calls directly from C code, bypassing the standard C library. It improves upon a previous example by handling variable-length argument lists and demonstrating how to package those arguments correctly for system calls. The core improvement involves using assembly code to dynamically construct the system call arguments on the stack and then execute the syscall instruction. This allows for a more general and flexible approach compared to hardcoding argument handling for each specific system call. The provided code example demonstrates this technique with the getpid() system call.

This blog post, "A more robust raw OpenBSD syscall demo," delves into the intricacies of making direct system calls on OpenBSD, focusing on a more resilient approach than previously demonstrated. The author begins by recalling a prior, simpler example of invoking the gettimeofday syscall, highlighting its inherent fragility due to reliance on hardcoded offsets within the system call table. This method risks breaking with system updates that might shift these offsets.

The core improvement in this revised approach lies in dynamically resolving the syscall number for gettimeofday at runtime. This is accomplished by parsing the /usr/include/sys/syscall.h header file, specifically searching for the SYS_gettimeofday definition. The post meticulously explains the C code used to achieve this, including how it opens and reads the header file, employs a regular expression to extract the syscall number, and converts the extracted string into an integer. This number is then stored for later use in the actual system call invocation.

The author emphasizes OpenBSD's unique approach to system calls, utilizing a dedicated syscall instruction rather than a conventional interrupt mechanism like int 0x80 found in Linux. The specifics of preparing the arguments for gettimeofday on OpenBSD are detailed, including the use of a struct timeval pointer and a timezone argument (conventionally set to NULL). The blog post provides the assembly code snippet for executing the syscall instruction, emphasizing the crucial role of loading the dynamically determined syscall number into the appropriate register (%rax) before execution.

Beyond merely demonstrating a functional syscall, the post meticulously covers error handling. It showcases how to retrieve the return value from the syscall instruction (stored in %rax), and how to interpret it. Negative return values indicate an error, and the post elaborates on using the errno global variable to determine the specific error encountered. This is complemented by C code demonstrating how to check for errors and appropriately handle them, including printing informative error messages using the strerror function.

Finally, the post provides the complete, compiled C code, combining the dynamic syscall resolution with the robust error handling. This holistic example allows readers to understand the complete lifecycle of a robust raw system call on OpenBSD, from finding the syscall number to executing the call and gracefully handling potential errors. The post's emphasis on dynamic resolution underscores a more maintainable and portable approach to system calls, making it resilient to system updates that might alter syscall table offsets.

Summary of Comments ( 18 )
https://news.ycombinator.com/item?id=43340385

Several Hacker News commenters discuss the impracticality of the raw syscall demo, questioning its real-world usefulness and emphasizing that libraries like libc exist for a reason. Some appreciated the technical depth and the exploration of low-level system interaction, viewing it as an interesting educational exercise. One commenter suggested the demo could be useful for specialized scenarios like writing a dynamic linker or a microkernel. There was also a brief discussion about the performance implications and the idea that bypassing libc wouldn't necessarily result in significant speed improvements, and might even be slower in some cases. Some users also debated the portability of the code and suggested alternative methods for achieving similar results.

The Hacker News post "A more robust raw OpenBSD syscall demo" (https://news.ycombinator.com/item?id=43340385) has a modest number of comments, sparking a discussion primarily around the practicality and implications of the demonstrated technique.

One commenter points out the historical context of similar syscall techniques in older systems, mentioning how CP/M and DOS worked, highlighting the simplicity and directness of these older approaches. They suggest that while the demo might be "neat," it's not particularly novel.

Another commenter raises a concern about the portability of this method. They specifically mention the interaction with dynamic linkers like ld.so and how this approach might clash with Position Independent Executables (PIE), a common security feature in modern systems. This raises a practical barrier to using this technique in many real-world scenarios.

Building upon the portability concerns, a separate commenter notes the potential issues with signal handling and memory management, especially in multi-threaded environments. They explain that relying on the stack for argument passing in a raw syscall context can become problematic when signals interrupt execution or when threads are involved.

One commenter expresses skepticism about the "robustness" claimed in the title, arguing that true robustness in system calls necessitates proper error handling and boundary checks. They imply the demo, in its simplicity, lacks these vital aspects.

A different commenter delves into the details of OpenBSD's system call implementation, specifically mentioning the syscall() wrapper function. They explain that this wrapper handles some of the low-level details, contrasting it with the rawer approach demonstrated in the linked blog post. This provides additional context on the standard way system calls are usually invoked in OpenBSD.

Finally, a commenter pivots the discussion slightly, mentioning the security implications of directly manipulating the stack for system call arguments. They suggest that this method might create vulnerabilities, especially if the input is not properly sanitized or validated. This adds another layer of concern regarding the practicality and safety of the demonstrated technique.

In summary, the comments on the Hacker News post offer a range of perspectives, from historical context and comparisons with older systems to concerns about portability, robustness, and security. While some find the demo interesting, others express reservations about its real-world applicability and potential drawbacks.

Testtrim: A testing tool that couldn't test itself (until now)

permalink

Posted: 2025-01-25 20:24:55

Testtrim, a tool designed to reduce the size of test suites while maintaining coverage, ironically struggled to effectively test itself due to its reliance on ptrace for syscall tracing. This limitation prevented Testtrim from analyzing nested calls, leading to incomplete coverage data and hindering its ability to confidently trim its own test suite. A recent update introduces a novel approach using eBPF, enabling Testtrim to accurately trace nested syscalls. This breakthrough allows Testtrim to thoroughly analyze its own behavior and finally optimize its test suite, demonstrating its newfound self-testing capability and reinforcing its effectiveness as a test suite reduction tool.

Mathieu Fenniak's blog post, "Testtrim: A testing tool that couldn't test itself (until now)," details the intricate journey of enhancing Testtrim, a sophisticated testing tool specifically designed for file descriptor usage in system calls within the Linux kernel. Initially, Testtrim faced a significant limitation: it couldn't effectively test itself. This self-testing deficiency stemmed from its reliance on ptrace for syscall tracing, which presented a fundamental conflict when attempting to trace syscalls generated by the tool itself while it was already utilizing ptrace for its testing operations. This created a recursive ptrace scenario, which the Linux kernel explicitly prohibits to prevent deadlocks and other complications.

The blog post meticulously outlines the technical complexities involved in overcoming this hurdle. The core of the solution involved leveraging a nested tracing mechanism. Instead of relying solely on ptrace, Testtrim was modified to employ a combination of ptrace(PTRACE_SEIZE) and seccomp(SECCOMP_MODE_FILTER) for syscall interception. This allowed Testtrim to trace the initial set of system calls. For the critical nested layer, where Testtrim needed to analyze its own syscall behavior while already engaged in a tracing operation, the blog post describes the implementation of a custom kernel module. This module intercepted the necessary syscalls specifically within the Testtrim process, providing the required information without resorting to the problematic recursive ptrace.

Fenniak elaborates on the technical challenges encountered during this implementation. The initial approach involved using kprobes, which proved insufficient due to their inability to access specific register values necessary for comprehensive syscall analysis. Subsequently, the implementation shifted to utilize tracepoints, offering the granular access required for accurate data collection. The blog post delves into the specifics of interacting with the trace_pipe mechanism to retrieve the captured syscall data from the kernel module. It also highlights the importance of carefully managing the synchronization and buffering aspects of this inter-process communication to ensure data integrity and prevent race conditions.

Finally, the blog post concludes by celebrating the successful implementation of this nested tracing approach. This advancement allows Testtrim to thoroughly test its own intricate syscall interactions, significantly bolstering its reliability and robustness. This achievement marks a substantial improvement in Testtrim's capabilities, solidifying its position as a valuable tool for rigorous testing of file descriptor management within the Linux kernel. The nuanced description of the solution underscores the depth of technical expertise required to navigate the complexities of kernel-level tracing and highlights the innovative approach taken to overcome the inherent limitations of traditional ptrace-based methods.

Summary of Comments ( 0 )
https://news.ycombinator.com/item?id=42824526

The Hacker News comments discuss the complexity of testing tools like Testtrim, which aim to provide comprehensive syscall tracing. Several commenters appreciate the author's deep dive into the technical challenges and the clever solution involving a VM and intercepting the vmexit instruction. Some highlight the inherent difficulties in testing tools that operate at such a low level, where the very act of observation can alter the behavior of the system. One commenter questions the practical applications, suggesting that existing tools like strace and ptrace might be sufficient in most scenarios. Others point out that Testtrim's targeted approach, specifically focusing on nested virtualization, addresses a niche but important use case not covered by traditional tools. The discussion also touches on the value of learning obscure assembly instructions and the excitement of low-level debugging.

The Hacker News post titled "Testtrim: A testing tool that couldn't test itself (until now)" sparked a brief but insightful discussion with a few key comments.

One commenter highlights the core issue presented in the article: the difficulty of testing system call tracing tools due to their reliance on ptrace. They explain that these tools essentially operate by "sitting underneath" the target process, making it challenging to trace themselves without creating a confusing and possibly conflicting hierarchy of tracing. The commenter then expresses appreciation for the clear explanation of the problem and solution provided in the article.

Another commenter points out the specific challenge related to the "observer effect" in such situations, where the act of observing (tracing) the system calls inherently alters the behavior of the system being observed, making self-testing problematic. They mention the difficulty of using existing tools like strace, further emphasizing the uniqueness of the problem faced by the testtrim developer. This comment adds to the discussion by providing another perspective on the inherent complexity involved.

A third comment adds a humorous touch, referencing the paradoxical nature of self-reference and using the example of a barber who shaves everyone in town who doesn't shave themselves, posing the classic question of who shaves the barber. This lighthearted comment, while not directly addressing the technical details, captures the essence of the self-referential challenge present in testing a system call tracing tool.

Finally, one commenter focuses on the solution implemented, which involves conditionally disabling syscall tracing if the process being traced is also testtrim. They applaud the elegance and simplicity of this solution, seeing it as a testament to good design and a clear understanding of the problem.

While the discussion is not extensive, these comments provide valuable insights into the complexities of testing system call tracing tools, the specific challenges related to self-referential testing, and the appreciation for the elegant solution presented by the author of the original article.

Process Creation in Io_uring

permalink

Posted: 2024-12-20 15:23:05

The article explores a new method for process creation using io_uring, aiming to improve efficiency and reduce overhead compared to traditional fork() and execve(). This new approach uses a "registered executable" within io_uring, allowing asynchronous process launching without the performance penalties of copying memory pages between parent and child processes. The proposed solution involves two new system calls: pidfd_spawn() and pidfd_wait(). pidfd_spawn() creates a new process from the registered executable and returns a process file descriptor, while pidfd_wait() provides an asynchronous wait mechanism using io_uring. This approach offers a streamlined process-creation pathway within the io_uring framework, potentially boosting performance for applications that frequently spawn processes, like containers or web servers.

This LWN article delves into a significant enhancement proposed for the Linux kernel's io_uring subsystem: the ability to directly create processes using a new operation type. Currently, io_uring excels at asynchronous I/O operations, allowing applications to submit batches of I/O requests without blocking. However, tasks requiring process creation, like launching a helper process to handle a specific part of a workload, necessitate a context switch back to the main kernel, disrupting the efficient asynchronous flow. This proposal aims to remedy this by introducing a dedicated IORING_OP_PROCESS operation.

The proposed mechanism allows applications to specify all necessary parameters for process creation within the io_uring submission queue entry (SQE). This includes details like the executable path, command-line arguments, environment variables, user and group IDs, and various other process attributes. Critically, this eliminates the need for a system call like fork() or execve(), thereby maintaining the asynchronous nature of the operation within the io_uring context. Upon completion, the kernel places the process ID (PID) of the newly created process in the completion queue entry (CQE), enabling the application to monitor and manage the spawned process.

The article highlights the intricate details of how this process creation within io_uring is implemented. It explains how the necessary data structures are populated within the kernel, how the new process is forked and executed within the context of the io_uring kernel threads, and how signal handling and other process-related intricacies are addressed. Specifically, the IORING_OP_PROCESS operation utilizes a dedicated structure called io_uring_process, embedded within the SQE, which mirrors the arguments of the traditional execveat() system call. This allows for a familiar and comprehensive interface for developers already accustomed to process creation in Linux.

Furthermore, the article discusses the security implications and design choices made to mitigate potential vulnerabilities. Given the asynchronous nature of io_uring, ensuring proper isolation and preventing unauthorized process creation are paramount. The article emphasizes how the proposal adheres to existing security mechanisms and leverages existing kernel infrastructure for process management, thereby minimizing the introduction of new security risks. This involves careful handling of file descriptor inheritance, namespace management, and other security-sensitive aspects of process creation.

Finally, the article touches upon the performance benefits of this proposed feature. By avoiding the context switch overhead associated with traditional process creation system calls, applications leveraging io_uring can achieve greater efficiency, particularly in scenarios involving frequent process spawning. This streamlines workflows involving parallel processing and asynchronous task execution, ultimately boosting overall system performance.

Summary of Comments ( 26 )
https://news.ycombinator.com/item?id=42471861

Hacker News users discuss the implications of io_uring's new process creation capabilities. Several express excitement about the potential performance improvements, particularly for applications that frequently spawn processes, like web servers. Some highlight the security benefits of avoiding execve, while others raise concerns about the complexity introduced by this new feature and the potential for misuse. A few commenters delve into the technical details, comparing the approach to other process creation methods and discussing the trade-offs involved. Several anticipate interesting use cases, including containerization and sandboxing. One user questions if io_uring is becoming overly complex and straying from its original purpose.

Stories with Tag system calls

Disabling kernel functions in your process (2009)

Summary of Comments ( 6 ) https://news.ycombinator.com/item?id=44047741

Why Does My eBPF Program Work on One Kernel but Fail on Another?

Summary of Comments ( 12 ) https://news.ycombinator.com/item?id=43769461

Linux Kernel Defence Map – Security Hardening Concepts

Summary of Comments ( 10 ) https://news.ycombinator.com/item?id=43597264

Landrun: Sandbox any Linux process using Landlock, no root or containers

Summary of Comments ( 122 ) https://news.ycombinator.com/item?id=43445662

A more robust raw OpenBSD syscall demo

Summary of Comments ( 18 ) https://news.ycombinator.com/item?id=43340385

Testtrim: A testing tool that couldn't test itself (until now)

Summary of Comments ( 0 ) https://news.ycombinator.com/item?id=42824526

Process Creation in Io_uring

Summary of Comments ( 26 ) https://news.ycombinator.com/item?id=42471861

Summary of Comments ( 6 )
https://news.ycombinator.com/item?id=44047741

Summary of Comments ( 12 )
https://news.ycombinator.com/item?id=43769461

Summary of Comments ( 10 )
https://news.ycombinator.com/item?id=43597264

Summary of Comments ( 122 )
https://news.ycombinator.com/item?id=43445662

Summary of Comments ( 18 )
https://news.ycombinator.com/item?id=43340385

Summary of Comments ( 0 )
https://news.ycombinator.com/item?id=42824526

Summary of Comments ( 26 )
https://news.ycombinator.com/item?id=42471861