In a production environment, filesystem faults might occur due to various incidents such as disk failures and administrator errors. As a Chaos Engineering platform, Chaos Mesh has supported simulating I/O faults in a filesystem ever since its early versions. By simply adding an IOChaos CustomResourceDefinition (CRD), we can watch how the filesystem fails and returns errors.
However, before Chaos Mesh 1.0, this experiment was not easy and may have consumed a lot of resources. We needed to inject sidecar containers to the Pod through the mutating admission webhooks and rewrite the
ENTRYPOINT command. Even if no fault was injected, the injected sidecar container caused a substantial amount of overhead.
Chaos Mesh 1.0 has changed all this. Now, we can use IOChaos to inject faults to a filesystem at runtime. This simplifies the process and greatly reduces system overhead. This blog post introduces how we implement the IOChaos experiment without using a sidecar.
To simulate I/O faults at runtime, we need to inject faults into a filesystem after the program starts system calls (such as reads and writes) but before the call requests arrive at the target filesystem. We can do that in one of two ways:
- Use Berkeley Packet Filter (BPF); however, it cannot be used to inject delay.
- Add a filesystem layer called ChaosFS before the target filesystem. ChaosFS uses the target filesystem as the backend and receives requests from the operating system. The entire call link is target program syscall -> Linux kernel -> ChaosFS -> target filesystem. Because ChaosFS is customizable, we can inject delays and errors as we want. Therefore, ChaosFS is our choice.
But ChaosFS has several problems:
- If ChaosFS reads and writes files in the target filesystem, we need to mount ChaosFS to a different path than the target path specified in the Pod configuration. ChaosFS cannot be mounted to the path of the target directory.
- We need to mount ChaosFS before the target program starts running. This is because the newly-mounted ChaosFS takes effect only on files that are newly opened by the program in the target filesystem.
- We need to mount ChaosFS to the target containter's
mntnamespace. For details, see mount_namespaces(7) — Linux manual page.
Before Chaos Mesh 1.0, we used the mutating admission webhook to implement IOChaos. This technique addressed the three problems lists above and allowed us to:
- Run scripts in the target container. This action changed the target directory of the ChaosFS's backend filesystem (for example, from
/mnt/a_bak) so that we could mount ChaosFS to the target path (
/mnt/a). Modify the command that starts the Pod. For example, we could modify the original command
waitfs.shscript kept checking whether the filesystem was successfully mounted. If it was mounted,
- Add a new container in the Pod to run ChaosFS. This container needed to share a volume with the target container (for example,
/mnt), and then we mounted this volume to the target directory (for example,
/mnt/a). We also properly enabled mount propagation for this volume's mount to penetrate the share to host and then penetrate slave to the target.
These three approaches allowed us to inject I/O faults while the program was running. However, the injection was far from convenient:
- We could only inject faults into a volume subdirectory, not into the entire volume. The workaround was to replace
mount moveto move the mount point of the target volume.
- We had to explicitly write commands in the Pod rather than implicitly use the image commands. Otherwise, the
/waitfs.shscript could not properly start the program after the filesystem was mounted.
- The corresponding container needed to have a proper configuration for mount propagation. Due to potential privacy and security issues, we could not modify the configuration via the mutating admission webhook.
- The injection configuration was troublesome. Worse still, we had to create a new Pod after the configuration was able to inject faults.
- We could not withdraw ChaosFS while the program was running. Even if no fault or error was injected, the performance was greatly affected.
What about cracking these tough nuts without the mutating admission webhook? Let's get back and think a bit about the reason why we used the mutating admission webhook to add a container in which ChaosFS runs. We do that to mount the filesystem to the target container.
In fact, there is another solution. Instead of adding containers to the Pod, we can first use the
setns Linux system call to modify the namespace of the current process and then use the
mount call to mount ChaosFS to the target container. Suppose that the filesystem to inject is
/mnt. The new injection process is as follows:
setnsfor the current process to enter the mnt namespace of the target container.
mount --moveto move
- Mount ChaosFS to
/mnt_bakas the backend.
After the process is finished, the target container will open, read, and write the files in
/mnt through ChaosFS. In this way, delays or faults are injected much more easily. However, there are still two questions to answer:
- How do you handle the files that are already opened by the target process?
- How do you recover the process given that we cannot unmount the filesystem when files are opened?
ptrace solves both of the two questions above. We can use ptrace to replace the opened file descriptors (FD) at runtime and replace the current working directory (CWD) and mmap.
ptrace is a powerful tool that makes the target process (tracee) to run any system call or binary program. For a tracee to run the program, ptrace modifies the RIP-pointed address to the target process and adds an
int3 instruction to trigger a breakpoint. When the binary program stops, we need to restore the registers and memory.
In the x86_64 architecture, the RIP register (also called an instruction pointer) always points to the memory address at which the next directive is run. To load the program into the target process memory spaces:
- Use ptrace to call mmap in the target program to allocate the needed memory.
- Write the binary program to the newly allocated memory and make the RIP register point to it.
- After the binary program stops, call munmap to clean up the memory section.
As a best practice, we often replace ptrace
POKE_TEXT writes with
process_vm_writev because if there is a huge amount of data to write,
process_vm_writev performs more efficiently.
Using ptrace, we are able to make a process to replace its own FD. Now we only need a method to make that replacement happen. This method is the
dup2 system call.
dup2 to replace file descriptor#
The signature of the
dup2 function is
int dup2(int oldfd, int newfd);. It is used to create a copy of the old FD (
oldfd). This copy has an FD number of
newfd already corresponds to the FD of an opened file, the FD on the file that's already opened is automatically closed.
For example, the current process opens
/var/run/__chaosfs__test__/a whose FD is
1. To replace this opened file with
/var/run/test/a, this process performs the following operations:
- Uses the
fcntlsystem call to get the
OFlags(the parameter used by the
opensystem call, such as
- Uses the
Iseeksystem call to get the current location of
- Uses the
opensystem call to open
/var/run/test/ausing the same
OFlags. Assume that the FD is
Iseekto change the
seeklocation of the newly opened FD
dup2(2, 1)to replace the FD
/var/run/__chaosfs__test__/awith the newly opened FD
- Closes FD
After the process is finished, FD
1 of the current process points to
/var/run/test/a. So that we can inject faults, any subsequent operations on the target file go through the Filesystem in Userspace (FUSE). FUSE is a software interface for Unix and Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code.
The combined functionality of ptrace and dup2 makes it possible for the tracer to make the tracee replace the opened FD by itself. Now, we need to write a binary program and make the target process run it:
In the implementation above, we assume that:
- The threads of the target process are POSIX threads and share the opened files.
- When the target process creates threads using the
CLONE_FILESparameter is passed.
Therefore, Chaos Mesh only replaces the FD of the first thread in the thread group.
- Write a piece of assembly code according to the two sections above and the usage of syscall directives. Here is an example of the assembly code.
- Use an assembler to translate the code into a binary program. We use dynasm-rs as the assembler.
- Use ptrace to make the target process run this program. When the program runs, the FD is replaced at runtime.
The following diagram illustrates the overall I/O fault injection process:
In this diagram, each horizontal line corresponds to a thread that runs in the direction of the arrows. The Mount/Umount Filesystem and Replace FD tasks are carefully arranged in sequence. Given the process above, this arrangement makes a lot of sense.
I've discussed how we implement fault injection to simulate I/O faults at runtime (see chaos-mesh/toda). However, the current implementation is far from perfect:
- Generation numbers are not supported.
- ioctl is not supported.
- Chaos Mesh does not immediately determine whether a filesystem is successfully mounted. It does so only after one second.
This is the first post in a series on Chaos Mesh implementation. If you want to see how other types of fault injection are implemented, stay tuned.