Why /usr/bin/echo works just the same as /////usr//bin//////echo?
Recently, I received as a suggestion for a patch I sent to trace-cmd
utility to change my implementation to something like this
// ommited for brevity
snprintf(buf, PATH_MAX, "%s/%s", path, exec_filename);
My initial reaction was to think that there was a corner case not being considered which is when value of path
happens to end with /
. This would cause buf
to end up with values like /usr/bin//echo
.
However, the person who suggested that to me was no beginner at all, he is in fact an extremely respected developer within the kernel community and I thought that it was unlikely for him to make such mistake.
So I decided to run a few tests myself and to my great surprise Linux doesn’t really care if you have extraneous "
/”
between folders 😨. How come I didn’t know that?
Trying to find the answer
I knew that something in the OS was handling that corner case but I didn’t know exactly where this piece of code lives and I was determined to find it.
In my mind it could be on any of those below:
- libC / bash (depending on the way the program is executed)
- kernel
- filesystem implementation
First attempt — tracing syscalls
In order to know whether libC
or bash
had something to do with that, I traced which values where sent to the execve
syscall
# with the right amount of '/' chars
$ strace /usr/bin/echo 2>&1 | grep execve
execve("/usr/bin/echo", ["/usr/bin/echo"], 0x7ffd14a710e0 /* 57 vars */) = 0
# with the extraneous '/' chars in between
$ strace //usr///bin////echo 2>&1 | grep execve
execve("//usr///bin////echo", ["//usr///bin////echo"], 0x7ffd401e6870 /* 57 vars */) = 0
As we could see, the string is sent as-is to the kernel which rules out libC
or bash
out of the equation. Now I needed to find where in the kernel that was handled. I guess my doubt could be translated to:
Is the filesystem implementation (ext4, zfs and etc) the one responsible for dealing with such cases or was the kernel sanitising the path and breaking it segments/tokens regardless of which filesystem is being used?
The problem now was: how to test something in a controlled environment that could either prove or disprove my hypothesis? 🤔
Second attempt — gdb’ing a small linux/kernel
I speculated that if I could attach GDB to a kernel running in QEMU then I would be able to set a breakpoint to the execve
syscall and definitively figure out the answer by myself.
However, I couldn’t stop thinking that in a fully-fledged Linux distribution that syscall would been called dozens of times which would make my debugging life harder than it needs to be.
My idea was that if I could reduce how many programs were running in userspace to as close as possible to a single program (ideally bash), then whichever program called
execve
would be the one I executed myself.
The solution I found was to build a initramfs
cpio image using buildroot. It went something like this:
cd buildroot
make O=../minimal_linux_fs qemu_x86_64_defconfig
cd ../minimal_linux_fs
make menuconfig
# enabled root cpio target file system && disable other stuff I didn't need
make -j$(nproc)
# once everything is compiled
ls -larth images/
total 21M
drwxr-xr-x. 1 paulo paulo 48 Jan 19 16:48 .
-rw-r--r--. 1 paulo paulo 21M Jan 19 18:31 rootfs.cpio
drwxr-xr-x. 1 paulo paulo 532 Jan 19 21:07 ..
Then, I started a custom kernel I compiled with CONFIG_DEBUG_INFO
on QEMU with GDB.
qemu-system-x86_64 \
-M pc \
# custom kernel
-kernel ../linux/arch/x86_64/boot/bzImage \
# initramfs
-initrd images/rootfs.cpio \
# disable kaslr, otherwise GDB wouldn't be able to set a breakpoint
-append "nokaslr" \
-m 128 -d guest_errors -no-reboot -no-shutdown \
# wait for gdb to connect on port 1234
-s -S
Last but not least, I started GDB
gdb \
# connect to QEMU
-ex "target remote :1234" \
# show code found in the debug_* ELF sections (oversimplification)
-ex "layout src" ./vmlinux
Once everything was initialised, I put a breakpoint to function that handles execve
syscall
After I stepped into a few functions I finally found this:
Finally I have found the ‘culprit’. The kernel increments the pointer if the next character is /
. That way if //usr/bin
or /usr//bin
is eventually broken into usr
and bin
before dentry
(and subsequentlyinodes
) can be actually found.
I put together a simplified version of that routine so that you can roughly understand what how it does it.
static int link_path_walk(const char *name, struct nameidata *nd)
{
// ...
if (IS_ERR(name))
return PTR_ERR(name);
while (*name=='/')
name++;
// ...
/* At this point we know we have a real path component. */
for(;;) {
// ...
/*
* If it wasn't NUL, we know it was '/'. Skip that
* slash, and continue until no more slashes.
*/
do {
name++;
} while (unlikely(*name == '/'));
if (unlikely(!*name)) {
/* last component of nested symlink */
name = nd->stack[--depth].name;
link = walk_component(nd, 0);
} else {
/* not the last component */
link = walk_component(nd, WALK_MORE);
}
}
}
If you are interested in reading the full source code, this is where you would find it: https://github.com/torvalds/linux/blob/v6.1/fs/namei.c#L2238-L2338
Conclusion
The Kernel performs the path sanity-check/cleaning up before the filesystem implementation gets a chance to run. That being said, the suggestion I got during the code review was spot on and would make the code simpler after all.
Well, I hope you enjoying reading this as much as I enjoyed figuring it out. Until next time :)
Paulo Almeida
Update
After I posted this on several mediums (Mastodon, Reddit and so on), one user was kind enough to point me to something written on the POSIX specification.
This behaviour is actually required by POSIX.2007
“Multiple successive <slash> characters are considered to be the same as one <slash>, except for the case of exactly two leading <slash> characters.”
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_271
The meaning of exactly two leading <slash> characters is implementation specific.