* Orphan filesystems after mount namespace destruction and tmpfs "leak"
@ 2026-02-02 17:50 Kiryl Shutsemau
2026-02-02 18:43 ` Al Viro
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Kiryl Shutsemau @ 2026-02-02 17:50 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Hugh Dickins, Baolin Wang
Cc: linux-mm, linux-fsdevel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2080 bytes --]
Hi,
In the Meta fleet, we saw a problem where destroying a container didn't
lead to freeing the shmem memory attributed to a tmpfs mounted inside
that container. It triggered an OOM when a new container attempted to
start.
Investigation has shown that this happened because a process outside of
the container kept a file from the tmpfs mapped. The mapped file is
small (4k), but it holds all the contents of the tmpfs (~47GiB) from
being freed.
When a tmpfs filesystem is mounted inside a mount namespace (e.g., a
container), and a process outside that namespace holds an open file
descriptor to a file on that tmpfs, the tmpfs superblock remains in
kernel memory indefinitely after:
1. All processes inside the mount namespace have exited.
2. The mount namespace has been destroyed.
3. The tmpfs is no longer visible in any mount namespace.
The superblock persists with mnt_ns = NULL in its mount structures,
keeping all tmpfs contents pinned in memory until the external file
descriptor is closed.
The problem is not specific to tmpfs, but for filesystems with backing
storage, the memory impact is not as severe since the page cache is
reclaimable.
The obvious solution to the problem is "Don't do that": the file should
be unmapped/closed upon container destruction.
But I wonder if the kernel can/should do better here? Currently, this
scenario is hard to diagnose. It looks like a leak of shmem pages.
Also, I wonder if the current behavior can lead to data loss on a
filesystem with backing storage:
- The mount namespace where my USB stick was mounted is gone.
- The USB stick is no longer mounted anywhere.
- I can pull the USB stick out.
- Oops, someone was writing there: corruption/data loss.
I am not sure what a possible solution would be here. I can only think
of blocking exit(2) for the last process in the namespace until all
filesystems are cleanly unmounted, but that is not very informative
either.
I have attached a Claude-generated reproducer and a drgn script that
lists orphan tmpfs filesystems.
--
Kiryl Shutsemau / Kirill A. Shutemov
[-- Attachment #2: tmpfs_leak_reproducer.sh --]
[-- Type: application/x-sh, Size: 4656 bytes --]
[-- Attachment #3: monitor_tmpfs.py --]
[-- Type: text/plain, Size: 4804 bytes --]
#!/usr/bin/env drgn
"""
Monitor tmpfs superblocks to detect orphaned/leaked tmpfs filesystems.
Run with: sudo drgn -s /path/to/vmlinux monitor_tmpfs.py
This script lists all tmpfs superblocks and shows which ones have
NULL mount namespaces (orphaned) or are otherwise in unusual states.
"""
from drgn import container_of
from drgn.helpers.linux.list import hlist_for_each_entry, list_for_each_entry
def get_mount_info(sb):
"""Get mount information for a superblock"""
mounts = []
try:
for mnt in list_for_each_entry('struct mount', sb.s_mounts.address_of_(), 'mnt_instance'):
try:
mnt_ns = mnt.mnt_ns
mnt_ns_addr = mnt_ns.value_() if mnt_ns else 0
# Get device name
devname = mnt.mnt_devname.string_().decode('utf-8', errors='replace') if mnt.mnt_devname else "?"
# Get mount point
mnt_mountpoint = mnt.mnt_mountpoint
if mnt_mountpoint:
name = mnt_mountpoint.d_name.name.string_().decode('utf-8', errors='replace')
else:
name = "?"
mounts.append({
'mnt_ns': mnt_ns_addr,
'devname': devname,
'mountpoint': name,
'mnt_addr': mnt.value_()
})
except:
continue
except:
pass
return mounts
def main():
# Get the tmpfs/shmem filesystem type
shmem_fs_type = prog['shmem_fs_type']
print("=" * 80)
print("tmpfs Superblock Monitor")
print("=" * 80)
print()
# Collect all tmpfs superblocks
tmpfs_sbs = []
for sb in hlist_for_each_entry('struct super_block', shmem_fs_type.fs_supers, 's_instances'):
tmpfs_sbs.append(sb)
print(f"Found {len(tmpfs_sbs)} tmpfs superblocks\n")
orphaned = []
normal = []
for sb in tmpfs_sbs:
sb_addr = sb.value_()
s_dev = sb.s_dev.value_()
s_active = sb.s_active.counter.value_()
mounts = get_mount_info(sb)
# Check if any mount has NULL namespace
has_orphaned = any(m['mnt_ns'] == 0 for m in mounts)
has_normal = any(m['mnt_ns'] != 0 for m in mounts)
info = {
'sb': sb,
'sb_addr': sb_addr,
's_dev': s_dev,
's_active': s_active,
'mounts': mounts,
'has_orphaned': has_orphaned
}
if has_orphaned and not has_normal:
orphaned.append(info)
else:
normal.append(info)
# Print orphaned superblocks first (these are the leaked ones)
if orphaned:
print("!!! ORPHANED tmpfs SUPERBLOCKS (potential leaks) !!!")
print("-" * 80)
for info in orphaned:
sb = info['sb']
print(f"Superblock: 0x{info['sb_addr']:016x}")
print(f" s_dev: {info['s_dev']}")
print(f" s_active: {info['s_active']}")
# Try to list some files
try:
root = sb.s_root
if root:
print(" Root dentry contents:")
count = 0
for child in hlist_for_each_entry('struct dentry', root.d_children, 'd_sib'):
name = child.d_name.name.string_().decode('utf-8', errors='replace')
inode = child.d_inode
if inode:
size = inode.i_size.value_()
print(f" - {name} ({size} bytes)")
count += 1
if count >= 10:
print(" ... (more files)")
break
except Exception as e:
print(f" Error listing files: {e}")
for m in info['mounts']:
print(f" Mount: {m['devname']} -> {m['mountpoint']}")
print(f" mnt_ns: 0x{m['mnt_ns']:016x} (NULL = orphaned)")
print()
print()
# Print summary of normal superblocks
print("Normal tmpfs superblocks:")
print("-" * 80)
for info in normal:
devnames = set(m['devname'] for m in info['mounts'])
namespaces = set(m['mnt_ns'] for m in info['mounts'])
print(f"0x{info['sb_addr']:016x} s_dev={info['s_dev']:<4} "
f"active={info['s_active']:<3} "
f"mounts={len(info['mounts']):<3} "
f"devnames={devnames}")
print()
print("=" * 80)
print(f"Summary: {len(orphaned)} orphaned, {len(normal)} normal")
if orphaned:
print("WARNING: Orphaned tmpfs superblocks detected - these may be leaking memory!")
print("=" * 80)
if __name__ == "__main__":
main()
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Orphan filesystems after mount namespace destruction and tmpfs "leak"
2026-02-02 17:50 Orphan filesystems after mount namespace destruction and tmpfs "leak" Kiryl Shutsemau
@ 2026-02-02 18:43 ` Al Viro
2026-02-02 19:43 ` Kiryl Shutsemau
2026-02-02 20:03 ` Askar Safin
2026-02-03 14:58 ` Christian Brauner
2 siblings, 1 reply; 6+ messages in thread
From: Al Viro @ 2026-02-02 18:43 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Christian Brauner, Jan Kara, Hugh Dickins, Baolin Wang, linux-mm,
linux-fsdevel, linux-kernel
On Mon, Feb 02, 2026 at 05:50:30PM +0000, Kiryl Shutsemau wrote:
> In the Meta fleet, we saw a problem where destroying a container didn't
> lead to freeing the shmem memory attributed to a tmpfs mounted inside
> that container. It triggered an OOM when a new container attempted to
> start.
>
> Investigation has shown that this happened because a process outside of
> the container kept a file from the tmpfs mapped. The mapped file is
> small (4k), but it holds all the contents of the tmpfs (~47GiB) from
> being freed.
>
> When a tmpfs filesystem is mounted inside a mount namespace (e.g., a
> container), and a process outside that namespace holds an open file
> descriptor to a file on that tmpfs, the tmpfs superblock remains in
> kernel memory indefinitely after:
>
> 1. All processes inside the mount namespace have exited.
> 2. The mount namespace has been destroyed.
> 3. The tmpfs is no longer visible in any mount namespace.
Yes? That's precisely what should happen as long as something's opened
on a filesystem.
> The superblock persists with mnt_ns = NULL in its mount structures,
> keeping all tmpfs contents pinned in memory until the external file
> descriptor is closed.
Yes.
> The problem is not specific to tmpfs, but for filesystems with backing
> storage, the memory impact is not as severe since the page cache is
> reclaimable.
>
> The obvious solution to the problem is "Don't do that": the file should
> be unmapped/closed upon container destruction.
Or remove the junk there from time to time, if you don't want it to stay
until the filesystem shutdown...
> But I wonder if the kernel can/should do better here? Currently, this
> scenario is hard to diagnose. It looks like a leak of shmem pages.
>
> Also, I wonder if the current behavior can lead to data loss on a
> filesystem with backing storage:
> - The mount namespace where my USB stick was mounted is gone.
> - The USB stick is no longer mounted anywhere.
> - I can pull the USB stick out.
> - Oops, someone was writing there: corruption/data loss.
>
> I am not sure what a possible solution would be here. I can only think
> of blocking exit(2) for the last process in the namespace until all
> filesystems are cleanly unmounted, but that is not very informative
> either.
That's insane - if nothing else, the process that holds the sucker
opened may very well be waiting for the one you've blocked.
You are getting exactly what you asked for - same as you would on
lazy umount, for that matter.
Filesystem may be active without being attached to any namespace;
it's an intentional behaviour. What's more, it _is_ visible to
ustat(2), as well as lsof(1) and similar userland tools in case
of opened file keeping it busy.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Orphan filesystems after mount namespace destruction and tmpfs "leak"
2026-02-02 18:43 ` Al Viro
@ 2026-02-02 19:43 ` Kiryl Shutsemau
0 siblings, 0 replies; 6+ messages in thread
From: Kiryl Shutsemau @ 2026-02-02 19:43 UTC (permalink / raw)
To: Al Viro
Cc: Christian Brauner, Jan Kara, Hugh Dickins, Baolin Wang, linux-mm,
linux-fsdevel, linux-kernel
On Mon, Feb 02, 2026 at 06:43:56PM +0000, Al Viro wrote:
> > I am not sure what a possible solution would be here. I can only think
> > of blocking exit(2) for the last process in the namespace until all
> > filesystems are cleanly unmounted, but that is not very informative
> > either.
>
> That's insane - if nothing else, the process that holds the sucker
> opened may very well be waiting for the one you've blocked.
Good point. I obviously don't underhand lifecycle here.
> You are getting exactly what you asked for - same as you would on
> lazy umount, for that matter.
>
> Filesystem may be active without being attached to any namespace;
> it's an intentional behaviour. What's more, it _is_ visible to
> ustat(2), as well as lsof(1) and similar userland tools in case
> of opened file keeping it busy.
I can only see the opened file, not the rest of filesystem, right?
Do you see the USB stick scenario problematic? It is weird to me that
umount would fail with -EBUSY here, but kill full namespace is fine.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Orphan filesystems after mount namespace destruction and tmpfs "leak"
2026-02-02 17:50 Orphan filesystems after mount namespace destruction and tmpfs "leak" Kiryl Shutsemau
2026-02-02 18:43 ` Al Viro
@ 2026-02-02 20:03 ` Askar Safin
2026-02-03 14:58 ` Christian Brauner
2 siblings, 0 replies; 6+ messages in thread
From: Askar Safin @ 2026-02-02 20:03 UTC (permalink / raw)
To: kas
Cc: baolin.wang, brauner, hughd, jack, linux-fsdevel, linux-kernel,
linux-mm, viro
Kiryl Shutsemau <kas@kernel.org>:
> Hi,
I think I know how to fix the problem.
The problem is in your container manager. It seems that the container
manager doesn't unmount filesystems or unmounts them with MNT_DETACH
(i. e. lazy).
What you should do is to make your container manager actually iterate
over all filesystems and unmount them without MNT_DETACH.
Of course, this will not work in your scenario. "umount" call will fail.
But at very least you will actually get failing syscall. I. e. you will
get fail instead of silent leak.
You may go further: if umount on tmpfs fails, then simply go and remove
all files in that tmpfs. And then unmount it using MNT_DETACH.
In fact, this will not remove all files. This process will remove all files
it can remove. I. e. exactly what you need! I. e. it will remove whole
47 GiB of data and just keep 4k, which are actually busy.
Feel free to ask me any questions.
--
Askar Safin
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Orphan filesystems after mount namespace destruction and tmpfs "leak"
2026-02-02 17:50 Orphan filesystems after mount namespace destruction and tmpfs "leak" Kiryl Shutsemau
2026-02-02 18:43 ` Al Viro
2026-02-02 20:03 ` Askar Safin
@ 2026-02-03 14:58 ` Christian Brauner
2026-02-04 17:04 ` Theodore Tso
2 siblings, 1 reply; 6+ messages in thread
From: Christian Brauner @ 2026-02-03 14:58 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Alexander Viro, Jan Kara, Hugh Dickins, Baolin Wang, linux-mm,
linux-fsdevel, linux-kernel
On Mon, Feb 02, 2026 at 05:50:30PM +0000, Kiryl Shutsemau wrote:
> Hi,
>
> In the Meta fleet, we saw a problem where destroying a container didn't
> lead to freeing the shmem memory attributed to a tmpfs mounted inside
> that container. It triggered an OOM when a new container attempted to
> start.
>
> Investigation has shown that this happened because a process outside of
> the container kept a file from the tmpfs mapped. The mapped file is
> small (4k), but it holds all the contents of the tmpfs (~47GiB) from
> being freed.
>
> When a tmpfs filesystem is mounted inside a mount namespace (e.g., a
> container), and a process outside that namespace holds an open file
> descriptor to a file on that tmpfs, the tmpfs superblock remains in
> kernel memory indefinitely after:
>
> 1. All processes inside the mount namespace have exited.
> 2. The mount namespace has been destroyed.
> 3. The tmpfs is no longer visible in any mount namespace.
>
> The superblock persists with mnt_ns = NULL in its mount structures,
> keeping all tmpfs contents pinned in memory until the external file
> descriptor is closed.
>
> The problem is not specific to tmpfs, but for filesystems with backing
> storage, the memory impact is not as severe since the page cache is
> reclaimable.
>
> The obvious solution to the problem is "Don't do that": the file should
> be unmapped/closed upon container destruction.
>
> But I wonder if the kernel can/should do better here? Currently, this
> scenario is hard to diagnose. It looks like a leak of shmem pages.
>
> Also, I wonder if the current behavior can lead to data loss on a
> filesystem with backing storage:
> - The mount namespace where my USB stick was mounted is gone.
> - The USB stick is no longer mounted anywhere.
> - I can pull the USB stick out.
> - Oops, someone was writing there: corruption/data loss.
If the USB stick is yanked and the filesystem uses fs_holder_ops it will
be notified about sudden device removal and can decide to handle it as
it sees fit. That works for all devices; including log devices or rt
devices or what have you. Usually it will shut the filesystem down and
tell userspace to EIO. I've switched all major filesystems to this model
a few kernel releases ago.
> I am not sure what a possible solution would be here. I can only think
None from the kernel's perspective. It's intended semantics that
userspace relies upon (For example, if you have an fdstore then you very
much want that persistence.).
We could zap all files and make the fds cataonic. But that's just insane
from the start. It would be a drastic deviation from basic semantics
we've had since forever. It would also be extremly difficult to get
right and performant because you'd need algorithms to find all of them.
Keeping a global map for all files open on a filesystem instance is just
insane so is walking all processes and inspecting their fdtables.
I don't believe we need to do anything here unless you want some tmpfs
specific black magic where you can issue a shutdown ioctl on tmpfs that
magically frees memory. And I'd still expect that this would fsck
userspace over that doesn't expect this behavior.
> of blocking exit(2) for the last process in the namespace until all
> filesystems are cleanly unmounted, but that is not very informative
> either.
If you just wanted the ability to wait for a filesystem to have gone
away completely that's an easier thing to achieve. Inotify has support
for this via IN_UNMOUNT. But as you say I doubt that's actually helpful.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Orphan filesystems after mount namespace destruction and tmpfs "leak"
2026-02-03 14:58 ` Christian Brauner
@ 2026-02-04 17:04 ` Theodore Tso
0 siblings, 0 replies; 6+ messages in thread
From: Theodore Tso @ 2026-02-04 17:04 UTC (permalink / raw)
To: Christian Brauner
Cc: Kiryl Shutsemau, Alexander Viro, Jan Kara, Hugh Dickins,
Baolin Wang, linux-mm, linux-fsdevel, linux-kernel
On Tue, Feb 03, 2026 at 03:58:52PM +0100, Christian Brauner wrote:
> I don't believe we need to do anything here unless you want some tmpfs
> specific black magic where you can issue a shutdown ioctl on tmpfs that
> magically frees memory. And I'd still expect that this would fsck
> userspace over that doesn't expect this behavior.
I think if we were going to do anything like this, adding support to
FS_IOC_SHUTDOWN to tmpfs is the only way we could go. Yeah, it will
fsck over userspace that's not expecting it, but normally, if you're
tearing down a file system, whether it's a read-only iSCSI device that
provides a software package that needs to go away because the iSCSI
target has gone away, or zapping a tmpfs file system, killing the
userspace which depends on it with extreme perjudice *is* actually the
right thing. We use FS_IOC_SHUTDWN on an ext4 file system that is
being served via iSCSI, and when that happens, killing the container
and the userspace processes running in it as quickly as possble
without harming other containers is the goal.
It might make fsck over userspace, granted, but so does "kill -9". :-)
- Ted
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-04 17:05 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-02 17:50 Orphan filesystems after mount namespace destruction and tmpfs "leak" Kiryl Shutsemau
2026-02-02 18:43 ` Al Viro
2026-02-02 19:43 ` Kiryl Shutsemau
2026-02-02 20:03 ` Askar Safin
2026-02-03 14:58 ` Christian Brauner
2026-02-04 17:04 ` Theodore Tso
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox