From: Kiryl Shutsemau <kas@kernel.org>
To: Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Hugh Dickins <hughd@google.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Orphan filesystems after mount namespace destruction and tmpfs "leak"
Date: Mon, 2 Feb 2026 17:50:30 +0000 [thread overview]
Message-ID: <aYDjHJstnz2V-ZZg@thinkstation> (raw)
[-- Attachment #1: Type: text/plain, Size: 2080 bytes --]
Hi,
In the Meta fleet, we saw a problem where destroying a container didn't
lead to freeing the shmem memory attributed to a tmpfs mounted inside
that container. It triggered an OOM when a new container attempted to
start.
Investigation has shown that this happened because a process outside of
the container kept a file from the tmpfs mapped. The mapped file is
small (4k), but it holds all the contents of the tmpfs (~47GiB) from
being freed.
When a tmpfs filesystem is mounted inside a mount namespace (e.g., a
container), and a process outside that namespace holds an open file
descriptor to a file on that tmpfs, the tmpfs superblock remains in
kernel memory indefinitely after:
1. All processes inside the mount namespace have exited.
2. The mount namespace has been destroyed.
3. The tmpfs is no longer visible in any mount namespace.
The superblock persists with mnt_ns = NULL in its mount structures,
keeping all tmpfs contents pinned in memory until the external file
descriptor is closed.
The problem is not specific to tmpfs, but for filesystems with backing
storage, the memory impact is not as severe since the page cache is
reclaimable.
The obvious solution to the problem is "Don't do that": the file should
be unmapped/closed upon container destruction.
But I wonder if the kernel can/should do better here? Currently, this
scenario is hard to diagnose. It looks like a leak of shmem pages.
Also, I wonder if the current behavior can lead to data loss on a
filesystem with backing storage:
- The mount namespace where my USB stick was mounted is gone.
- The USB stick is no longer mounted anywhere.
- I can pull the USB stick out.
- Oops, someone was writing there: corruption/data loss.
I am not sure what a possible solution would be here. I can only think
of blocking exit(2) for the last process in the namespace until all
filesystems are cleanly unmounted, but that is not very informative
either.
I have attached a Claude-generated reproducer and a drgn script that
lists orphan tmpfs filesystems.
--
Kiryl Shutsemau / Kirill A. Shutemov
[-- Attachment #2: tmpfs_leak_reproducer.sh --]
[-- Type: application/x-sh, Size: 4656 bytes --]
[-- Attachment #3: monitor_tmpfs.py --]
[-- Type: text/plain, Size: 4804 bytes --]
#!/usr/bin/env drgn
"""
Monitor tmpfs superblocks to detect orphaned/leaked tmpfs filesystems.
Run with: sudo drgn -s /path/to/vmlinux monitor_tmpfs.py
This script lists all tmpfs superblocks and shows which ones have
NULL mount namespaces (orphaned) or are otherwise in unusual states.
"""
from drgn import container_of
from drgn.helpers.linux.list import hlist_for_each_entry, list_for_each_entry
def get_mount_info(sb):
"""Get mount information for a superblock"""
mounts = []
try:
for mnt in list_for_each_entry('struct mount', sb.s_mounts.address_of_(), 'mnt_instance'):
try:
mnt_ns = mnt.mnt_ns
mnt_ns_addr = mnt_ns.value_() if mnt_ns else 0
# Get device name
devname = mnt.mnt_devname.string_().decode('utf-8', errors='replace') if mnt.mnt_devname else "?"
# Get mount point
mnt_mountpoint = mnt.mnt_mountpoint
if mnt_mountpoint:
name = mnt_mountpoint.d_name.name.string_().decode('utf-8', errors='replace')
else:
name = "?"
mounts.append({
'mnt_ns': mnt_ns_addr,
'devname': devname,
'mountpoint': name,
'mnt_addr': mnt.value_()
})
except:
continue
except:
pass
return mounts
def main():
# Get the tmpfs/shmem filesystem type
shmem_fs_type = prog['shmem_fs_type']
print("=" * 80)
print("tmpfs Superblock Monitor")
print("=" * 80)
print()
# Collect all tmpfs superblocks
tmpfs_sbs = []
for sb in hlist_for_each_entry('struct super_block', shmem_fs_type.fs_supers, 's_instances'):
tmpfs_sbs.append(sb)
print(f"Found {len(tmpfs_sbs)} tmpfs superblocks\n")
orphaned = []
normal = []
for sb in tmpfs_sbs:
sb_addr = sb.value_()
s_dev = sb.s_dev.value_()
s_active = sb.s_active.counter.value_()
mounts = get_mount_info(sb)
# Check if any mount has NULL namespace
has_orphaned = any(m['mnt_ns'] == 0 for m in mounts)
has_normal = any(m['mnt_ns'] != 0 for m in mounts)
info = {
'sb': sb,
'sb_addr': sb_addr,
's_dev': s_dev,
's_active': s_active,
'mounts': mounts,
'has_orphaned': has_orphaned
}
if has_orphaned and not has_normal:
orphaned.append(info)
else:
normal.append(info)
# Print orphaned superblocks first (these are the leaked ones)
if orphaned:
print("!!! ORPHANED tmpfs SUPERBLOCKS (potential leaks) !!!")
print("-" * 80)
for info in orphaned:
sb = info['sb']
print(f"Superblock: 0x{info['sb_addr']:016x}")
print(f" s_dev: {info['s_dev']}")
print(f" s_active: {info['s_active']}")
# Try to list some files
try:
root = sb.s_root
if root:
print(" Root dentry contents:")
count = 0
for child in hlist_for_each_entry('struct dentry', root.d_children, 'd_sib'):
name = child.d_name.name.string_().decode('utf-8', errors='replace')
inode = child.d_inode
if inode:
size = inode.i_size.value_()
print(f" - {name} ({size} bytes)")
count += 1
if count >= 10:
print(" ... (more files)")
break
except Exception as e:
print(f" Error listing files: {e}")
for m in info['mounts']:
print(f" Mount: {m['devname']} -> {m['mountpoint']}")
print(f" mnt_ns: 0x{m['mnt_ns']:016x} (NULL = orphaned)")
print()
print()
# Print summary of normal superblocks
print("Normal tmpfs superblocks:")
print("-" * 80)
for info in normal:
devnames = set(m['devname'] for m in info['mounts'])
namespaces = set(m['mnt_ns'] for m in info['mounts'])
print(f"0x{info['sb_addr']:016x} s_dev={info['s_dev']:<4} "
f"active={info['s_active']:<3} "
f"mounts={len(info['mounts']):<3} "
f"devnames={devnames}")
print()
print("=" * 80)
print(f"Summary: {len(orphaned)} orphaned, {len(normal)} normal")
if orphaned:
print("WARNING: Orphaned tmpfs superblocks detected - these may be leaking memory!")
print("=" * 80)
if __name__ == "__main__":
main()
next reply other threads:[~2026-02-02 17:50 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-02 17:50 Kiryl Shutsemau [this message]
2026-02-02 18:43 ` Al Viro
2026-02-02 19:43 ` Kiryl Shutsemau
2026-02-02 20:03 ` Askar Safin
2026-02-03 14:58 ` Christian Brauner
2026-02-04 17:04 ` Theodore Tso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aYDjHJstnz2V-ZZg@thinkstation \
--to=kas@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=brauner@kernel.org \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox