linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Felix Abecassis <fabecassis@nvidia.com>
To: Pedro Falcato <pfalcato@suse.de>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	Zi Yan <ziy@nvidia.com>, John Hubbard <jhubbard@nvidia.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>
Subject: Re: OOM kill of privileged processes when exhausting a single NUMA node
Date: Thu, 26 Jun 2025 20:15:16 -0700	[thread overview]
Message-ID: <aF4MxLlwvdf2jHDX@DESKTOP-SESBGTA.localdomain> (raw)
In-Reply-To: <btenbxjm4eheurdl2oexxy4h5diphmfy5cugiscfv6nljqhfki@2xxhbwtfslsj>

On Fri, Jun 27, 2025 at 12:21:57AM +0100, Pedro Falcato wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Thu, Jun 26, 2025 at 10:27:36PM +0000, Felix Abecassis wrote:
> > Hello linux-mm team,
> >
> > I have found an interesting behavior in the Linux kernel: an unprivileged user
> > with access to user namespaces can cause privileged processes to be killed due
> > to an OOM situation on a single NUMA node, even if the system has plenty of
> > memory available on other NUMA nodes.
> >
> > This might lead to a local denial of service in some situations, so please
> > review and let me know if the current behavior is expected.
> >
> > The steps are simple:
> > 1. Use a Linux system with multiple NUMA nodes
> > 2. Enable unprivileged user namespaces (often distro dependent)
> > 3. As an unprivileged user, create a user namespace + mount namespace
> >    and mount a tmpfs bound to NUMA node 1
> > 4. Attempt to fill the tmpfs with more data than it can possibly store
> > 5. The OOM killer will kill a significant amount of system daemons
> >    (UID 0).
> >
> 
> I somewhat agree that this is somewhat unintended tmpfs behavior, but you can
> (probably) pull this off in other ways:
> 
> - use set_mempolicy()/mbind to bind to a NUMA node and use a big mmap() mapping
> - just use a lot of memory
> 
> and it's not limited to NUMA either.
> 
> AFAIK user namespaces aren't really isolating in the sense that you need a
> cgroup on top to further control software you don't trust (or want to limit
> for other reasons)
> 

Yes, but inside a user namespace you are able to mount a tmpfs bound to a
single NUMA node and that is the key to triggering the bug I described. You
would not have the same problem when trying to fill /tmp (assuming a tmpfs),
because it's not bound to a single NUMA node and thus you would be limited by
the memory cgroup (in addition, the default is size=50% for a tmpfs).

> 
> And in this case the particular problem is that tmpfs really can't track
> what process "owns" a file, even if O_TMPFILE was specified. So you can quite
> trivially run out of memory in a regular Linux distro by filling up the /tmp
> (if tmpfs, of course), if you have write perms for /tmp, which by default you do.
> 
> 
> The only alarming bit (to me) is that cgroups don't work in this case as well.
> The most adhoc solution I have would be to possibly limit the tmpfs size to
> memory.max. Adding the memcg folks for more comments.
>

The cgroup memory limit cannot work because we are far from the limit. The
process is limited to 2GB, it attempts to write 1.5GB to the tmpfs, and after
writing 1GB it will trigger the OOM kill of privileged processes, despite the
fact that the system still has ~8GB of free memory on NUMA node 0.

> --
> Pedro
> 
> > The possible mitigations I currently know of are: create a swap space, disable
> > unprivileged user namespaces, or set sysctl vm.oom_kill_allocating_task=1.
> >
> > To be 100% clear, this does not require elevated privileges, and we are only
> > using a fraction of the total system memory.
> >
> > Below is an example on a Ubuntu 25.04 VM under qemu where I hotplugged a new
> > NUMA node with 1GB of memory, I also place the current process under a 2GB
> > memory cgroup to show that it's not an effective mitigation.
> >
> > $ uname -a
> > Linux ubuntu 6.14.0-22-generic #22-Ubuntu SMP PREEMPT_DYNAMIC Wed May 21 15:01:51 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
> >
> > $ id -u
> > 1000
> >
> > # Enable unprivileged user namespaces (this is an Ubuntu feature)
> > $ sudo sysctl kernel.apparmor_restrict_unprivileged_userns=0
> >
> > $ sudo sh -c 'echo 2G > /sys/fs/cgroup/user.slice/user-1000.slice/memory.max'
> >
> > $ numastat -mzc
> >
> > Per-node system memory usage (in MBs):
> > Token Unaccepted not in hash table.
> > Token Unaccepted not in hash table.
> >                  Node 0 Node 1 Total
> >                  ------ ------ -----
> > MemTotal           7940   1024  8964
> > MemFree            7533   1024  8557
> > MemUsed             407      0   407
> > Active              176      0   176
> > Inactive             44      0    44
> > Active(anon)         42      0    42
> > Active(file)        134      0   134
> > Inactive(file)       44      0    44
> > Unevictable          26      0    26
> > Mlocked              26      0    26
> > Dirty                 0      0     0
> > FilePages           186      0   186
> > Mapped               57      0    57
> > AnonPages            59      0    59
> > Shmem                 1      0     1
> > KernelStack           2      0     2
> > PageTables            2      0     2
> > Slab                 84      0    84
> > SReclaimable         17      0    17
> > SUnreclaim           68      0    68
> > KReclaimable         17      0    17
> >
> > $ unshare -U -r -m sh -xc 'mount -t tmpfs -o mpol=bind:1 tmpfs /dev/shm ; dd if=/dev/zero of=/dev/shm/file bs=64K count=25000'
> > + mount -t tmpfs -o mpol=bind:1 tmpfs /dev/shm
> > + dd if=/dev/zero of=/dev/shm/file bs=64K count=25000
> > [  294.046130] Out of memory: Killed process 1074 (systemd) total-vm:21968kB, anon-rss:2048kB, file-rss:10164kB, shmem-rss:0kB, UID:1000 pgtables:88kB oom_score_adj:100
> > [  294.052224] Out of memory: Killed process 1076 ((sd-pam)) total-vm:21992kB, anon-rss:1772kB, file-rss:1832kB, shmem-rss:0kB, UID:1000 pgtables:76kB oom_score_adj:100
> > [  294.058446] Out of memory: Killed process 821 (unattended-upgr) total-vm:121388kB, anon-rss:13272kB, file-rss:16004kB, shmem-rss:0kB, UID:0 pgtables:140kB oom_score_adj:0
> > [  294.064551] Out of memory: Killed process 423 (systemd-resolve) total-vm:23200kB, anon-rss:2560kB, file-rss:11504kB, shmem-rss:0kB, UID:990 pgtables:88kB oom_score_adj:0
> > [  294.070491] Out of memory: Killed process 789 (udisksd) total-vm:470572kB, anon-rss:1920kB, file-rss:11840kB, shmem-rss:0kB, UID:0 pgtables:136kB oom_score_adj:0
> > [  294.076371] Out of memory: Killed process 848 (ModemManager) total-vm:391392kB, anon-rss:1792kB, file-rss:10516kB, shmem-rss:0kB, UID:0 pgtables:124kB oom_score_adj:0
> > [  294.082350] Out of memory: Killed process 733 (systemd-network) total-vm:20804kB, anon-rss:1296kB, file-rss:10068kB, shmem-rss:0kB, UID:998 pgtables:76kB oom_score_adj:0
> > [  294.088273] Out of memory: Killed process 1141 ((resolved)) total-vm:20604kB, anon-rss:1280kB, file-rss:8556kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
> > [  294.094350] Out of memory: Killed process 788 (systemd-logind) total-vm:18896kB, anon-rss:896kB, file-rss:7968kB, shmem-rss:0kB, UID:0 pgtables:84kB oom_score_adj:0
> > [  294.100461] Out of memory: Killed process 1151 ((resolved)) total-vm:20604kB, anon-rss:1280kB, file-rss:7732kB, shmem-rss:0kB, UID:0 pgtables:84kB oom_score_adj:0
> > [  294.106462] Out of memory: Killed process 1154 ((networkd)) total-vm:20604kB, anon-rss:1280kB, file-rss:8036kB, shmem-rss:0kB, UID:0 pgtables:84kB oom_score_adj:0
> > [  294.112592] Out of memory: Killed process 1155 ((resolved)) total-vm:20604kB, anon-rss:1280kB, file-rss:8648kB, shmem-rss:0kB, UID:0 pgtables:84kB oom_score_adj:0
> > [  294.118725] Out of memory: Killed process 1161 ((networkd)) total-vm:20604kB, anon-rss:1280kB, file-rss:8648kB, shmem-rss:0kB, UID:998 pgtables:84kB oom_score_adj:0
> > [  294.124827] Out of memory: Killed process 1165 ((resolved)) total-vm:20604kB, anon-rss:1280kB, file-rss:8484kB, shmem-rss:0kB, UID:0 pgtables:88kB oom_score_adj:0
> > [  294.131138] Out of memory: Killed process 1169 ((networkd)) total-vm:20604kB, anon-rss:1280kB, file-rss:8604kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
> > [  294.137548] Out of memory: Killed process 1177 ((resolved)) total-vm:20604kB, anon-rss:1280kB, file-rss:8592kB, shmem-rss:0kB, UID:0 pgtables:84kB oom_score_adj:0
> > [  294.144659] Out of memory: Killed process 1187 ((networkd)) total-vm:20604kB, anon-rss:1280kB, file-rss:8800kB, shmem-rss:0kB, UID:998 pgtables:80kB oom_score_adj:0
> > [  294.151118] Out of memory: Killed process 1179 (systemd-logind) total-vm:18728kB, anon-rss:1024kB, file-rss:7972kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
> > [  294.157569] Out of memory: Killed process 1194 ((networkd)) total-vm:20604kB, anon-rss:1280kB, file-rss:8596kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
> > [  294.163877] Out of memory: Killed process 417 (systemd-timesyn) total-vm:91608kB, anon-rss:896kB, file-rss:7132kB, shmem-rss:0kB, UID:996 pgtables:88kB oom_score_adj:0
> > [  294.170240] Out of memory: Killed process 783 (polkitd) total-vm:306832kB, anon-rss:640kB, file-rss:7264kB, shmem-rss:0kB, UID:988 pgtables:96kB oom_score_adj:0
> > [  294.176668] Out of memory: Killed process 1200 ((imesyncd)) total-vm:20604kB, anon-rss:1280kB, file-rss:7776kB, shmem-rss:0kB, UID:0 pgtables:84kB oom_score_adj:0
> > [  294.183107] Out of memory: Killed process 1205 (9) total-vm:20136kB, anon-rss:1152kB, file-rss:6584kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
> > [  294.189627] Out of memory: Killed process 1210 ((imesyncd)) total-vm:20604kB, anon-rss:1280kB, file-rss:7844kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
> > [  294.196227] Out of memory: Killed process 1209 ((d-logind)) total-vm:20140kB, anon-rss:1280kB, file-rss:7284kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
> > [  294.202956] Out of memory: Killed process 1212 ((imesyncd)) total-vm:20604kB, anon-rss:1280kB, file-rss:8568kB, shmem-rss:0kB, UID:0 pgtables:84kB oom_score_adj:0
> > [  294.209719] Out of memory: Killed process 1223 ((imesyncd)) total-vm:20604kB, anon-rss:1280kB, file-rss:8556kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
> > [  294.216356] Out of memory: Killed process 851 (rsyslogd) total-vm:220676kB, anon-rss:1280kB, file-rss:4292kB, shmem-rss:0kB, UID:101 pgtables:80kB oom_score_adj:0
> > [  294.223146] Out of memory: Killed process 1220 (systemd-logind) total-vm:18728kB, anon-rss:1024kB, file-rss:8044kB, shmem-rss:0kB, UID:0 pgtables:88kB oom_score_adj:0
> > [  294.229888] Out of memory: Killed process 1234 ((systemd)) total-vm:21992kB, anon-rss:1664kB, file-rss:8852kB, shmem-rss:0kB, UID:0 pgtables:84kB oom_score_adj:100
> > [  294.236624] Out of memory: Killed process 952 (login) total-vm:11220kB, anon-rss:768kB, file-rss:4616kB, shmem-rss:0kB, UID:0 pgtables:64kB oom_score_adj:0
> > [  294.243266] Out of memory: Killed process 940 (cron) total-vm:7512kB, anon-rss:256kB, file-rss:2760kB, shmem-rss:0kB, UID:0 pgtables:56kB oom_score_adj:0
> > [  294.249871] Out of memory: Killed process 956 (agetty) total-vm:8516kB, anon-rss:128kB, file-rss:2492kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:0


  parent reply	other threads:[~2025-06-27  4:28 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-26 22:27 Felix Abecassis
2025-06-26 23:21 ` Pedro Falcato
2025-06-26 23:27   ` Zi Yan
2025-06-27  3:15   ` Felix Abecassis [this message]
2025-06-27  8:17   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aF4MxLlwvdf2jHDX@DESKTOP-SESBGTA.localdomain \
    --to=fabecassis@nvidia.com \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=pfalcato@suse.de \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox