an idea concerning i_mmap_rwsem vs fork/exec/exit

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* an idea concerning i_mmap_rwsem vs fork/exec/exit
@ 2025-04-30 22:56 Mateusz Guzik
  0 siblings, 0 replies; only message in thread
From: Mateusz Guzik @ 2025-04-30 22:56 UTC (permalink / raw)
  To: linux-mm

I'll note upfront large the biggest singular problem in a real
workload like kernel building is the lru vec handling. But let's
pretend that's fixed. ;)

One of the problems when executing binaries at scale is the tracking
of mappings backed by heavily shared inodes like libc or the loader.
Processes end up with 5 mappings for each -- that's 5 times they get
serialized to add them when execing and forking (removal is batched,
consequently there is only one acquire). That's terrible from
scalability standpoint.

Example: cat /proc/self/maps
[snip]
7d6076800000-7d6076828000 r--p 00000000 fc:01 23335661
  /usr/lib/x86_64-linux-gnu/libc.so.6
7d6076828000-7d60769b0000 r-xp 00028000 fc:01 23335661
  /usr/lib/x86_64-linux-gnu/libc.so.6
7d60769b0000-7d60769ff000 r--p 001b0000 fc:01 23335661
  /usr/lib/x86_64-linux-gnu/libc.so.6
7d60769ff000-7d6076a03000 r--p 001fe000 fc:01 23335661
  /usr/lib/x86_64-linux-gnu/libc.so.6
7d6076a03000-7d6076a05000 rw-p 00202000 fc:01 23335661
  /usr/lib/x86_64-linux-gnu/libc.so.6
[snip]
7d6076ba4000-7d6076ba5000 r--p 00000000 fc:01 23333229
  /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7d6076ba5000-7d6076bd0000 r-xp 00001000 fc:01 23333229
  /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7d6076bd0000-7d6076bda000 r--p 0002c000 fc:01 23333229
  /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7d6076bda000-7d6076bdc000 r--p 00036000 fc:01 23333229
  /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7d6076bdc000-7d6076bde000 rw-p 00038000 fc:01 23333229
  /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2

While one can argue something can be done to reduce the lock hold
time, the repeated acquire is a known pattern putting a low ceiling on
achievable scalability.

Regardless of what kind of data structure is used to represent these
mappings, the state needs to get decentralized in some capacity.

One option would be that mm_structs with mapppings to backed by a
given inode add themselves to a list in that inode. Then the repeat
calls only need to concern themsleves with modifying per-mm tracking.
The problem here is that there can be thousands of processes and
walking the mappings will become impractical.

Another option is to distribute the tree per-cpu. This again can be a
problem on bigger boxen and I'm not confident is all that warranted.

imo the a perfectly sensible way out is to merely distribute the state
with one instance per -- say -- 8 CPUs -- this would be a tradeoff
between scalability and the total count of nodes to visit when
walking.

I think it would also make sense to make it dynamic. For example start
with the current centralized state and trylock on addition. If
trylocks go past a threshold, convert it to the distributed state.
Then future additions/removals are largely deserialized, while
comparatively rarely used binaries don't use extra memory.

This is a rough outline for someone interested, maybe someone will
have a better idea. Extra points for going through with it. ;)

cheers,
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2025-04-30 22:56 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-30 22:56 an idea concerning i_mmap_rwsem vs fork/exec/exit Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox