[PATCH AUTOSEL 6.19-5.10] unshare: fix unshare_fs() handling

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Waiman Long <longman@redhat.com>,
	Christian Brauner <brauner@kernel.org>,
	Sasha Levin <sashal@kernel.org>,
	david@kernel.org, kees@kernel.org, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 6.19-5.10] unshare: fix unshare_fs() handling
Date: Thu,  5 Mar 2026 10:36:45 -0500	[thread overview]
Message-ID: <20260305153704.106918-2-sashal@kernel.org> (raw)
In-Reply-To: <20260305153704.106918-1-sashal@kernel.org>

From: Al Viro <viro@zeniv.linux.org.uk>

[ Upstream commit 6c4b2243cb6c0755159bd567130d5e12e7b10d9f ]

There's an unpleasant corner case in unshare(2), when we have a
CLONE_NEWNS in flags and current->fs hadn't been shared at all; in that
case copy_mnt_ns() gets passed current->fs instead of a private copy,
which causes interesting warts in proof of correctness]

> I guess if private means fs->users == 1, the condition could still be true.

Unfortunately, it's worse than just a convoluted proof of correctness.
Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS
(and current->fs->users == 1).

We pass current->fs to copy_mnt_ns(), all right.  Suppose it succeeds and
flips current->fs->{pwd,root} to corresponding locations in the new namespace.
Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM).
We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's
destroyed and its mount tree is dissolved, but...  current->fs->root and
current->fs->pwd are both left pointing to now detached mounts.

They are pinning those, so it's not a UAF, but it leaves the calling
process with unshare(2) failing with -ENOMEM _and_ leaving it with
pwd and root on detached isolated mounts.  The last part is clearly a bug.

There is other fun related to that mess (races with pivot_root(), including
the one between pivot_root() and fork(), of all things), but this one
is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new
fs_struct even if it hadn't been shared in the first place".  Sure, we could
go for something like "if both CLONE_NEWNS *and* one of the things that might
end up failing after copy_mnt_ns() call in create_new_namespaces() are set,
force allocation of new fs_struct", but let's keep it simple - the cost
of copy_fs_struct() is trivial.

Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets
a freshly allocated fs_struct, yet to be attached to anything.  That
seriously simplifies the analysis...

FWIW, that bug had been there since the introduction of unshare(2) ;-/

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://patch.msgid.link/20260207082524.GE3183987@ZenIV
Tested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a thorough understanding of the bug and the fix. Let me
compile the analysis.

## Analysis

### What the Commit Fixes

This is a real bug in `unshare(2)` that has existed since its
introduction in 2006. The bug is in `unshare_fs()` at
`kernel/fork.c:3085`:

When `unshare(CLONE_NEWNS | CLONE_NEWCGROUP)` is called with `fs->users
== 1`:
1. `unshare_fs()` skips allocating a new `fs_struct` (early return at
   line 3085)
2. `ksys_unshare()` passes `current->fs` directly to
   `create_new_namespaces()`
3. `copy_mnt_ns()` succeeds and **mutates** `current->fs->pwd` and
   `current->fs->root` to point into the new namespace
4. `copy_cgroup_ns()` fails (e.g., `-ENOMEM`)
5. Error unwind calls `put_mnt_ns()`, destroying the new namespace and
   dissolving its mount tree
6. **Result**: The process's `pwd` and `root` are left pointing to
   detached mounts

The process is left in a broken state — `unshare()` returns an error but
has silently corrupted the caller's filesystem context.

### The Fix

A single-line change adding `!(unshare_flags & CLONE_NEWNS) &&` to the
early-return condition. This forces allocation of a fresh `fs_struct`
whenever `CLONE_NEWNS` is set, ensuring `copy_mnt_ns()` never mutates
the live `current->fs`.

### Stable Kernel Criteria Assessment

1. **Obviously correct and tested**: Yes — authored by Al Viro (VFS
   maintainer), signed off by Christian Brauner (VFS maintainer), tested
   by Waiman Long (Red Hat). The logic is straightforward.

2. **Fixes a real bug**: Yes — a process can end up with pwd/root on
   detached mounts after a failed `unshare()`. This is a correctness bug
   that can break container runtimes and namespace tools.

3. **Important issue**: Yes — namespace operations are critical for
   containers (Docker, Kubernetes, LXC). A failed `unshare()` silently
   corrupting process state is serious.

4. **Small and contained**: Yes — a one-line change in a single file.

5. **No new features**: Correct — this only fixes error handling
   behavior.

6. **Applies cleanly**: The code at line 3085 (`if (fs->users == 1)`)
   has been unchanged since Al Viro's 2009 commit `498052bba55e`, so it
   should exist identically in all stable trees.

### Risk Assessment

**Risk: Very Low**
- The change only affects the `CLONE_NEWNS` path in `unshare()`
- The "cost" is an extra `copy_fs_struct()` allocation when
  `CLONE_NEWNS` is used with `fs->users == 1` — trivial overhead
- No behavioral change for the success path (the new fs_struct replaces
  the old one anyway)
- Only the failure path is meaningfully affected (preventing corruption)

### User Impact

**Impact: Medium-High**
- Affects any process calling `unshare(CLONE_NEWNS | <other_ns_flag>)`
  where a later namespace copy fails
- Container runtimes commonly use these flag combinations
- The bug has existed since Linux 2.6.16 (introduction of `unshare(2)`)
- Waiman Long's testing confirmed `copy_mnt_ns()` with `current->fs` was
  called 13 times during normal boot

### Verification

- **Read `kernel/fork.c:3077-3093`**: Confirmed the `unshare_fs()`
  function and the buggy `if (fs->users == 1)` early return at line 3085
- **Read `kernel/nsproxy.c:87-161`**: Confirmed
  `create_new_namespaces()` calls `copy_mnt_ns()` first (line 98), then
  `copy_cgroup_ns()` (line 123), with error unwind calling
  `put_mnt_ns()` (line 157) — confirming the failure scenario
- **Read `kernel/fork.c:3149-3150`**: Confirmed `CLONE_NEWNS` implies
  `CLONE_FS` (`unshare_flags |= CLONE_FS`), so `unshare_fs()` will
  always enter the main body when `CLONE_NEWNS` is set
- **git blame**: Confirmed the buggy line dates back to commit
  `498052bba55e` (Al Viro, 2009), itself a refactor of the original
  `unshare(2)` code from 2006
- **Mailing list (patch.msgid.link)**: Al Viro posted this as
  `[PATCH][RFC] bug in unshare(2) failure recovery`. Waiman Long tested
  and confirmed the fix eliminates the problematic code path. No
  objections raised.
- **git log --grep**: Found the fix as commit `6c4b2243cb6c0`

This is a clear, well-understood bug fix by the VFS maintainer with
testing confirmation from Red Hat. The fix is minimal (one line),
surgical, low-risk, and addresses a real correctness bug in a core
syscall that has existed for 20 years. It meets all stable kernel
criteria.

**YES**

 kernel/fork.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index b1f3915d5f8ec..68ccbaea7398a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3082,7 +3082,7 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
 		return 0;

 	/* don't need lock here; in the worst case we'll do useless copy */
-	if (fs->users == 1)
+	if (!(unshare_flags & CLONE_NEWNS) && fs->users == 1)
 		return 0;

 	*new_fsp = copy_fs_struct(fs);
-- 
2.51.0

          parent reply	other threads:[~2026-03-05 15:37 UTC|newest]

Thread overview: expand[flat|nested]  mbox.gz  Atom feed
 [parent not found: <20260305153704.106918-1-sashal@kernel.org>]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260305153704.106918-2-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=brauner@kernel.org \
    --cc=david@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=kees@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=patches@lists.linux.dev \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox