From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3849CC5B543 for ; Wed, 4 Jun 2025 23:12:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 62A8E6B0289; Wed, 4 Jun 2025 19:12:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5B4096B028B; Wed, 4 Jun 2025 19:12:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 454BD6B028D; Wed, 4 Jun 2025 19:12:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1B30D6B0289 for ; Wed, 4 Jun 2025 19:12:26 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D9E011D08FF for ; Wed, 4 Jun 2025 23:12:25 +0000 (UTC) X-FDA: 83519269050.10.EAF0455 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf02.hostedemail.com (Postfix) with ESMTP id 0EF7A80004 for ; Wed, 4 Jun 2025 23:12:23 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=imMyS5xx; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of 31tJAaAYKCMc574r0ot11tyr.p1zyv07A-zzx8npx.14t@flex--surenb.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=31tJAaAYKCMc574r0ot11tyr.p1zyv07A-zzx8npx.14t@flex--surenb.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749078744; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0jAZB46rUPJ2qgJnBZa+KjMxZRHXtnvSFd09hYheqHc=; b=nqHBAeyeASOgxx+F46RQgraNW6NF0kgq30T9aDO7yrhsaMOwSgIzsaaUdyaziKQa2idHLj prJSR+zCR7IGgP1/XEbjvDIhkIntcOd0jKJxYsOL+N0YmeJsbgc/Du1nNRexO21498Sxuh swIQ4+lWo9cZw83Ych1NtDLVxN3RHIc= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=imMyS5xx; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of 31tJAaAYKCMc574r0ot11tyr.p1zyv07A-zzx8npx.14t@flex--surenb.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=31tJAaAYKCMc574r0ot11tyr.p1zyv07A-zzx8npx.14t@flex--surenb.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749078744; a=rsa-sha256; cv=none; b=HIg9wa01zp18qNpY6XPtW5qZUrGhEhaK9jvkSOMDfo3+wDH1LtQcMeYtJJCGthl3yvByoq tuVhtJjEVzR+9/aTNHpYilFkWQWdwrzXwk15FnoYxefvPMe/TqLHTNQUVr2eHxhybJLbz6 nIaMFk72u+DmzX4VViX1XZafdD4+pzg= Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-30a39fa0765so529227a91.3 for ; Wed, 04 Jun 2025 16:12:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1749078743; x=1749683543; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=0jAZB46rUPJ2qgJnBZa+KjMxZRHXtnvSFd09hYheqHc=; b=imMyS5xxWCmV3oiqNERwEjIqSDn0HReGBYkhsWPDYkcI1FeboSwJH48bqSc0JdIq5U nJQXsHIDg4gx8lZMytxbRT38VZc8DAIeK+DTb3FSr3mtooXDWQywOkS/8lecuElC1OiX aKNioOrgUTHgnAsaEK5zog/nECueMhmqS5v06xMgG5WK3AGQ7QlGwxl7WzHqTBOkoY6I vzrK8YjGCTjlbgjzHmbH8mQaJbnjDCoChUgIeWcXyqGP7qhX0wmCftp8gGCMBuC+QcTi 052cOGH5Xf5k9bi7j48DAftfK0wAYHCMCHyNDE/pq2qeCzq/Bl+x2MjC3WM/SFREivDo ONJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749078743; x=1749683543; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0jAZB46rUPJ2qgJnBZa+KjMxZRHXtnvSFd09hYheqHc=; b=tC0Hg51v+CTAiTb/GV3/FqNUryOU6v6VmvNe+u9tH4/5vBRy8UePn7TAxRWR89EK8K pt25/i8ZNCJU6AzfGsWH4e4t/UQR8YL/6oo84WIzOm5SJM3CPNsxFf38OTPy3US/A6kP HUexXCL93O5c21wjmXvOCdDolG9hSPYOwAFeyGdril21uX+kee7djXgBkUd0wMsmjKF6 MkvDmpWjjsLH4biWnO11yfBATia0OSDWfxmyqEgpGkOtvwmO208Zy60d9yRfZNvT0J8f O+Lfn9luXmwrZNagn/3pJToKDCOvVIsoiqGBfY7cDs2PSjxBIWU9TYSzYcPppxL5Klk+ 97Dw== X-Forwarded-Encrypted: i=1; AJvYcCVjO6beH8trbrgkDRtqXgqAtHvsdVjZg5BFD7NaJucMOCK+6ar1DqFM8ZnJZZ4GjnLGCa77A+OPhQ==@kvack.org X-Gm-Message-State: AOJu0YzzQfb85YSVGgefJD80TlOrFhCETS6vXuMmqs6y49GhdVzhGUqE aLKXXlpMLNJAILQ+fUEtcjuJQDtkhjDmPfm98ED+wlCkl1U3Yj0bs6hv5lRiypU7si8Ooyxylff cSNNRDw== X-Google-Smtp-Source: AGHT+IHqF47aW9M08yO3toeqc2RtzX4uDZyQH+hhMJ0NwJbs8GXHWVPLg8Azianm01zLeglCaQIlbi3uDTU= X-Received: from pjbpl16.prod.google.com ([2002:a17:90b:2690:b0:312:1c59:43a6]) (user=surenb job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:274b:b0:312:e90b:419c with SMTP id 98e67ed59e1d1-3130cdad77amr6196366a91.26.1749078742966; Wed, 04 Jun 2025 16:12:22 -0700 (PDT) Date: Wed, 4 Jun 2025 16:11:50 -0700 In-Reply-To: <20250604231151.799834-1-surenb@google.com> Mime-Version: 1.0 References: <20250604231151.799834-1-surenb@google.com> X-Mailer: git-send-email 2.49.0.1266.g31b7d2e469-goog Message-ID: <20250604231151.799834-7-surenb@google.com> Subject: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, jannh@google.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, brauner@kernel.org, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, tjmercier@google.com, kaleshsingh@google.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, surenb@google.com Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 0EF7A80004 X-Stat-Signature: oe3f7smiy9bwy7t9h8hch59sjbeuq8fh X-Rspam-User: X-HE-Tag: 1749078743-827465 X-HE-Meta: U2FsdGVkX18JUdPIiKWdCOCVPwEalmOfORhmh0bVzUTxsp6M3l7OF41ZjPjY+mrZY1joK1bFT36DzIcpUUs9EhFYF3TxK/lhBpTWZFNldBW9zanNv1ymniXgl1AqKwBBltbzHVqRC0GpFFVuXGxmmHAwQ5dAxGsa4N4Xvwgyz9/WDOhdVHzL9/o5KqoIFQ1XT5dCnQy9QonBGhxtGj1QrD2SGvEY6yndJim6xFGGS7EEIDNOLVZrhlDHJ5+8aUsETUBL/rpoU73fl2qgbYrqr7qTMGWIPJpvEyN15NZe8Syn97f8tZVxVmcTJ5VqgUOSSfKsTTh6lZgWgGfmvljO+tR5Mn4fpET3l9bhXrOcPoOAPk6R/Vmgo/KrOYA2ikxBY56yuKQ0wRwsIi/sGihMX0mb3bn26nXzps7/9Q2GljXWDE03CWAKbdsuE9393orBWtrW21z4DGhGXuMNJE6DKEwsMNmVqqKekdc0xgLFCAUnVfEXGXumQaAuo1mENQ9GAyaF+0yXRgXuJ7bm/fJQayVxycTafE8baoi2iVG0tE4qcSfOjxRVrnCBft2+QTDT8x14JIfX3axFYbY5fnxI2w1v5HkUaJe+pq95TkrifQS1uta7tdpIM8BvBGs65/axID+b4cWEiuI1wdHMrEpe55Mcg7ZJiES2c1GOod0YyhQhNNDWpUenvduTQ5BBENX4GjwIj1SmIJN0W4x8v9WxE3kHCbND1nWUu7LNRZ8kD9iC9BZeo3RhAU8MZyHUJHr8H5IAavXC7nJ95FoWTVPT2CUKJ5JKyp0R+K1nR9oLQT/hlA4p2r3lChopaUSnhYte24r8KqjaKHggFf2H8bYuXUGt9ZTuPKWpb0t4B//2uWS8+7Vkt/KIqD0okpIhg9wCeVz95Gv5nA4GQvPJ2Kut/vXNxRDao9eW5+f2YPHZpge7rEG1em4BXYhewOHutK+dBzSuYL1j5gPVzN9u2Vy 6FOyOznP ez3Uol9H4VxXmOG6rJ16RZzHxnYdtNXfGKOndndmH/WOvO7YA1UdoJY53bePjpzjRy8+EH85/jVMpBfMpDYwlcAFihq4KoVnrqcepAtUd9HdLAkxU1dWu+a+A1ONkKsWbShUkFAwE/yNnt1KHWmDzAjKwn2zh6XRGPzxc4x2hKOqdk8vJWOPi6Uf5tI7JbN282sEbPQmIYa88oU2cOZsTrBvv1yTIze4Ww+8H7G235/5qyINlIwfLwb8wcZxFgsLS1J8CKotLAYhVtqrp2gxPHGukBKQYKGMPWzL0iqLxctHRNVjWdHn3SPIC1KiuFg4Cuby6zhIr6Rh5svhLCj2KFJU/Yg2NLvYUIbvHDwstQg/CFdMXuWNDimHizITjlIi/sN8adLQ1WY7We3SkxhLSIiLBw8BDJUhZWF7bZ/pefmufCwv+hAYUUF0VAu45NU9VUfF6ZnI/BgpIy8NYXqsfPYGbr1Wzzqt3OfgQa3NXUP/V+ZHP3t3edAJZe2+4QAn4W+rOYRsMaEzktz5/RvsPjXsribVc/Z1EYveaXoDNY4Uxwng= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: With maple_tree supporting vma tree traversal under RCU and per-vma locks, /proc/pid/maps can be read while holding individual vma locks instead of locking the entire address space. Completely lockless approach would be quite complex with the main issue being get_vma_name() using callbacks which might not work correctly with a stable vma copy, requiring original (unstable) vma. When per-vma lock acquisition fails, we take the mmap_lock for reading, lock the vma, release the mmap_lock and continue. This guarantees the reader to make forward progress even during lock contention. This will interfere with the writer but for a very short time while we are acquiring the per-vma lock and only when there was contention on the vma reader is interested in. One case requiring special handling is when vma changes between the time it was found and the time it got locked. A problematic case would be if vma got shrunk so that it's start moved higher in the address space and a new vma was installed at the beginning: reader found: |--------VMA A--------| VMA is modified: |-VMA B-|----VMA A----| reader locks modified VMA A reader reports VMA A: | gap |----VMA A----| This would result in reporting a gap in the address space that does not exist. To prevent this we retry the lookup after locking the vma, however we do that only when we identify a gap and detect that the address space was changed after we found the vma. This change is designed to reduce mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates. Note that this change has a userspace visible disadvantage: it allows for sub-page data tearing as opposed to the previous mechanism where data tearing could happen only between pages of generated output data. Since current userspace considers data tearing between pages to be acceptable, we assume is will be able to handle sub-page data tearing as well. Signed-off-by: Suren Baghdasaryan --- fs/proc/internal.h | 6 ++ fs/proc/task_mmu.c | 177 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 175 insertions(+), 8 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 96122e91c645..3728c9012687 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -379,6 +379,12 @@ struct proc_maps_private { struct task_struct *task; struct mm_struct *mm; struct vma_iterator iter; + loff_t last_pos; +#ifdef CONFIG_PER_VMA_LOCK + bool mmap_locked; + unsigned int mm_wr_seq; + struct vm_area_struct *locked_vma; +#endif #ifdef CONFIG_NUMA struct mempolicy *task_mempolicy; #endif diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 27972c0749e7..36d883c4f394 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_maps_private *priv) } #endif -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv, - loff_t *ppos) +#ifdef CONFIG_PER_VMA_LOCK + +static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv, + struct vm_area_struct *vma, + unsigned long last_pos, + bool mm_unstable) +{ + vma = vma_start_read(priv->mm, vma); + if (IS_ERR_OR_NULL(vma)) + return NULL; + + /* Check if the vma we locked is the right one. */ + if (unlikely(vma->vm_mm != priv->mm)) + goto err; + + /* vma should not be ahead of the last search position. */ + if (unlikely(last_pos >= vma->vm_end)) + goto err; + + /* + * vma ahead of last search position is possible but we need to + * verify that it was not shrunk after we found it, and another + * vma has not been installed ahead of it. Otherwise we might + * observe a gap that should not be there. + */ + if (mm_unstable && last_pos < vma->vm_start) { + /* Verify only if the address space changed since vma lookup. */ + if ((priv->mm_wr_seq & 1) || + mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) { + vma_iter_init(&priv->iter, priv->mm, last_pos); + if (vma != vma_next(&priv->iter)) + goto err; + } + } + + priv->locked_vma = vma; + + return vma; +err: + vma_end_read(vma); + return NULL; +} + + +static void unlock_vma(struct proc_maps_private *priv) +{ + if (priv->locked_vma) { + vma_end_read(priv->locked_vma); + priv->locked_vma = NULL; + } +} + +static const struct seq_operations proc_pid_maps_op; + +static inline bool lock_content(struct seq_file *m, + struct proc_maps_private *priv) +{ + /* + * smaps and numa_maps perform page table walk, therefore require + * mmap_lock but maps can be read with locked vma only. + */ + if (m->op != &proc_pid_maps_op) { + if (mmap_read_lock_killable(priv->mm)) + return false; + + priv->mmap_locked = true; + } else { + rcu_read_lock(); + priv->locked_vma = NULL; + priv->mmap_locked = false; + } + + return true; +} + +static inline void unlock_content(struct proc_maps_private *priv) +{ + if (priv->mmap_locked) { + mmap_read_unlock(priv->mm); + } else { + unlock_vma(priv); + rcu_read_unlock(); + } +} + +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv, + loff_t last_pos) { - struct vm_area_struct *vma = vma_next(&priv->iter); + struct vm_area_struct *vma; + int ret; + + if (priv->mmap_locked) + return vma_next(&priv->iter); + + unlock_vma(priv); + /* + * Record sequence number ahead of vma lookup. + * Odd seqcount means address space modification is in progress. + */ + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq); + vma = vma_next(&priv->iter); + if (!vma) + return NULL; + + vma = trylock_vma(priv, vma, last_pos, true); + if (vma) + return vma; + + /* Address space got modified, vma might be stale. Re-lock and retry */ + rcu_read_unlock(); + ret = mmap_read_lock_killable(priv->mm); + rcu_read_lock(); + if (ret) + return ERR_PTR(ret); + + /* Lookup the vma at the last position again under mmap_read_lock */ + vma_iter_init(&priv->iter, priv->mm, last_pos); + vma = vma_next(&priv->iter); + if (vma) { + vma = trylock_vma(priv, vma, last_pos, false); + WARN_ON(!vma); /* mm is stable, has to succeed */ + } + mmap_read_unlock(priv->mm); + + return vma; +} + +#else /* CONFIG_PER_VMA_LOCK */ +static inline bool lock_content(struct seq_file *m, + struct proc_maps_private *priv) +{ + return mmap_read_lock_killable(priv->mm) == 0; +} + +static inline void unlock_content(struct proc_maps_private *priv) +{ + mmap_read_unlock(priv->mm); +} + +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv, + loff_t last_pos) +{ + return vma_next(&priv->iter); +} + +#endif /* CONFIG_PER_VMA_LOCK */ + +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos) +{ + struct proc_maps_private *priv = m->private; + struct vm_area_struct *vma; + + vma = get_next_vma(priv, *ppos); + if (IS_ERR(vma)) + return vma; + + /* Store previous position to be able to restart if needed */ + priv->last_pos = *ppos; if (vma) { - *ppos = vma->vm_start; + /* + * Track the end of the reported vma to ensure position changes + * even if previous vma was merged with the next vma and we + * found the extended vma with the same vm_start. + */ + *ppos = vma->vm_end; } else { *ppos = -2UL; vma = get_gate_vma(priv->mm); @@ -163,19 +322,21 @@ static void *m_start(struct seq_file *m, loff_t *ppos) return NULL; } - if (mmap_read_lock_killable(mm)) { + if (!lock_content(m, priv)) { mmput(mm); put_task_struct(priv->task); priv->task = NULL; return ERR_PTR(-EINTR); } + if (last_addr > 0) + *ppos = last_addr = priv->last_pos; vma_iter_init(&priv->iter, mm, last_addr); hold_task_mempolicy(priv); if (last_addr == -2UL) return get_gate_vma(mm); - return proc_get_vma(priv, ppos); + return proc_get_vma(m, ppos); } static void *m_next(struct seq_file *m, void *v, loff_t *ppos) @@ -184,7 +345,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos) *ppos = -1UL; return NULL; } - return proc_get_vma(m->private, ppos); + return proc_get_vma(m, ppos); } static void m_stop(struct seq_file *m, void *v) @@ -196,7 +357,7 @@ static void m_stop(struct seq_file *m, void *v) return; release_task_mempolicy(priv); - mmap_read_unlock(mm); + unlock_content(priv); mmput(mm); put_task_struct(priv->task); priv->task = NULL; -- 2.49.0.1266.g31b7d2e469-goog