From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29B75C77B7C for ; Tue, 24 Jun 2025 19:34:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A05216B0095; Tue, 24 Jun 2025 15:34:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9B5546B0098; Tue, 24 Jun 2025 15:34:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A44D6B009C; Tue, 24 Jun 2025 15:34:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 746586B0095 for ; Tue, 24 Jun 2025 15:34:06 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A9A3D56E57 for ; Tue, 24 Jun 2025 19:34:05 +0000 (UTC) X-FDA: 83591294850.22.E735104 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) by imf24.hostedemail.com (Postfix) with ESMTP id D277D180005 for ; Tue, 24 Jun 2025 19:34:03 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=gxoswPPj; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf24.hostedemail.com: domain of 3qv1aaAYKCFsLNK7G49HH9E7.5HFEBGNQ-FFDO35D.HK9@flex--surenb.bounces.google.com designates 209.85.215.201 as permitted sender) smtp.mailfrom=3qv1aaAYKCFsLNK7G49HH9E7.5HFEBGNQ-FFDO35D.HK9@flex--surenb.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750793643; a=rsa-sha256; cv=none; b=Ktn3SmoH8gtzSbhEHPOGspyxPe7Y4ds9QFk873CUN39dxWqMsMlN598nqU4IOgnhXIk1qJ LxvOXS3S2FiuUFw1fWLl3uV+VMTTiWeIf1Gp1f1bD4FuTxZCAm03HxyuK3nCS5L1jkurJL tj1XZyxUp705bk2qRBWTyMw3cnVERrs= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=gxoswPPj; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf24.hostedemail.com: domain of 3qv1aaAYKCFsLNK7G49HH9E7.5HFEBGNQ-FFDO35D.HK9@flex--surenb.bounces.google.com designates 209.85.215.201 as permitted sender) smtp.mailfrom=3qv1aaAYKCFsLNK7G49HH9E7.5HFEBGNQ-FFDO35D.HK9@flex--surenb.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750793643; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=oAfF+mMc1XNl8c+YVTYgvfkcW5XUc9MqdxvqcD7TAp8=; b=uFtxsdZbBWAsI3eYerXMUxNbu4Yu86uwI/IBLL8bGsdK2xdT/uQ8xE2fOWMql5a+h5nreo 3aJUZM0Uz4KnKFT6NTPzYBP8kFRu1Yya9wvx2YuCPEj/56OThbQliCsTa86ls1OOScWKiN Kxr17wx2UyCGR9gD0x8dQYHrwo7r9d4= Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-b2eeff19115so1181398a12.0 for ; Tue, 24 Jun 2025 12:34:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1750793642; x=1751398442; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=oAfF+mMc1XNl8c+YVTYgvfkcW5XUc9MqdxvqcD7TAp8=; b=gxoswPPjxxvalsQlbP7AQAuJkWPKE8HCClJfwlxgSL/oxmYBshK7ZkxHEIw/vdmW27 Z2tegjpnU9SmHX5IlfwpH8EQgiOz8UywmzXT/jnN5zZ3UMvWPWGvgNryJz2mzsSCRDd/ qNVcq0vGaVc9r9rbWgI59wFTTO3FJylstZlV/AJP1EWJ1TmdhV5+E2bi1RPAZSkMqdQ/ 5eGC6b7cAA/2z/QZRnsDs4KfZaeVNaCFCHi01MsD2YTXVh6F2xbltzcbV9W+dGBwKWS6 3b1cngY01Db0OEemh2MGH4cuZIDYIfCGIrvmuWmcIqt789GGr+FndpRY7NbO55eTUYX1 jIKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750793642; x=1751398442; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=oAfF+mMc1XNl8c+YVTYgvfkcW5XUc9MqdxvqcD7TAp8=; b=WZtD9INWPoK7mdUZE2WHHguLqkjNSH2bugKtNWM+/OrOOPLgTUWJZbFWnju990jAAU xyJoBGYY7ZqQEqVwGEv4i0HIs7aBeOifrsT3r3Qzl5vur1Tlz5VZ21NNlzK2vM/KeLrs l92GHXv9iETsHwRnswIrou8ba2mv3ehWGIh78NMa73PDccuhSYcpsd6eDJbQUKznxrzM 28HdX/AHIWKgDgWu48cMzD9p2m6dZ997Js9n/Np+wpJqrN2v1hqkjm50OkoJ9O1UeyNj jDx88YPbbCg5Hah2Cq7rtZzfsC3u3kW7Hi5vndzAMGBcGHWSkxBWzD2ImSD5KhLbmpZf 2iMw== X-Forwarded-Encrypted: i=1; AJvYcCWUKS65McBQi3maYXGBstO14GGfTX99BVfayaQQ3GUHOQM1SUKQMKxyyFHBtiQoI50mZ62cY8kctw==@kvack.org X-Gm-Message-State: AOJu0YxCab8GEU60B4TbOEzXM7o5sf+EJvLhLNlcqzZVCHVHVHjLQO5X x0mplx0WXcKSZjQm33fPQusf51ExD6lvIKll/AHgq/WRWCWOPV8UZQqBJ9YQvLd+lLSBhvCGosU 6uRawsg== X-Google-Smtp-Source: AGHT+IESPNWdUFvkJLMcSh5tmTSGQQDEpRowdisdAiY1tiikGkyPK41qKhmKqgqXISv7VzDDl5GwYzdl9gY= X-Received: from pfrb8.prod.google.com ([2002:aa7:8ec8:0:b0:746:32ae:99d5]) (user=surenb job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:3399:b0:21f:7430:148a with SMTP id adf61e73a8af0-2207f2858c1mr238077637.28.1750793642586; Tue, 24 Jun 2025 12:34:02 -0700 (PDT) Date: Tue, 24 Jun 2025 12:33:52 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.50.0.714.g196bf9f422-goog Message-ID: <20250624193359.3865351-1-surenb@google.com> Subject: [PATCH v5 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, jannh@google.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, brauner@kernel.org, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, tjmercier@google.com, kaleshsingh@google.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, surenb@google.com Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Queue-Id: D277D180005 X-Rspamd-Server: rspam10 X-Stat-Signature: hpjtsfzbqponnxg6mb8hih33rcirz9pe X-HE-Tag: 1750793643-206645 X-HE-Meta: U2FsdGVkX18HqurBMgL9XJFQ2UbCXPWG7CdQpJIaImNnx9A7YFF4NNXYVpHXKySdc9GcsDa7H2p96Ajv9k/Vlwdw6gYkvvTYsrYUMDLyxciFznMOvKkXXNikVozfl2yWpm+7m3fu+w0xWSB9TOqdbUC+WY1kf2+21AkEauUGBGUZ2mjjt6IaFjbMCSPR/c/mP1bFXCtCMeVxdQrUGnM+NwmxprTpbdhpdh6ujapZrW+KnvSa83ZtxYhnHJVvFx9b1DGUrs7nSCOuQDFJqd3xG2TCESoyVLuyN8eoD+OzNHqu9ziOmg/ZgyHFQ1XXM/KJiOcf+UsFryryXmzWrRW9MP9MIj2YtSUp/TdaJ68cleoTTPk+qOSd59D9Y0ybq+XOk4LCs2U7Iytn1tCGA1bFOaK6ZRZiNYl5xrzQnH43EjnmRRSnmCE75aaCqd5ujLoMIAV8wAiYBuLYSg78l1dZy9QtzdCST+7d1RCtO7iPBvQ9R5GZhT9RjflbGgn+WBW/sv5jxp9pi5ECNieuDVnDq30ARgKTkC5vwXr5qPe3HINjoZc9i0CBbo6vikhORRHhQWICcKDVc9RI1ENqlGzObKBk9N98CnZGuzMOlsDlWFlNQs/ztGOdCrRN+CqB8y0H4vyU+E1Acpw2NrjlNwh0p3PoaF2NSnavVEeMCkvdsKLHrXRahDL4ZDYo3s0jErQMsNj4ZMS/PIAzd+G8QM8g2+jrW4XfoDy7K+GS4GeCi0ljBXCUW66DvRJCph4usGXwmvUdE4Guj2R+t49WfHcJ/RI7k30eix7MaI7fOOvHkU72CrLmScT5xox/xpaRI/l0C0SlgGT1y0ugF8eJplwD9Z9NgCeZJfzFlsQom8rZ2xW18q97mJhxoLqcnDlOHQ6M7K3icVBha0DVQp/2wftWXwId81iuld8GtlQ5SfO8EaTaP7h7coXw12gz+cAlw7lWCpkkOAHXOI/KeeZOc14 t7wJP7g0 0wywrMCKVK86mSjs3UBT6PScWDJJ2WDhG1NITiQorXVndJ6fZjT7RQ0SP6js4xv7clDM28sZzL8cog0K+Rg/scA3xm+gAov5q/9OtEjLvGp0RdpxMLMtMRvK3Z6N2rC1ckx2Qp/DFQ47l3MZ/SpZa1ttALcj1F5JP1HVqqfKGWBGqBm0lLY8vEY4Qi7EeDIRAmpoNRUZrbpS+HukFv2maugcUNc2eDzHrm8RirmjFfIOjOsXFY9uux/7JVzjh6r5zz0ANkEemnu2rdwnjdEKk58ykZ4TSnjTLl2UsUofu0rJ1iPcwvQEJcEzts65Ez1Y4a+HGpP6JFpshTiP84KIjU/XzyMOyclYPOR8izSmVzBw4Rcy9ppChD68RQRO/HyhVGcLwG9KL4iPf1+VFZSyBjtqRF62U30cXDQrTo8L7kHoZXbLBgi7Ubw/RZIn7UGQf/eUR3KCEsaVAno6lEauMzlbbvQAKhWPY3C/IyMLFkjT97toVb8kdrrRF4Gh3Jj6c//Bi0PF4XQBeXBoWbxSRm3tsx+svrv35HuWYQQL2IcKXVT+cDC+Rq2acsWRrOP57i/R3BVUb3XzQhIS7pmvMujFYxPcd1V1QjSbKrminhsYWchH59FNdTpb+Db3UdsJOIvPN X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Reading /proc/pid/maps requires read-locking mmap_lock which prevents any other task from concurrently modifying the address space. This guarantees coherent reporting of virtual address ranges, however it can block important updates from happening. Oftentimes /proc/pid/maps readers are low priority monitoring tasks and them blocking high priority tasks results in priority inversion. Locking the entire address space is required to present fully coherent picture of the address space, however even current implementation does not strictly guarantee that by outputting vmas in page-size chunks and dropping mmap_lock in between each chunk. Address space modifications are possible while mmap_lock is dropped and userspace reading the content is expected to deal with possible concurrent address space modifications. Considering these relaxed rules, holding mmap_lock is not strictly needed as long as we can guarantee that a concurrently modified vma is reported either in its original form or after it was modified. This patchset switches from holding mmap_lock while reading /proc/pid/maps to taking per-vma locks as we walk the vma tree. This reduces the contention with tasks modifying the address space because they would have to contend for the same vma as opposed to the entire address space. Same is done for PROCMAP_QUERY ioctl which locks only the vma that fell into the requested range instead of the entire address space. Previous version of this patchset [1] tried to perform /proc/pid/maps reading under RCU, however its implementation is quite complex and the results are worse than the new version because it still relied on mmap_lock speculation which retries if any part of the address space gets modified. New implementaion is both simpler and results in less contention. Note that similar approach would not work for /proc/pid/smaps reading as it also walks the page table and that's not RCU-safe. Paul McKenney's designed a test [2] to measure mmap/munmap latencies while concurrently reading /proc/pid/maps. The test has a pair of processes scanning /proc/PID/maps, and another process unmapping and remapping 4K pages from a 128MB range of anonymous memory. At the end of each 10 second run, the latency of each mmap() or munmap() operation is measured, and for each run the maximum and mean latency is printed. The map/unmap process is started first, its PID is passed to the scanners, and then the map/unmap process waits until both scanners are running before starting its timed test. The scanners keep scanning until the specified /proc/PID/maps file disappears. This test registered close to 10x improvement in update latencies: Before the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.011 0.008 0.455 0.011 0.008 0.472 0.011 0.008 0.535 0.011 0.009 0.545 ... 0.011 0.014 2.875 0.011 0.014 2.913 0.011 0.014 3.007 0.011 0.015 3.018 After the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.006 0.005 0.036 0.006 0.005 0.039 0.006 0.005 0.039 0.006 0.005 0.039 ... 0.006 0.006 0.403 0.006 0.006 0.474 0.006 0.006 0.479 0.006 0.006 0.498 The patchset also adds a number of tests to check for /proc/pid/maps data coherency. They are designed to detect any unexpected data tearing while performing some common address space modifications (vma split, resize and remap). Even before these changes, reading /proc/pid/maps might have inconsistent data because the file is read page-by-page with mmap_lock being dropped between the pages. An example of user-visible inconsistency can be that the same vma is printed twice: once before it was modified and then after the modifications. For example if vma was extended, it might be found and reported twice. What is not expected is to see a gap where there should have been a vma both before and after modification. This patchset increases the chances of such tearing, therefore it's even more important now to test for unexpected inconsistencies. In [3] Lorenzo identified the following possible vma merging/splitting scenarios: Merges with changes to existing vmas: 1 Merge both - mapping a vma over another one and between two vmas which can be merged after this replacement; 2. Merge left full - mapping a vma at the end of an existing one and completely over its right neighbor; 3. Merge left partial - mapping a vma at the end of an existing one and partially over its right neighbor; 4. Merge right full - mapping a vma before the start of an existing one and completely over its left neighbor; 5. Merge right partial - mapping a vma before the start of an existing one and partially over its left neighbor; Merges without changes to existing vmas: 6. Merge both - mapping a vma into a gap between two vmas which can be merged after the insertion; 7. Merge left - mapping a vma at the end of an existing one; 8. Merge right - mapping a vma before the start end of an existing one; Splits 9. Split with new vma at the lower address; 10. Split with new vma at the higher address; If such merges or splits happen concurrently with the /proc/maps reading we might report a vma twice, once before the modification and once after it is modified: Case 1 might report overwritten and previous vma along with the final merged vma; Case 2 might report previous and the final merged vma; Case 3 might cause us to retry once we detect the temporary gap caused by shrinking of the right neighbor; Case 4 might report overritten and the final merged vma; Case 5 might cause us to retry once we detect the temporary gap caused by shrinking of the left neighbor; Case 6 might report previous vma and the gap along with the final marged vma; Case 7 might report previous and the final merged vma; Case 8 might report the original gap and the final merged vma covering the gap; Case 9 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma start; Case 10 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma end; In all these cases the retry mechanism prevents us from reporting possible temporary gaps. Changes from v4 [4]: - refactored trylock_vma() and other locking parts into mmap_lock.c, per Lorenzo - renamed {lock|unlock}_content() into {lock|unlock}_vma_range(), per Lorenzo - added clarifying comments for sentinels, per Lorenzo - introduced is_sentinel_pos() helper function - fixed position reset logic when last_addr is a sentinel, per Lorenzo - added Acked-by to the last patch, per Andrii Nakryiko [1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/ [2] https://github.com/paulmckrcu/proc-mmap_sem-test [3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.local/ [4] https://lore.kernel.org/all/20250604231151.799834-1-surenb@google.com/ Suren Baghdasaryan (7): selftests/proc: add /proc/pid/maps tearing from vma split test selftests/proc: extend /proc/pid/maps tearing test to include vma resizing selftests/proc: extend /proc/pid/maps tearing test to include vma remapping selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified selftests/proc: add verbose more for tests to facilitate debugging mm/maps: read proc/pid/maps under per-vma lock mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks fs/proc/internal.h | 5 + fs/proc/task_mmu.c | 179 ++++- include/linux/mmap_lock.h | 11 + mm/mmap_lock.c | 88 +++ tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++- 5 files changed, 1053 insertions(+), 23 deletions(-) base-commit: 0b2a863368fb0cf674b40925c55dc8898c5a33af -- 2.50.0.714.g196bf9f422-goog