From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EECFAC5B549 for ; Fri, 30 May 2025 10:44:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 858EE6B00F0; Fri, 30 May 2025 06:44:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 809AE6B00F2; Fri, 30 May 2025 06:44:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F8B56B00F6; Fri, 30 May 2025 06:44:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4E0526B00F0 for ; Fri, 30 May 2025 06:44:52 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 0E7AE1203FA for ; Fri, 30 May 2025 10:44:52 +0000 (UTC) X-FDA: 83499241224.27.BD8ED61 Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) by imf08.hostedemail.com (Postfix) with ESMTP id 2874016000E for ; Fri, 30 May 2025 10:44:49 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=A1ddJdzf; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.210.177 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748601890; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=WnmyWWnkHhL5mEaYJ42mO1WBC828RJ92PmrmoVql9C4=; b=78Qvah9E5N+AnwHTo9LCoxibKBCnLdCvStEgiLBEu2G+vJDnfgQlaEpnVxYzD2/AIbgQCK 2+etE/e8s7BjpDvuF/o1zu627QZ3F8dkHQqMomFHvm3feIYo2cyQJlqRZuFzboUSYyZubo MUcaql8RZONHWWlK7tBSggsPCfrz0Zk= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=A1ddJdzf; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.210.177 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748601890; a=rsa-sha256; cv=none; b=iL759dvlY8ksmUoURknJsZrdWXlCuRWa3TePrvc2nLntb7YEOtNV8VaXYdE4LlQNmef0cX eEZ9l6bdEJVfGPjmhzgYsNJ4FwajLsWHn6sYwbZdq3mAYd/WLmyPdciDXHqAQ8nmTShkbM EiKxrnqv5n8sya1lUe+Q6TCJI+HZ6q4= Received: by mail-pf1-f177.google.com with SMTP id d2e1a72fcca58-74264d1832eso1943324b3a.0 for ; Fri, 30 May 2025 03:44:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748601889; x=1749206689; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=WnmyWWnkHhL5mEaYJ42mO1WBC828RJ92PmrmoVql9C4=; b=A1ddJdzf2ZFLu2BzlAHraHKKtpz1PGwSWTstODRw+y+OOJ3GbJdqR+zSDgpd7GvoKo FpEIEW+l4WvqBWr3AYUr2XzbarNgEXMmX6/S4rdL8r4Cza86YGatap4TRz61hxlISoBN LvOYHkIiqjJ+8I+zBA84F977G3hVTVDHBbUwZWdGgsUXNWcM75N4vg1/eBK8xyLd1wtY +xJ+RvpnfXQ32nyFKzSHBHb88EbUpiW4znvxTkbw3wRZXcfIuOI0Dij+hGQHF1ubjmxE IzNdOeHa7Esg7HoIxhF7wQFsWd5EXvWHAT6auRO35O+hZ5j4ZXxD5UWAgi2mtT8/F7r0 z7zQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748601889; x=1749206689; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WnmyWWnkHhL5mEaYJ42mO1WBC828RJ92PmrmoVql9C4=; b=PlNoaYgfKpDNqQGGghlTlsMD8FJEMqQXdqib9/2UH5Xq2wqnbVu+DZ6xuUZeXfjWwF A36nXhcXgN5xB2qxsZA7tZM9cBRD/f6EgQjAs7Vr2995VtX3PfJ3NzzH0gT6Ffps3mmo +LHtYqA87QYtgdFycDXq6PbASC4yORvODzMFHcGSdkfjdHb5RGfx5DH4xYO4Is/Fmk/K PABwR8TONA7SYi6TsjMPJbO+R/tTzeB6sHGBRnOBxzUm42Q4o3Kq3ZvtS7aOcFx5nZjv lK8YcB6BStzNvipcOJtDHCCWBDUWl0CdTJp9zHmcVLp/TUqDariLLOdM9o7jqRO/QQbp tFuA== X-Forwarded-Encrypted: i=1; AJvYcCWXiYTNCAjLek96eXzLk6V0Z7npA32sn7cq9/CNf7w9ljnGZjcn3kp43j82rs0hSfIfgihEsyC6AA==@kvack.org X-Gm-Message-State: AOJu0Yz6R2G6juXNMYbuOzm3PvwZXcdludrbx1+/PlMHc0QzcIu5hDRU U+v1jG2lZpJ0p97hnHTZKhkjYonlv85KSwajRUS0lToZCVL96QtWw6qQ X-Gm-Gg: ASbGncu3FMorwAoL0huVO8jxMxj26B9zkBr8dXnBrOpcP1N5Tls58UYpVRufNOIjvL+ LdVTCmfZJU37XISRprP6nu535T3wQjImKnw50mfWi2ML9GioP/FVMC1oGMKFTMoc5nRUkZHVrqo PBQTTwduLU9Hpw1wzZQSA8WK/PIeVnKddTU9/kWepZBfUNVx9qe1xUD+1UdWDZqKM5K44FXo8jI GJ6GAGbQdaTMDNLW+mP4+cmI/YHYs7sGbqFemdqQUGw8qNXUDZ900dz1QZJHSwraYvisVZUHHLx Ubg9t6t77s/JXNT453R3ncKvC2t9+VlDE5al0BcMs+W9k/xybwimwdbFl4zP2nn6q9DBAL4o7+h +vt2mcZmOMo+y4JI= X-Google-Smtp-Source: AGHT+IHeNd74QiY5s8umG+LXY6p1qQEn4uJ7HgGVIgmQHCms1DbpS7bfrva0vne3RuKXbvcISBeFxQ== X-Received: by 2002:a05:6a21:3990:b0:216:60bc:2ca9 with SMTP id adf61e73a8af0-21ad989d387mr5428702637.40.1748601888893; Fri, 30 May 2025 03:44:48 -0700 (PDT) Received: from Barrys-MBP.hub ([118.92.145.159]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-b2eceb35702sm1124107a12.33.2025.05.30.03.44.44 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 03:44:48 -0700 (PDT) From: Barry Song <21cnbao@gmail.com> To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , Lorenzo Stoakes , David Hildenbrand , Vlastimil Babka , Jann Horn , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng Subject: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED Date: Fri, 30 May 2025 22:44:39 +1200 Message-Id: <20250530104439.64841-1-21cnbao@gmail.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 2874016000E X-Rspamd-Server: rspam09 X-Stat-Signature: edz8cb9a13xea7jrscbf17xkpmfkb1xh X-HE-Tag: 1748601889-313505 X-HE-Meta: U2FsdGVkX1/31hRRS9bpKahr4i60q86LFD/G3GZsnNGVHk9kZiyzYQgPZoV8V2NDql5dzilEWZ1EEv0OwahP76XWNZJgWr+GZusQr4XfHwT2Lg0u9OzW64CsLuxx2UQyqgFC+Y6P7nZrrMk0IwYEHuz0S7s7o/oWz1lXbwI6YpbHXhv92Lfp0VVRcCF2u9ri8Hp9MHxSapC9TKj62jnhEyNwd7koa4VUh/CrWbDPkmnPcCnusvW/+LbXKskTytiBClJVQkhxCkvGJZ3/drVJBktvJTMiXL6cII797IWF0frjeM/iyBbmMnsoANuvzyrO333u+jiM6EWo0XFV0eJjQhcKljr2lEOR/JZiqIGrcxbY9VQbXKNe8O2j4mdanYSQbnosC/14Uf+NeRHDWQrs/zLFTvH5s4pLe750eLPWhAyE8P9i6L9HWNAfvTzCkDvzPj3hi5o50b+DFUuWorHjz/xwrxqnkK/YoLqshXn1YwOaH2mJkscXMyCyM2xEpVvhoqP/nYXV5GN/nJVahEmIzC/pksG9//txGsYjKkl++SGw3NE+rkqHK9vBCwuuW2pDJL+OX+vTFqxoOxMVxyaKPdh+YogeVEgpgxiouCmg2AyPsJf+t3AoqDT/Wyvjw5b2763RC3IjqBlWTMMbbGcPAvik6okilQVFv080x/G7wICUckNolhAKXqCZsPHN0jCSeVtuC/oaUeGYuANfKO9cysrurW+BMBGDLOv97nlmlzLPVTsooNAm51avaSReVrUZY8kIdJYn8vbto0N7fxwP38FcdM5N8qc4AmrJxEr1n/uK+Dm6mTDewzA8eGRBLTADk2h+cjJSNX9FC5AqNcdG4EEhhZBNznn3JIuLifFg6irQYaWMgrsqsZJjSdm8RD3mLT1xJDfyoXXci2Dwo1PfVm/l0ZSJ5JxD9Q6SN5sYsDzPMM2OM06pHTSleEcu5ecd1KD2kOT8qD7mbtMvEiI t6KuN0bR AB0E8WXg3dXQ5k2elc98/o37weDUDfEdwHsoSAUIpm+dJccms1H7wV/W4O75YVcrG6yXquV7kLBz8uRmqHvnv0HuZr0G3UvmV8O3+yk+ewrCaR5hcpqRIXBbig4rGsA4cDI1/7gcGKh6ZRl2Xtob1XmfNcVGEp3ms22Q4wT/bVKeGsQwQTuGDz/AwDB016G4r7G70/TYXCOIp3bgML2NxjOqa2yaz0+bF7br+LNhf6ys9LcwAUC9YCcmDgrF4rA7BGGlQh15Nuc6lYdbwTvf4o95PdL6WH+hC3NnD4I82rdOEj34Je8guU6tyXc7Lni9yeO1yNKeieAf9AJMXUiXSdGL1we0ik3dxGnQ9WzNmHF7LYC4e/laezy2Ofk0oRLFGFTYLEkZE8OGhypYhORfrqlzsUblzgLTZI02+v9HdKlJF6pyP0zlMNxzmoG5d8IMGtLldMRhDC+L0CIMJN3zI9G30puaMsvBEQSBtfvrVx17RD/g4vOSxeam11Gxghm2NHoFTZ8O7daMbGc8mAe3eqnDjhXoVnlUxTaUKlyFT37Vngo/7+642D9M550+gnCkr2S2j X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Barry Song Certain madvise operations, especially MADV_DONTNEED, occur far more frequently than other madvise options, particularly in native and Java heaps for dynamic memory management. Currently, the mmap_lock is always held during these operations, even when unnecessary. This causes lock contention and can lead to severe priority inversion, where low-priority threads—such as Android's HeapTaskDaemon— hold the lock and block higher-priority threads. This patch enables the use of per-VMA locks when the advised range lies entirely within a single VMA, avoiding the need for full VMA traversal. In practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. Tangquan’s testing shows that over 99.5% of memory reclaimed by Android benefits from this per-VMA lock optimization. After extended runtime, 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while only 1,231 fell back to mmap_lock. To simplify handling, the implementation falls back to the standard mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of userfaultfd_remove(). Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: David Hildenbrand Cc: Vlastimil Babka Cc: Jann Horn Cc: Suren Baghdasaryan Cc: Lokesh Gidra Cc: Tangquan Zheng Signed-off-by: Barry Song --- -v2: * try to hide the per-vma lock in madvise_lock, per Lorenzo; * ideally, for vector_madvise(), we are able to make the decision of lock types for each iteration; for this moment, we still use the global lock. -v1: https://lore.kernel.org/linux-mm/20250527044145.13153-1-21cnbao@gmail.com/ mm/madvise.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 70 insertions(+), 9 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 8433ac9b27e0..d408ffa404b3 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -51,6 +51,7 @@ struct madvise_walk_private { struct madvise_behavior { int behavior; struct mmu_gather *tlb; + struct vm_area_struct *vma; }; /* @@ -1553,6 +1554,21 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, return unmapped_error; } +/* + * Call the visit function on the single vma with the per_vma lock + */ +static inline +int madvise_single_locked_vma(struct vm_area_struct *vma, + unsigned long start, unsigned long end, void *arg, + int (*visit)(struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end, void *arg)) +{ + struct vm_area_struct *prev; + + return visit(vma, &prev, start, end, arg); +} + #ifdef CONFIG_ANON_VMA_NAME static int madvise_vma_anon_name(struct vm_area_struct *vma, struct vm_area_struct **prev, @@ -1603,7 +1619,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, } #endif /* CONFIG_ANON_VMA_NAME */ -static int madvise_lock(struct mm_struct *mm, int behavior) +static int __madvise_lock(struct mm_struct *mm, int behavior) { if (is_memory_failure(behavior)) return 0; @@ -1617,7 +1633,7 @@ static int madvise_lock(struct mm_struct *mm, int behavior) return 0; } -static void madvise_unlock(struct mm_struct *mm, int behavior) +static void __madvise_unlock(struct mm_struct *mm, int behavior) { if (is_memory_failure(behavior)) return; @@ -1628,6 +1644,46 @@ static void madvise_unlock(struct mm_struct *mm, int behavior) mmap_read_unlock(mm); } +static int madvise_lock(struct mm_struct *mm, unsigned long start, + unsigned long len, struct madvise_behavior *madv_behavior) +{ + int behavior = madv_behavior->behavior; + + /* + * MADV_DONTNEED is commonly used with userspace heaps and most often + * affects a single VMA. In these cases, we can use per-VMA locks to + * reduce contention on the mmap_lock. + */ + if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) { + struct vm_area_struct *vma; + unsigned long end; + + start = untagged_addr(start); + end = start + len; + vma = lock_vma_under_rcu(mm, start); + if (!vma) + goto out; + if (end > vma->vm_end || userfaultfd_armed(vma)) { + vma_end_read(vma); + goto out; + } + madv_behavior->vma = vma; + return 0; + } + +out: + return __madvise_lock(mm, behavior); +} + +static void madvise_unlock(struct mm_struct *mm, + struct madvise_behavior *madv_behavior) +{ + if (madv_behavior->vma) + vma_end_read(madv_behavior->vma); + else + __madvise_unlock(mm, madv_behavior->behavior); +} + static bool madvise_batch_tlb_flush(int behavior) { switch (behavior) { @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct *mm, unsigned long start, size_t len_in, struct madvise_behavior *madv_behavior) { + struct vm_area_struct *vma = madv_behavior->vma; int behavior = madv_behavior->behavior; + struct blk_plug plug; unsigned long end; int error; if (is_memory_failure(behavior)) return madvise_inject_error(behavior, start, start + len_in); - start = untagged_addr_remote(mm, start); + start = untagged_addr(start); end = start + PAGE_ALIGN(len_in); blk_start_plug(&plug); if (is_madvise_populate(behavior)) error = madvise_populate(mm, start, end, behavior); + else if (vma) + error = madvise_single_locked_vma(vma, start, end, + madv_behavior, madvise_vma_behavior); else error = madvise_walk_vmas(mm, start, end, madv_behavior, madvise_vma_behavior); @@ -1817,13 +1878,13 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh if (madvise_should_skip(start, len_in, behavior, &error)) return error; - error = madvise_lock(mm, behavior); + error = madvise_lock(mm, start, len_in, &madv_behavior); if (error) return error; madvise_init_tlb(&madv_behavior, mm); error = madvise_do_behavior(mm, start, len_in, &madv_behavior); madvise_finish_tlb(&madv_behavior); - madvise_unlock(mm, behavior); + madvise_unlock(mm, &madv_behavior); return error; } @@ -1847,7 +1908,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, total_len = iov_iter_count(iter); - ret = madvise_lock(mm, behavior); + ret = __madvise_lock(mm, behavior); if (ret) return ret; madvise_init_tlb(&madv_behavior, mm); @@ -1880,8 +1941,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, /* Drop and reacquire lock to unwind race. */ madvise_finish_tlb(&madv_behavior); - madvise_unlock(mm, behavior); - madvise_lock(mm, behavior); + __madvise_unlock(mm, behavior); + __madvise_lock(mm, behavior); madvise_init_tlb(&madv_behavior, mm); continue; } @@ -1890,7 +1951,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, iov_iter_advance(iter, iter_iov_len(iter)); } madvise_finish_tlb(&madv_behavior); - madvise_unlock(mm, behavior); + __madvise_unlock(mm, behavior); ret = (total_len - iov_iter_count(iter)) ? : ret; -- 2.39.3 (Apple Git-146)