From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36B7BC71157 for ; Wed, 18 Jun 2025 13:05:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 905206B0088; Wed, 18 Jun 2025 09:05:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8B56A6B0089; Wed, 18 Jun 2025 09:05:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7CB096B008A; Wed, 18 Jun 2025 09:05:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 6D5976B0088 for ; Wed, 18 Jun 2025 09:05:55 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E03A9121765 for ; Wed, 18 Jun 2025 13:05:54 +0000 (UTC) X-FDA: 83568543828.12.6752C29 Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179]) by imf28.hostedemail.com (Postfix) with ESMTP id C5EC1C001A for ; Wed, 18 Jun 2025 13:05:52 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=U7QLjMw1; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf28.hostedemail.com: domain of lance.yang@linux.dev designates 95.215.58.179 as permitted sender) smtp.mailfrom=lance.yang@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750251953; a=rsa-sha256; cv=none; b=glMQzBSmblUUU5kXco8QbviAf28ZSn5zWnLFMQItxOPJj1c7LJvC8FVgc0KyOBM2WNX1N9 qlZN1NOxM//EHKhV60vU562ndTWrtBm2qB66TwCqw3M8y/eSGBZ50wMBGqB3zqDAVss6vy fhvgVzyCA28SBOocGQn81Vy3uesm6YI= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=U7QLjMw1; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf28.hostedemail.com: domain of lance.yang@linux.dev designates 95.215.58.179 as permitted sender) smtp.mailfrom=lance.yang@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750251953; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qjE/HZ9yBukz5rd7dZg8Lj4y2PBMtnPQc/ZMBh/aZeQ=; b=Y4hO1nh3rGKdIksqHHrskORVHKGtc6tHR8yrSY5qSNHMX4zWu9hxnUNYfk2ZSi8onSdvqA 3eH7IFRTLanE05JogIwQshZb1GGHqfhUVBZLAAKQHLCPJrRMMU0NF6IrVU5asGQrF2mS3z jgB9FiyKlu/zusFlL3Eo056ffkvTVGA= Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1750251950; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qjE/HZ9yBukz5rd7dZg8Lj4y2PBMtnPQc/ZMBh/aZeQ=; b=U7QLjMw16E3d4j/9tU2l8vV/TEBUrfppUtfnyQyFpODyD9X035A/i3DJZBjWsIXY0zEDHO rGP/4XI183zBgjabbIuo+GcQDyxzdLgy6m2mqPe5f7k61/SHp8QrTpkP1FRVjzl1InD1D7 UgOsiA0hCzcMs/9FRvEHIKh6PUK37fI= Date: Wed, 18 Jun 2025 21:05:42 +0800 MIME-Version: 1.0 Subject: Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED Content-Language: en-US To: David Hildenbrand , Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , Vlastimil Babka , Jann Horn , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng , Qi Zheng , Lance Yang , Lorenzo Stoakes , Zi Li References: <20250607220150.2980-1-21cnbao@gmail.com> <309d22ca-6cd9-4601-8402-d441a07d9443@lucifer.local> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Lance Yang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Stat-Signature: mmdzqpi1x93dkbasuizi7z8ys1rymcw3 X-Rspamd-Queue-Id: C5EC1C001A X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1750251952-669326 X-HE-Meta: U2FsdGVkX1+B8r5TkJefoCaY1rh1Tnz314zoTDXsjXXenaP93OCUc74eZ+unNLsCC5l/u/7XCGg22nQ4MRFrEUfe6UKVnYdL1QOVKpy78tUL0d9zmk1RN5eEKsqPOPGjsTJx9wOVwLCBfjW9qdFfTgmbdE4wjG0L7dgNA31uZUkZljta9MzEukFld6rcZ51b6zz4meeIqxkEsx+xgkBqVBkEyTUQJ6WOT21kev/LRu6bUd3LPM3KYD75plzNG+mFRRYnE5K95P0zF1y6QvrXmpx57FIGO19v0SWw8+nzt10VSofML/YpU/WBJX9BuQBsaSc3UvVcos9a5vwLZsI5GRiU3d4Pp5BW8l5H+AltxbjFmxk5qduVZUzV9kPtYC377NlMxxjTDjKaYx5dVADtSdT4stTBalaIEQNIVgk28pHVK6EvEQl/nH1DfFm8aLVurni1bk6tNVlB+S8s1D1/lPWhlOlo+zrujuXT91nlP8//kgm8BLhTqUzLRIhE0B26Awpbo4AoKh3xrPkRA0UCAFutNzGyGPRP/Y/jdYIthbBXR77h9qOXcTgpM+2WUdNOdCMYagutAT4vUyDtrom6gA57yuR78WhbmNng6ZGzs9Fwyk3mOQey1ASYoxeB6e6G3llE3eghxJo5S1skI6cTLQORzsFgd93uTTo2tdvWbsmIno9PskTIFw/Dh20enXwKkM+PUQd68WBvfLG/9mxrfP04UToAw7QKtdsTq4K20ESbpRHP8sESup8khR0ouPe3UHi/1s+H8JZ4kT9c4XiFDrKxRwqmz2JMEpNJxBr4AsZAZUt9/OJJyOJX5pZlRJsF752KVMGpVcTXtF461oQehLG6ETzmC2FO93ZrlWjN3YVSxBxuRCLlYIvkyOjViF8eqa7bsBVJfFIROS6WVYl6xXeHFCPeMUt9xdNYlDDHYKA8nV2g0+DrTtY1O1vYJ3LTGB9Rodr+OVnfwWmJ2WV JsvMoDLC sqo6oS7HsOp/uJs7JERCJyAxmkfjY8I4YSqGDuci5a4kiPJbJFjrHBa74dvp1zcGcsWki5G+2esru2bzuEvzEOlR+z+Oxn8g7BodoIgw7vDah7XdXp2mVByZNHuRiAv3zPgmESfsf3Ca58RxBwmSzzd8iA80wq4YVbSYSvGsUBoXw5SZjM2RRB2bdGvLyCF/oYbXXJ8EJkhnVfLjxaH27VRDk8CfJLGilFrjM2BfjKrM52HaJc0ytXtqU5Pc3Xo7sY/GxJ7kyJ3oJlGLneh0h2vdT2Ma68dw+/FPPb+BxF+zIXxqJO99pDHRwAaei16kdrQvv X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/6/18 18:18, David Hildenbrand wrote: > On 18.06.25 11:52, Barry Song wrote: >> On Wed, Jun 18, 2025 at 10:25 AM Lance Yang wrote: >>> >>> Hi all, >>> >>> Crazy, the per-VMA lock for madvise is an absolute game-changer ;) >>> >>> On 2025/6/17 21:38, Lorenzo Stoakes wrote: >>> [...] >>>> >>>> On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote: >>>>> From: Barry Song >>>>> >>>>> Certain madvise operations, especially MADV_DONTNEED, occur far more >>>>> frequently than other madvise options, particularly in native and Java >>>>> heaps for dynamic memory management. >>>>> >>>>> Currently, the mmap_lock is always held during these operations, >>>>> even when >>>>> unnecessary. This causes lock contention and can lead to severe >>>>> priority >>>>> inversion, where low-priority threads—such as Android's >>>>> HeapTaskDaemon— >>>>> hold the lock and block higher-priority threads. >>>>> >>>>> This patch enables the use of per-VMA locks when the advised range >>>>> lies >>>>> entirely within a single VMA, avoiding the need for full VMA >>>>> traversal. In >>>>> practice, userspace heaps rarely issue MADV_DONTNEED across >>>>> multiple VMAs. >>>>> >>>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by >>>>> Android >>>>> benefits from this per-VMA lock optimization. After extended runtime, >>>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while >>>>> only 1,231 fell back to mmap_lock. >>>>> >>>>> To simplify handling, the implementation falls back to the standard >>>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the >>>>> complexity of >>>>> userfaultfd_remove(). >>>>> >>>>> Many thanks to Lorenzo's work[1] on: >>>>> "Refactor the madvise() code to retain state about the locking mode >>>>> utilised for traversing VMAs. >>>>> >>>>> Then use this mechanism to permit VMA locking to be done later in the >>>>> madvise() logic and also to allow altering of the locking mode to >>>>> permit >>>>> falling back to an mmap read lock if required." >>>>> >>>>> One important point, as pointed out by Jann[2], is that >>>>> untagged_addr_remote() requires holding mmap_lock. This is because >>>>> address tagging on x86 and RISC-V is quite complex. >>>>> >>>>> Until untagged_addr_remote() becomes atomic—which seems unlikely in >>>>> the near future—we cannot support per-VMA locks for remote processes. >>>>> So for now, only local processes are supported. >>> >>> Just to put some numbers on it, I ran a micro-benchmark with 100 >>> parallel threads, where each thread calls madvise() on its own 1GiB Correction: it uses 256MiB chunks per thread, not 1GiB ... >>> chunk of 64KiB mTHP-backed memory. The performance gain is huge: >>> >>> 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s (~47% >>> faster) >>> 2) MADV_FREE     saw its average time drop from 0.3078s to 0.1095s (~64% >>> faster) >> >> Thanks for the report, Lance. I assume your micro-benchmark includes some >> explicit or implicit operations that may require mmap_write_lock(). >> As  mmap_read_lock() only waits for writers and does not block other >> mmap_read_lock() calls. > > The number rather indicate that one test was run with (m)THPs enabled > and the other not? Just a thought. The locking overhead from my > experience is not that significant. > Both tests were run with 64KiB mTHP enabled on an Intel(R) Xeon(R) Silver 4314 CPU. The micro-benchmark code is following: ``` #define _GNU_SOURCE #include #include #include #include #include #include #include #define NUM_THREADS 100 #define MMAP_SIZE (512L * 1024 * 1024) #define WRITE_START (128L * 1024 * 1024) #define WRITE_SIZE (256L * 1024 * 1024) #define MADV_HUGEPAGE 14 #define MADV_DONTNEED 4 #define MADV_FREE 8 typedef struct { int id; int madvise_option; } thread_data_t; void *thread_function(void *arg) { thread_data_t *data = (thread_data_t *)arg; uint8_t *mmap_area = mmap(NULL, MMAP_SIZE, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (mmap_area == MAP_FAILED) { perror("mmap"); return NULL; } if (mprotect(mmap_area + WRITE_START, WRITE_SIZE, PROT_READ | PROT_WRITE) != 0) { perror("mprotect"); munmap(mmap_area, MMAP_SIZE); return NULL; } if (madvise(mmap_area + WRITE_START, WRITE_SIZE, MADV_HUGEPAGE) != 0) { perror("madvise hugepage"); munmap(mmap_area, MMAP_SIZE); return NULL; } for (size_t i = 0; i < WRITE_SIZE; i++) { mmap_area[WRITE_START + i] = 255; } struct timespec start_time, end_time; clock_gettime(CLOCK_MONOTONIC, &start_time); if (madvise(mmap_area + WRITE_START, WRITE_SIZE, data->madvise_option) != 0) { perror("madvise"); } clock_gettime(CLOCK_MONOTONIC, &end_time); double elapsed_time = (end_time.tv_sec - start_time.tv_sec) + (end_time.tv_nsec - start_time.tv_nsec) / 1e9; printf("Thread %d elapsed time: %.6f seconds\n", data->id, elapsed_time); munmap(mmap_area, MMAP_SIZE); return NULL; } int main(int argc, char *argv[]) { if (argc != 2) { fprintf(stderr, "Usage: %s \n", argv[0]); fprintf(stderr, " 1: MADV_DONTNEED\n"); fprintf(stderr, " 2: MADV_FREE\n"); return EXIT_FAILURE; } int madvise_option; if (atoi(argv[1]) == 1) { madvise_option = MADV_DONTNEED; } else if (atoi(argv[1]) == 2) { madvise_option = MADV_FREE; } else { fprintf(stderr, "Invalid madvise_option. Use 1 for MADV_DONTNEED or 2 for MADV_FREE.\n"); return EXIT_FAILURE; } pthread_t threads[NUM_THREADS]; thread_data_t thread_data[NUM_THREADS]; int i; for (i = 0; i < NUM_THREADS; i++) { thread_data[i].id = i; thread_data[i].madvise_option = madvise_option; pthread_create(&threads[i], NULL, thread_function, &thread_data[i]); } for (i = 0; i < NUM_THREADS; i++) { pthread_join(threads[i], NULL); } sleep(10); return 0; } ``` Thanks, Lance