From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C35F4C02188 for ; Mon, 27 Jan 2025 22:08:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1BB652801B7; Mon, 27 Jan 2025 17:08:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 16BBB2801B0; Mon, 27 Jan 2025 17:08:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0333A2801B7; Mon, 27 Jan 2025 17:08:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id DE2172801B0 for ; Mon, 27 Jan 2025 17:08:48 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 8D8291A049C for ; Mon, 27 Jan 2025 22:08:48 +0000 (UTC) X-FDA: 83054622336.27.C1F77B6 Received: from mail-ua1-f48.google.com (mail-ua1-f48.google.com [209.85.222.48]) by imf29.hostedemail.com (Postfix) with ESMTP id A53FD120009 for ; Mon, 27 Jan 2025 22:08:46 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="UJ/ycBJw"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738015726; a=rsa-sha256; cv=none; b=fcGtGGIqhFfw3PIwWS1wlCPy3U53svCnsZPlISQG5ND1ByoWeDuzagR8vjkWgpU/bIlJzj kMXbHI0ILHzX/BKzRFTAEN1RlUA2b7Sf8y+PmfIFcb34Ye1Y69nDly8Iulvw51UNkPOWHf Dzg6jCSU0PjprhbUL3PDdNS0kSHxD70= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="UJ/ycBJw"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738015726; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=r/MaX3YAY2PQiBOYsEwPJEIAwTTPFRHSYvA/7ys2JTA=; b=HUWtcqx/rL0BXvqeCPnNNqPd8qmnDgqtps/4L8JxE+oLrqif4LifN7V4jvkVlK9+W7KXUo fHjW3M6UiVa+VNpj9L+6h+BFi+Ox/p62AYoqH8ETBPujbC8iCTB5KNUN45XsshGqqBfc/2 biXgQnyfrlA5IhmrrPRQIDBLmYHGeDI= Received: by mail-ua1-f48.google.com with SMTP id a1e0cc1a2514c-85b9f2de38eso1018995241.2 for ; Mon, 27 Jan 2025 14:08:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1738015725; x=1738620525; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=r/MaX3YAY2PQiBOYsEwPJEIAwTTPFRHSYvA/7ys2JTA=; b=UJ/ycBJwLex0GrwEp6b/jcMjF6oPvLLhn4gX1AU9oW1QUGmSihfIUErpQoXG6Fx/ZT 4GbJ5+iNpqKEI98BwqdoiHbFIlu5nZ/mMnDPcmt7QoPRRAgWsU5HeT3sgBNkB2E/fxt+ D4tBnjappZMOLP3t8TXgzENTs1xKZhl08Jn5boInHdsGXvUJ/SdTx/97z7FIwbjP6oCP Dl6hML/qxccjcqmsZOfnq+Vr6gzhWMDZR/kLtUOf5i4Uwq7JF8Y+ptL1RaUyfAZ66SZq 1evwfkzt/YRtfxPC4f5xey+38gS0ANcK6q5kmcEuUmzv7e2zwCwyTbSYfxonqprP8Awk gylQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738015725; x=1738620525; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=r/MaX3YAY2PQiBOYsEwPJEIAwTTPFRHSYvA/7ys2JTA=; b=Zf3ihkyC1z6AtFiXTrDNy3U25rbLR8Cjh0LHIliKY/m8F9T6DMsgrsPiKFzo6vuyle ub/34BcVkpVMkb8plz2A3obGQ9F9Pg8SKaGVQthD0VYXxtIfUaq7TqNyjqdgITLfCZQ9 5aD4wdhVdAhc4WiwP8Ws5F57K84Frg49wh50x8IVbGUSKQNiUw0U0CZubO0xfRaZGLeG M88URRoX0OK6dCYZ2dvn07/XXTAbVfDjTHeDZRA6NEBHljuuZ9o5NteIOI20kQL9rrh6 c37O74DeYlkSaRa57X4UguJ7FKPseycy9mm8zVqOI1G8woELoJCnk26dTnqYC9ilwVGG vGuQ== X-Forwarded-Encrypted: i=1; AJvYcCUrm39R1vHbBqYxfzdaD9gR01CDr06W5QaIzovuzmB26ziJQ0dmg8zjlteKZQMVT2iDtMvwQMsuag==@kvack.org X-Gm-Message-State: AOJu0Yy9rTWVpDsgceaSgBJag06fNyMCmd7BJMhvfLNPJ8znC29eDoIp ixopqo71ZEpv+rQmzacmidPiVafYn9Pa6ak7STb+Fr0t9RlCyDUYv1VrQMqphQ2c1eu3e1ecLTa XaNmdj5JPFlrp4NGkmRIsLuv9Hy4= X-Gm-Gg: ASbGncs/h3ptYjqdyIyqe/NDkJ5kgh1PPMn0WclwBvLEGDIkmsUAqEjT4vs1thYlxzo Cirwdni0zcJqRmxRJWc9Lb/BUwI6d02Ws5wiaVCWSnMVDaxX047+/n3UksJeInAH7zK7Z3MwJ+D 8fC2ZjzcL3n0ISHqQEKoBm X-Google-Smtp-Source: AGHT+IGuYh9hEG25zHdFNari8iaK0FGwypKdrfMdzkC4lUjyLPuEDLdNXpIv8VHKsRGphTbuWP011BGsVDSzcdNE0ww= X-Received: by 2002:a05:6122:4005:b0:517:4fca:86d4 with SMTP id 71dfb90a1353d-51d5b3aacb3mr35313618e0c.11.1738015725296; Mon, 27 Jan 2025 14:08:45 -0800 (PST) MIME-Version: 1.0 References: <20240215182756.3448972-5-lokeshgidra@google.com> <20250123041427.1987-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 28 Jan 2025 06:08:31 +0800 X-Gm-Features: AWEUYZkd54V2eWdghLL9GpsA51a04wOUhGrwB5LWEt3YUD8isFhku1qMpzYHb8Y Message-ID: Subject: Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations To: Lokesh Gidra Cc: "Liam R. Howlett" , aarcange@redhat.com, akpm@linux-foundation.org, axelrasmussen@google.com, bgeffon@google.com, david@redhat.com, jannh@google.com, kaleshsingh@google.com, kernel-team@android.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, ngeoffray@google.com, peterx@redhat.com, rppt@kernel.org, ryan.roberts@arm.com, selinux@vger.kernel.org, surenb@google.com, timmurray@google.com, willy@infradead.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: A53FD120009 X-Stat-Signature: dezfsxjpbmen5kmw6zzt8wkyukmsu8ay X-Rspam-User: X-HE-Tag: 1738015726-897628 X-HE-Meta: U2FsdGVkX19pydnUdMg/BIxursK/1p1RWPBBX5xNPIXF+aiHW9Vxecp7gKKKS90gJwXGrulF1TaEA+3B6sGJ8zs2l1Mf82cpORZOmcTfD5pl7vRoqT9L3F+G1ucBJdfqSIAvHQsfgkRYiBf4fEOquKyAEIALsP/bi9Iv5ssJNpXEQLBAwXKdV+VMZZdUXpeFAK2583rDnhoAob0Ib2hWabSS+mUTl+fC27VT19uy68PrFpGHJi/XBdsBauLY9+iIiBGMdCDPwY4BN29ScVvXCVTEzh+Yvylw8rgdU9DgHfhFNoBI6rDaeYAvlWXi+X2IE3ExnWn7HOzAzVT6n2emmvz2LfZ0oKGDYQq7lEsunULjlPWqGtDTZxDbjYyAejFfuXYL4nkrwUUcTMrUmAMgrhmrvD8ZTH9a1Bl/G+fx+NvgGCMuh2r7KYxFi/iVs8Sc24hbI6cy3P85+w1qPgrAvFAaKVDglvNHBGQAAS6FsDpsd2KgGJvrFVPL7IbwQvvmM1nubkRDP/CYPbeGiWozy/uOQgt9q4R45VyuN4Nd6vjMw5CISAsr/3oCU4nvq8wmvpm/J1jGIhYlxLAqVHCgIOGBXL5kJxeT08Uvckbj82vSzOjGpg77RlC9k7SGa2sI3EjUqOsi4P8kFGLYBeyeaiVPVthRPk+P22Z/GJ+7moC8xHvA54s0UhiOlBd16Ab52G1zo4LtV06oPQHaMt2GPYZ0L6IHW5p37Bb6Iuk/ULHyM1yB89o9bkHCEBV4+wyxWJ8JgNokAoNIOYWGxC07IAAVD2Za6CucX72A+LrZzzuUTflwGvVKI7eggHFwDdnFoG0+mjnSn651fDCH9VRjYflkggE5bbA92w28kB4aqddiZmwVXVmqPrVipo1b/BjueXtcHPcJTgXcgIwWkkLZbahGEQDWMyop5HEbXTX36i2ae6Q8QBRaz1wSui9cQWU8xCOKtD5u3gzyyCx4BQw /DCeUhnt PwXPtKLhd9eZTzO9FDV4nCgkD3Nb/v+SgXETgM+PjJs9sfIW6LXlOAr2zRkRsUti41bIViv8mZiVX5sI0av7HG5vZhR7YTNXIqnl9/RllOiPiYPwst9bE35h1RaJJvPnO50attovIRvhpbR1leyVlWGfFSMF2HIdZ95wq++gmR0KMiwORALStmH07ItR0vYViZa9LWNHoy+VVmDRx75y9EzA1H5IiwJ1gICksnz328MwGqU5MkQc5ijL/d4y0xNq3aMyrEl2knX02wgCUfTDoD9IwAenc88AHUMxU+j/U6azBq5HBxNVlNyPn/uV8QY2JV7YB9yuZ1Qe6rTf3EaqlEEhr/X2CmJsGdN+0nuBlwcB1UMe6113PtyJSMAvXH2dUNx8NahVzeNWNNUHiXmnVHexjUIzH2aasdsyL X-Bogosity: Ham, tests=bogofilter, spamicity=0.000033, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 24, 2025 at 2:45=E2=80=AFAM Lokesh Gidra wrote: > > On Thu, Jan 23, 2025 at 8:52=E2=80=AFAM Liam R. Howlett wrote: > > > > * Barry Song <21cnbao@gmail.com> [250122 23:14]: > > > > All userfaultfd operations, except write-protect, opportunistically= use > > > > per-vma locks to lock vmas. On failure, attempt again inside mmap_l= ock > > > > critical section. > > > > > > > > Write-protect operation requires mmap_lock as it iterates over mult= iple > > > > vmas. > > > h > > > Hi Lokesh, > > > > > > Apologies for reviving this old thread. We truly appreciate the excel= lent work > > > you=E2=80=99ve done in transitioning many userfaultfd operations to p= er-VMA locks. > > > > > > However, we=E2=80=99ve noticed that userfaultfd still remains one of = the largest users > > > of mmap_lock for write operations, with the other=E2=80=94binder=E2= =80=94having been recently > > > addressed by Carlos Llamas's "binder: faster page installations" seri= es: > > > > > > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@google= .com/ > > > > > > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_regi= ster() > > > and userfaultfd_unregister() operations, both of which require the mm= ap_lock > > > in write mode to either split or merge VMAs. Since HeapTaskDaemon is = a > > > lower-priority background task, there are cases where, after acquirin= g the > > > mmap_lock, it gets preempted by other tasks. As a result, even high-p= riority > > > threads waiting for the mmap_lock =E2=80=94 whether in writer or read= er mode=E2=80=94can > > > end up experiencing significant delays=EF=BC=88The delay can reach se= veral hundred > > > milliseconds in the worst case.=EF=BC=89 > > Do you happen to have some trace that I can take a look at? We observed a rough trace in Android Studio showing the HeapTaskDaemon stuck in a runnable state after holding the mmap_lock for 1 second, while o= ther threads were waiting for the lock. Our team will assist in collecting a detailed trace, but everyone is currently on an extended Chinese New Year holiday. Apologies, this may delay the process until after February 8. > > > > This needs an RFC or proposal or a discussion - certainly not a reply t= o > > an old v7 patch set. I'd want neon lights and stuff directing people t= o > > this topic. > > > > > > > > We haven=E2=80=99t yet identified an ideal solution for this. However= , the Java heap > > > appears to behave like a "volatile" vma in its usage. A somewhat simp= listic > > > idea would be to designate a specific region of the user address spac= e as > > > "volatile" and restrict all "volatile" VMAs to this isolated region. > > > > I'm going to assume the uffd changes are in the volatile area? But > > really, maybe you mean the opposite.. I'll just assume I guessed > > correct here. Because, both sides of this are competing for the write > > lock. > > > > > > > > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag w= ill be > > > mapped to the volatile space, while those without it will be mapped t= o the > > > non-volatile space. > > > > > > =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=90TASK_SIZE > > > =E2=94=82 =E2=94=82 > > > =E2=94=82 =E2=94=82 > > > =E2=94=82 =E2=94=82mmap VOLATILE > > > =E2=94=BC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=A4 > > > =E2=94=82 =E2=94=82 > > > =E2=94=82 =E2=94=82 > > > =E2=94=82 =E2=94=82 > > > =E2=94=82 =E2=94=82 > > > =E2=94=82 =E2=94=82default mmap > > > =E2=94=82 =E2=94=82 > > > =E2=94=82 =E2=94=82 > > > =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98 > > > > No, this is way too complicated for what you are trying to work around. > > > > You are proposing a segmented layout of the virtual memory area so that > > an optional (userfaultfd) component can avoid a lock - which already ha= s > > another optional (vma locking) workaround. > > > > I think we need to stand back and look at what we're doing here in > > regards to userfaultfd and how it interacts with everything. Things > > have gotten complex and we're going in the wrong direction. > > > > I suggest there is an easier way to avoid the contention, and maybe try > > to rectify some of the uffd code to fit better with the evolved use > > cases and vma locking. > > > > > > > > VMAs in the volatile region are assigned their own volatile_mmap_lock= , > > > which is independent of the mmap_lock for the non-volatile region. > > > Additionally, we ensure that no single VMA spans the boundary between > > > the volatile and non-volatile regions. This separation prevents the > > > frequent modifications of a small number of volatile VMAs from blocki= ng > > > other operations on a large number of non-volatile VMAs. > > > > > > The implementation itself wouldn=E2=80=99t be overly complex, but the= design > > > might come across as somewhat hacky. > > I agree with others. Your proposal sounds too radical and doesn't seem > necessary to me. I'd like to see the traces and understand how > real/frequent the issue is. No worries, I figured the idea might not be well-received since it was more= of a hack. Just try to explain that some VMAs might contribute more mmap_lock contention (volatile), while others might not. > > > > > > Lastly, I have two questions: > > > > > > 1. Have you observed similar issues where userfaultfd continues to > > > cause lock contention and priority inversion? > > We haven't seen any such cases so far. But due to some other reasons, > we are seriously considering temporarily increasing the GC-thread's > priority when it is running stop-the-world pause. > > > > > > 2. If so, do you have any ideas or suggestions on how to address this > > > problem? > > There are userspace solutions possible to reduce/eliminate the number > of times userfaultfd register/unregister are done during a GC. I > didn't do it due to added complexity it would introduce to the GC's > code. > > > > These are good questions. > > > > I have a few of my own about what you described: > > > > - What is causing your application to register/unregister so many uffds= ? > > In every GC invocation, we have two userfaultfd_register() + mremap() > in a stop-the-world pause, and then two userfaultfd_unregister() at > the end of GC. The problematic ones ought to be the one in the pause > as we want to keep it as short as possible. The reason we want to > register/unregister the heap during GC is so that the overhead of > userfaults can be avoided when GC is not active. > > > > > - Does the writes to the vmas overlap the register/unregsiter area > > today? That is, do you have writes besides register/unregister going > > into your proposed volatile area or uffd modifications happening in > > the 'default mmap' area you specify above? > > That shouldn't be the case. The access to uffd registered VMAs should > start *after* registration. That's the reason it is done in a pause. > AFAIK, the source of contention is if some native (non-java) thread, > which is not participating in the pause, does a mmap_lock write > operation (mmap/munmap/mprotect/mremap/mlock etc.) elsewhere in the > address space. The heap can't be involved. Exactly. Essentially, we observe that the GC holds the mmap_lock but gets preempted for an extended period, causing other tasks performing mmap-like operations to wait for the GC to release the lock. > > > > Barry, this is a good LSF topic - will you be there? I hope to attend. > > > > Something along the lines of "Userfualtfd contention, interactions, and > > mitigations". Thank you for your interest in this topic It's unlikely that a travel budget will be available, so I won=E2=80=99t be= attending in person. I might apply for virtual attendance to participate in some discussions, but I don=E2=80=99t plan to run a session remotely=E2=80=94too= many things can go wrong. > > > > Thanks, > > Liam > > Thanks Barry