From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B6F1C02182 for ; Thu, 23 Jan 2025 18:45:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 780CE6B0093; Thu, 23 Jan 2025 13:45:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 72FE3280007; Thu, 23 Jan 2025 13:45:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F7E56B0099; Thu, 23 Jan 2025 13:45:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 41C526B0093 for ; Thu, 23 Jan 2025 13:45:53 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 9C390120411 for ; Thu, 23 Jan 2025 18:45:52 +0000 (UTC) X-FDA: 83039595744.12.4202A2F Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by imf13.hostedemail.com (Postfix) with ESMTP id A005A20005 for ; Thu, 23 Jan 2025 18:45:50 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=4ndpIb5+; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=lokeshgidra@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737657950; a=rsa-sha256; cv=none; b=QOiGMdfftIqNQndJFFnSAyeOHq8hPx8yD4I+RziyhyvN+c8Sg55QP6+6ZKgU2MJeVPnFd3 +n2QddVbfIvbEV9Kv40Tn7P1pZqLHM/2DlfJOLy0g6r6XjVWcpG79Tc854BpEFPAfGCMeT BVNpwOYFOlDvtt8QsE+pQpwRHcH0lkM= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=4ndpIb5+; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=lokeshgidra@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737657950; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QzTal6iWSgRdob0bIKRNXXhtdn3zJnxk/3GyL7XVhOY=; b=8H3vPO+Jaj4tc9KWt/qVeS2XQpDSx89XDCS2qIiV5HcOFmZd8hCtGo8BpjnphrjHOsDF14 SuoV1OUsdkfu0yICLoV05in5qIFm4mpOc+OrUdmRExRv+10d79SOdvfA+BHww3LWHTCnna HUDprvoj7dFSZDcQ735YAbYWyswDV1I= Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-467abce2ef9so22591cf.0 for ; Thu, 23 Jan 2025 10:45:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737657950; x=1738262750; darn=kvack.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=QzTal6iWSgRdob0bIKRNXXhtdn3zJnxk/3GyL7XVhOY=; b=4ndpIb5+iFTBEqIXbKq/TlOcmO5dBiScznOQ/0+M/lZzx/OgIEQQY368iEOSukZhT5 q/hQGiGmAVahO3p9OvUrypKXCqsgAJNFmB2Z+OpsPiltAdp2x4R7i84Gvy1erHr3pLkr tHlxcJVQUHiOoPW0jePMAEODZ2Dc+NEgSf3dr5vzjb2RU4Gzq6fvy2LDnmi7GV1zKaUf yVWzkwUQTf4Nm2H7QcsWWXSEDqR/fQNouukSilSp38+Ffn+IGz5zo5RZMzLGVyNv2wC2 OLIvC8PNg36c2ohB0Vr9XtsArg4q+VD1Q5fzm4MgFUvtacrWKTaSGNRueQP7Q/v3qH/d 2v0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737657950; x=1738262750; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QzTal6iWSgRdob0bIKRNXXhtdn3zJnxk/3GyL7XVhOY=; b=vddDEGwJ1wF921FZNKblvvNjC8G2irGyikTv7Hsujy0IX8JeUMbUM0EvHqPP6IEBw3 gRdlJa65fM7X0fyZy7+ohBlQAQcjrnLi1vh599wwl2V0W+rzxPmgNpKvJBU4bhSkBUYx mTZSKxMVUOYUZmg3HHEUsqhjHUIrAjV8HGiJHuHN5nhoXlYwi/edyeI7nm7s8ckiuBh6 HIUAdfAISOVMxyv3/fc4eR1OVBj1DzHZ6RT0vm+EwY8amr+6kPhTjaGhNpnlmy7amSIG glZgm70j8S1ATe5G6STTrIP4lxTfb2kQr8VzSAsSItrIBmc4T0u0mH14b1jm0zHvMzuZ vdAw== X-Forwarded-Encrypted: i=1; AJvYcCV2n4hlj2r/RY5oapFaZ3CY+ov15GcK2jTjIvHHwrxq9HBkfBQLS1DGc5e94kEKAz4pOwz2zVeMFg==@kvack.org X-Gm-Message-State: AOJu0Yw6HlgCawm/MI+9mwo25vXUfdsS6RUO692NWOZ7mDiqOwjyiQRl 3VW8w++xkz20ZGjd+ybBP6dKZ5WDY9FipayMv/+UMRCbKNIFjcTOofEoRpb6jmcr/MCfKd0+9gC SIdIK9I7J7R/zQUvzGx/F933N0PBfMFQ+MP2t X-Gm-Gg: ASbGnctUweBsklcyhmgqipC7t4Xx658bnbyfYV1iLRKlK+pjvUEhglZcEvWYrQyVz2b Pv1vzqbaRqDcOUJEY9rymHcsvxdEqoEBEZVwEduXL/7+11q+62o/V9nCoCFpEj4YqjaDzCWOP0v R4Vdvk8fkA8tQb/uzs X-Google-Smtp-Source: AGHT+IG4V8k2kiWVwIIq8NqKoAcpRM90sfY7xSa2fqPnbG4bspmvURYYLtxVVkgmjdUEevT7QXUIGyQvs+SL5p4eq/g= X-Received: by 2002:a05:622a:10a:b0:467:8416:d99e with SMTP id d75a77b69052e-46e5dad8c0amr4346531cf.21.1737657949428; Thu, 23 Jan 2025 10:45:49 -0800 (PST) MIME-Version: 1.0 References: <20240215182756.3448972-5-lokeshgidra@google.com> <20250123041427.1987-1-21cnbao@gmail.com> In-Reply-To: From: Lokesh Gidra Date: Thu, 23 Jan 2025 10:45:37 -0800 X-Gm-Features: AWEUYZnxP3hHn1SCq9RZ8V68AJ7A6aXxrkeYLIAPk8QrC-9p9GGZ0p-pls4mfwA Message-ID: Subject: Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations To: "Liam R. Howlett" , Barry Song <21cnbao@gmail.com>, lokeshgidra@google.com, aarcange@redhat.com, akpm@linux-foundation.org, axelrasmussen@google.com, bgeffon@google.com, david@redhat.com, jannh@google.com, kaleshsingh@google.com, kernel-team@android.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, ngeoffray@google.com, peterx@redhat.com, rppt@kernel.org, ryan.roberts@arm.com, selinux@vger.kernel.org, surenb@google.com, timmurray@google.com, willy@infradead.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: A005A20005 X-Rspamd-Server: rspam10 X-Stat-Signature: tewfcu6rod1zsoyo7zwtfmjxpqgp6ujz X-HE-Tag: 1737657950-557371 X-HE-Meta: U2FsdGVkX19pptxaLuJGr9aQ95UZYAfhBjtuWGd29pPtdz2NqwPd7W7NkFdsB1iZlXh9isnitmJHwR+2T21Z7a4ru9Q7zUU4bNOgmIBETpKZCoOeszBjTK5dYhZSLBCyLhMJiOIolxpc42MDWkro38MRQES8zLTfly/qMr30Nlj3PtUL8T7A31CxBeYa8j9k/w9vAJqukaCJQPUZafSwU2AS5X+crx0IjW7OnS1281DQJNlKWYpreZfRdvW10sD1XkBxldtSP3nWHpaXyD6M1j5/N9hZfvLj3K4MO2xCcXrKvyjO0DhpXY8jRK7rgG6KBYoqC/6hJf7lIEmqJCSmu8q4O5Q3K3uGaT2yiyzX4CdUNjvDk3J4b3lwOGivbRPAcDr95QJNiYwB1OjRy8C8q2SC58Uc4APfP/XKeXJ4ZPo0kr7y21rBqM3uUlw9ifNcuOPoDVQmjZ6EclGClKL+gKYW2OCr2EzwaWf0VlYds0TRksfVhYh2f1zrjZQ4d20GnYCmB/DX+LcLfr42JUYQfCip+5R3UL6VfyF539K5lzJ61ioQ7cOC+CYT+44mdbGrhyGI8iVoXPi5QZngwdk16CdWTGGRmRyFP0TUssGhh8S0dZifxP+fqgniMQmvUlZEvlrO2Rl0sW6CLcsgbBC0Z2/l8vpUz8McMgooxSuiIVJv/5d1ProBD4ynf3dxhQ3SUmua2IxIANp1MGyoQ7INLGt6nQWttYCupsoxsMSsSwWaovJoBwdDz2OoDXzjNWDb1oE3Z9bizBr+bNFPrO2b2KbQ/7OiF0Y6bEoU2ziH0MVXEw15AVoLFWl23h5W2Xxs0p2Ixa2PRs+EI2wVRHXgF/hSV0CWeVmCN5vHDjBA2jdvbbsJqN/SEAOCTBa/ri267I2BuJWRrKEA1DkTVr/ky7K4IArTitoRfi7+GkHXXL1O7YpJEPnPjDRWL/bUSALCrLWigIxyh4JjsZuJmBz uuDSK4wZ ser8Asrth7fjZmOj85m6kUPS1Kiw/P4vtCE51a8b0b4Q262JeXRpGg/PCa1HbucBr0eK8US4S4VJKkvdgNSle1dJhUQWrt6icCKnSvRHhAipaC5IET4R0754tRwTMH11rWgptxtGY3OUgkAckT9ZxmHwdRZ26YE/tWGWNDrESFX56D/SzDOEp5B7AZSVu/vYE3TNTOO/Mbtavo+Yc6AtBrd7qYEfPYrumztpt9kvUybbtg/oCTcwzUezRkcINu0+Vw+JveHDCNZw0XMQK0A7r8Gil+3sc0RwJeaok X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 23, 2025 at 8:52=E2=80=AFAM Liam R. Howlett wrote: > > * Barry Song <21cnbao@gmail.com> [250122 23:14]: > > > All userfaultfd operations, except write-protect, opportunistically u= se > > > per-vma locks to lock vmas. On failure, attempt again inside mmap_loc= k > > > critical section. > > > > > > Write-protect operation requires mmap_lock as it iterates over multip= le > > > vmas. > > h > > Hi Lokesh, > > > > Apologies for reviving this old thread. We truly appreciate the excelle= nt work > > you=E2=80=99ve done in transitioning many userfaultfd operations to per= -VMA locks. > > > > However, we=E2=80=99ve noticed that userfaultfd still remains one of th= e largest users > > of mmap_lock for write operations, with the other=E2=80=94binder=E2=80= =94having been recently > > addressed by Carlos Llamas's "binder: faster page installations" series= : > > > > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@google.c= om/ > > > > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_regist= er() > > and userfaultfd_unregister() operations, both of which require the mmap= _lock > > in write mode to either split or merge VMAs. Since HeapTaskDaemon is a > > lower-priority background task, there are cases where, after acquiring = the > > mmap_lock, it gets preempted by other tasks. As a result, even high-pri= ority > > threads waiting for the mmap_lock =E2=80=94 whether in writer or reader= mode=E2=80=94can > > end up experiencing significant delays=EF=BC=88The delay can reach seve= ral hundred > > milliseconds in the worst case.=EF=BC=89 Do you happen to have some trace that I can take a look at? > > This needs an RFC or proposal or a discussion - certainly not a reply to > an old v7 patch set. I'd want neon lights and stuff directing people to > this topic. > > > > > We haven=E2=80=99t yet identified an ideal solution for this. However, = the Java heap > > appears to behave like a "volatile" vma in its usage. A somewhat simpli= stic > > idea would be to designate a specific region of the user address space = as > > "volatile" and restrict all "volatile" VMAs to this isolated region. > > I'm going to assume the uffd changes are in the volatile area? But > really, maybe you mean the opposite.. I'll just assume I guessed > correct here. Because, both sides of this are competing for the write > lock. > > > > > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag wil= l be > > mapped to the volatile space, while those without it will be mapped to = the > > non-volatile space. > > > > =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=90TASK_SIZE > > =E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82mmap VOLATILE > > =E2=94=BC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=A4 > > =E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82default mmap > > =E2=94=82 =E2=94=82 > > =E2=94=82 =E2=94=82 > > =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98 > > No, this is way too complicated for what you are trying to work around. > > You are proposing a segmented layout of the virtual memory area so that > an optional (userfaultfd) component can avoid a lock - which already has > another optional (vma locking) workaround. > > I think we need to stand back and look at what we're doing here in > regards to userfaultfd and how it interacts with everything. Things > have gotten complex and we're going in the wrong direction. > > I suggest there is an easier way to avoid the contention, and maybe try > to rectify some of the uffd code to fit better with the evolved use > cases and vma locking. > > > > > VMAs in the volatile region are assigned their own volatile_mmap_lock, > > which is independent of the mmap_lock for the non-volatile region. > > Additionally, we ensure that no single VMA spans the boundary between > > the volatile and non-volatile regions. This separation prevents the > > frequent modifications of a small number of volatile VMAs from blocking > > other operations on a large number of non-volatile VMAs. > > > > The implementation itself wouldn=E2=80=99t be overly complex, but the d= esign > > might come across as somewhat hacky. I agree with others. Your proposal sounds too radical and doesn't seem necessary to me. I'd like to see the traces and understand how real/frequent the issue is. > > > > Lastly, I have two questions: > > > > 1. Have you observed similar issues where userfaultfd continues to > > cause lock contention and priority inversion? We haven't seen any such cases so far. But due to some other reasons, we are seriously considering temporarily increasing the GC-thread's priority when it is running stop-the-world pause. > > > > 2. If so, do you have any ideas or suggestions on how to address this > > problem? There are userspace solutions possible to reduce/eliminate the number of times userfaultfd register/unregister are done during a GC. I didn't do it due to added complexity it would introduce to the GC's code. > > These are good questions. > > I have a few of my own about what you described: > > - What is causing your application to register/unregister so many uffds? In every GC invocation, we have two userfaultfd_register() + mremap() in a stop-the-world pause, and then two userfaultfd_unregister() at the end of GC. The problematic ones ought to be the one in the pause as we want to keep it as short as possible. The reason we want to register/unregister the heap during GC is so that the overhead of userfaults can be avoided when GC is not active. > > - Does the writes to the vmas overlap the register/unregsiter area > today? That is, do you have writes besides register/unregister going > into your proposed volatile area or uffd modifications happening in > the 'default mmap' area you specify above? That shouldn't be the case. The access to uffd registered VMAs should start *after* registration. That's the reason it is done in a pause. AFAIK, the source of contention is if some native (non-java) thread, which is not participating in the pause, does a mmap_lock write operation (mmap/munmap/mprotect/mremap/mlock etc.) elsewhere in the address space. The heap can't be involved. > > Barry, this is a good LSF topic - will you be there? I hope to attend. > > Something along the lines of "Userfualtfd contention, interactions, and > mitigations". > > Thanks, > Liam >