From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A19D4C369AB for ; Thu, 24 Apr 2025 17:44:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1D4C26B00B7; Thu, 24 Apr 2025 13:44:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 185046B00B8; Thu, 24 Apr 2025 13:44:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F412E6B00B9; Thu, 24 Apr 2025 13:44:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id CFF5B6B00B7 for ; Thu, 24 Apr 2025 13:44:33 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 87D25B713C for ; Thu, 24 Apr 2025 17:44:34 +0000 (UTC) X-FDA: 83369662068.17.D264EC1 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) by imf01.hostedemail.com (Postfix) with ESMTP id 93F9940004 for ; Thu, 24 Apr 2025 17:44:32 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=U1RpA8CP; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.216.54 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745516672; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mffAT54qMPQbjpYgx52ilpx2rHU47fkXThvhDlA6gFI=; b=uzHt9zrxPQuJzo2Q+FTMwvM5JWG//pyeWdFH97qd12NK1D5hoiqgiADI1BDjOVWGjWPXeV 1VrU+76oa6NlFAooC8uwVh6Xeszirke9b5CcE5FO54rkyctMuUe2roCI47e+oumhJ1WWmK 7Gf1Vr7LcCHViajhv4RajFAFnjHqNm4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745516672; a=rsa-sha256; cv=none; b=qwQDme36RRNVuxMbToN5FE1GC/KZBiBjlEwsFr3FYPcHuNDyy7bg+5szHf0lUstdcZJTzN cOpwb06uPC1eXDEhpupOOWz9BliCbKyLu0mY1djpRMFbrn4rj0NoefkWfk6LTqqpPWEDkk x/bg/w2gvDw8f7p8ZKR4Rq+nsFXN1Hs= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=U1RpA8CP; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of andrii.nakryiko@gmail.com designates 209.85.216.54 as permitted sender) smtp.mailfrom=andrii.nakryiko@gmail.com Received: by mail-pj1-f54.google.com with SMTP id 98e67ed59e1d1-301918a4e3bso1440486a91.3 for ; Thu, 24 Apr 2025 10:44:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1745516671; x=1746121471; darn=kvack.org; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mffAT54qMPQbjpYgx52ilpx2rHU47fkXThvhDlA6gFI=; b=U1RpA8CPWADhtayjjaFrODEmCPQSDiIr/f4jUCfubZDqg/EATXgcNVEkSeSXLeEPtC CtuX3q/cX5eWKX1m3qGetU31R+Zt1qGk2+vSw3sxIiX7W0fQAl5N6LTptrep/1j2oRRa w59DuP+vjdDiVmirjWrGdp5BQZNlrDJsXyE5sEbEosGRxi9xx57oHuIBc11SXSnVdGao axV1OqxaFNTMVLZrafyjoBfxQN0Lh2UK1bhc0a1wsRSJqjXkNEUVwp5hY+tuEADDqnBp C8I/lJUmWeuD5q3eXB8u2cZ6M6sUfaSDqwLcUwebWhBhhUy3yVEd5dAMhijJv+9obF8U rtYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745516671; x=1746121471; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mffAT54qMPQbjpYgx52ilpx2rHU47fkXThvhDlA6gFI=; b=qWBqq7p49FSIAAEpTkDh6YFtMaMKTTi66d1KbzaBmqm3sakqUsLt8gU+5crtcb8a06 xVcrkXJE+0izJq06yItKvfVsh2IzHh1EoXZQbhcewoKQm2j6gKj94ndwhKSQUP+S/zUX akz2rFVNzrllllXUXsV704LOJZrCWPu5/880fk/+A6mUPEoQRcexKNY8xOZcmkjHo1Ug vY5cyTLKa+xzjmroRFPX3qHMpevPGWzT5gzvcxH3AYjZPHH0320GE6OHPOZrAUK4C2ZI KzvKfAUvP5yXC5QSb9bxFqdfbTFFcVpwqNMFoK/GZnUdf/HXAj1VaGt7b7kNVuBBt1m9 kIgg== X-Forwarded-Encrypted: i=1; AJvYcCXAkP3xakD1SxBR0ktLCBCpq+jYdyRiMRGEn9bVBMwOaWoFMD+RkPH0Tr/qB51D8qZaIUxZUHeHrQ==@kvack.org X-Gm-Message-State: AOJu0Yxc2Pdbe6OqjPb5wFhz+BTvNt8evOwN3y7/9/wayHY/BccSdTVK ZHTxyKXxtaDMuQkWXnBg14zKABIxdwuFUg6piBWzLEz6imZ3Et/ev7GTBrMhJ8HGzmEILIRuG6m RF7fEiNl31O79Hsu60rY1osqCsc8= X-Gm-Gg: ASbGncsfQlOE6tit//BupVYs11xUKHueVYQV4vkgtnVVkMsJT6q5AszCkrzXUUIjvK+ AdK7uaaXToGWivW73QaG+s0nwqoLOq4fis/E8eFp8LwEbkPJmpsVqq5cKyqZR1BzwPlHI4lPpj4 HsQeF3ppi/FNj6hwWE9m+1D+BRjMhSRubUl6dzTQ== X-Google-Smtp-Source: AGHT+IEbC9MfH5k//FpLzbteCvNMZEOW+FW4qBhgPMfH21QiQKG25C2OHUzSRugHR0yrctME3aO8CzTi/i+3qJY7ijY= X-Received: by 2002:a17:90b:37ce:b0:2ee:45fd:34f2 with SMTP id 98e67ed59e1d1-309f551b2f5mr761953a91.6.1745516671386; Thu, 24 Apr 2025 10:44:31 -0700 (PDT) MIME-Version: 1.0 References: <20250418174959.1431962-1-surenb@google.com> <20250418174959.1431962-8-surenb@google.com> <6ay37xorr35nw4ljtptnfqchuaozu73ffvjpmwopat42n4t6vr@qnr6xvralx2o> In-Reply-To: From: Andrii Nakryiko Date: Thu, 24 Apr 2025 10:44:19 -0700 X-Gm-Features: ATxdqUERQdGbt6np1cB_OGPNOZCBsAnzZeDlTy6xzP7Epb6E5SYubXPJlY8PLHw Message-ID: Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU To: "Liam R. Howlett" , Andrii Nakryiko , Suren Baghdasaryan , akpm@linux-foundation.org, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, jannh@google.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, brauner@kernel.org, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 93F9940004 X-Rspam-User: X-Stat-Signature: ckrpakkx64ccxofg3wyokoojr33b3ew6 X-HE-Tag: 1745516672-369384 X-HE-Meta: U2FsdGVkX18YLIOrO4CNKCRIJYx/u6TOHbU6FKkOXodnQeX0l1mZ3pwtZyjpqWJJKpgrQXTHNzQiB08maqR/mKg72gxPLSGTEmKoWIIX026wTFTWfjrKvfYpv2USd/6XRGXMVL7MlrINrQ9NhPPYgzWOgtWXS7pjsFOGV+/SWi6Le6pLbua75iHTqPV9Duu7F66JD8/TxI/EdA6yyYh2ZduL9ujSXUis9P6M/YV1tjUFHbXbp/9Wy/OZ+te1vh6S1jz+1vFkNWLuYeYNBYjWeziPYA//jQ4X8HUCbLf4+0VcWc0myaUwsgSl+tr/zPivbn+L3dyd8GCdwh22CnG7N2W8rtrEX4fuBZe/QPHadY/P101aP9t2XJ5zZgPHb2UBLyah+PY5Q+p0cEWa1y/nFopCql1uPiGQQl3XGRLvYU90PLwi3lkPNOwJ0KE7FD9gALGml+pD9aXdAZxkFINEcbgeFznvov3fVzHKQLI868XnPJMDCfO5vzIZHepj5quysM4KRYeH2n6pFZ1YjU6QDysMBtvAkvfmq/R1MI5a05Y3udgrZXhEigVVHaloOwD3f9H3j+R9pM4vjHFnMxNjL08n18pk2IPasXCCqXeeY7amW5diipGH5limAWE2ELGb+rdrt3vFuIrNk3ke9SkvHATgtgx+5VOGrLVwkDARS6AmVQgRa9jKdFWhBXy9TyWX7ExoFTD0MUCntFcCb+Tr7xUg5Fvjaz4TXatSNMnZ5Rrzj2WVjQhGvVMTR2ydanAvhSEZ/L/y/eIqWKPOT6aAESnK8NWMp17gWlY//vZ6p+6QACysj/dF0d0avXPLqakkawx8ADHsygTjCLXag+3Zt3A8FGgrCsjLYZYOCllEprqye8RlujDMqSVvCsLzxylkIV38xRMAlCCT4cLQpxunQibDgBh4u04J2l/20V5vEiSrbv09fNeRN93XTLsFrkM/wD0L4lNhgTaIydcjHPk u4+POV73 9Q6M7KlwyrXOJPGjJlpZ/NR0iQbow5QyIh5TIZHbgqShaJL1MTMNUAmREmJKSOT989qMrHzyYAKwp2BFp5sqOjFz6823RLn/i7cR/WbT+2nR2Mq4U0g1agvgpdmatCODbEnvMnmWbTt6WloArSEeiEatKkNZdKUt3vremh2FD1fLEKZa3Nw76FaKt+sP8fuTQ1yvULC4/yNkUqFBh+TXUdkA3axMWx0j3EJTxnL8W1MPdZYO10ZGeYGMOml/5kbuS25X6MmFOGSPCDJqCEIGHCWnd+ECrk+R/vekcft8a1OR8boGqIlIb+EyHYuaH0X32fJBor0aLsSbbg1hRH7kGWtkYOOp2R4cQVc2xSqtB3tS/CfG7Tem5oqdyxo9XxO+pZIe4Nrk1v1ecLsSCjACl/UowDWPAGY9mwGymtQ1HVsgJiJeBUKK01bWq5UsV2ICNLiQi17Yjv8Lq/sLx8617t6qDOQkXSnaozrLDEsjWX9Xl8aTKzKLxFoQ9Qg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 24, 2025 at 9:42=E2=80=AFAM Liam R. Howlett wrote: > > * Andrii Nakryiko [250424 12:04]: > > On Thu, Apr 24, 2025 at 8:20=E2=80=AFAM Suren Baghdasaryan wrote: > > > > > > On Wed, Apr 23, 2025 at 5:24=E2=80=AFPM Liam R. Howlett wrote: > > > > > > > > * Andrii Nakryiko [250423 18:06]: > > > > > On Wed, Apr 23, 2025 at 2:49=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > > > > > > > > On Tue, Apr 22, 2025 at 3:49=E2=80=AFPM Andrii Nakryiko > > > > > > wrote: > > > > > > > > > > > > > > On Fri, Apr 18, 2025 at 10:50=E2=80=AFAM Suren Baghdasaryan <= surenb@google.com> wrote: > > > > > > > > > > > > > > > > With maple_tree supporting vma tree traversal under RCU and= vma and > > > > > > > > its important members being RCU-safe, /proc/pid/maps can be= read under > > > > > > > > RCU and without the need to read-lock mmap_lock. However vm= a content > > > > > > > > can change from under us, therefore we make a copy of the v= ma and we > > > > > > > > pin pointer fields used when generating the output (current= ly only > > > > > > > > vm_file and anon_name). Afterwards we check for concurrent = address > > > > > > > > space modifications, wait for them to end and retry. While = we take > > > > > > > > the mmap_lock for reading during such contention, we do tha= t momentarily > > > > > > > > only to record new mm_wr_seq counter. This change is design= ed to reduce > > > > > > > > > > > > > > This is probably a stupid question, but why do we need to tak= e a lock > > > > > > > just to record this counter? uprobes get away without taking = mmap_lock > > > > > > > even for reads, and still record this seq counter. And then d= etect > > > > > > > whether there were any modifications in between. Why does thi= s change > > > > > > > need more heavy-weight mmap_read_lock to do speculative reads= ? > > > > > > > > > > > > Not a stupid question. mmap_read_lock() is used to wait for the= writer > > > > > > to finish what it's doing and then we continue by recording a n= ew > > > > > > sequence counter value and call mmap_read_unlock. This is what > > > > > > get_vma_snapshot() does. But your question made me realize that= we can > > > > > > optimize m_start() further by not taking mmap_read_lock at all. > > > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we= can > > > > > > try mmap_lock_speculate_try_begin() and only if it fails do the= same > > > > > > dance we do in the get_vma_snapshot(). I think that should work= . > > > > > > > > > > Ok, yeah, it would be great to avoid taking a lock in a common ca= se! > > > > > > > > We can check this counter once per 4k block and maintain the same > > > > 'tearing' that exists today instead of per-vma. Not that anyone sa= id > > > > they had an issue with changing it, but since we're on this road an= yways > > > > I'd thought I'd point out where we could end up. > > > > > > We would need to run that check on the last call to show_map() right > > > before seq_file detects the overflow and flushes the page. On > > > contention we will also be throwing away more prepared data (up to a > > > page worth of records) vs only the last record. All in all I'm not > > > convinced this is worth doing unless increased chances of data tearin= g > > > is identified as a problem. > > > > > > > Yep, I agree, with filling out 4K of data we run into much higher > > chances of conflict, IMO. Not worth it, I'd say. > > Sounds good. > > If this is an issue we do have a path forward still. Although it's less > desirable. > > > > > > > > > > > I am concerned about live locking in either scenario, but I haven't > > > > looked too deep into this pattern. > > > > > > > > I also don't love (as usual) the lack of ensured forward progress. > > > > > > Hmm. Maybe we should add a retry limit on > > > mmap_lock_speculate_try_begin() and once the limit is hit we just tak= e > > > the mmap_read_lock and proceed with it? That would prevent a > > > hyperactive writer from blocking the reader's forward progress > > > indefinitely. > > > > Came here to say the same. I'd add a small number of retries (3-5?) > > and then fallback to the read-locked approach. The main challenge is > > to keep all this logic nicely isolated from the main VMA > > search/printing logic. > > > > For a similar pattern in uprobes, we don't even bother to rety, we > > just fallback to mmap_read_lock and proceed, under the assumption that > > this is going to be very rare and thus not important from the overall > > performance perspective. > > In this problem space we are dealing with a herd of readers caused by > writers delaying an ever-growing line of readers, right? I'm assuming that the common case is there is no writer, we attempt lockless vma read, but then (very rarely) writer comes in and starts to change something, disrupting the read. In uprobe vma lookup speculation case we don't even attempt to do lockless read if there is an active writer, we just fallback to mmap_read_lock. So I guess in that case we don't really need many retries. Just check if there is active writer, and if not - mmap_read_lock. If there was no writer, speculate, and when done double-check that nothing changed. If something changed - retry with mmap_read_lock. Does that sound more reasonable? > > Assuming there is a backup caused by a writer, then I don't know if the > retry is going to do anything more than heat the data centre. In this scenario, yes, I agree that retrying isn't useful, because writer probably is going to be quite a lot slower than fast readers. So see above, perhaps no retries are needed beyond just lockles -> mmap_read_lock retry. Just a quick mmap_lock_speculate_try_begin() check at the start. BTW, I realized that me referencing uprobe speculation is done with no context or code pointers. I'm talking about find_active_uprobe_speculative() in kernel/events/uprobes.c, if you are curious. > > The readers that take the read lock will get the data, while the others > who arrive during read locked time can try lockless, but will most > likely have a run time that extends beyond the readers holding the lock > and will probably be interrupted by the writer. > > We can predict the new readers will also not make it through in time > because the earlier ones failed. The new readers will then take the > lock and grow the line of readers. > > Does that make sense? I think so, though not 100% sure I got all the points you are raising. But see above if my thoughts make sense to you :) > > Thanks, > Liam > >