From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 571C1C3ABB0 for ; Mon, 5 May 2025 16:38:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 97FD46B008A; Mon, 5 May 2025 12:38:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 92E156B008C; Mon, 5 May 2025 12:38:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F63A6B0092; Mon, 5 May 2025 12:38:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 5F2FE6B008A for ; Mon, 5 May 2025 12:38:40 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 10BD31A02CE for ; Mon, 5 May 2025 16:38:42 +0000 (UTC) X-FDA: 83409412884.30.1F65961 Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) by imf18.hostedemail.com (Postfix) with ESMTP id 27ED51C000D for ; Mon, 5 May 2025 16:38:40 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=nEnGUgmA; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of surenb@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746463120; a=rsa-sha256; cv=none; b=8bFMbncP/otSRpzBUF4MuyzPWYc/A6yFU5D6kCT+69VbDJn6epuIRALHksiw2dpdS3O7hG M3qupS4kbdFomc1EOBMFloLqblEePAbQdPyjd5ecNwSk4xXT5DjFGnQvw315hnpgZcydAa 1rt37IOREAchWO4kXh6he5VWJM5Fg68= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746463120; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=e8cGfqMPxInbFVttVOXD20cN2ryppIoxJ2ihtNYujm8=; b=5qI8FsguCpKJ+SltJNdsrHL1tGn06SnLKFZMRobmNkdByZ6wK+3wI9yHZ8cTRLuIqFHDPz WpLnIzfvIFIzFNiUnAjrg23vberC/gKBvvE67r2c0a+y9kAWzGO52ZTt9ZqKJm3Yu9Utkb BnIKX1R3bjnVnsrkClLAOPJW4qLReBY= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=nEnGUgmA; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of surenb@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=surenb@google.com Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-47666573242so3061cf.0 for ; Mon, 05 May 2025 09:38:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1746463119; x=1747067919; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=e8cGfqMPxInbFVttVOXD20cN2ryppIoxJ2ihtNYujm8=; b=nEnGUgmAShPt25sIvEznMG5XXqy31Y/EuLwexgZmc/nJruC+jbc0O8hz3veq6qhW5a VyczzD1/0XO4TcVlFhf+MfWqNYIXT5D4kSrg9/oZPoBtaVOG+T+zFdG+cmfsrHE2wXwN X9qcIrnMgwx8ZfwxC8KnOehwVOMRDhAbuUKxxPHMQ9qVjtzYGll4AmthImQ+6vzeYK/s 0nnMMWl3gQTcC4au9yW0rtPwiP5H96xgSRi65kDdKMQuirITHGDaOx4Wn/31yYdUjzus 166K1v4vVCzHjDFj1U7n8eNdchTbwL44lS64BSs+fRe4ZzWo9pPfckJO3B21GdDLjB26 /k1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746463119; x=1747067919; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=e8cGfqMPxInbFVttVOXD20cN2ryppIoxJ2ihtNYujm8=; b=u0HojaLrIFmMLLsQF3K8aJYbsdT/8SjkbPeMN/4otB6qEZQ8esKin6z/+uvKVaM6xR e7V0w5dktZAbMKGhMrweydEegqACsNA3hJPmaa4qx5wggFvSDUOMr1WykdhYLLs4cOG4 2Gj9fDm0bUyISuEcdjJAecCKyvi/2PUWYUO22wKxADLd8eJ+byKiNz2kJyh0J+NLMh4w qd78fEuoPQRzUIFpOqSZq1Izh0nI9KGPMGUTjKh6UTmE3AHEFMNdWHGpxYK9WLaaZUMY DSbnZSbuVPlP1fzX6ainen8Tv3DA3DTZM7P06nciQFj/RqUX/3Yy1S9FBDKWHIhneMFo 7dbQ== X-Forwarded-Encrypted: i=1; AJvYcCXDghZo3xckmu66e9z/X0Gt6bFY1exFpwpWH3/d3pIHg8DFxqD8x3zCskJKApyr8S7yF/cGjJf0Qw==@kvack.org X-Gm-Message-State: AOJu0Yzb9MrxREukACcP4qFJ3J23menjOfbezjBm1Hx9R3qYb8LbP5rP 49nt3aEWSukrt+Nu5PuxaRGOgikERxDD5fmCyJDACzvRO9Y4DHIkkQmegc9Xlh5A3GJ7nd9lnWG prQvWf6S1LPHW7Ut5pDQNpm+ACyaNSlej/SqL X-Gm-Gg: ASbGncsQ+duoWza8gtwDkm4pAVtD0BI+xIHpn5jm+msRdci8plNgH6//57EJjqmtDH4 ugIgPl1SGI/yAqUs2wF26O5vM8mlHeqYCxcJu1RSBtVhkcfv3WU2lC2txhnSrg1Svs7l7gIWIUm g0TiqGglvRzRdT2foJzNZ9 X-Google-Smtp-Source: AGHT+IFWiSPtprBCFhUiQJ87HBkeklAt1J+ZsCn+rVHWym0jK4XLeniq9xKP5/CRbfUkWSCmfruKW7KYmB/qC+RSHJY= X-Received: by 2002:a05:622a:15d1:b0:486:a185:3136 with SMTP id d75a77b69052e-490cc658b8bmr651011cf.14.1746463118837; Mon, 05 May 2025 09:38:38 -0700 (PDT) MIME-Version: 1.0 References: <20250418174959.1431962-1-surenb@google.com> <20250418174959.1431962-8-surenb@google.com> <20250505-wachen-konform-3fe08f1b3214@brauner> In-Reply-To: <20250505-wachen-konform-3fe08f1b3214@brauner> From: Suren Baghdasaryan Date: Mon, 5 May 2025 09:38:27 -0700 X-Gm-Features: ATxdqUExh_T1r-sy8rTeLU-t84R_-ihK_Dse0jJQ6kQZSv2WSV8VGLRmcYSeIkA Message-ID: Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU To: Christian Brauner Cc: Jann Horn , Al Viro , linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 27ED51C000D X-Stat-Signature: deyci7p1bnipjadxgwjxt655s7kwsbjk X-Rspam-User: X-HE-Tag: 1746463120-447614 X-HE-Meta: U2FsdGVkX1/hMaxhNDAjpqDqCXNJbAkR+WJL4WQA8bszbdLZxQvuluXwRzRUP0+PqejoBn2b5tNhOI0lmL5YlZub9QwyFef2Z+Ww4i1dl2dKl0obfwNMFVHKJn7mWw9k89dMqMEAiQlVY4soyAbTnyO6+MTvOhz71G2HW8xQibBjH3cGI1oEpyQhha3J9KzBFX8mEf3qy89DpNPCd0epv933Q4mEY8ttyKDd4nOxlYIXGNummkpPj7/4MN+U+Uyq9lijGHBwhDmpBCYSgCrY/vPw17OmVr7XytW56M4o1mpNMF7afF9Jh3klxyfZ+L1CNisKZSbL7lQGPNhwYKvuncSVq3xK7KU/P/yAsY+NnWcfpPdXejTw9ejdQaD7X3ibOp88K6unnarblq9lC7dYv40o0GrHdmdkQDGvzWffkns79HGpieS1ysah/QXeCaT43XbLSeo5+0DbaDc7jSMdR8ZEhntL98t4wmlKUfMai59TtRNnLNLfSwNAJkbvAmH7sjWxpMeOtBn0G8/YznWU30axLBipvkmz//amfxpvnyv2ypjA/rHkRK5m6H3HRUaaJyLkwLzLuF2vhoBzsc9OXGK1srlJHfeGVLPxGM1rvxlOxWcgvtZOF5kdTyX2o2KNWxDr+/pb/G6G6hNK9e4RqWeN0Bvs6nSLsZOD86nwhNvUyJpFpq2f8ujnv2A7zUnWuNak3B6CrBq4Uf+xY/W/lfNYEfWCYREfRTmpm0FhkXjIRC51nHMH+eRNogcEEVT8WEGy8HsIN2FusLFk5Ltmpi6fN5nYJbBU/vPu4n5aJuBZWaFGctCnXgh21O2fWw/YVu2gc3QfMph7XRH+G9b2pwkKxywxbEdcUaR47ch99oap6qnImtJneBUHxVC1r4EN9y6iI5Kwo0mYA9HsgD7iW5ykWZGDv/AacFS5bf0ZJi+6MYlxr4wYkcndTrFW8lP5zZW4x51apUeNOPbmrx+ sXXVLfZR WqAr8H0ELO5z2Td1R0MNgDlII8FM/y6BpYin3QkZO6xbdSk+nR8lNP+Dtj1nEcMHlLNddnK7B2tcVGSOpWiIcRPM6gk8xFWD2p4tqWlxBwBjag7sZ+Bo31A/6lR8zDgr7YwpMJUQSxtTL1stlDjSqWFwBXMKFWdg7RDzrJu5chCzXmf+fGMiX6ZxNZvkEMj37T36m0hPdZbSXSxDJEuD5o0cXuJxlqr4J/y3dxDUpWGnVBufaFkgjTmOu5g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, May 5, 2025 at 6:50=E2=80=AFAM Christian Brauner wrote: > > On Tue, Apr 29, 2025 at 08:54:58PM +0200, Jann Horn wrote: > > On Tue, Apr 29, 2025 at 8:04=E2=80=AFPM Suren Baghdasaryan wrote: > > > On Tue, Apr 29, 2025 at 10:21=E2=80=AFAM Jann Horn = wrote: > > > > > > > > Hi! > > > > > > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow = I > > > > forgot all about that...) > > > > > > Does this fact affect your previous comments? Just want to make sure > > > I'm not missing something... > > > > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing > > down a VMA, and using get_file_rcu() for the lockless lookup, I did > > not realize that you could actually also race with all the other > > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I > > did not think about whether any of those code paths might leave a VMA > > with a dangling ->vm_file pointer. > > > > I guess maybe that means you really do need to do the lookup from the > > copied data, as you did in your patch; and that might require calling > > get_file_active() on the copied ->vm_file pointer (instead of > > get_file_rcu()), even though I think that is not really how > > get_file_active() is supposed to be used (it's supposed to be used > > when you know the original file hasn't been freed yet). Really what > > I think it's fine for get_file_active() to be used in this way. That > ->vm_file pointer usage should get a fat comment above it explaining how > what you're doing is safe. Got it. Will use it in my next version. Thanks! > > > you'd want for that is basically a raw __get_file_rcu(), but that is > > static and I think Christian wouldn't want to expose more of these > > internals outside VFS... > > Yeah, no. I don't want that to be usable outside of that file. > > > (In that case, all the stuff below about get_file_rcu() would be moot.) > > > > Or you could pepper WRITE_ONCE() over all the places that write > > ->vm_file, and ensure that ->vm_file is always NULLed before its > > reference is dropped... but that seems a bit more ugly to me. > > > > > > On Tue, Apr 29, 2025 at 7:09=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > On Tue, Apr 29, 2025 at 8:40=E2=80=AFAM Jann Horn wrote: > > > > > > On Fri, Apr 18, 2025 at 7:50=E2=80=AFPM Suren Baghdasaryan wrote: > > > > > > > With maple_tree supporting vma tree traversal under RCU and v= ma and > > > > > > > its important members being RCU-safe, /proc/pid/maps can be r= ead under > > > > > > > RCU and without the need to read-lock mmap_lock. However vma = content > > > > > > > can change from under us, therefore we make a copy of the vma= and we > > > > > > > pin pointer fields used when generating the output (currently= only > > > > > > > vm_file and anon_name). Afterwards we check for concurrent ad= dress > > > > > > > space modifications, wait for them to end and retry. While we= take > > > > > > > the mmap_lock for reading during such contention, we do that = momentarily > > > > > > > only to record new mm_wr_seq counter. This change is designed= to reduce > > > > > > > mmap_lock contention and prevent a process reading /proc/pid/= maps files > > > > > > > (often a low priority task, such as monitoring/data collectio= n services) > > > > > > > from blocking address space updates. > > > > > > [...] > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > > > > > --- a/fs/proc/task_mmu.c > > > > > > > +++ b/fs/proc/task_mmu.c > > > > > > [...] > > > > > > > +/* > > > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they a= re used by > > > > > > > + * show_map_vma. > > > > > > > + */ > > > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, = struct vm_area_struct *vma) > > > > > > > +{ > > > > > > > + struct vm_area_struct *copy =3D &priv->vma_copy; > > > > > > > + int ret =3D -EAGAIN; > > > > > > > + > > > > > > > + memcpy(copy, vma, sizeof(*vma)); > > > > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > > > > > + goto out; > > > > > > > > > > > > I think this uses get_file_rcu() in a different way than intend= ed. > > > > > > > > > > > > As I understand it, get_file_rcu() is supposed to be called on = a > > > > > > pointer which always points to a file with a non-zero refcount = (except > > > > > > when it is NULL). That's why it takes a file** instead of a fil= e* - if > > > > > > it observes a zero refcount, it assumes that the pointer must h= ave > > > > > > been updated in the meantime, and retries. Calling get_file_rcu= () on a > > > > > > pointer that points to a file with zero refcount, which I think= can > > > > > > happen with this patch, will cause an endless loop. > > > > > > (Just as background: For other usecases, get_file_rcu() is supp= osed to > > > > > > still behave nicely and not spuriously return NULL when the fil= e* is > > > > > > concurrently updated to point to another file*; that's what tha= t loop > > > > > > is for.) > > > > > > > > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixab= le by > > > > > checking the return value of get_file_rcu() and retrying speculat= ion > > > > > if it changed. > > > > > > > > I think you could probably still end up looping endlessly in get_fi= le_rcu(). > > > > (Just to be clear: What I meant here is that get_file_rcu() loops > > *internally*; get_file_rcu() is not guaranteed to ever return if the > > pointed-to file has a zero refcount.) > > > > > By "retrying speculation" I meant it in the sense of > > > get_vma_snapshot() retry when it takes the mmap_read_lock and then > > > does mmap_lock_speculate_try_begin to restart speculation. I'm also > > > thinking about Liam's concern of guaranteeing forward progress for th= e > > > reader. Thinking maybe I should not drop mmap_read_lock immediately o= n > > > contention but generate some output (one vma or one page worth of > > > vmas) before dropping mmap_read_lock and proceeding with speculation. > > > > Hm, yeah, I guess you need that for forward progress... > > > > > > > > (If my understanding is correct, maybe we should document that = more > > > > > > explicitly...) > > > > > > > > > > Good point. I'll add comments for get_file_rcu() as a separate pa= tch. > > > > > > > > > > > > > > > > > Also, I think you are introducing an implicit assumption that > > > > > > remove_vma() does not NULL out the ->vm_file pointer (because t= hat > > > > > > could cause tearing and could theoretically lead to a torn poin= ter > > > > > > being accessed here). > > > > > > > > > > > > One alternative might be to change the paths that drop referenc= es to > > > > > > vma->vm_file (search for vma_close to find them) such that they= first > > > > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after = that, > > > > > > maybe with a new helper like this: > > > > > > > > > > > > static void vma_fput(struct vm_area_struct *vma) > > > > > > { > > > > > > struct file *file =3D vma->vm_file; > > > > > > > > > > > > if (file) { > > > > > > WRITE_ONCE(vma->vm_file, NULL); > > > > > > fput(file); > > > > > > } > > > > > > } > > > > > > > > > > > > Then on the lockless lookup path you could use get_file_rcu() o= n the > > > > > > ->vm_file pointer _of the original VMA_, and store the returned= file* > > > > > > into copy->vm_file. > > > > > > > > > > Ack. Except for storing the return value of get_file_rcu(). I thi= nk > > > > > once we detect that get_file_rcu() returns a different file we s= hould > > > > > bail out and retry. The change in file is an indication that the = vma > > > > > got changed from under us, so whatever we have is stale. > > > > > > > > What does "different file" mean here - what file* would you compare > > > > the returned one against? > > > > > > Inside get_vma_snapshot() I would pass the original vma->vm_file to > > > get_file_rcu() and check if it returns the same value. If the value > > > got changed we jump to /* Address space got modified, vma might be > > > stale. Re-lock and retry. */ section. That should work, right? > > > > Where do you get an "original vma->vm_file" from? > > > > To be clear, get_file_rcu(p) returns one of the values that *p had > > while get_file_rcu(p) is running.