From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33BFBC4725D for ; Mon, 22 Jan 2024 07:24:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A547780008; Mon, 22 Jan 2024 02:24:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9DE7880007; Mon, 22 Jan 2024 02:24:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8574D80008; Mon, 22 Jan 2024 02:24:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6DFA780007 for ; Mon, 22 Jan 2024 02:24:05 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 384041608D9 for ; Mon, 22 Jan 2024 07:24:05 +0000 (UTC) X-FDA: 81706108050.05.490BDCD Received: from mail-yw1-f170.google.com (mail-yw1-f170.google.com [209.85.128.170]) by imf30.hostedemail.com (Postfix) with ESMTP id 9AB3C80005 for ; Mon, 22 Jan 2024 07:24:02 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="Lj2G1S/R"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of surenb@google.com designates 209.85.128.170 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705908242; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6nULwB4h5pXORlWHrU8L4yn/qG7KobE3skZQ7okWTvc=; b=Nk77rXpk/ZR6MHbDhACMpiThTUcZeWWiR7Zkl10S6dPAopAr3gh2VJV/9P8LTSHEvN9g2q N3gEp+fDxBBHtEoZo5z7KPt87cp30apnwlJbdUekXtiTJhHKuHt/N5ngnzh+ibXs+kta3C zk9Xos1e2SXUDJeb3akuqmxws31TIJA= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="Lj2G1S/R"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of surenb@google.com designates 209.85.128.170 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705908242; a=rsa-sha256; cv=none; b=r2wkKQ7VqvO3Q+nHJjgQLTtFql2lxlbaCIKN8IQnUZmKwukn8Ueg0l+wTzDBENGedDjy6e Kfw09E2sI+2JnpzzGMr8fACsIYuAii8f67ZCfrfJFKm1dWcHYlz1S0zQtUqSf0gpRVlv5Z YE4vbhhGTqru95f+mqaZQgRtSryrusc= Received: by mail-yw1-f170.google.com with SMTP id 00721157ae682-5f2d4aaa2fdso26848057b3.1 for ; Sun, 21 Jan 2024 23:24:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1705908242; x=1706513042; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=6nULwB4h5pXORlWHrU8L4yn/qG7KobE3skZQ7okWTvc=; b=Lj2G1S/RBJg9Hl/RnCL9jDS4JIBOvCgtT4iPf/htpCl/LJ768/3ubN9X3O9yqdtdC2 HP+WZa6O+9uuR6WWxYm7s7d8AZM6ujB6YEPRsFdo/39g/O4mJArHZPqeLXl3R7RG3d6B 3uwLTLdQfaZpli3H73xxxwuAzSXeo7ZBk1e0Wv4cPR///XaOP5xDaZRSWbDVZ6nud2KH vGIB0CZ9NT9L38WUDU+njcAb5qCCXebm04HOyBj9PaiYoE+81Rwp2EGr6UxMvljzwype LzSvUu5H2Zlkl5rx5maQPAWrKzxkFRlm69kA14kq5ZLcgmuLN1RLD++zGavNFFu6Ln+m rC4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705908242; x=1706513042; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6nULwB4h5pXORlWHrU8L4yn/qG7KobE3skZQ7okWTvc=; b=jQ4eeU6XmYpJnZZPrT+axCBkVxBv4PlEkUZTFJJpY59QGKrSqslECD8F0LRS+Yn4ee lY4wzxRJwtj/NL8YtiqLbfpWVvB7W0RpweSIZF3ADFX/Jm2vtrDJMgBFe0EgWMKzmxds qC3EvE/tBdMrvyfyvjdOSoIgfaDp+4oJaOzODG+YzJVkY14d8yraGMSFoP7raaZGfceV 4wB5G8XSmRVelZI7XY19cZ6m/eKXzebaJOq+0lKEeeiBONtslm3Yo+1tNLz+pNHu77Lm 3xuYA0Kk/wsTyvhYvtdRbexV1MiN3ebvn4SnX6TjDHDz1ho6KKwIR/5NJHzUjeglM4eW M34g== X-Gm-Message-State: AOJu0YxDdW7bgH4Go3QI5Uasj92mgjLBYWvPqx+UhLllO2j525O0CSAi qDVHNETUy1gF7c/gYDIKOWs8iVzHz1zs83C61Xo0uNN8B+aXCgfxsCs3JakPmoZ13/go3QHKy3H 60lsyIOGttb0AvWx/GocdwbKdffJlR/UXr6Ib X-Google-Smtp-Source: AGHT+IE6F//wT29st6eFFtAeRUskoVww8wNftQF3w1WeZzm+OMGiSq08/v3R/eN+cXzIAeOSU7Yv/rOii9evC7i0dCs= X-Received: by 2002:a81:5383:0:b0:5ff:6587:19fd with SMTP id h125-20020a815383000000b005ff658719fdmr3099226ywb.86.1705908240727; Sun, 21 Jan 2024 23:24:00 -0800 (PST) MIME-Version: 1.0 References: <20240115183837.205694-1-surenb@google.com> <1bc8a5df-b413-4869-8931-98f5b9e82fe5@suse.cz> <74005ee1-b6d8-4ab5-ba97-92bec302cc4b@suse.cz> In-Reply-To: From: Suren Baghdasaryan Date: Sun, 21 Jan 2024 23:23:47 -0800 Message-ID: Subject: Re: [RFC 0/3] reading proc/pid/maps under RCU To: Vlastimil Babka Cc: akpm@linux-foundation.org, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, dchinner@redhat.com, casey@schaufler-ca.com, ben.wolsieffer@hefring.com, paulmck@kernel.org, david@redhat.com, avagin@google.com, usama.anjum@collabora.com, peterx@redhat.com, hughd@google.com, ryan.roberts@arm.com, wangkefeng.wang@huawei.com, Liam.Howlett@oracle.com, yuzhao@google.com, axelrasmussen@google.com, lstoakes@gmail.com, talumbau@google.com, willy@infradead.org, mgorman@techsingularity.net, jhubbard@nvidia.com, vishal.moola@gmail.com, mathieu.desnoyers@efficios.com, dhowells@redhat.com, jgg@ziepe.ca, sidhartha.kumar@oracle.com, andriy.shevchenko@linux.intel.com, yangxingui@huawei.com, keescook@chromium.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 9AB3C80005 X-Stat-Signature: zjkwgke9ohc1nyj4tgzoqpimongpecoe X-HE-Tag: 1705908242-495714 X-HE-Meta: U2FsdGVkX1+wq3HUN5my2afeV6qff9rz7FJ7nq7xyoNIjQZO6NRqYOK8DOUnXJVZSAMvpXfzGyeAwNjeQCenQFGI/laXgke02umwOOxbyUWpYm+c5YThwksOwrUO3pUDlTCJWiRSISziSC566ZWEKk7gQx/1Yu4YYl4yWVeKhYofhEFHFHVoO5Tg5fcOAGfj5I4L5gkeSDi7IB5MHRpesuLXRyfZl0kJDq1Q7CWMofdzxwfR7+Z7jrQPWJFjQRhivvxyBItLuyLZNj99tUJSkGaA8Zc2437beBD+U8x5kDLgZtCWc4s77xeJdxzRPTAzv5x1LIqC5wUsxTfwO2snK728PZU9UqZ4SivPsLe4HBtQ1AHJnsSwZMkejKrTLsOtJQ4/NmMtu6MrRcuSM23KjSvRHB0xKU5ambcXswkXBdE4/iuPVkmxBX74s9cXoyOU6OuSEf7C7eKSjUpVtpA8N+WB4hv06nEVQT/ZxrXKOEHOwLxKKjKL9jnfZN+WNL0qE2f9g12rFiTVVtFlhHE8AVl8a/bv66nD2Qml4mxDlPJnFi91id/lcOwcdheHUuNmsQNXzqFcAVY7kppqqmo+EFnbcY4DAx99up5FC/2VVSKSCTQb6fglrD6emuHG5XvYYUrR5bLyR7lcVNpTcBnr2EFdpOnCHvHsPptjG0yLV630v+9JQJb17vVajsVH+WxBuHPrRaDQn2hN2dKZ6XmkM+jiELMEIXPW+ePSuLnf+YYKqn8c+aG46ysrq7iSDXUMoXM+JVmPFSSHTFPF5PE5+yiTufN9lTdeNYbx9YAppzlZk8q8pywmECdHYGH8vDY3tdao/ygGwhwItAjaBBMYzNeoQ9i2zRz7OBd8wQa2iGPDh9bJYGJs9vCMakmSu5S5d8EEHS1mm2YmQrEhT45PwqXd6awNv5FlmU1AmWe3055YbxdHJXnxWzEA2SmBn0tI2n/7fp8Bf1ZBZJfSbci aRkHweWm W6MCbdPpYLN1rW0ZhO6ba9h/Dg6WRWiUmPqT9De0TE+lczEUOXJg5kFxD3BHFF47GVeFe X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 18, 2024 at 9:58=E2=80=AFAM Suren Baghdasaryan wrote: > > On Tue, Jan 16, 2024 at 9:57=E2=80=AFAM Suren Baghdasaryan wrote: > > > > On Tue, Jan 16, 2024 at 6:46=E2=80=AFAM Vlastimil Babka wrote: > > > > > > On 1/16/24 15:42, Vlastimil Babka wrote: > > > > On 1/15/24 19:38, Suren Baghdasaryan wrote: > > > > > > > > Hi, > > > > > > > >> The issue this patchset is trying to address is mmap_lock contenti= on when > > > >> a low priority task (monitoring, data collecting, etc.) blocks a h= igher > > > >> priority task from making updated to the address space. The conten= tion is > > > >> due to the mmap_lock being held for read when reading proc/pid/map= s. > > > >> With maple_tree introduction, VMA tree traversals are RCU-safe and= per-vma > > > >> locks make VMA access RCU-safe. this provides an opportunity for l= ock-less > > > >> reading of proc/pid/maps. We still need to overcome a couple obsta= cles: > > > >> 1. Make all VMA pointer fields used for proc/pid/maps content gene= ration > > > >> RCU-safe; > > > >> 2. Ensure that proc/pid/maps data tearing, which is currently poss= ible at > > > >> page boundaries only, does not get worse. > > > > > > > > Hm I thought we were to only choose this more complicated in case a= dditional > > > > tearing becomes a problem, and at first assume that if software can= deal > > > > with page boundary tearing, it can deal with sub-page tearing too? > > > > Hi Vlastimil, > > Thanks for the feedback! > > Yes, originally I thought we wouldn't be able to avoid additional > > tearing without a big change but then realized it's not that hard, so > > I tried to keep the change in behavior transparent to the userspace. > > In the absence of other feedback I'm going to implement and post the > originally envisioned approach: remove validation step and avoid any > possibility of blocking but allowing for sub-page tearing. Will use > Matthew's rwsem_wait() to deal with possible inconsistent maple_tree > state. I posted v1 at https://lore.kernel.org/all/20240122071324.2099712-1-surenb@google.com/ In the RFC I used mm_struct.mm_lock_seq to detect if mm is being changed but then I realized that won't work. mm_struct.mm_lock_seq is incremented after mm is changed and right before mmap_lock is write-unlocked. Instead I need a counter that changes once we write-lock mmap_lock and before any mm changes. So the new patchset introduces a separate counter to detect possible mm changes. In addition, I could not use rwsem_wait() and instead had to take mmap_lock for read to wait for the writer to finish and then record the new counter while holding mmap_lock for read. That prevents concurrent mm changes while we are recording the new counter value. > Thanks, > Suren. > > > > > > > > > > >> The patchset deals with these issues but there is a downside which= I would > > > >> like to get input on: > > > >> This change introduces unfairness towards the reader of proc/pid/m= aps, > > > >> which can be blocked by an overly active/malicious address space m= odifyer. > > > > > > > > So this is a consequence of the validate() operation, right? We cou= ld avoid > > > > this if we allowed sub-page tearing. > > > > Yes, if we don't care about sub-page tearing then we could get rid of > > validate step and this issue with updaters blocking the reader would > > go away. If we choose that direction there will be one more issue to > > fix, namely the maple_tree temporary inconsistent state when a VMA is > > replaced with another one and we might observe NULL there. We might be > > able to use Matthew's rwsem_wait() to deal with that issue. > > > > > > > > > >> A couple of ways I though we can address this issue are: > > > >> 1. After several lock-less retries (or some time limit) to fall ba= ck to > > > >> taking mmap_lock. > > > >> 2. Employ lock-less reading only if the reader has low priority, > > > >> indicating that blocking it is not critical. > > > >> 3. Introducing a separate procfs file which publishes the same dat= a in > > > >> lock-less manner. > > > > > > Oh and if this option 3 becomes necessary, then such new file shouldn= 't > > > validate() either, and whoever wants to avoid the reader contention a= nd > > > converts their monitoring to the new file will have to account for th= is > > > possible extra tearing from the start. So I would suggest trying to c= hange > > > the existing file with no validate() first, and if existing userspace= gets > > > broken, employ option 3. This would mean no validate() in either case= ? > > > > Yes but I was trying to avoid introducing additional file which > > publishes the same content in a slightly different way. We will have > > to explain when userspace should use one vs the other and that would > > require going into low level implementation details, I think. Don't > > know if that's acceptable/preferable. > > Thanks, > > Suren. > > > > > > > > >> I imagine a combination of these approaches can also be employed. > > > >> I would like to get feedback on this from the Linux community. > > > >> > > > >> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map= () > > > >> can be replaced with more efficiend rwsem_wait() proposed by Matth= ew > > > >> in [1]. > > > >> > > > >> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.= org/ > > > >> > > > >> Suren Baghdasaryan (3): > > > >> mm: make vm_area_struct anon_name field RCU-safe > > > >> seq_file: add validate() operation to seq_operations > > > >> mm/maps: read proc/pid/maps under RCU > > > >> > > > >> fs/proc/internal.h | 3 + > > > >> fs/proc/task_mmu.c | 130 +++++++++++++++++++++++++++++++++= +---- > > > >> fs/seq_file.c | 24 ++++++- > > > >> include/linux/mm_inline.h | 10 ++- > > > >> include/linux/mm_types.h | 3 +- > > > >> include/linux/seq_file.h | 1 + > > > >> mm/madvise.c | 30 +++++++-- > > > >> 7 files changed, 181 insertions(+), 20 deletions(-) > > > >> > > > > > > >