From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F0268C77B72 for ; Fri, 14 Apr 2023 21:52:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 81950900002; Fri, 14 Apr 2023 17:52:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7C8C26B0075; Fri, 14 Apr 2023 17:52:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 69088900002; Fri, 14 Apr 2023 17:52:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 594766B0072 for ; Fri, 14 Apr 2023 17:52:13 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 22E07403E6 for ; Fri, 14 Apr 2023 21:52:13 +0000 (UTC) X-FDA: 80681345346.02.40A6057 Received: from mail-yb1-f180.google.com (mail-yb1-f180.google.com [209.85.219.180]) by imf30.hostedemail.com (Postfix) with ESMTP id 5764D8000F for ; Fri, 14 Apr 2023 21:52:11 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=20XlelL+; spf=pass (imf30.hostedemail.com: domain of surenb@google.com designates 209.85.219.180 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681509131; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=46Ec7uFiIfmRmN61V1bJw83NzlxVzPRXSYdLuJ40kFQ=; b=AqIZ0zss4eU+vWHlVfE2ipGmb1wI9qiDrxWCUcSgyXqpTWn1KAL2Yy40/HEuFCRaEU2MYq UIscl0UfO0xBUd9q4z29XCO7L/8BcKtSlHa2aJYEDoIenTsjmbo/7QNNnb3UiV5qkRmLZw Xi7STA7fnz8NfYncXG7ttCC7X4cEhi4= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=20XlelL+; spf=pass (imf30.hostedemail.com: domain of surenb@google.com designates 209.85.219.180 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681509131; a=rsa-sha256; cv=none; b=mgZ5N7iSxfesYUZ/nxLol6nr+eOlADEfCAv8e9lW0TDAVwjPmyfl35oM4LDo7B7DiTymbY D5rNBj1SOqkSm2q623WHBzizmjQ3qy+KmvIuhePF2PdQJxg6/T/rvLA2sIm9qoc//NtfnZ 9dJ9YrMfiIGpJSiMzmcIdV9/J9kcc3c= Received: by mail-yb1-f180.google.com with SMTP id r184so4406548ybc.1 for ; Fri, 14 Apr 2023 14:52:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1681509130; x=1684101130; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=46Ec7uFiIfmRmN61V1bJw83NzlxVzPRXSYdLuJ40kFQ=; b=20XlelL+jChK79ShTU3jIFYMVYsTy4bydr0eghQ0sbN+vYBGl2mvjANHKs+gvvmfNd behIfmNJuGLYbs0EZCP3B1qIKL30mKS2+TMMsC+6goZ8ht6B4engzIc7LOewjic8jlcm bJBeh2n/L8L4jM4iuQZ1hrB+694ETnuacFgJ+PLl48YhPn8X9Iz+XeWu5Ud2EeCEdHNJ 4seUQwlqrRZ2XLAwOoH0X9uuj8qKAdU5krgXfhVau1I88bqzByQMo9ntrZuPDwF6bkb9 sOrFsppnzMHHEuCTAsWkpgx6Hq2VPnrojC2GAbQn7fk1KHNO3i2HjuK1vMZcMz3SFBDL GuYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681509130; x=1684101130; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=46Ec7uFiIfmRmN61V1bJw83NzlxVzPRXSYdLuJ40kFQ=; b=dLwPQ7rXcsOgH+Vim8tPaP31wxsTVK0n+tfyCA5jn6wF6qbebmGDzC0GDV+7aT3Mle GX8GszkcvNuqyt38KaF3+Yw4RHfAlaFaaeQ9WnkUZdSWA8rsE/jykrQeRUgox1zQqLqJ YF4BKdq2maiLieQHzifiyvSIxGj+xdS6Sc09MmR38SGqauVq2o9ElFyvr0P9XEPHmB86 2YU+mCbe3/KJrNEH4cEJM3mpgcT+Tgf45KKWXUjUaP/hvDB+YTW3ZEyMZbCPdYd3um0P WnrEnKHgd9WA3JpVAR1r84FwKR4na9cCkyi6/I8FDn2g8swpA7KnKaTWJVVFskp5o/oM cf4w== X-Gm-Message-State: AAQBX9ewDXDdXrw2Xp7ekHSO7boiXCSBtLmj2dXTvpKkB6BbpC1k8ftZ kNGNS2AEww148K9YXvgg3NOhsdrsaK2Vp5MxOweOPA== X-Google-Smtp-Source: AKy350ab6zATiyZGtaIhEWr8kb6QJ/jqLf1DxG2XPENo2+Qz94wtYQds9/wct2ztRxrJLixMF3T2qCyEYoN2GfJz8Ho= X-Received: by 2002:a25:308a:0:b0:b8e:e0db:5b9d with SMTP id w132-20020a25308a000000b00b8ee0db5b9dmr3918724ybw.12.1681509130246; Fri, 14 Apr 2023 14:52:10 -0700 (PDT) MIME-Version: 1.0 References: <20230414180043.1839745-1-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Fri, 14 Apr 2023 14:51:59 -0700 Message-ID: Subject: Re: [PATCH 1/1] mm: handle swap page faults if the faulting page can be locked To: Matthew Wilcox Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@suse.com, josef@toxicpanda.com, jack@suse.cz, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, michel@lespinasse.org, liam.howlett@oracle.com, jglisse@google.com, vbabka@suse.cz, minchan@google.com, dave@stgolabs.net, punit.agrawal@bytedance.com, lstoakes@gmail.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: o3sanrc4gpreequ5p4qehr77xebtr66f X-Rspamd-Queue-Id: 5764D8000F X-HE-Tag: 1681509131-133418 X-HE-Meta: U2FsdGVkX18aX4zrLOiVuaX/ho2j7v/hBVvcCKJf5ug1R4baYu2AwjuDKuERuI73SKARixPYjaMQbJi9lLIQlusqONQSuyKD48IYl+w8G6DRRueKR+cgYkH5vWteTnvv8BK6M3ay++kypVHruR7SkGhDa+ggpqJjb804siarozDhgiikpFZBFRxjJ/naIkCgCIyYgh3Jy1lgFslnNHJORymEygp0Xc66bnsp28vVE9VQY/U5LcScx30sXfvMocYahDBByxCeZVRbxiM48VCntIVG3eEVc5pz1Z1HzFQiJP7QHCmju3OY4NJ52ndOKt+IjDGC2YynJN7XVPqVcb5NiRVxJT9Eg13IL8dN7v/g1bCq+ihCiWHKcF3ZDnKQf1yS5r3xT3kXalnb0e4De+hXclXIMxsRaAqADpvCXC0W/f/l54o9ufMMhxU6yaV8Zk/5VPTV4gXIL9BHvFBiXskb/4l1+jkjwEB4APvVvQpYgWmw84TPUxhLM+CdBL72DPF8iE121JrxHzuZX0oI5w1OIIERfqkW/NiMxQlL0wqbEUT+LR2jTKJVDzuPGFIiPFrAWQWrhNH2qph3odiQifd4rPIxWwGVM72kBXmpYMXtSO8WBnuU/C5TkkscSLe1XzrxaUvOuTyyKZvTEwfqMJp5m9kgeWuFjvlCyYNmTdDtn3AxH/FHwavsO8g5v8Wxw/dKQKfDeVk8lKYVP5CcGEoFCnhnsr2a3Q7oBW6cCgSmLJf/SORyvhXEJAbqOKFmA0ZJhBpjcwpGJJxBc76143sIrOeQh8l27xw8fv1TdRNiIF5aQUhWsMHILfOjng97Z6yJmSI0/v8qvS7T+/eaEzU+8A/WHed59rMmINsdY0eHqVUuKtM9Lo6JgNnwDz7g/9z/WwZnj1aKouCDUQ2DopF1eresnePaW5sgbfCa8EXy7x412qB3g4R8WZBo8SIAeY+9U4CtoZNepyvX3AzfV5D SkfBF7Yp 51h3YGaI1UF2GBuhbRp8NQoCDQjXL3hS38ZSYpd+/frOnEzFG2Xj1ISq1YNLMDhWddCmOqMhGYQoddE8msOWt0tdl0ojgPbJruVhFGlxFHxLjfE3ErBT5ZnGInY8srJgKad3+hfNgCBXbFpe6KKfUEOHBjpmdr1oqGzycyeInASTlAVSbE+iOM9KJ03X3tLzXSGAoJRlDnwDDa4QFCfP4u95H4etJwR1fseLanVMclOaAsJ8hIfJJ19oK7Xc6JXmNMSM7rC48lIvJ89LiMp2db3zaSt3WAhwA9ag8Ssa8wWkIKh6VrZhPv939Ng== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Apr 14, 2023 at 1:32=E2=80=AFPM Matthew Wilcox wrote: > > On Fri, Apr 14, 2023 at 12:48:54PM -0700, Suren Baghdasaryan wrote: > > > - We can call migration_entry_wait(). This will wait for PG_locked = to > > > become clear (in migration_entry_wait_on_locked()). As previously > > > discussed offline, I think this is safe to do while holding the VM= A > > > locked. > > Just to be clear, this particular use of PG_locked is not during I/O, > it's during page migration. This is a few orders of magnitude > different. > > > > - We can call swap_readpage() if we allocate a new folio. I haven't > > > traced through all this code to tell if it's OK. > > ... whereas this will wait for I/O. If we decide that's not OK, we'll > need to test for FAULT_FLAG_VMA_LOCK and bail out of this path. > > > > So ... I believe this is all OK, but we're definitely now willing to > > > wait for I/O from the swap device while holding the VMA lock when we > > > weren't before. And maybe we should make a bigger deal of it in the > > > changelog. > > > > > > And maybe we shouldn't just be failing the folio_lock_or_retry(), > > > maybe we should be waiting for the folio lock with the VMA locked. > > > > Wouldn't that cause holding the VMA lock for the duration of swap I/O > > (something you said we want to avoid in the previous paragraph) and > > effectively undo d065bd810b6d ("mm: retry page fault when blocking on > > disk transfer") for VMA locks? > > I'm not certain we want to avoid holding the VMA lock for the duration > of an I/O. Here's how I understand the rationale for avoiding holding > the mmap_lock while we perform I/O (before the existence of the VMA lock)= : > > - If everybody is doing page faults, there is no specific problem; > we all hold the lock for read and multiple page faults can be handled > in parallel. > - As soon as one thread attempts to manipulate the tree (eg calls > mmap()), all new readers must wait (as the rwsem is fair), and the > writer must wait for all existing readers to finish. That's > potentially milliseconds for an I/O during which time all page faults > stop. > > Now we have the per-VMA lock, faults which can be handled without taking > the mmap_lock can still be satisfied, as long as that VMA is not being > modified. It is rare for a real application to take a page fault on a > VMA which is being modified. > > So modifications to the tree will generally not take VMA locks on VMAs > which are currently handling faults, and new faults will generally not > find a VMA which is write-locked. > > When we find a locked folio (presumably for I/O, although folios are > locked for other reasons), if we fall back to taking the mmap_lock > for read, we increase contention on the mmap_lock and make the page > fault wait on any mmap() operation. Do you mean we increase mmap_lock contention by holding the mmap_lock between the start of pagefault retry and until we drop it in __folio_lock_or_retry? > If we simply sleep waiting for the > I/O, we make any mmap() operation _which touches this VMA_ wait for > the I/O to complete. But I think that's OK, because new page faults > can continue to be serviced ... as long as they don't need to take > the mmap_lock. Ok, so we will potentially block VMA writers for the duration of the I/O... Stupid question: why was this a bigger problem for mmap_lock? Potentially our address space can consist of only one anon VMA, so locking that VMA vs mmap_lock should be the same from swap pagefault POV. Maybe mmap_lock is taken for write in some other important cases when VMA lock is not needed? > > So ... I think what we _really_ want here is ... > > +++ b/mm/filemap.c > @@ -1690,7 +1690,8 @@ static int __folio_lock_async(struct folio *folio, = struct wait_page_queue *wait) > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, > unsigned int flags) > { > - if (fault_flag_allow_retry_first(flags)) { > + if (!(flags & FAULT_FLAG_VMA_LOCK) && > + fault_flag_allow_retry_first(flags)) { > /* > * CAUTION! In this case, mmap_lock is not released > * even though return 0. > @@ -1710,7 +1711,8 @@ bool __folio_lock_or_retry(struct folio *folio, str= uct mm_struct *mm, > > ret =3D __folio_lock_killable(folio); > if (ret) { > - mmap_read_unlock(mm); > + if (!(flags & FAULT_FLAG_VMA_LOCK)) > + mmap_read_unlock(mm); > return false; > } > } else { >