From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBC2AC25B4F for ; Tue, 7 May 2024 08:25:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 54E636B009E; Tue, 7 May 2024 04:25:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 524F46B009F; Tue, 7 May 2024 04:25:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3ECB06B00A0; Tue, 7 May 2024 04:25:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 21A136B009E for ; Tue, 7 May 2024 04:25:01 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id C0F2080391 for ; Tue, 7 May 2024 08:25:00 +0000 (UTC) X-FDA: 82090914360.03.C4D2F0F Received: from mail-vk1-f179.google.com (mail-vk1-f179.google.com [209.85.221.179]) by imf11.hostedemail.com (Postfix) with ESMTP id 0343840010 for ; Tue, 7 May 2024 08:24:58 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GmRFQ3cN; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.179 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715070299; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dAQWDGtCFXbapm8xatzzeQK02lgJWFufMkgQUQ06QiU=; b=Y02UErM/wg6PKFGJVd9nis7zZ7wi7YCLXAjeeTGL+Qdp5RGoPOQTijXzoN/HaSiFj6g6WQ 8JwH8lHKzpOk6TslmmyEA4tezdTHRioRilra70h61eShwY11XfV1k0sgoW061F9OIK619I zI1lP5oMcUJgJf68NQItXY3UgDTLLZQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715070299; a=rsa-sha256; cv=none; b=REveJzC/TIuXtzZVIVP6G3ibg3BT5xm7EfdFvddz/cH5HI4FWTBLYYce2RulIFnkPBkfgg ZGRV7pSQoGLQ77js3NIMi799yyWuT+c8zGF0es8sGsfIUeTcJfRalNgdrvgVHw+8D9PdKu G2wd+FgfJw5YZtWcBeaV5OpHDozmGaM= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GmRFQ3cN; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.179 as permitted sender) smtp.mailfrom=21cnbao@gmail.com Received: by mail-vk1-f179.google.com with SMTP id 71dfb90a1353d-4df32efa5baso931260e0c.2 for ; Tue, 07 May 2024 01:24:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1715070298; x=1715675098; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=dAQWDGtCFXbapm8xatzzeQK02lgJWFufMkgQUQ06QiU=; b=GmRFQ3cNw5hEnXxgkJmqdEVaQ+y2RMNstODgwedyO19szb78h/4QG2mOmAbVlmAPYi pFnjHZezbJk5CcxpzXNtXwgmx6DBuCvjd5i65Mk/6XKT4DacK7Dk+LUQKY/Vtmiz5V65 pR0bbNbLNvjaaEuI5TY0Xnt88UNGphvz30s4RwWv2UPEmd5MJWiTXvBcFixokdFiN5gy iLwak1AxXPSb7GIuxObUIYi5SvNzFqPJZjKYqLq1RjzxVkF2gDRGw9IwD5NHKhwiXvHw H54qpj9bE4j30zHRzA02a1PLo94kETMxsGkN1BdChyXv01rrWwvSY+HWCC4oFEuqFnQq 0A2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715070298; x=1715675098; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dAQWDGtCFXbapm8xatzzeQK02lgJWFufMkgQUQ06QiU=; b=VfZlwbQqAXMk5UF5vrOumPTWh41p3euwgIHnWzEvArKihMMlSujDml/l7d9H3wIEW7 Ie6g+Sr0+537YmbauEKzWHvsB6a6O90d7nbS/c/B5Avjhp6DhkZipTEHJbqwdSCOChgd 6AKu5Fk2k6Yu0RDT4H1ZZdhAewHG1quAL6ItlwUKitQcMaoW/2WiynfbMOiCo9wpmIP2 0Vc3MyslcqNkiVjwpr/OKji3/3mK2pILSsBdvjwMwwqG2gYA/N5go1srQm77oI0U+J1x VW7rFWE4xTF2Of1U8UZbO2GMSWXm7wcT5TVo6t44tKSpHiOtzeC9eQZ9OaqqbLrDP1Ry RTNw== X-Forwarded-Encrypted: i=1; AJvYcCWRHg0l4MXg/OvryL3BQUt86Y4d3FPyuWPBUqocBY09Xj6PSUmwmZ1dCTtDHTiO/RUJnW6KZDT0ASpvPqKhjFDJ+w0= X-Gm-Message-State: AOJu0YwTQe8MVC8wtZphB/yBSBR0xp0CzZMTKm62EsR05kFeoVUCZF57 qcEOhYEHopslIQu/roqmUoEm9CEcwC5IFNt2jJnWQWuLQjiy4szye3xgSeIk3R6hkajfT47zV2V 18jUUdq5rbdrzOrxkaEybR98vW5OaXyK27iE= X-Google-Smtp-Source: AGHT+IErHD8Er/A/4xDDoTqdiYe1heOjgHIMxZJg7YG+6B+DcxurdVS9JMU0c4k0TkKwnqTa59iWL/SoC1hliuPW/nk= X-Received: by 2002:a05:6122:2017:b0:4d8:37eb:9562 with SMTP id l23-20020a056122201700b004d837eb9562mr11949380vkd.0.1715070298004; Tue, 07 May 2024 01:24:58 -0700 (PDT) MIME-Version: 1.0 References: <20240503005023.174597-1-21cnbao@gmail.com> <20240503005023.174597-4-21cnbao@gmail.com> <7548e30c-d56a-4a57-ab87-86c9c8e523b1@arm.com> <0d20d8af-e480-4eb8-8606-1e486b13fd7e@redhat.com> <0bca057d-7344-40a6-a981-9a7a9347a19f@arm.com> In-Reply-To: <0bca057d-7344-40a6-a981-9a7a9347a19f@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 7 May 2024 20:24:46 +1200 Message-ID: Subject: Re: [PATCH v3 3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally To: Ryan Roberts Cc: David Hildenbrand , akpm@linux-foundation.org, linux-mm@kvack.org, baolin.wang@linux.alibaba.com, chrisl@kernel.org, hanchuanhua@oppo.com, hannes@cmpxchg.org, hughd@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, surenb@google.com, v-songbaohua@oppo.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yosryahmed@google.com, yuzhao@google.com, ziy@nvidia.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 0343840010 X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: p4z5u9wqsyn4x6y6mnrkkb87iwxubrsq X-HE-Tag: 1715070298-409742 X-HE-Meta: U2FsdGVkX19jiUU074oP/IL9pWkLxTrDYa59IhAOMOKPVp02nP7YQv2itTGSYvdDIdzvldTYgB2b2n48kGr9/zSl8FlURqYqam2bSTa4K8A4aizXAb2dvW85ASSj255kGZXLdVkFC/J9lM5Z9kpROsgF8T2PJu1qxPum39uSss6xu5ZyrgxEKv1L6+ZxVhTq+hu42LhgHdXCUptakDHFQBWsbnaTrQX0RmsM1npX73lJMiqmc4g9yKTr1cuEmb27l94KAHEMi1/oPdylHQ8d6ZqN2RnrRMIHwoWnC3A96VU4Rzs/lMfOhrkVLgat2B+IaeCX8OPmPEq8P43aMsQ9XUgcPKVmR7N+/KHfhCLxf9XLjtMIo66pEdVGPpAKA17j5PPEDEz0TVHLDtI1Ry4h6XToeGYads/W/bkdsSIMCqx0KRUe/GCcqer4RZDUzlvJy96qXrWEsaR8rF3v0rT0dC9XPB24MEA3z66KNXZZyFtmq/XecvhsmzGV5Agyt7H6lYY+bLUz7tctQcYxLC2G9oiGlbxkcsd1Fjij1iE9xyykZlWpNYK+9B1574uoJklmjw3XQfRhkf33Y/ni8G/zh3iGnH87MBTrqrmZaturhO1UZuunGsf0GupsG13XC4AhsqvdZoau6/XnzSbzLPJQ+cF0AHzwdg5ouh3jQYt2gq9Md/hjVNnvv9dqALVuGaRxXntH4xDJ9EyMBb6o1wvb1rLWAkcUceo5jfcx1wO/hq0cCTglZ4pWXYzmRhDSCHTfN45jRZKteMmu95rGwjDq9Z0kOYpjPuCUND42KNPc85xadGMk8mJjOu9AO+ze8Z08YbTHjQUQAotuXytbYagCRQo2hvNBDTxs2c/WViW7HMMprLllBbP9kJggFV462eN/UnL5MYSk5PVJbk2cdQvns46cUSH7JmG8jmIpxisrSYKqR+Xo/Ugl7zf/YxKzx5s6wyqUJ+zdDajjo5k6wpS sW6l+NVL f1RudRaFCwCfwkdMtH/1DsughTeE/qXCq7GexqRXppFzlDWvcxic5+Bib09I72aFyjjx/kwQVz5ECAIi3rEpFuDqk9NIgL4jRj6SpPY28tQbRbMNv1nBQNpcSGg+6tn34U9XXHqCf1xNC28mAZi6UL013pySkQ59Q21SDh1JdqhzLbR158uFCGF3MBdWE5T+jemePw0QkXCXP2kz25tpIUtBulbIgOyoRkQRFfkll1gmCUfr08VxFIzsque94o8gUvR3/SUlD0ya7mCD7aPZncxUBhefdY/VaIKjvDvH77UNGwVt1ikvDccuNWQHpUAV/zftp7E6iZJpL6RoBEXh3OWKK+E2dIbu6HqQaHZ4EAmeeEQGbNAcfQPNH08337UQnUDLKtU2pElDhcJLPZZS18hAGDPb/IkixUbFGYxgxAM9KhzoprCIWPtuJyPjXm5Rt5WG26hVjgN+f4ytOeIFN8z9YRl7G3f3u1RL9z+vwc8CdvtrlTyGLveErhBl5qU3fU/WEEdq8UyYFus4EJxwo+PkpTpQwv8YYzagyAKQW5oh84Nl+7Q+iDXkNGPba2+78plu7W6NocnAIj/8YY/QjXvS++psAbNceu6bpERsGvqNNVt8OnAQ2qqwGnQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 7, 2024 at 8:14=E2=80=AFPM Ryan Roberts = wrote: > > On 06/05/2024 09:31, David Hildenbrand wrote: > > On 06.05.24 10:20, Barry Song wrote: > >> On Mon, May 6, 2024 at 8:06=E2=80=AFPM David Hildenbrand wrote: > >>> > >>> On 04.05.24 01:40, Barry Song wrote: > >>>> On Fri, May 3, 2024 at 5:41=E2=80=AFPM Ryan Roberts wrote: > >>>>> > >>>>> On 03/05/2024 01:50, Barry Song wrote: > >>>>>> From: Barry Song > >>>>>> > >>>>>> There could arise a necessity to obtain the first pte_t from a swa= p > >>>>>> pte_t located in the middle. For instance, this may occur within t= he > >>>>>> context of do_swap_page(), where a page fault can potentially occu= r in > >>>>>> any PTE of a large folio. To address this, the following patch int= roduces > >>>>>> pte_move_swp_offset(), a function capable of bidirectional movemen= t by > >>>>>> a specified delta argument. Consequently, pte_increment_swp_offset= () > >>>>> > >>>>> You mean pte_next_swp_offset()? > >>>> > >>>> yes. > >>>> > >>>>> > >>>>>> will directly invoke it with delta =3D 1. > >>>>>> > >>>>>> Suggested-by: "Huang, Ying" > >>>>>> Signed-off-by: Barry Song > >>>>>> --- > >>>>>> mm/internal.h | 25 +++++++++++++++++++++---- > >>>>>> 1 file changed, 21 insertions(+), 4 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/internal.h b/mm/internal.h > >>>>>> index c5552d35d995..cfe4aed66a5c 100644 > >>>>>> --- a/mm/internal.h > >>>>>> +++ b/mm/internal.h > >>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct fol= io > >>>>>> *folio, unsigned long addr, > >>>>>> } > >>>>>> > >>>>>> /** > >>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of= a swap > >>>>>> pte. > >>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a sw= ap pte > >>>>>> + * forward or backward by delta > >>>>>> * @pte: The initial pte state; is_swap_pte(pte) must be true a= nd > >>>>>> * non_swap_entry() must be false. > >>>>>> + * @delta: The direction and the offset we are moving; forward if= delta > >>>>>> + * is positive; backward if delta is negative > >>>>>> * > >>>>>> - * Increments the swap offset, while maintaining all other fields= , including > >>>>>> + * Moves the swap offset, while maintaining all other fields, inc= luding > >>>>>> * swap type, and any swp pte bits. The resulting pte is return= ed. > >>>>>> */ > >>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte) > >>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta) > >>>>> > >>>>> We have equivalent functions for pfn: > >>>>> > >>>>> pte_next_pfn() > >>>>> pte_advance_pfn() > >>>>> > >>>>> Although the latter takes an unsigned long and only moves forward c= urrently. I > >>>>> wonder if it makes sense to have their naming and semantics match? = i.e. change > >>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too. > >>>>> > >>>>> I guess we don't have a need for that and it adds more churn. > >>>> > >>>> we might have a need in the below case. > >>>> A forks B, then A and B share large folios. B unmap/exit, then large > >>>> folios of process > >>>> A become single-mapped. > >>>> Right now, while writing A's folios, we are CoWing A's large folios > >>>> into many small > >>>> folios. I believe we can reuse the entire large folios instead of do= ing > >>>> nr_pages > >>>> CoW and page faults. > >>>> In this case, we might want to get the first PTE from vmf->pte. > >>> > >>> Once we have COW reuse for large folios in place (I think you know th= at > >>> I am working on that), it might make sense to "COW-reuse around", > >> > >> TBH, I don't know if you are working on that. please Cc me next time := -) > > > > I could have sworn I mentioned it to you already :) > > > > See > > > > https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@r= edhat.com/T/ > > > > I'll follow-up on that soonish (now that batching is upstream and the l= arge > > mapcount is on its way upstream). > > > >> > >>> meaning we look if some neighboring PTEs map the same large folio and > >>> map them writable as well. But if it's really worth it, increasing pa= ge > >>> fault latency, is to be decided separately. > >> > >> On the other hand, we eliminate latency for the remaining nr_pages - 1= PTEs. > >> Perhaps we can discover a more cost-effective method to signify that a= large > >> folio is probably singly mapped? > > > > Yes, precisely what I am up to! > > > >> and only call "multi-PTEs" reuse while that > >> condition is true in PF and avoid increasing latency always? > > > > I'm thinking along those lines: > > > > If we detect that it's exclusive, we can certainly mapped the current P= TE > > writable. Then, we can decide how much (and if) we want to fault-around= writable > > as an optimization. > > > > For smallish large folios, it might make sense to try faulting around m= ost of > > the folio. > > > > For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might= not want > > to fault around the whole thing -- especially if there is little benefi= t to be > > had from contig-pte bits. > > > >> > >>> > >>> > >>>> > >>>> Another case, might be > >>>> A forks B, and we write either A or B, we might CoW an entire large > >>>> folios instead > >>>> CoWing nr_pages small folios. > >>>> > >>>> case 1 seems more useful, I might have a go after some days. then we= might > >>>> see pte_move_pfn(). > >>> pte_move_pfn() does sound odd to me. > > Yes, I agree the name is odd. pte_move_swp_offset() sounds similarly odd = tbh. > Perhaps just pte_advance_swp_offset() with a negative value is clearer ab= out > what its doing? > I am not a native speaker. but dictionary says advance: move forward in a purposeful way. a forward movement. Now we are moving backward or forward :-) > >>> It might not be required to > >>> implement the optimization described above. (it's easier to simply re= ad > >>> another PTE, check if it maps the same large folio, and to batch from= there) > > Yes agreed. > > >>> > >> > >> It appears that your proposal suggests potential reusability as follow= s: if we > >> have a large folio containing 16 PTEs, you might consider reusing only= 4 by > >> examining PTEs "around" but not necessarily all 16 PTEs. please correc= t me > >> if my understanding is wrong. > >> > >> Initially, my idea was to obtain the first PTE using pte_move_pfn() an= d then > >> utilize folio_pte_batch() with the first PTE as arguments to ensure co= nsistency > >> in nr_pages, thus enabling complete reuse of the whole folio. > > > > Simply doing an vm_normal_folio(pte - X) =3D=3D folio and then trying t= o batch from > > there might be easier and cleaner. > > >