From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 02DC1C4167D for ; Wed, 8 Nov 2023 21:04:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 891716B02EE; Wed, 8 Nov 2023 16:04:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8424E6B02EF; Wed, 8 Nov 2023 16:04:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6E2DD6B02F0; Wed, 8 Nov 2023 16:04:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5A41B6B02EE for ; Wed, 8 Nov 2023 16:04:33 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 2CCD9140C83 for ; Wed, 8 Nov 2023 21:04:33 +0000 (UTC) X-FDA: 81436015626.14.DC8334E Received: from mail-vk1-f173.google.com (mail-vk1-f173.google.com [209.85.221.173]) by imf23.hostedemail.com (Postfix) with ESMTP id 609BF14001B for ; Wed, 8 Nov 2023 21:04:30 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=f1UqjPj9; spf=pass (imf23.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699477470; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ewM9PTqPdI3B1hKS5lr5GhB5pSwRt9mQhtmf5QhLeMs=; b=FI1oUdG1hPFMj4QtEJLSIHCLJZniT8X/s4uUfRFQGUE/JABDOqmMWbkKAo8DCvb5BdkXQd xYz2QYQZcEgxP+QzkeeIMGd4tB0d7z63Y7sWRt2zMSkg8rGkEvJ9oa7ED3lmt4crhAyftF viCoIvieGoOw3VtwZzv+6+LvzQjQJ9U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699477470; a=rsa-sha256; cv=none; b=e+vbQn2Ay1Nv13Ju7394IrQuMO4R/ndSPQabwAaNXBlYCQM9TjoHPuFK1GmQlJxrPMp6cL LY5qSuVGDDnHKYkA3biVJyaS5s/yClAhkxU1Ng81nhCIU5W1V9EPMgFMe9vUOTO/IzqQL9 d5t8nN2ctgjIGWwQTW1U0F1gNxrIXeQ= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=f1UqjPj9; spf=pass (imf23.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vk1-f173.google.com with SMTP id 71dfb90a1353d-4ac0d137835so66451e0c.2 for ; Wed, 08 Nov 2023 13:04:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1699477469; x=1700082269; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ewM9PTqPdI3B1hKS5lr5GhB5pSwRt9mQhtmf5QhLeMs=; b=f1UqjPj9sWGmEkAuLZza0ERnBniugkLD4QWeXAaeyqZXliPnE5OgA1UA1cjf4cfYHK 43NxT3HzbJATQB/qHFggnBScmxnQux5EQs/drQWTWve8giSWOwZPimVZ2R2P1QDffkD8 G0hPQN1hc1i0ld+RGSblayWuFgigCmJfSYKhbe1/2Z7RnQSl0v4mmM+NqXDkBUH2v0Od y6bffpKgHxJlU1tv1qVoL14XBlNHS+CMJilr94i9AM7Wlq/vPuFM7p8gNbNSA5csbGkM FvSquxIiN4qbSfq4Tid7LFJWMDUs4jplD2a87euNQDrfM8z/W/VspVjx+J/w0I9sdDf+ EUrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699477469; x=1700082269; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ewM9PTqPdI3B1hKS5lr5GhB5pSwRt9mQhtmf5QhLeMs=; b=uafKOmxJLkOehEolYB/m9+35i9ylfcGm/z4XSWGjtu95kqe4PiiY2zws3iV4Ze1RL3 HGDCsmjXOcZ6Qya7FbFWdnicF6xcqD7YMFxZRn5lV2J/nHITURWe/g4LI9PuYVzH88gK KK8UL/eiG5dI0zU094HNgTiFUo1FoQ9gD9nRVLdxYmd0l+W4DMEJQCRE8WFGUJmukoiO b1irIjyG6ewW/zhiF+cJO1xqoW6Q4qe0vVdvI3oi+K0spUJ25AaNPnXxz4oHbZedOWaR pu99G/nFbGrruKx3suY32ndjqW00QIilactdXQlYy7Ilq0lEuJWJeDm1lro4M7DC/L0C SF1g== X-Gm-Message-State: AOJu0YyOkvljjqTTwTRduPohzNr4SNKIYndQKm9akwVNHlOiASFwkSux QV0WFA6Y8mAxx2VAzQj3gKrWYROLxHT7w8dig4Q= X-Google-Smtp-Source: AGHT+IFXzuUlwphGdjnCmvGUz2HQ8DpSimrXJ60fBe9GegI1mRrvGZNQcokCgFwJGh+Xcz3rAAqKo8pE0DT1j624Tao= X-Received: by 2002:a1f:244f:0:b0:496:248e:43fc with SMTP id k76-20020a1f244f000000b00496248e43fcmr3140932vkk.8.1699477469046; Wed, 08 Nov 2023 13:04:29 -0800 (PST) MIME-Version: 1.0 References: <2fe5ce7e-9c5c-4df4-b4fc-9fd3d9b2dccb@arm.com> <20231104093423.170054-1-v-songbaohua@oppo.com> <2c98be67-657e-4c65-bf6b-3d70ff596c64@arm.com> In-Reply-To: <2c98be67-657e-4c65-bf6b-3d70ff596c64@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Thu, 9 Nov 2023 05:04:17 +0800 Message-ID: Subject: Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting To: Ryan Roberts Cc: steven.price@arm.com, akpm@linux-foundation.org, david@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yuzhao@google.com, Barry Song , nd@arm.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 609BF14001B X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: ym5mo45p73dhj4e35k1wfwo65d1m9s3d X-HE-Tag: 1699477470-688162 X-HE-Meta: U2FsdGVkX1+Fv6i1eIbl+UPbvt3ty6ULSRnEBjk6w8hCHRrJP7LgAaq7KYjGPr8sD9cCCcmAvJmf8hXJ95L0vY99112wlJJKWYBfbo16TSQsWcPs0SC2HQ7ZLbr0JaAWZJmoQ/C24MPYVQmNv6h2Q8AnzU0LK6oRk6Xqa55MAr5FK5pdCE69N893h9TakyLqOTZaaZHvSlnifOGqc51jKoinWA1shCNXD9VeW3h8WCNZPn0xwHHmZVVG1zXVDotwfErEJla2PLnqQd+UC28bQgw0Q0LiRyM1J/hk40bT6aNGA6s/0y2SA4w+zS7ZS+C15cV7vWOTtdA1T9+9fPxHN2jpPVifJbhap3CnhvhPcMLgVgaPjkzd/nq4E4bKAaqo3jMp16ZGtfapvkqjkClAam9JK2xhRldXEY79B049q+sLrg1EqnmqR4qP8ou+z9eY9zO+tnZHnFKFc+iic+Pl2+6zdm9TBxZV7rT7ZSbRNaWhvchpnkh6JAECUUj6pKRAn8t3+a6DbJJG2/cnzRlr1neHvabzS3fABSupeXPj5blQk9Vr60qWSFSu0Jscpl7evXQ1feiQrjdRSofrsi0UJHJNMo84hkMFzqxWhefSuI3W4/+ZlOACVZR6bObr/d0CODi0hF8A/0TNbXL9sCSj1zbiGHkk/1lvgjADIhul2iF7hjr5HpRuyh/2yUHTtoOYNFIE0paa67QAqLfOIm88Ywr42KFnyXNAxHfQJHd9FYLdBSHwnR6lrKEE22VAzlnf675yVfNnOv3L9M2yb7JAaZxknynZDHvwgpWKFsEa/bEpBzGriWWCwjEtSr1K/mj6sgtT15rLE6eUiPFVDqFAi1yzkLWoVfjcTXy5IBuXjIKslLq3WwXy7hRh+enGIdr/dhtW0dS4cCuhICNNXinvQcwH0NUHffxXyXFIdUpPM1wX4W0Cem3XVceRz3UKPe9fkDdB54sE03FMb2fyi+U W+xSnyyY 8Qew9u+hWcVHgFqCdLUV7p310/+TDSe1lDvzIkgHrV9/k7j/lRkdpFA7ffAj2lDKOAPxS3DYMGbLCSR6Ivig/gcSX53Xwjyb9RZzqHj9Us6erts9MqGm/iRpwjNEneckxCWmb6TgrRQalkLfScXIm4F5W5tsTf94qY0jIjf52FTa+gWH9jrDpL2RY15eiH1pAPLkHsIsiMOU/VFdrUdxgUWGt7bPVBOMC00gjpOny4y89TCQ+jAmYKFOqGUSgvX8iqEGZBonSnh6+Uz7s4hzNaYH6STCLOBNHcCjdEktrEmM8/1s7s5R6R4iym0c7tI1fmzkgKRFAD/i6eNn0fO3QAV0aH0jxT16xLxUuxvjouBMJ12/LXYWQ1QkZ1JFWabOFxFyhfmP9vsmPQuypSfbLmQPl2Xioh1WKyrOZFtl/ZVsC7+sqI/zbQeXLPV7nQ2XAK3MW5ONzgHZL2KjPymVfWeK2/FDOXm67IMUucW/DIOeaLHzgi9dzkPdRLQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 9, 2023 at 4:21=E2=80=AFAM Ryan Roberts = wrote: > > On 08/11/2023 11:23, Barry Song wrote: > > On Wed, Nov 8, 2023 at 2:05=E2=80=AFAM Barry Song <21cnbao@gmail.com> w= rote: > >> > >> On Tue, Nov 7, 2023 at 8:46=E2=80=AFPM Ryan Roberts wrote: > >>> > >>> On 04/11/2023 09:34, Barry Song wrote: > >>>>> Yes that's right. mte_save_tags() needs to allocate memory so can f= ail > >>>>> and if failing then arch_prepare_to_swap() would need to put things= back > >>>>> how they were with calls to mte_invalidate_tags() (although I think > >>>>> you'd actually want to refactor to create a function which takes a > >>>>> struct page *). > >>>>> > >>>>> Steve > >>>> > >>>> Thanks, Steve. combining all comments from You and Ryan, I made a v2= . > >>>> One tricky thing is that we are restoring one page rather than folio > >>>> in arch_restore_swap() as we are only swapping in one page at this > >>>> stage. > >>>> > >>>> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large = folios > >>>> > >>>> This patch makes MTE tags saving and restoring support large folios, > >>>> then we don't need to split them into base pages for swapping on > >>>> ARM64 SoCs with MTE. > >>>> > >>>> This patch moves arch_prepare_to_swap() to take folio rather than > >>>> page, as we support THP swap-out as a whole. And this patch also > >>>> drops arch_thp_swp_supported() as ARM64 MTE is the only one who > >>>> needs it. > >>>> > >>>> Signed-off-by: Barry Song > >>>> --- > >>>> arch/arm64/include/asm/pgtable.h | 21 +++------------ > >>>> arch/arm64/mm/mteswap.c | 44 +++++++++++++++++++++++++++++= +++ > >>>> include/linux/huge_mm.h | 12 --------- > >>>> include/linux/pgtable.h | 2 +- > >>>> mm/page_io.c | 2 +- > >>>> mm/swap_slots.c | 2 +- > >>>> 6 files changed, 51 insertions(+), 32 deletions(-) > >>>> > >>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/a= sm/pgtable.h > >>>> index b19a8aee684c..d8f523dc41e7 100644 > >>>> --- a/arch/arm64/include/asm/pgtable.h > >>>> +++ b/arch/arm64/include/asm/pgtable.h > >>>> @@ -45,12 +45,6 @@ > >>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) > >>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > >>>> > >>>> -static inline bool arch_thp_swp_supported(void) > >>>> -{ > >>>> - return !system_supports_mte(); > >>>> -} > >>>> -#define arch_thp_swp_supported arch_thp_swp_supported > >>>> - > >>>> /* > >>>> * Outside of a few very special situations (e.g. hibernation), we = always > >>>> * use broadcast TLB invalidation instructions, therefore a spuriou= s page > >>>> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_= area_struct *vma, > >>>> #ifdef CONFIG_ARM64_MTE > >>>> > >>>> #define __HAVE_ARCH_PREPARE_TO_SWAP > >>>> -static inline int arch_prepare_to_swap(struct page *page) > >>>> -{ > >>>> - if (system_supports_mte()) > >>>> - return mte_save_tags(page); > >>>> - return 0; > >>>> -} > >>>> +#define arch_prepare_to_swap arch_prepare_to_swap > >>>> +extern int arch_prepare_to_swap(struct folio *folio); > >>>> > >>>> #define __HAVE_ARCH_SWAP_INVALIDATE > >>>> static inline void arch_swap_invalidate_page(int type, pgoff_t offs= et) > >>>> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(= int type) > >>>> } > >>>> > >>>> #define __HAVE_ARCH_SWAP_RESTORE > >>>> -static inline void arch_swap_restore(swp_entry_t entry, struct foli= o *folio) > >>>> -{ > >>>> - if (system_supports_mte()) > >>>> - mte_restore_tags(entry, &folio->page); > >>>> -} > >>>> +#define arch_swap_restore arch_swap_restore > >>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *foli= o); > >>>> > >>>> #endif /* CONFIG_ARM64_MTE */ > >>>> > >>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c > >>>> index a31833e3ddc5..14a479e4ea8e 100644 > >>>> --- a/arch/arm64/mm/mteswap.c > >>>> +++ b/arch/arm64/mm/mteswap.c > >>>> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset= ) > >>>> mte_free_tag_storage(tags); > >>>> } > >>>> > >>>> +static inline void __mte_invalidate_tags(struct page *page) > >>>> +{ > >>>> + swp_entry_t entry =3D page_swap_entry(page); > >>>> + mte_invalidate_tags(swp_type(entry), swp_offset(entry)); > >>>> +} > >>>> + > >>>> void mte_invalidate_tags_area(int type) > >>>> { > >>>> swp_entry_t entry =3D swp_entry(type, 0); > >>>> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type) > >>>> } > >>>> xa_unlock(&mte_pages); > >>>> } > >>>> + > >>>> +int arch_prepare_to_swap(struct folio *folio) > >>>> +{ > >>>> + int err; > >>>> + long i; > >>>> + > >>>> + if (system_supports_mte()) { > >>>> + long nr =3D folio_nr_pages(folio); > >>> > >>> nit: there should be a clear line between variable declarations and l= ogic. > >> > >> right. > >> > >>> > >>>> + for (i =3D 0; i < nr; i++) { > >>>> + err =3D mte_save_tags(folio_page(folio, i)); > >>>> + if (err) > >>>> + goto out; > >>>> + } > >>>> + } > >>>> + return 0; > >>>> + > >>>> +out: > >>>> + while (--i) > >>> > >>> If i is initially > 0, this will fail to invalidate page 0. If i is i= nitially 0 > >>> then it will wrap and run ~forever. I think you meant `while (i--)`? > >> > >> nop. if i=3D0 and we goto out, that means the page0 has failed to save= tags, > >> there is nothing to revert. if i=3D3 and we goto out, that means 0,1,2= have > >> saved, we restore 0,1,2 and we don't restore 3. > > > > I am terribly sorry for my previous noise. You are right, Ryan. i > > actually meant i--. > > No problem - it saves me from writing a long response explaining why --i = is > wrong, at least! > > > > >> > >>> > >>>> + __mte_invalidate_tags(folio_page(folio, i)); > >>>> + return err; > >>>> +} > >>>> + > >>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio) > >>>> +{ > >>>> + if (system_supports_mte()) { > >>>> + /* > >>>> + * We don't support large folios swap in as whole yet,= but > >>>> + * we can hit a large folio which is still in swapcach= e > >>>> + * after those related processes' PTEs have been unmap= ped > >>>> + * but before the swapcache folio is dropped, in this= case, > >>>> + * we need to find the exact page which "entry" is map= ping > >>>> + * to. If we are not hitting swapcache, this folio won= 't be > >>>> + * large > >>>> + */ > >>> > >>> So the currently defined API allows a large folio to be passed but th= e caller is > >>> supposed to find the single correct page using the swap entry? That f= eels quite > >>> nasty to me. And that's not what the old version of the function was = doing; it > >>> always assumed that the folio was small and passed the first page (wh= ich also > >>> doesn't feel 'nice'). If the old version was wrong, I suggest a separ= ate commit > >>> to fix that. If the old version is correct, then I guess this version= is wrong. > >> > >> the original version(mainline) is wrong but it works as once we find t= he SoCs > >> support MTE, we will split large folios into small pages. so only smal= l pages > >> will be added into swapcache successfully. > >> > >> but now we want to swap out large folios even on SoCs with MTE as a wh= ole, > >> we don't split, so this breaks the assumption do_swap_page() will alwa= ys get > >> small pages. > > > > let me clarify this more. The current mainline assumes > > arch_swap_restore() always > > get a folio with only one page. this is true as we split large folios > > if we find SoCs > > have MTE. but since we are dropping the split now, that means a large > > folio can be > > gotten by do_swap_page(). we have a chance that try_to_unmap_one() has = been done > > but folio is not put. so PTEs will have swap entry but folio is still > > there, and do_swap_page() > > to hit cache directly and the folio won't be released. > > > > but after getting the large folio in do_swap_page, it still only takes > > one basepage particularly > > for the faulted PTE and maps this 4KB PTE only. so it uses the faulted > > swap_entry and > > the folio as parameters to call arch_swap_restore() which can be someth= ing like: > > > > do_swap_page() > > { > > arch_swap_restore(the swap entry for the faulted 4KB PTE, large= folio); > > } > > OK, I understand what's going on, but it seems like a bad API decision. I= think > Steve is saying the same thing; If its only intended to operate on a sing= le > page, it would be much clearer to pass the actual page rather than the fo= lio; > i.e. leave the complexity of figuring out the target page to the caller, = which > understands all this. right. > > As a side note, if the folio is still in the cache, doesn't that imply th= at the > tags haven't been torn down yet? So perhaps you can avoid even making the= call > in this case? right. but it is practically very hard as arch_swap_restore() is always called unconditionally. it is hard to find a decent condition before calling arch_swap_restore(). That is why we actually have been doing redundant arch_swap_restore() lots of times right = now. For example, A forks B,C,D,E,F,G. now A,B,C,D,E,F,G will share one page before CoW. After the page is swapped out, if B is the first process to swap in, B will add the page to swapcache, and restore MTE. After that, A, C, D, E, F,G will directly hit the page swapped in by B, now they restore MTE again. so the MTE is restored 7 times but actually only B needs to do it. so it seems we can put a condition to only let B do restore. But it won't work because we can't guarrent B is the first process who will do PTE mapping. A, C, D, E, F, G can map PTEs earlier than B even if B is the one who did the I/O swapin. swapin/add swapcache and PTE mapping are not done atomically. PTE mapping needs to take PTL. so After B has done swapin, A, C,E,F,G can still begin to use the page earlier than B. So it turns out anyone who first maps the page should restore MTE, but the question is that: How could A,B,C,D,E,F,G know if it is the first one mapping the page to PTE? > > >> > >>> > >>> Thanks, > >>> Ryan > > > > Thanks > > Barry > Thanks Barry