From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C6A9C5478C for ; Fri, 23 Feb 2024 09:46:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C141D6B0075; Fri, 23 Feb 2024 04:46:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BC3726B0078; Fri, 23 Feb 2024 04:46:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A654A6B007B; Fri, 23 Feb 2024 04:46:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 933776B0075 for ; Fri, 23 Feb 2024 04:46:16 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 3E8AB160F47 for ; Fri, 23 Feb 2024 09:46:16 +0000 (UTC) X-FDA: 81822587952.25.4DAD2E8 Received: from mail-ua1-f43.google.com (mail-ua1-f43.google.com [209.85.222.43]) by imf16.hostedemail.com (Postfix) with ESMTP id 6819C18000B for ; Fri, 23 Feb 2024 09:46:14 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Ic3F2cGq; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.43 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708681574; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2M7A9OXj74j9IPRozvW1OVuzfe+3xdUDawNt5dPev8s=; b=PbAgNTukauuJvhAiMpQWfTIVYAjkYi0yn0GMTgEpQRrtUksYA0T4TtSIv5VmqLOkOlN2yA XjIkGBGZ47+Hg/A+aKSkxwPf+uO65/uNavyOEk4BaX9oXzR08t27R9GZiw5VNFj1zoJl27 ERZkjMPtoo5d2S5X4AE2VU7vMPtJcyY= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Ic3F2cGq; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.43 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708681574; a=rsa-sha256; cv=none; b=DS00T0JDayDeXpCncP6+OW1fJkcsJIFP/iPgJwAzh5I9mGBJ4bh8mZA7Aq2RBbY4BCJN5W aVx53SSRTamPodNmkpJLaFNX8U1Ms3QJfZ/3UrlgYh/+KZ1SlqFEB+Robg7bLycskgEi+h wNgVx5a9jKsb64LN9UDLGMaFWWcJu7c= Received: by mail-ua1-f43.google.com with SMTP id a1e0cc1a2514c-7d5bbbe592dso95646241.0 for ; Fri, 23 Feb 2024 01:46:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708681573; x=1709286373; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=2M7A9OXj74j9IPRozvW1OVuzfe+3xdUDawNt5dPev8s=; b=Ic3F2cGq+7gkHVw+XizbkXiWTFZXmqh0lqVmOeGNMGL3R3DlYiBBjQXdlShPqgUp9/ N2R9ecWlrNVyasITagI0sc3D2i0w8nkXs+i8nuUE3LdTDgvZbPmqlG5dUTExlUR+LFzB +EAHPJuYadV4aWTFubuQbPUdcRtzlnUNt9psVwIvI830WtuqDL4qOxc5+Fj4/LuiD2oV KQWK5xidP+zf0c5r/9oNMjER71TUzg9RMZ6fw9LLwuTrcoAiKwYUziQCBVEQA+Kf9Vs1 5GSRSLGNBSuDHp5b0rcGyLKmo5z/3Qh9V1n+X9S105wwJ0Jk+LMBCd/m/SWgOnTJvsyw A+Mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708681573; x=1709286373; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2M7A9OXj74j9IPRozvW1OVuzfe+3xdUDawNt5dPev8s=; b=dHzWpQd54G+Mt1syJBAm1j/ay6Sp+HSdDjjkR61sa25mRa5ELzMub9d6KQoqVD07Rr D+lq8ew+4Ay0adQJOWXWFcTeV/oKlE4XxN4igW5WNY4tFdC/Du/qboaqAbtTP8bgdNSc grE5cK+sO6dQR+Rx5mihIQmJr2CJ3p8+Zmdx/uavN5axZXIW4I+91+PMxnLHfK53j7j8 6sjK8SKZW+s2yv9KV7D23H4TajpzxbyJgCsqqqT5y//uxjlxUDalHn8JkBrtiyXxYwwR zZIkj1RPqot+S7OCxIP9cRrsTgphuH/v5UoMd5ojJB+Va/1Sbx96/zU36RQofA61FWYH ktZw== X-Forwarded-Encrypted: i=1; AJvYcCWB6byZBFoBwjIjlLAZgWH2wruiyS6yWFmx5DtZf2/HlQiFbpuHB+U30j44VpVMSk5VgbY9oifo7IzqGgs3ZmM0Fv4= X-Gm-Message-State: AOJu0YxxUKobZ1FrSkTUdG2M+HrDG10CkxvUS5NAeEsPqRDYVWlk0BKl M64VSOMd57ORhqGx+6mxp5fEmRbpDiYNB5tMjcri6xuYKaxb2QFWKfkCg9Z/UgzTXYKabcVIYkf fjzXSI6HG2mrGD524GOSdJCcMSLs= X-Google-Smtp-Source: AGHT+IGTOYkpTHymA0CLqK81q3hxRHctyivYZqYvk894pQGeM/8NnWaokwxnxIg6HftWAlCLa9kLVmeCbgfrWMLOk1k= X-Received: by 2002:a67:f41a:0:b0:470:5c4a:86ab with SMTP id p26-20020a67f41a000000b004705c4a86abmr1295771vsn.0.1708681573263; Fri, 23 Feb 2024 01:46:13 -0800 (PST) MIME-Version: 1.0 References: <20231025144546.577640-5-ryan.roberts@arm.com> <20240222070544.133673-1-21cnbao@gmail.com> <1a9fcdcd-c0dd-46dd-9c03-265a6988eeb2@redhat.com> In-Reply-To: <1a9fcdcd-c0dd-46dd-9c03-265a6988eeb2@redhat.com> From: Barry Song <21cnbao@gmail.com> Date: Fri, 23 Feb 2024 22:46:01 +1300 Message-ID: Subject: Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting To: David Hildenbrand Cc: ryan.roberts@arm.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yuzhao@google.com, chrisl@kernel.org, surenb@google.com, hanchuanhua@oppo.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 6819C18000B X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: uy6pbksgbm9sbsxswf7mcyehxijh3tdu X-HE-Tag: 1708681574-521546 X-HE-Meta: U2FsdGVkX18FCjFeFkl3T4ENPeGSr6ogDjuv5xmprIeP80b5BK7Qe+R0DW17ecrE6E/MEPpV/hwbslR5pUUCoCTcnBjR+GP8xxApb1VdyydPaYj4RZ4G+M0hQsY5Zn0PdxyyNtcVZW3prT7KzkrHMi/HCCDjlVx4yi1I442eoUCWUciJF5fN6nHP+CrhP/LaERZj0nd6f2sPtulXHMTdMOKEeT5J/vpHm12uxomvQaIBaTKxrpnZvwkfgVbGvrNXM2Jc1aalE5bsZhWAms4INF4Iu22U/a1oK7SzsrfiztSd1swnT3AtSN+3iXClrPNVVIQxUJAxryXSrM7Q/Bmx26nFjgLa5fgmho6smgv7dGlS6lFlXRBZRLw/t6/sEJ3p03d103Po+qQowwPt0zM0ubdJFrT/HelUw+akuH4VsVHetVF9XMOIAFt+b6z5eV2QC7EmTcR37acZOaioNkDFXpZ7iaCEk+rQQFDOq9Nhh2ex54CZG4AsRQwD/ga52geCS0Nc+3fzmbtCjV35jdi4Ry1bLOUK+vXGjwmfFcdcJ52r55nTzMUG2jt5XF9HfhD1+T6BnrWppXKp2u2tZcxW8dQuo+T2V6uaqkeYDNz9c6YJvO2w5bT4BMzu4QS7P/NNRLbhqaBhqPbDXfp5X5EPqneHxuFnZ7cR64U01ubb7THoqq+TO2mDF6ZUFVZAkr2iRhHzXNN/eJKJ/9hB8CPYtHAQLpgGKDTArZEnclIpEG0phNa2rbiXT0C+0Fw9KQpz7wuVpidvGegpvAj9rezAJTKlFoYL11fHcvtzFt4dVcsxWyCoNFOK69bMBOnwv4t90WA3uF6eqb6GKTnkFu0ZKXptSWWUeZIX7ayI+eTJ3SqLriXyqGRsXsUV7Vo2nZyK7AZBUvl/6gtfTy4MSe++v+rfRg4MycN7q1OXtyweodyKAr2IrsHkmyGZaKXvUD6OuQB1Oqa6DDa/S0TgJn+ i01yNEn/ w5wj8q+O6KRpoq7qsbbF5GRkZmijNfDTiweO223o4j72FmraqZaiFfwsVWQ7Sl4K5DBwBXMvppR3quDCvgv5eRgLCEUsaeBRiYwCzb1g4FnVIgV87RRKpmqkShPyEay6wGfR+Kpkpfd6jWO705jSRTcrQginM7AMujpjVmIVzkvjnaG6oUiGsOYvkfmWap79O6PijSLHaRJIneWnQ1HR/l7m/lj9nAi85E3osGglAMKQn5AiTRkzgGF6oZ8eGjOta9KYb4gLuiSzZ4rv+WbWWGvDN2O2xp8SBfimS2MSz0xGU1m3Bno7++mZqcRthCz9+DO+TLjmRHKyktmTQdMqqOdabebep4DVeMAw90HEZQPVJJHXaVLq9J0jUO4Mm57mBp0YW2ST6LfromZXoAEp16MobwW+6vyNZ0AGtt5pVn9JKCY7M4nX2sZ3WIQskreFUoY64DEMdUdSHA64= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 22, 2024 at 11:09=E2=80=AFPM David Hildenbrand wrote: > > On 22.02.24 08:05, Barry Song wrote: > > Hi Ryan, > > > >> diff --git a/mm/vmscan.c b/mm/vmscan.c > >> index 2cc0cb41fb32..ea19710aa4cd 100644 > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct l= ist_head *folio_list, > >> if (!can_split_folio(folio, NULL)= ) > >> goto activate_locked; > >> /* > >> - * Split folios without a PMD map= right > >> - * away. Chances are some or all = of the > >> - * tail pages can be freed withou= t IO. > >> + * Split PMD-mappable folios with= out a > >> + * PMD map right away. Chances ar= e some > >> + * or all of the tail pages can b= e freed > >> + * without IO. > >> */ > >> - if (!folio_entire_mapcount(folio)= && > >> + if (folio_test_pmd_mappable(folio= ) && > >> + !folio_entire_mapcount(folio)= && > >> split_folio_to_list(folio, > >> folio_lis= t)) > >> goto activate_locked; > > > > I ran a test to investigate what would happen while reclaiming a partia= lly > > unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEE= D > > 4KB~64KB, and keep the first subpage 0~4KiB. > > IOW, something that already happens with ordinary THP already IIRC. > > > > > My test wants to address my three concerns, > > a. whether we will have leak on swap slots > > b. whether we will have redundant I/O > > c. whether we will cause races on swapcache > > > > what i have done is printing folio->_nr_pages_mapped and dumping 16 swa= p_map[] > > at some specific stage > > 1. just after add_to_swap (swap slots are allocated) > > 2. before and after try_to_unmap (ptes are set to swap_entry) > > 3. before and after pageout (also add printk in zram driver to dump all= I/O write) > > 4. before and after remove_mapping > > > > The below is the dumped info for a particular large folio, > > > > 1. after add_to_swap > > [ 27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1 > > [ 27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-= 40-40-40-40-40-40-40 > > > > as you can see, > > _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40) > > > > > > 2. before and after try_to_unmap > > [ 27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr= :1 > > [ 27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43= pte_p:ffff0000c36a8f80 > > [ 27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:= 0 > > [ 27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-= 40-40-40-40-40-40-40 > > > > as you can see, one pte is set to swp_entry, and _nr_pages_mapped becom= es > > 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1 > > > > 3. before and after pageout > > [ 27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0 > > [ 27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-= 40-40-40-40-40-40-40 > > [ 27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0 > > [ 27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1 > > [ 27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2 > > [ 27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3 > > [ 27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4 > > [ 27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5 > > [ 27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6 > > [ 27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7 > > [ 27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8 > > [ 27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9 > > [ 27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba > > [ 27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb > > [ 27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc > > [ 27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd > > [ 27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be > > [ 27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf > > [ 27.276102] vmscan: after pageout and before needs_release shrink_fo= lio_list 2124 mapnr:0 > > > > as you can see, obviously, we have done redundant I/O - 16 zram_write_p= age though > > 4~64KiB has been zap_pte_range before, we still write them to zRAM. > > > > 4. before and after remove_mapping > > [ 27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-= 40-40-40-40-40-40-40 > > [ 27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapn= r:0 offset:0 > > [ 27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00= -00-00-00-00 > > > > as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1. > > all SWAP_HAS_CACHE has been removed. This is perfect and there is no sw= ap > > slot leak at all! > > > > Thus, only two concerns are left for me, > > 1. as we don't split anyway, we have done 15 unnecessary I/O if a large= folio > > is partially unmapped. > > 2. large folio is added as a whole as a swapcache covering the range wh= ose > > part has been zapped. I am not quite sure if this will cause some probl= ems > > while some concurrent do_anon_page, swapin and swapout occurs between 3= and > > 4 on zapped subpage1~subpage15. still struggling.. my brain is explodin= g... > > Just noting: I was running into something different in the past with > THP. And it's effectively the same scenario, just swapout and > MADV_DONTNEED reversed. > > Imagine you swapped out a THP and the THP it still is in the swapcache. > > Then you unmap/zap some PTEs, freeing up the swap slots. > > In zap_pte_range(), we'll call free_swap_and_cache(). There, we run into > the "!swap_page_trans_huge_swapped(p, entry)", and we won't be calling > __try_to_reclaim_swap(). I guess you mean swap_page_trans_huge_swapped(p, entry) not !swap_page_trans_huge_swapped(p, entry) ? at that time, swap_page_trans_huge_swapped should be true as there are stil= l some entries whose swap_map=3D0x41 or above (SWAP_HAS_CACHE and swap_count >=3D 1) static bool swap_page_trans_huge_swapped(struct swap_info_struct *si, swp_entry_t entry, unsigned int nr_pages) { ... for (i =3D 0; i < nr_pages; i++) { if (swap_count(map[offset + i])) { ret =3D true; break; } } unlock_out: unlock_cluster_or_swap_info(si, ci); return ret; } So this will stop the swap free even for those ptes which have been zapped? Another case I have reported[1] is that while reclaiming a large folio, in try_to_unmap_one, we are calling page_vma_mapped_walk(). as it only begins to hold PTL after it hits a valid pte, a paralel break-before-make might make 0nd, 1st and beginning PTEs zero, try_to_unmap_one can read intermediate ptes value, thus we can run into this below case. afte try_to_unmap_one, pte 0 - untouched, present pte pte 1 - untouched, present pte pte 2 - swap entries pte 3 - swap entries ... pte n - swap entries or pte 0 - untouched, present pte pte 1 - swap entries pte 2 - swap entries pte 3 - swap entries ... pte n - swap entries etc. Thus, after try_to_unmap, the folio is still mapped with some ptes becoming swap entries, some PTEs are still present. it might be staying in swapcache for a long time with a broken CONT-PTE. I also hate that and hope for a SYNC way to let large folio hold PTL from t= he 0nd pte, thus, it won't get intermediate PTEs from other break-before-make. This also doesn't increase PTL contention as my test shows we will always get PTL for a large folio after skipping zero, one or two PTEs in try_to_unmap_one. but skipping 1 or 2 is really bad and sad, breaking a large folio from the = same whole to nr_pages different segments. [1] https://lore.kernel.org/linux-mm/CAGsJ_4wo7BiJWSKb1K_WyAai30KmfckMQ3-mC= JPXZ892CtXpyQ@mail.gmail.com/ > > So we won't split the large folio that is in the swapcache and it will > continue consuming "more memory" than intended until fully evicted. > > > > > To me, it seems safer to split or do some other similar optimization if= we find a > > large folio has partial map and unmap. > > I'm hoping that we can avoid any new direct users of _nr_pages_mapped if > possible. > Is _nr_pages_mapped < nr_pages a reasonable case to split as we have known the folio has at least some subpages zapped? > If we find that the folio is on the deferred split list, we might as > well just split it right away, before swapping it out. That might be a > reasonable optimization for the case you describe. i tried to change Ryan's code as below @@ -1905,11 +1922,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, * PMD map right away. Chances are = some * or all of the tail pages can be = freed * without IO. + * Similarly, split PTE-mapped foli= os if + * they have been already deferred_split. */ - if (folio_test_pmd_mappable(folio) = && - !folio_entire_mapcount(folio) &= & - split_folio_to_list(folio, - folio_list)= ) + if (((folio_test_pmd_mappable(folio) && !folio_entire_mapcount(folio)) || + (!folio_test_pmd_mappable(folio) && !list_empty(&folio->_deferred_list))) + && split_folio_to_list(folio, folio_list)) goto activate_locked; } if (!add_to_swap(folio)) { It seems to work as expected. only one I/O is left for a large folio with 16 PTEs but 15 of them have been zapped before. > > -- > Cheers, > > David / dhildenb > Thanks Barry