From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D535C5475B for ; Thu, 29 Feb 2024 01:33:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 829406B00A2; Wed, 28 Feb 2024 20:33:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7B2466B00A3; Wed, 28 Feb 2024 20:33:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 605256B00A4; Wed, 28 Feb 2024 20:33:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 4A83A6B00A2 for ; Wed, 28 Feb 2024 20:33:53 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 177E8C1245 for ; Thu, 29 Feb 2024 01:33:53 +0000 (UTC) X-FDA: 81843119946.18.2CD37A7 Received: from mail-vk1-f178.google.com (mail-vk1-f178.google.com [209.85.221.178]) by imf08.hostedemail.com (Postfix) with ESMTP id 47070160003 for ; Thu, 29 Feb 2024 01:33:51 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lp2tYBlY; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.178 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709170431; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qLYSx3phCLrW6DjF1IDFBmQIkWWNdBzi6bZC46PrVDQ=; b=IeFn0Jtzqe03oBnE+yDu9DEaWR8MR9wblWdWH9rHxPMyioqGW+/xS4lRuqRJIy26H+Uwtc 6+Z/zPV1Rlahc61gktAG+uVTh8DwFT+AHYaQLaIMRSgsmASrxJ/XCx3mfrdYbfNEj8vBq4 Ob2ironBCsjryqSNASdK9IE4KfdK4Sg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709170431; a=rsa-sha256; cv=none; b=LEaR/E5+R4MMh8CWkEyMQwRSh1nshBW+yO4kvYgZMaxgQZESVFj8Z/b9+oKESnQuibyX/B mIPs/QCsb7vqmr3K6fbtPmcWTAEeHHFulogBsFjlC4QG0IwT75/iGqSmubG8DlgufdSlxw 14uiuPhVGV+b7WpvSpObw2Lbt3HYfqM= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lp2tYBlY; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.178 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vk1-f178.google.com with SMTP id 71dfb90a1353d-4d332d0db9cso80361e0c.3 for ; Wed, 28 Feb 2024 17:33:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709170430; x=1709775230; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qLYSx3phCLrW6DjF1IDFBmQIkWWNdBzi6bZC46PrVDQ=; b=lp2tYBlYTBCvcm1BgkzCWbEf4jOwFdEfAS3J3nAWjz/GrzSBLaxSKpNl+W5Ogkt3BY wKRsuoRAShb4nRsaSEa2xOUL02P4vKsvnQU6C3woQI2WPRfH7vXwlHYMwzf4JwuX28He BNCo3r1TGupes0YFBrdiZW8GKlzFDnhttMO9kTnW898cSose8/nQDEyKp/sAgXmUM4S4 ezZyAZY313TAnJD3h113TxuIZ2bWuOgVa2coDaFw1WTQyDBirWwO08US6a2aAVsxD+Nt h4N+3XhWhRgdgf4gvZjHIAkvRhiytOjJkPYazWhXGJWcBEJ0kBw+J3Mg0NCMWtJuETTa p8Ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709170430; x=1709775230; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qLYSx3phCLrW6DjF1IDFBmQIkWWNdBzi6bZC46PrVDQ=; b=SNJIdQaFvltEq6jdd1R6nExNym2Z97kV0jeJP3NbJtM6ZDG4uhiEvdGPhPVcxJUNx4 LpjeiFrxgChD5KE1rJ6UMU8NpUFsGpqoyEe8nJb97FQS9JcqM0kkMBS78kMYITOs8iug 1gJ7rrEuBDVtyHr2jPzTTLtUUo4pPACdEI3lPzSNnzlSb/7vSbPQ1wF9m1bFQPqXKilN vW4B2QmIT579fenMqcrhiW1M1Q5nrMKMRdJ2VpH147PRjGruKvaOuxeAwRClPXVr2csd Mnqx3qxHeQVXjUGVT/Pu5KhCjcJ/KdHJ8tpHADXelMnr4egDdE93P9w338hfoXJBs88s cYUw== X-Forwarded-Encrypted: i=1; AJvYcCWKyrjiv3GjZyTgPL4ox9UKqciGjGHU+C2QML7zHk5EQlKaGhtJF7QYGDxHHyKf2e4+Q3r92P1+P0qLslGrIvPrqfc= X-Gm-Message-State: AOJu0YxaXZ+H8LqW/UOrvLdavyxnbhhIHhz8dSYBJdh3hEdTD1Voq4kF 48uTOSA8vcq2qGLVsJIBTH5s0m6+8/T8ZdTdcSkVtws9efm9W/M5IUCQyU6NiQtYft75KopPA0a 7Id+xcy5qfhzdiMtyUgyQO2NpptU= X-Google-Smtp-Source: AGHT+IEfiVS1j/W6czzK/rI3j/StrrCUE/xhlh23zu9XRt4ZT2hWFqS4BXH74iLSO6xMePmpAnzQkVCmMWfjwSoHNPs= X-Received: by 2002:a1f:ea82:0:b0:4cd:b718:4b08 with SMTP id i124-20020a1fea82000000b004cdb7184b08mr603758vkh.11.1709170430033; Wed, 28 Feb 2024 17:33:50 -0800 (PST) MIME-Version: 1.0 References: <20240229003753.134193-1-21cnbao@gmail.com> <20240229003753.134193-5-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 29 Feb 2024 14:33:38 +1300 Message-ID: Subject: Re: [PATCH RFC v2 4/5] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in To: Yosry Ahmed Cc: akpm@linux-foundation.org, david@redhat.com, linux-mm@kvack.org, ryan.roberts@arm.com, chrisl@kernel.org, linux-kernel@vger.kernel.org, mhocko@suse.com, shy828301@gmail.com, steven.price@arm.com, surenb@google.com, wangkefeng.wang@huawei.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yuzhao@google.com, kasong@tencent.com, nphamcs@gmail.com, chengming.zhou@linux.dev, hannes@cmpxchg.org, linux-arm-kernel@lists.infradead.org, Barry Song , Hugh Dickins , Minchan Kim , SeongJae Park Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 47070160003 X-Rspam-User: X-Stat-Signature: isoswwirfdrmt83pk66ru4o1hc4zxi8s X-Rspamd-Server: rspam03 X-HE-Tag: 1709170431-1314 X-HE-Meta: U2FsdGVkX18F5agQop7Q6F+HAxb/1b2wq3Dqau7dmE32Wjb/UToYuSaw0FzcqmApFtqKwJLbGujFV44BR9/0cQIAws0rvKpK1vYK5BUfka5q5fsSH9DBmWkZdYbxgTukuY7tuLT28BiXfVKwtUOWdfDV6OPo50FKj/4dsZtE/8fUP5UdS5nW+bMDF5m8/XH3BazS+h1lWyUdaBCogp2DmaHi9RgaxcIz0F2T9uci5RzRpTyLftge3d2e6/wuZDDUNST0T+PYFKP0j+WwxAemw0K5++O949gqZYJTkyxpifp6GovIJacYKI12rgNs92fgMtF2w4Ywg8NDT8/e8llvnfvhsQTsiMPrAWT51K+YMyPtbIs5U4vqrIPgYSgYtUibfKcVdH0VADzEWX6jXQciZqSIYBlMsRRgTKV8x5V/R4ieIiEKj07M1b+y7JVcK+qR4Z2F3q1BNn0jkPBTZuOHbkOI4Z2pzcWdVM14LyJhAttX9zONp6Sw3SqUntHAgd5hYuSakDxVfH5pppBrm3clipMeN2CbmXTPj2MNv32F6oiok0HB8SHEeAunOA71ZHujzfEPHHYwBDk47kuBSSabaXvtQyUhkB5zrVpss/J3acB5C2QeIl2m2HJptoTG+jLFZH9t0apz5JzOEs3EuLddHzOpISkKFh1YA7Jlt2/4dqkGh+U+VfP8PaCAlJzlNcXBampOjUwkOYDC9yuyMVC4zXgvDw9Fifoku+1NAAcMDWfIq9I6BTRasj+rXPGsKu1vpCq5BhmMPvqO2W6qwdVmTf54pxeOYD4kZhjtLJ3FUuHHtWw8VWBtBRjHBpZxhaBK216kDznalhuZejFMiOurT5UodDDFZXwjYIg4hHm8sDHw9XRWWH1Ei329GjE0t5UZrkJKstAqc88zU38FsosOLbfVo6IaPOHaNhpxNePES/nl5iQMBBIQnusOHz/alloHUbZgtlHhS8MvQ0kuPA8 4wpE39v3 qVQz05zzUcaMxD9GnGu3rtl0dZtckIscJTcNlntSCKgWAYVQsM6WjTJX/5Xf1yDATEqIHRSQaZYtJ1kRLAsRohx9TyB4LH0vPrrPFVfDKMQ9jnC/Fq9EYraIG4rW6TySW0/RUtjb+j+Q5Fqo9JQz2i4hqgu7WWiZPfCaNw1zd0h0fo2ilX6uIa3y18NHkHdu0UNFAltACiGFXq5S+B3Wk0Fn4wJu3fVhWmA0wRSFChzSVObxIE/iKdgpy+ET+ZMOAfIf5/ME39VBKESgSPB72xqnK8Ae4e6X3dYjj204L7B4MH3mkKnKhXi1sh39gjlXLrQbf6jTarfYCiqpXIoOkBQAy8gdBkh6VU41xYzDmkee7LVQLdAwUpF9BEtJCxXZGcqvmoaQKdLkHchTxSm4BYfP7RbRA3gXBVzABrgVWu+IwBBdWvxO3CwPZcZect7JofTwgyttNsEM8oq8sBtVGB2dInuPVfnDDElZkbQRx31Wkhg0OUiEhEFI29EVxFHOiGCP/QV71D+ihdHewRbrOMWI6Lm7ba989N9HACWd69QR4+WUfL6/DG2MFh4CmdS9xItcmfFaJ+A5AUmzzNUOpRCDhxdzJtrbH0CY3BouAX6DVGyxBwb/lIvNM4h+fAaquvdo4/Hx8t1SVYjoIJeZxvss8yDV6jPYmoBnVOz4G5Fm+xXc+lbvQ1KEGQZyxJTUTVQroD5uVoVtjymHVS2UVSzgcM3ejZVDQzt1qu7GtbzI2yvs5Bdlbl5lrFzgXUnbxGVhIbySQC+VSQJH+ULZztSa1S05oydHEBP3bzgvcZvn2WR62iPav8bJ/eRChvvllhwSwWKMo/48tLpP6q+0aVLQySl5ls51OCc2M X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 29, 2024 at 2:12=E2=80=AFPM Yosry Ahmed = wrote: > > On Wed, Feb 28, 2024 at 4:59=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr= ote: > > > > On Thu, Feb 29, 2024 at 1:52=E2=80=AFPM Yosry Ahmed wrote: > > > > > > On Wed, Feb 28, 2024 at 4:39=E2=80=AFPM Barry Song <21cnbao@gmail.com= > wrote: > > > > > > > > From: Barry Song > > > > > > > > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") s= upports > > > > one entry only, to support large folio swap-in, we need to handle m= ultiple > > > > swap entries. > > > > > > > > Cc: Kairui Song > > > > Cc: "Huang, Ying" > > > > Cc: Yu Zhao > > > > Cc: David Hildenbrand > > > > Cc: Chris Li > > > > Cc: Hugh Dickins > > > > Cc: Johannes Weiner > > > > Cc: Matthew Wilcox (Oracle) > > > > Cc: Michal Hocko > > > > Cc: Minchan Kim > > > > Cc: Yosry Ahmed > > > > Cc: Yu Zhao > > > > Cc: SeongJae Park > > > > Signed-off-by: Barry Song > > > > --- > > > > include/linux/swap.h | 1 + > > > > mm/swap.h | 1 + > > > > mm/swapfile.c | 117 ++++++++++++++++++++++++++-------------= ---- > > > > 3 files changed, 72 insertions(+), 47 deletions(-) > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > index b3581c976e5f..2691c739d9a4 100644 > > > > --- a/include/linux/swap.h > > > > +++ b/include/linux/swap.h > > > > @@ -480,6 +480,7 @@ extern int add_swap_count_continuation(swp_entr= y_t, gfp_t); > > > > extern void swap_shmem_alloc(swp_entry_t); > > > > extern int swap_duplicate(swp_entry_t); > > > > extern int swapcache_prepare(swp_entry_t); > > > > +extern int swapcache_prepare_nr(swp_entry_t, int nr); > > > > extern void swap_free(swp_entry_t); > > > > extern void swap_nr_free(swp_entry_t entry, int nr_pages); > > > > extern void swapcache_free_entries(swp_entry_t *entries, int n); > > > > diff --git a/mm/swap.h b/mm/swap.h > > > > index fc2f6ade7f80..1cec991efcda 100644 > > > > --- a/mm/swap.h > > > > +++ b/mm/swap.h > > > > @@ -42,6 +42,7 @@ void delete_from_swap_cache(struct folio *folio); > > > > void clear_shadow_from_swap_cache(int type, unsigned long begin, > > > > unsigned long end); > > > > void swapcache_clear(struct swap_info_struct *si, swp_entry_t entr= y); > > > > +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t e= ntry, int nr); > > > > struct folio *swap_cache_get_folio(swp_entry_t entry, > > > > struct vm_area_struct *vma, unsigned long addr); > > > > struct folio *filemap_get_incore_folio(struct address_space *mappi= ng, > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > > index c0c058ee7b69..c8c8b6dbaeda 100644 > > > > --- a/mm/swapfile.c > > > > +++ b/mm/swapfile.c > > > > @@ -3308,7 +3308,7 @@ void si_swapinfo(struct sysinfo *val) > > > > } > > > > > > > > /* > > > > - * Verify that a swap entry is valid and increment its swap map co= unt. > > > > + * Verify that nr swap entries are valid and increment their swap = map count. > > > > * > > > > * Returns error code in following case. > > > > * - success -> 0 > > > > @@ -3318,66 +3318,73 @@ void si_swapinfo(struct sysinfo *val) > > > > * - swap-cache reference is requested but the entry is not used. = -> ENOENT > > > > * - swap-mapped reference requested but needs continued swap coun= t. -> ENOMEM > > > > */ > > > > -static int __swap_duplicate(swp_entry_t entry, unsigned char usage= ) > > > > +static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned= char usage) > > > > { > > > > struct swap_info_struct *p; > > > > struct swap_cluster_info *ci; > > > > unsigned long offset; > > > > - unsigned char count; > > > > - unsigned char has_cache; > > > > - int err; > > > > + unsigned char count[SWAPFILE_CLUSTER]; > > > > + unsigned char has_cache[SWAPFILE_CLUSTER]; > > > > > > > Hi Yosry, > > > > Thanks for reviewing! > > > > > I am not closely following this series, but a couple of things caught= my eyes. > > > > > > Is this reasonable for stack usage? > > > > SWAPFILE_CLUSTER is not huge. typically 512 or 256. > > So that's 1K of stack usage out of 16K total on x86. I think this may > be a lot for a single function to use, but perhaps others will > disagree. > > > > > #ifdef CONFIG_THP_SWAP > > #define SWAPFILE_CLUSTER HPAGE_PMD_NR > > > > #define swap_entry_size(size) (size) > > #else > > #define SWAPFILE_CLUSTER 256 > > > > If this is still a concern, I can move it to a bitmap. > > > > > > > > > + int err, i; > > > > > > > > p =3D swp_swap_info(entry); > > > > > > > > offset =3D swp_offset(entry); > > > > ci =3D lock_cluster_or_swap_info(p, offset); > > > > > > > > - count =3D p->swap_map[offset]; > > > > - > > > > - /* > > > > - * swapin_readahead() doesn't check if a swap entry is vali= d, so the > > > > - * swap entry could be SWAP_MAP_BAD. Check here with lock h= eld. > > > > - */ > > > > - if (unlikely(swap_count(count) =3D=3D SWAP_MAP_BAD)) { > > > > - err =3D -ENOENT; > > > > - goto unlock_out; > > > > - } > > > > + for (i =3D 0; i < nr; i++) { > > > > + count[i] =3D p->swap_map[offset + i]; > > > > > > > > - has_cache =3D count & SWAP_HAS_CACHE; > > > > - count &=3D ~SWAP_HAS_CACHE; > > > > - err =3D 0; > > > > - > > > > - if (usage =3D=3D SWAP_HAS_CACHE) { > > > > - > > > > - /* set SWAP_HAS_CACHE if there is no cache and entr= y is used */ > > > > - if (!has_cache && count) > > > > - has_cache =3D SWAP_HAS_CACHE; > > > > - else if (has_cache) /* someone else add= ed cache */ > > > > - err =3D -EEXIST; > > > > - else /* no users remaini= ng */ > > > > + /* > > > > + * swapin_readahead() doesn't check if a swap entry= is valid, so the > > > > + * swap entry could be SWAP_MAP_BAD. Check here wit= h lock held. > > > > + */ > > > > + if (unlikely(swap_count(count[i]) =3D=3D SWAP_MAP_B= AD)) { > > > > err =3D -ENOENT; > > > > + goto unlock_out; > > > > + } > > > > > > > > > > Here we immediately exit if there is an error, but we don't below, we > > > just keep overwriting the error every iteration as far as I can tell. > > > Also, it doesn't seem like we do any cleanups if we hit an error > > > halfway through. Should we undo previously updated swap entries, or a= m > > > I missing something here? > > > > we are safely immediately exiting because we don't change swap_map > > till we finish all checks. while all checks are done, we write them by > > WRITE_ONCE(p->swap_map[offset + i], count[i] | has_cache[i]); > > at the end. > > I see, but I think we may be overwriting the error from each iteration be= low? you are right, Yosry, i used to have the below to break. don't know when I accidentally dropped it :-) diff --git a/mm/swapfile.c b/mm/swapfile.c index c8c8b6dbaeda..091966056aec 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3369,6 +3369,9 @@ static int __swap_duplicate_nr(swp_entry_t entry, int nr, unsigned char usage) err =3D -ENOMEM; } else err =3D -ENOENT; /* unused swap en= try */ + + if (err) + break; } if (!err) { > > > > > > > > > > > > > > - } else if (count || has_cache) { > > > > - > > > > - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) > > > > - count +=3D usage; > > > > - else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) > > > > - err =3D -EINVAL; > > > > - else if (swap_count_continued(p, offset, count)) > > > > - count =3D COUNT_CONTINUED; > > > > - else > > > > - err =3D -ENOMEM; > > > > - } else > > > > - err =3D -ENOENT; /* unused swap en= try */ > > > > - > > > > - if (!err) > > > > - WRITE_ONCE(p->swap_map[offset], count | has_cache); > > > > + has_cache[i] =3D count[i] & SWAP_HAS_CACHE; > > > > + count[i] &=3D ~SWAP_HAS_CACHE; > > > > + err =3D 0; > > > > + > > > > + if (usage =3D=3D SWAP_HAS_CACHE) { > > > > + > > > > + /* set SWAP_HAS_CACHE if there is no cache = and entry is used */ > > > > + if (!has_cache[i] && count[i]) > > > > + has_cache[i] =3D SWAP_HAS_CACHE; > > > > + else if (has_cache[i]) /* someone = else added cache */ > > > > + err =3D -EEXIST; > > > > + else /* no users= remaining */ > > > > + err =3D -ENOENT; > > > > + } else if (count[i] || has_cache[i]) { > > > > + > > > > + if ((count[i] & ~COUNT_CONTINUED) < SWAP_MA= P_MAX) > > > > + count[i] +=3D usage; > > > > + else if ((count[i] & ~COUNT_CONTINUED) > SW= AP_MAP_MAX) > > > > + err =3D -EINVAL; > > > > + else if (swap_count_continued(p, offset + i= , count[i])) > > > > + count[i] =3D COUNT_CONTINUED; > > > > + else > > > > + err =3D -ENOMEM; > > > > + } else > > > > + err =3D -ENOENT; /* unused= swap entry */ > > > > + } > > > > > > > > + if (!err) { > > > > + for (i =3D 0; i < nr; i++) > > > > + WRITE_ONCE(p->swap_map[offset + i], count[i= ] | has_cache[i]); > > > > Here is the place where we really write data. Before that, we only > > touched temp variables. > > > > > > + } > > > > unlock_out: > > > > unlock_cluster_or_swap_info(p, ci); > > > > return err; > > > > } > > thanks Barry