From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1E7EC3DA6E for ; Wed, 10 Jan 2024 03:32:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 58E186B0083; Tue, 9 Jan 2024 22:32:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 53E186B0085; Tue, 9 Jan 2024 22:32:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4063A6B0087; Tue, 9 Jan 2024 22:32:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 30BE56B0083 for ; Tue, 9 Jan 2024 22:32:43 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id ECB31A0406 for ; Wed, 10 Jan 2024 03:32:42 +0000 (UTC) X-FDA: 81661979364.24.18E87E5 Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) by imf29.hostedemail.com (Postfix) with ESMTP id 39610120022 for ; Wed, 10 Jan 2024 03:32:39 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=A6wa6gFo; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1704857560; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vLhUnLlA9UL6h9CmJ4uHhML0vBTPPQtz2oe7wj8xL/8=; b=W8K0kIF9RTJ2msRuVb/8Aqqv7aKGhXf0UiHLS5iwsqi/eSWRKwRrK5g/pJb2pelgS4LPl9 +oM5QT1d3cJJAraGQvOvmDAinKYy3b9JymiCPu+Ozmxs5pIqXLvuJR3Qj5q45r1JUPR58w RRG6FMjmjAxqY7SY28QyIBjQY5qZA9s= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=A6wa6gFo; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704857560; a=rsa-sha256; cv=none; b=hCVug/i3b98tb0pRlWJtYNvTVS/qdjJx+GwW60WOfaruFWAwYipSwUiVxxbJn2fcLH2uam O7XXVGUHZtdYlW9adDA/sX5u/eWZIYZwajZ+zFiHY1YWNLYDPoY7hmT+S1py24G6vR1MR5 z4WTHVs2GXmcVfp6ggELJ5ifZVC11aU= Received: by mail-lj1-f172.google.com with SMTP id 38308e7fff4ca-2cc7d2c1ff0so42025701fa.3 for ; Tue, 09 Jan 2024 19:32:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1704857558; x=1705462358; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vLhUnLlA9UL6h9CmJ4uHhML0vBTPPQtz2oe7wj8xL/8=; b=A6wa6gFotNgmTmoOxCbLzJcc78n+p5q6crRxh79kRdQQL/ee78AA9nWwUvdWaspZDy Rue0/zih5a76hX2f5osz1teg58ir8Sq2F1fil2BcKiruIhXGkmxfaOREDR1tJCuxgfIy pSg118u1UkvrU+lqxFd4CgjBPrNpDZz+ExWsbsmVLjimDzbEWxBgQ1XFFM7dz8s6dnXn kD6fw1pTwTi1dwH8wdf8KvWhbJoQ64NCJGM1neqgapg00iUMGijR0yMu9MppVtJZRjKv /YQUWZEsZ0lkct1YnS9CnsPsBxA7c1ACX+fbp2rbFwB5V4fS91m1OdNCCM45PtH0xPYC kNQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704857558; x=1705462358; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vLhUnLlA9UL6h9CmJ4uHhML0vBTPPQtz2oe7wj8xL/8=; b=Pdp7dQsz9zi3Heu2j6MXmNHgt1pgwSB4Ksyq/BCAjZCkqM8M46pQoRsb0uypOPBOvD T/SKIIE0JaG1HVBHFAVof3zwmCFtw5/hC+YLPfuBdfQuIvqtLAoSJYThVpSldJ9W6w2F Vmo9HClIcvBp09082qmGCuIkE8Z89QEzI8duEemW62Rgtfy59ejlo8aloiLAY65OEqff 2JYBgTa3As+do393gTs9X6NG91zjmRAwk9Nqrbw8DEHkZ79MYy2BRqEzMuLgi78kHhRn FtydIm/A2srXA/rjCthkToJ3dt/KoPNR1Q4eiBaurdUCI5fD3WVZTjsCM6GS+Y7P1VHA sUYA== X-Gm-Message-State: AOJu0Yy2P5BjnN6hW7tKHzXR48Xom5Br8WFjXEUycnr8u9pNZWVTV/8k 7BaksmVGwOxBPZv976wHLw2rAbHedagYTzLT6as= X-Google-Smtp-Source: AGHT+IEkpVsu5BDS0vg20DpRMDMLBZMg8rjOopL8oXYfecQg4gO8OVqAHuS5AyqrahRvrVc0t0maSzvF1bzJwq3LeTM= X-Received: by 2002:a05:651c:210c:b0:2cd:2334:ed0d with SMTP id a12-20020a05651c210c00b002cd2334ed0dmr216835ljq.88.1704857558303; Tue, 09 Jan 2024 19:32:38 -0800 (PST) MIME-Version: 1.0 References: <20240102175338.62012-1-ryncsn@gmail.com> <20240102175338.62012-9-ryncsn@gmail.com> <875y039utw.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <875y039utw.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Kairui Song Date: Wed, 10 Jan 2024 11:32:21 +0800 Message-ID: Subject: Re: [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault To: "Huang, Ying" Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , David Hildenbrand , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 39610120022 X-Stat-Signature: y6nh8ewqc5tzfwwtjab59wgbf9p9hq4q X-HE-Tag: 1704857559-764603 X-HE-Meta: U2FsdGVkX19thzAG0TJL4GYfQs1CPvzNyGWxtb7NeLBvaHK1aAeu385D2/Ft3rc822YAFHmdvmqgMYwvZ2fZUgbImhO566aqUgAjEtBBc89AQuWABh9cdyFd5ei30t2fu8+weosgVPrr5ea+OrJRFsC46zuRkOZFmccmoN+evDXELoA9DX+6V+hwhIjfPD6Hwbq7rYTw7SVnPkyX3M/5xAT5pViStMzoKF9nR4pIWRgyylKH4wJCsDRdScpc8zYgMe8EEgpVPXs/mdcq/Npsl2Wb5lvJRlAudBwETwLCMDmcU+3hVt9SHqbjZ8Kz88z7FeQzwJ0JiXukDXgV0rrbIenvNWVNtvq8wriqfl0tuwUKwCqKmKmaIg83sSuUU9cIHBSzFws1jRYmkuhAdLS2qzCZT3LLVjCda6YQVFdQQ8n46ajcB5EmNH6yytz8Z2UbNcdVRoXmspIWhY/WNdMaJTgxZOkZ/5F9K6kjKdduOPmmgSeNe5+PHV+or+h2dFEW05bdB8J3Jxh/vnk1qqFI95661NXIcGYDQozzaUWBwbtw/1KJoLHzlXWd0PIPmiBXFzqoDy86S+1039Hz6yax0HrHMvnIpf/w/AqNRikfr5Znjy4+pujJqw04W7q0jgdKJnoPGpBFpv/tSNWwUm+M/TfzrpoR/WGLQk3RvTcx2kxL63bcTns/rnjWy9u2UMBeW382MyoLoX0k5QFmBaV5EmFX58HE+3r+u6zIuP5GriH9kc5ZwheDkYGWTj1ZtuqipLkbkSBCD4THql8g9u/JqZm8YJIhHVcZRX6Tl5UlqggDtTibNwU2kg7R9cqY6BJgPk3M2LsNgmWvSt61RUpiPg5ciztu4grQPMZqxOIfqbiaHekQPunAOxbQ4v501R3mtAiP4kGvrLoUAofyvDPRLO2TUJT4G93RRC9pR1k1KGrgoOBELA8MvAw8S2T4vliECboJa2XWsfflxdeWbhv 6N7X88iJ MvfWS0vFqg5IlvxWkNIEwfswfxNArZhGRuh2I X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Huang, Ying =E4=BA=8E2024=E5=B9=B41=E6=9C=889=E6=97= =A5=E5=91=A8=E4=BA=8C 09:11=E5=86=99=E9=81=93=EF=BC=9A > > Kairui Song writes: > > > From: Kairui Song > > > > There are two places where swapin is not caused by direct anon page fau= lt: > > - shmem swapin, invoked indirectly through shmem mapping > > - swapoff > > > > They used to construct a pseudo vmfault struct for swapin function. > > Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d > > ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff > > path is still using one. > > > > Introduce a helper for them both, this help save stack usage for swapof= f > > path, and help apply a unified swapin cache and readahead policy check. > > > > Due to missing vmfault info, the caller have to pass in mempolicy > > explicitly, make it different from swapin_entry and name it > > swapin_entry_mpol. > > > > This commit convert swapoff to use this helper, follow-up commits will > > convert shmem to use it too. > > > > Signed-off-by: Kairui Song > > --- > > mm/swap.h | 9 +++++++++ > > mm/swap_state.c | 40 ++++++++++++++++++++++++++++++++-------- > > mm/swapfile.c | 15 ++++++--------- > > 3 files changed, 47 insertions(+), 17 deletions(-) > > > > diff --git a/mm/swap.h b/mm/swap.h > > index 9180411afcfe..8f790a67b948 100644 > > --- a/mm/swap.h > > +++ b/mm/swap.h > > @@ -73,6 +73,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entr= y, gfp_t flag, > > struct mempolicy *mpol, pgoff_t ilx); > > struct folio *swapin_entry(swp_entry_t entry, gfp_t flag, > > struct vm_fault *vmf, enum swap_cache_result = *result); > > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask, > > + struct mempolicy *mpol, pgoff_t ilx, > > + enum swap_cache_result *result); > > > > static inline unsigned int folio_swap_flags(struct folio *folio) > > { > > @@ -109,6 +112,12 @@ static inline struct folio *swapin_entry(swp_entry= _t swp, gfp_t gfp_mask, > > return NULL; > > } > > > > +static inline struct page *swapin_entry_mpol(swp_entry_t entry, gfp_t = gfp_mask, > > + struct mempolicy *mpol, pgoff_t ilx, enum swap_cache_resu= lt *result) > > +{ > > + return NULL; > > +} > > + > > static inline int swap_writepage(struct page *p, struct writeback_cont= rol *wbc) > > { > > return 0; > > diff --git a/mm/swap_state.c b/mm/swap_state.c > > index 21badd4f0fc7..3edf4b63158d 100644 > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -880,14 +880,13 @@ static struct folio *swap_vma_readahead(swp_entry= _t targ_entry, gfp_t gfp_mask, > > * in. > > */ > > static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask, > > - struct vm_fault *vmf, void *shadow) > > + struct mempolicy *mpol, pgoff_t ilx, > > + void *shadow) > > { > > - struct vm_area_struct *vma =3D vmf->vma; > > struct folio *folio; > > > > - /* skip swapcache */ > > - folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > - vma, vmf->address, false); > > + folio =3D (struct folio *)alloc_pages_mpol(gfp_mask, 0, > > + mpol, ilx, numa_node_id()); > > if (folio) { > > if (mem_cgroup_swapin_charge_folio(folio, NULL, > > GFP_KERNEL, entry)) { > > @@ -943,18 +942,18 @@ struct folio *swapin_entry(swp_entry_t entry, gfp= _t gfp_mask, > > goto done; > > } > > > > + mpol =3D get_vma_policy(vmf->vma, vmf->address, 0, &ilx); > > if (swap_use_no_readahead(swp_swap_info(entry), entry)) { > > - folio =3D swapin_direct(entry, gfp_mask, vmf, shadow); > > + folio =3D swapin_direct(entry, gfp_mask, mpol, ilx, shado= w); > > cache_result =3D SWAP_CACHE_BYPASS; > > } else { > > - mpol =3D get_vma_policy(vmf->vma, vmf->address, 0, &ilx); > > if (swap_use_vma_readahead()) > > folio =3D swap_vma_readahead(entry, gfp_mask, mpo= l, ilx, vmf); > > else > > folio =3D swap_cluster_readahead(entry, gfp_mask,= mpol, ilx); > > - mpol_cond_put(mpol); > > cache_result =3D SWAP_CACHE_MISS; > > } > > + mpol_cond_put(mpol); > > done: > > if (result) > > *result =3D cache_result; > > @@ -962,6 +961,31 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_= t gfp_mask, > > return folio; > > } > > > > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask, > > + struct mempolicy *mpol, pgoff_t ilx, > > + enum swap_cache_result *result) > > +{ > > + enum swap_cache_result cache_result; > > + void *shadow =3D NULL; > > + struct folio *folio; > > + > > + folio =3D swap_cache_get_folio(entry, NULL, 0, &shadow); > > + if (folio) { > > + cache_result =3D SWAP_CACHE_HIT; > > + } else if (swap_use_no_readahead(swp_swap_info(entry), entry)) { > > + folio =3D swapin_direct(entry, gfp_mask, mpol, ilx, shado= w); > > + cache_result =3D SWAP_CACHE_BYPASS; > > + } else { > > + folio =3D swap_cluster_readahead(entry, gfp_mask, mpol, i= lx); > > + cache_result =3D SWAP_CACHE_MISS; > > + } > > + > > + if (result) > > + *result =3D cache_result; > > + > > + return folio; > > +} > > + > > #ifdef CONFIG_SYSFS > > static ssize_t vma_ra_enabled_show(struct kobject *kobj, > > struct kobj_attribute *attr, char *b= uf) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 5aa44de11edc..2f77bf143af8 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -1840,18 +1840,13 @@ static int unuse_pte_range(struct vm_area_struc= t *vma, pmd_t *pmd, > > do { > > struct folio *folio; > > unsigned long offset; > > + struct mempolicy *mpol; > > unsigned char swp_count; > > swp_entry_t entry; > > + pgoff_t ilx; > > int ret; > > pte_t ptent; > > > > - struct vm_fault vmf =3D { > > - .vma =3D vma, > > - .address =3D addr, > > - .real_address =3D addr, > > - .pmd =3D pmd, > > - }; > > - > > if (!pte++) { > > pte =3D pte_offset_map(pmd, addr); > > if (!pte) > > @@ -1871,8 +1866,10 @@ static int unuse_pte_range(struct vm_area_struct= *vma, pmd_t *pmd, > > pte_unmap(pte); > > pte =3D NULL; > > > > - folio =3D swapin_entry(entry, GFP_HIGHUSER_MOVABLE, > > - &vmf, NULL); > > + mpol =3D get_vma_policy(vma, addr, 0, &ilx); > > + folio =3D swapin_entry_mpol(entry, GFP_HIGHUSER_MOVABLE, > > + mpol, ilx, NULL); > > + mpol_cond_put(mpol); > > if (!folio) { > > /* > > * The entry could have been freed, and will not > > IIUC, after the change, we will always use cluster readahead for > swapoff. This may be OK. But, at least we need some test results which > show that this will not cause any issue for this behavior change. And > the behavior change should be described explicitly in patch description. Hi Ying Actually there is a swap_use_no_readahead check in swapin_entry_mpol, so when readahaed is not needed (SYNC_IO), it's just skipped. And I think VMA readahead is not helpful swapoff, swapoff is already walking the VMA, mostly uninterrupted in kernel space. With VMA readahead or not, it will issue IO page by page. The benchmark result I posted before is actually VMA readhead vs no-readahead for ZRAM, sorry I didn't make it clear. It's obvious no-readahead is faster. For actual block device, cluster readahead might be a good choice for swapoff, since all pages will be read for swapoff, there has to be enough memory for all swapcached page to stay in memory or swapoff will fail anyway, and cluster read is faster for block devices. > > And I don't think it's a good abstraction to make swapin_entry_mpol() > always use cluster swapin, while swapin_entry() will try to use vma > swapin. I think we can add "struct mempolicy *mpol" and "pgoff_t ilx" > to swapin_entry() as parameters, and use them if vmf =3D=3D NULL. If we > want to enforce cluster swapin in swapoff path, it will be better to add > some comments to describe why. Good suggestion, I thought extending swapin_entry may make its arguments list get too long, but seems mpol and ilx are the only thing needed here. I'll update it.