From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF478C54FB9 for ; Tue, 21 Nov 2023 06:35:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7451A6B03F5; Tue, 21 Nov 2023 01:35:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6F59C6B03FE; Tue, 21 Nov 2023 01:35:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 597E56B03FF; Tue, 21 Nov 2023 01:35:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 493D06B03F5 for ; Tue, 21 Nov 2023 01:35:49 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 242C31CB5EF for ; Tue, 21 Nov 2023 06:35:49 +0000 (UTC) X-FDA: 81481000818.17.CC2D59D Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) by imf10.hostedemail.com (Postfix) with ESMTP id 34DD3C000C for ; Tue, 21 Nov 2023 06:35:46 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ObocUkJ8; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.174 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700548547; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yo16uqWJijBurQTMP5uv6v+uVi2PctVlnAn9HzhCqXg=; b=SwL1qiG3PDT92lgqFPc/AFFEaPqYo66dmEX53h+BF0FPNCnKtsWMJ/WLp7EZu/R+2RW8g9 ggNU8osJcGO298kpvIarCHTm4oRiv8xslyNdRMaBroIfBD8AuVKutJ+pZ8NO2/HAVbXE10 g08+HvA7y0NIfKNYtepfJ06Vp776ZAE= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ObocUkJ8; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.174 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700548547; a=rsa-sha256; cv=none; b=RTfBM3rCPTvEHTKgcDXZOo2ghK/OqmAf6mWPhPSp+mhvlwke660C4mzSatON+MYTs9WujP GOJOXVJLMOgKPIWw3OjEIw2NMytG0Dp4hCTKlh6taLZCZPmqQVoFiSePh+eKd/npO4Ek/Z dIMJRBbdP2qaIaAiZWONoMSYlcb2W+g= Received: by mail-lj1-f174.google.com with SMTP id 38308e7fff4ca-2c8790474d5so32325561fa.2 for ; Mon, 20 Nov 2023 22:35:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1700548545; x=1701153345; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=yo16uqWJijBurQTMP5uv6v+uVi2PctVlnAn9HzhCqXg=; b=ObocUkJ8td78XQCcsOs1prZoG/Aki7ncjW7ncO6vj9cZBL3NeauBt4arIf4die0tlK GXwXQeTwEKQ+6MCw2ynrB94EE4qYcM2G55aV3Hsqg3kC66TQM+v5ifFQjtJWmIGzibKP 12uTxhJWbNUZtJ8q0m9T2J8OhMl61HvLVLflN5gBqUJHrqtGONu20fGv/5JbX6W9v0Yv 3/ULXwtwo0/4P2E2srUsKCi945jxATcw7czyM5CldZ++wB9LChibdvphJUXKEp2R3kiv O+wkyc8VVXRc5YqNV6A7B1IMBuNhGPCLOIexYp00SQOpIQCCTg3tVFs7YMan5cnpyhEb MphQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700548545; x=1701153345; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yo16uqWJijBurQTMP5uv6v+uVi2PctVlnAn9HzhCqXg=; b=Tkb9130lzgQmea53tFOrPfNiahuxhCeq0ouNskMWmFLeYJfol8odouWI0g8zFmThss Baoq/Z1/MtQkraDYEWt2zx0evZBq0GeHtu8qGBdVUhd+iZo6jSqq1Wy2zjzfyEhCbI2/ YJkj5sKrwUCswx6P4DXw7GqpdJhlx7G8+dR3RkY+1KZj/y5AeMgN+RxQ3G876Y1pA2j4 iqKrEQjH/D0WssvwsRrt4m5NbJOv51e6WU8jDvfhueP8wBlj0xq3Zv1VRsZ+C6bDWVoe 83hklnmifFHP48VLdlcyoyeUab+GT9Q/OPynfgC4YKzLoYnXoNOdYuoqFEjm8bLxMXTK EGRg== X-Gm-Message-State: AOJu0Yygv57eLPm9uXJEi1nz3QpPEMuiWIoplSIByDjMgO4JNOpR4J9T z6SWLnErsixM7aIx/8jS8qFe6NyL08jwv5hmboo= X-Google-Smtp-Source: AGHT+IG4tuALlsduIFaTUf3UG47QlvlRpHsQ5uzv+8Iy1FnWo7Z54ovrmt+fjjq/to32Rt22zuHs2Bh9pJrZ/O8+OP4= X-Received: by 2002:a2e:86d5:0:b0:2c5:183d:42bf with SMTP id n21-20020a2e86d5000000b002c5183d42bfmr5927606ljj.45.1700548545048; Mon, 20 Nov 2023 22:35:45 -0800 (PST) MIME-Version: 1.0 References: <20231119194740.94101-1-ryncsn@gmail.com> <20231119194740.94101-6-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Tue, 21 Nov 2023 14:35:27 +0800 Message-ID: Subject: Re: [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead To: Chris Li Cc: linux-mm@kvack.org, Andrew Morton , "Huang, Ying" , David Hildenbrand , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 34DD3C000C X-Stat-Signature: argj31ehen1nfdmmbyyxqeynogrgzg3q X-HE-Tag: 1700548546-118973 X-HE-Meta: U2FsdGVkX1+P1BSKOGZcIa3x6ZX5eZXofc1s0th0V1x9fZ5YWnVjJaxA1htVqwKV5M6fIvowGw/Li0Ng8oo2hWX8EQ/0zj+VNvhJD/y7nd65uzJhPQRnicPw7eV9ECIK5stS+hsqLR/UBUWuHgCXyo7kcfotTDr/0S2StvB5dgGykI6RwOOcwOpyjK5e+XHmI/oFTRQd2Lt21wKxwyKCQuqSahEi8SrHjzvkY2o+NWxQZSc1fTeEXLGcQ8VUgolpfF7B99CsIlFhNnmvKEktdmuFjkqh31tMwu+sE6tIUInOvze4g1oO8cT4n02fObs3dOhJM1LpOJd1n6JyXo2b7sGJq4aVDVgrq5Jf5/+JO3rkeFHkf0EdL0cXXX/u4QcOJEGjfY1Ghpa39veDTgEscMrlhPkoVY1qNkFDO3HHV9HfXjihfLbf6DNRYUtuUNPgBdDH739IsWuEQCSbwDMd5b8gDMNuazxwx1TeeZ4mLJMGWVyRJdRfhpcwh5WU6k57cid5FUBgcTrp9mvIXgMRFhxFu84Dvj+w43kHr2Ff0UcWDBq6jolEGAAchrTx17zr/WKxD49hSOQg+s0uN514PRt8tRVlR+UzQ8zrxjg+MSCm/W8PPVevrV9e/WIe8yjASReYzdwt1ER+YHXMsb/JD9WM+ohAQkEcZS4oTdo616YzN8+RSFbG+awCjV0ApEgK/taigE162fAd2DEdKUr9VlmYkcVR4rr4O5O3fbZDjZ/oD1JRFOqm+mXLJIUwHWG7Zy23pIiDxavW74fFi0ackB+7rOVelWoEXOBjtoCroSzsHBq4L+2+qBbCdf4YJc8GohZ0q0hbf+keOpGkz8NHne/WE1kddvbArE/yxFsPYOWuCTOA9z/yEz+Thm0RzThBg3svuclg5i3CE5II4Ys5PwLJqy70pN2UeidFmJvGpC5koqBGKfVyPT9u4D0Lmje2HVeXHjVP22+lXRixHRv kZXLLRF9 0UY2ucdjZO8Z3AlVDqrkyBzE+Gm7Bo8ZqHfywBZNJiTTYee3dBUAEgkGfW+GcXTCAUZ6AzVz6x+VLZkYMCZ/tlwNAZVSAdVzTC28nK4KdCd1qDEJluy9xJmHdcKm9gBQMzFxgQxvXiP/fOCaV5B7OewD9rpkd+PWTHomqt/7LQTzJ2LJ5t1Le4ryjKhHtRy3Qxh89+D/+QV0gPUGmtPFZvyFUN2wf7t6riepAQj6CrJTHe9X6JqLFBKjb1Zj76LEp8vgcvRjSAa6wkjTL14Mgb5wBVE2uXZiTX6JIOA4jiwMx4sCdttmm/izs3nYFX8yWlijED4ux4GXaHAw86NS8zBS5CT+1SHywytDIZqb5VH9O9eS4O6A2MLi47x4Qqo+ojMM51SDwtHqOirePgtvoCYfSaSGMs21wHoWt/zvPRygOVn0mV+cIzqRA/P4+tVe+8XvyT9Vyt53oknWF93RtP7IcFPoROYBtazu5vDvPopc7ipl5/yRlYT3PGwDWDSWCW5NfenXhB8RgMwjrcD9LPBRRqV1fsesuXfpU X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li =E4=BA=8E2023=E5=B9=B411=E6=9C=8821=E6=97=A5= =E5=91=A8=E4=BA=8C 14:18=E5=86=99=E9=81=93=EF=BC=9A > > On Sun, Nov 19, 2023 at 11:48=E2=80=AFAM Kairui Song w= rote: > > > > From: Kairui Song > > > > This makes swapin_readahead a main entry for swapin pages, > > prepare for optimizations in later commits. > > > > This also makes swapoff able to make use of readahead checking > > based on entry. Swapping off a 10G ZRAM (lzo-rle) is faster: > > > > Before: > > time swapoff /dev/zram0 > > real 0m12.337s > > user 0m0.001s > > sys 0m12.329s > > > > After: > > time swapoff /dev/zram0 > > real 0m9.728s > > user 0m0.001s > > sys 0m9.719s > > > > And what's more, because now swapoff will also make use of no-readahead > > swapin helper, this also fixed a bug for no-readahead case (eg. ZRAM): > > when a process that swapped out some memory previously was moved to a n= ew > > cgroup, and the original cgroup is dead, swapoff the swap device will > > make the swapped in pages accounted into the process doing the swapoff > > instead of the new cgroup the process was moved to. > > > > This can be easily reproduced by: > > - Setup a ramdisk (eg. ZRAM) swap. > > - Create memory cgroup A, B and C. > > - Spawn process P1 in cgroup A and make it swap out some pages. > > - Move process P1 to memory cgroup B. > > - Destroy cgroup A. > > - Do a swapoff in cgroup C. > > - Swapped in pages is accounted into cgroup C. > > > > This patch will fix it make the swapped in pages accounted in cgroup B. > > Can you help me understand where the code makes this behavior change? > As far as I can tell, the patch just allows swap off to use the code > path to avoid read ahead and avoid swap cache path. I haven't figured > out where the code makes the swapin charge to B. Yes, swapoff always call swapin_readahead to swapin pages. Before this patch, the page allocate/charge path is like this: (Here we assume there ia only a ZRAM device so VMA readahead is used) swapoff swapin_readahead swap_vma_readahead __read_swap_cache_async alloc_pages_mpol // *** Page charge happens here, and // note the second argument is NULL mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry) After the patch: swapoff swapin_readahead swapin_no_readahead vma_alloc_folio // *** Page charge happens here, and // note the second argument is vma->mm mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry) In the previous callpath (swap_vma_readahead), the mm struct info is not passed to the final allocation/charge. But now swapin_no_readahead can simply pass the mm struct to the allocation/charge. mem_cgroup_swapin_charge_folio will take the memcg of the mm_struct as the charge target when the entry memcg is not online. > Does it need to be ZRAM? Will zswap or SSD work in your example? The "swapoff moves memory charge out of a dying cgroup" issue exists for all swap devices, just this patch changed the behavior for ZRAM (which bypass swapcache and uses swapin_no_readahead). > > > > > The same bug exists for readahead path too, we'll fix it in later > > commits. > > As discussed in another email, this behavior change is against the > current memcg memory charging model. > Better separate out this behavior change with code clean up. Good suggestion. > > > > > Signed-off-by: Kairui Song > > --- > > mm/memory.c | 22 +++++++--------------- > > mm/swap.h | 6 ++---- > > mm/swap_state.c | 33 ++++++++++++++++++++++++++------- > > mm/swapfile.c | 2 +- > > 4 files changed, 36 insertions(+), 27 deletions(-) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index fba4a5229163..f4237a2e3b93 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3792,6 +3792,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > rmap_t rmap_flags =3D RMAP_NONE; > > bool exclusive =3D false; > > swp_entry_t entry; > > + bool swapcached; > > pte_t pte; > > vm_fault_t ret =3D 0; > > > > @@ -3855,22 +3856,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > swapcache =3D folio; > > > > if (!folio) { > > - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > - __swap_count(entry) =3D=3D 1) { > > - /* skip swapcache and readahead */ > > - page =3D swapin_no_readahead(entry, GFP_HIGHUSE= R_MOVABLE, > > - vmf); > > - if (page) > > - folio =3D page_folio(page); > > + page =3D swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, > > + vmf, &swapcached); > > + if (page) { > > + folio =3D page_folio(page); > > + if (swapcached) > > + swapcache =3D folio; > > } else { > > - page =3D swapin_readahead(entry, GFP_HIGHUSER_M= OVABLE, > > - vmf); > > - if (page) > > - folio =3D page_folio(page); > > - swapcache =3D folio; > > - } > > - > > - if (!folio) { > > /* > > * Back out if somebody else faulted in this pt= e > > * while we released the pte lock. > > diff --git a/mm/swap.h b/mm/swap.h > > index ea4be4791394..f82d43d7b52a 100644 > > --- a/mm/swap.h > > +++ b/mm/swap.h > > @@ -55,9 +55,7 @@ struct page *__read_swap_cache_async(swp_entry_t entr= y, gfp_t gfp_mask, > > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, > > struct mempolicy *mpol, pgoff_t ilx= ); > > struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, > > - struct vm_fault *vmf); > > -struct page *swapin_no_readahead(swp_entry_t entry, gfp_t flag, > > - struct vm_fault *vmf); > > + struct vm_fault *vmf, bool *swapcached); > > > > static inline unsigned int folio_swap_flags(struct folio *folio) > > { > > @@ -89,7 +87,7 @@ static inline struct page *swap_cluster_readahead(swp= _entry_t entry, > > } > > > > static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp= _mask, > > - struct vm_fault *vmf) > > + struct vm_fault *vmf, bool *swapcached) > > { > > return NULL; > > } > > diff --git a/mm/swap_state.c b/mm/swap_state.c > > index 45dd8b7c195d..fd0047ae324e 100644 > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -316,6 +316,11 @@ void free_pages_and_swap_cache(struct encoded_page= **pages, int nr) > > release_pages(pages, nr); > > } > > > > +static inline bool swap_use_no_readahead(struct swap_info_struct *si, = swp_entry_t entry) > > +{ > > + return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_coun= t(entry) =3D=3D 1; > > +} > > + > > static inline bool swap_use_vma_readahead(void) > > { > > return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rota= te_swap); > > @@ -861,8 +866,8 @@ static struct page *swap_vma_readahead(swp_entry_t = targ_entry, gfp_t gfp_mask, > > * Returns the struct page for entry and addr after the swap entry is = read > > * in. > > */ > > -struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask, > > - struct vm_fault *vmf) > > +static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_m= ask, > > + struct vm_fault *vmf) > > { > > struct vm_area_struct *vma =3D vmf->vma; > > struct page *page =3D NULL; > > @@ -904,6 +909,8 @@ struct page *swapin_no_readahead(swp_entry_t entry,= gfp_t gfp_mask, > > * @entry: swap entry of this memory > > * @gfp_mask: memory allocation flags > > * @vmf: fault information > > + * @swapcached: pointer to a bool used as indicator if the > > + * page is swapped in through swapcache. > > * > > * Returns the struct page for entry and addr, after queueing swapin. > > * > > @@ -912,17 +919,29 @@ struct page *swapin_no_readahead(swp_entry_t entr= y, gfp_t gfp_mask, > > * or vma-based(ie, virtual address based on faulty address) readahead= . > > */ > > struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > > - struct vm_fault *vmf) > > + struct vm_fault *vmf, bool *swapcached) > > At this point the function name "swapin_readahead" does not match what > it does any more. Because readahead is just one of the behaviors it > does. It also can do without readahead. It needs a better name. This > function becomes a generic swapin_entry. I renamed this function in later commits, I can rename it here to avoid confusion. > > > { > > struct mempolicy *mpol; > > - pgoff_t ilx; > > struct page *page; > > + pgoff_t ilx; > > + bool cached; > > > > mpol =3D get_vma_policy(vmf->vma, vmf->address, 0, &ilx); > > - page =3D swap_use_vma_readahead() ? > > - swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) : > > - swap_cluster_readahead(entry, gfp_mask, mpol, ilx); > > + if (swap_use_no_readahead(swp_swap_info(entry), entry)) { > > + page =3D swapin_no_readahead(entry, gfp_mask, vmf); > > + cached =3D false; > > + } else if (swap_use_vma_readahead()) { > > + page =3D swap_vma_readahead(entry, gfp_mask, mpol, ilx,= vmf); > > + cached =3D true; > > + } else { > > Notice that which flavor of swapin_read is actually a property of the > swap device. > For that device, it always calls the same swapin_entry function. > > One idea is to have a VFS-like API for swap devices. This can be a > swapin_entry function callback from the swap_ops struct. Difference > swap devices just register different swapin_entry hooks. That way we > don't need to look at the device flags to decide what to do. We can > just call the VFS like swap_ops->swapin_entry(), the function pointer > will point to the right implementation. Then we don't need these three > levels if/else block. It is more flexible to register custom > implementations of swap layouts as well. Something to consider for the > future, you don't have to implement it in this series. Interesting idea, we may look into this later. > > > + page =3D swap_cluster_readahead(entry, gfp_mask, mpol, = ilx); > > + cached =3D true; > > + } > > mpol_cond_put(mpol); > > + > > + if (swapcached) > > + *swapcached =3D cached; > > + > > return page; > > } > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 756104ebd585..0142bfc71b81 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -1874,7 +1874,7 @@ static int unuse_pte_range(struct vm_area_struct = *vma, pmd_t *pmd, > > }; > > > > page =3D swapin_readahead(entry, GFP_HIGHUSER_M= OVABLE, > > - &vmf); > > + &vmf, NULL); > > if (page) > > folio =3D page_folio(page); > > } > > -- > > 2.42.0 > > > > Thanks!