From: Kairui Song <ryncsn@gmail.com>
To: Barry Song <21cnbao@gmail.com>
Cc: wangkefeng.wang@huawei.com, akpm@linux-foundation.org,
baolin.wang@linux.alibaba.com, chrisl@kernel.org,
david@redhat.com, hanchuanhua@oppo.com, hannes@cmpxchg.org,
hch@infradead.org, hughd@google.com, kaleshsingh@google.com,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
mhocko@suse.com, minchan@kernel.org, nphamcs@gmail.com,
ryan.roberts@arm.com, senozhatsky@chromium.org,
shakeel.butt@linux.dev, shy828301@gmail.com, surenb@google.com,
v-songbaohua@oppo.com, willy@infradead.org, xiang@kernel.org,
ying.huang@intel.com, yosryahmed@google.com
Subject: Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
Date: Sat, 17 Aug 2024 00:50:00 +0800 [thread overview]
Message-ID: <CAMgjq7AeVkmOiVFa4-iP4nbyX3LHft_5wfvV_raj_N+twdzMKQ@mail.gmail.com> (raw)
In-Reply-To: <20240815230612.77266-1-21cnbao@gmail.com>
On Fri, Aug 16, 2024 at 7:06 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >
> >
> >
> > On 2024/8/15 17:47, Kairui Song wrote:
> > > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote:
> > >>
> > >> From: Chuanhua Han <hanchuanhua@oppo.com>
> > >
> > > Hi Chuanhua,
> > >
> > >>
> > ...
> >
> > >> +
> > >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >> +{
> > >> + struct vm_area_struct *vma = vmf->vma;
> > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >> + unsigned long orders;
> > >> + struct folio *folio;
> > >> + unsigned long addr;
> > >> + swp_entry_t entry;
> > >> + spinlock_t *ptl;
> > >> + pte_t *pte;
> > >> + gfp_t gfp;
> > >> + int order;
> > >> +
> > >> + /*
> > >> + * If uffd is active for the vma we need per-page fault fidelity to
> > >> + * maintain the uffd semantics.
> > >> + */
> > >> + if (unlikely(userfaultfd_armed(vma)))
> > >> + goto fallback;
> > >> +
> > >> + /*
> > >> + * A large swapped out folio could be partially or fully in zswap. We
> > >> + * lack handling for such cases, so fallback to swapping in order-0
> > >> + * folio.
> > >> + */
> > >> + if (!zswap_never_enabled())
> > >> + goto fallback;
> > >> +
> > >> + entry = pte_to_swp_entry(vmf->orig_pte);
> > >> + /*
> > >> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > >> + * and suitable for swapping THP.
> > >> + */
> > >> + orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > >> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> > >> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > >> + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
> > >> +
> > >> + if (!orders)
> > >> + goto fallback;
> > >> +
> > >> + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
> > >> + if (unlikely(!pte))
> > >> + goto fallback;
> > >> +
> > >> + /*
> > >> + * For do_swap_page, find the highest order where the aligned range is
> > >> + * completely swap entries with contiguous swap offsets.
> > >> + */
> > >> + order = highest_order(orders);
> > >> + while (orders) {
> > >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > >> + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> > >> + break;
> > >> + order = next_order(&orders, order);
> > >> + }
> > >> +
> > >> + pte_unmap_unlock(pte, ptl);
> > >> +
> > >> + /* Try allocating the highest of the remaining orders. */
> > >> + gfp = vma_thp_gfp_mask(vma);
> > >> + while (orders) {
> > >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > >> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> > >> + if (folio)
> > >> + return folio;
> > >> + order = next_order(&orders, order);
> > >> + }
> > >> +
> > >> +fallback:
> > >> +#endif
> > >> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> > >> +}
> > >> +
> > >> +
> > >> /*
> > >> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > >> * but allow concurrent faults), and pte mapped but not yet locked.
> > >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >> if (!folio) {
> > >> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > >> __swap_count(entry) == 1) {
> > >> - /*
> > >> - * Prevent parallel swapin from proceeding with
> > >> - * the cache flag. Otherwise, another thread may
> > >> - * finish swapin first, free the entry, and swapout
> > >> - * reusing the same entry. It's undetectable as
> > >> - * pte_same() returns true due to entry reuse.
> > >> - */
> > >> - if (swapcache_prepare(entry, 1)) {
> > >> - /* Relax a bit to prevent rapid repeated page faults */
> > >> - schedule_timeout_uninterruptible(1);
> > >> - goto out;
> > >> - }
> > >> - need_clear_cache = true;
> > >> -
> > >> /* skip swapcache */
> > >> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > >> - vma, vmf->address, false);
> > >> + folio = alloc_swap_folio(vmf);
> > >> page = &folio->page;
> > >> if (folio) {
> > >> __folio_set_locked(folio);
> > >> __folio_set_swapbacked(folio);
> > >>
> > >> + nr_pages = folio_nr_pages(folio);
> > >> + if (folio_test_large(folio))
> > >> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> > >> + /*
> > >> + * Prevent parallel swapin from proceeding with
> > >> + * the cache flag. Otherwise, another thread may
> > >> + * finish swapin first, free the entry, and swapout
> > >> + * reusing the same entry. It's undetectable as
> > >> + * pte_same() returns true due to entry reuse.
> > >> + */
> > >> + if (swapcache_prepare(entry, nr_pages)) {
> > >> + /* Relax a bit to prevent rapid repeated page faults */
> > >> + schedule_timeout_uninterruptible(1);
> > >> + goto out_page;
> > >> + }
> > >> + need_clear_cache = true;
> > >> +
> > >> if (mem_cgroup_swapin_charge_folio(folio,
> > >> vma->vm_mm, GFP_KERNEL,
> > >> entry)) {
> > >> ret = VM_FAULT_OOM;
> > >> goto out_page;
> > >> }
> > >
> > > After your patch, with build kernel test, I'm seeing kernel log
> > > spamming like this:
> > > [ 101.048594] pagefault_out_of_memory: 95 callbacks suppressed
> > > [ 101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > ............
> > >
> > > And heavy performance loss with workloads limited by memcg, mTHP enabled.
> > >
> > > After some debugging, the problematic part is the
> > > mem_cgroup_swapin_charge_folio call above.
> > > When under pressure, cgroup charge fails easily for mTHP. One 64k
> > > swapin will require a much more aggressive reclaim to success.
> > >
> > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
> > > gone and mTHP swapin should have a much higher swapin success rate.
> > > But this might not be the right way.
> > >
> > > For this particular issue, maybe you can change the charge order, try
> > > charging first, if successful, use mTHP. if failed, fallback to 4k?
> >
> > This is what we did in alloc_anon_folio(), see 085ff35e7636
> > ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"),
> > 1) fallback earlier
> > 2) using same GFP flags for allocation and charge
> >
> > but it seems that there is a little complicated for swapin charge
>
> Kefeng, thanks! I guess we can continue using the same approach and
> it's not too complicated.
>
> Kairui, sorry for the trouble and thanks for the report! could you
> check if the solution below resolves the issue? On phones, we don't
> encounter the scenarios you’re facing.
>
> From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001
> From: Barry Song <v-songbaohua@oppo.com>
> Date: Fri, 16 Aug 2024 10:51:48 +1200
> Subject: [PATCH] mm: fallback to next_order if charing mTHP fails
>
> When memcg approaches its limit, charging mTHP becomes difficult.
> At this point, when the charge fails, we fallback to the next order
> to avoid repeatedly retrying larger orders.
>
> Reported-by: Kairui Song <ryncsn@gmail.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> mm/memory.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ed3603aaf31..6cba28ef91e7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> while (orders) {
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> folio = vma_alloc_folio(gfp, order, vma, addr, true);
> - if (folio)
> - return folio;
> + if (folio) {
> + if (!mem_cgroup_swapin_charge_folio(folio,
> + vma->vm_mm, gfp, entry))
> + return folio;
> + folio_put(folio);
> + }
> order = next_order(&orders, order);
> }
>
> @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
> need_clear_cache = true;
>
> - if (mem_cgroup_swapin_charge_folio(folio,
> + if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
> vma->vm_mm, GFP_KERNEL,
> entry)) {
> ret = VM_FAULT_OOM;
> --
> 2.34.1
>
Hi Barry
After the fix the spamming log is gone, thanks for the fix.
>
> Thanks
> Barry
>
next prev parent reply other threads:[~2024-08-16 16:50 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-26 9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
2024-07-26 9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
2024-07-30 3:00 ` Baolin Wang
2024-07-30 3:11 ` Matthew Wilcox
2024-07-30 3:15 ` Barry Song
2024-07-26 9:46 ` [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper " Barry Song
2024-07-26 16:30 ` Yosry Ahmed
2024-07-29 2:02 ` Barry Song
2024-07-29 3:43 ` Matthew Wilcox
2024-07-29 4:52 ` Barry Song
2024-07-26 9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
2024-07-29 3:51 ` Matthew Wilcox
2024-07-29 4:41 ` Barry Song
[not found] ` <CAGsJ_4wxUZAysyg3cCVnHhOFt5SbyAMUfq3tJcX-Wb6D4BiBhA@mail.gmail.com>
2024-07-29 12:49 ` Matthew Wilcox
2024-07-29 13:11 ` Barry Song
2024-07-29 15:13 ` Matthew Wilcox
2024-07-29 20:03 ` Barry Song
2024-07-29 21:56 ` Barry Song
2024-07-30 8:12 ` Ryan Roberts
2024-07-29 6:36 ` Chuanhua Han
2024-07-29 12:55 ` Matthew Wilcox
2024-07-29 13:18 ` Barry Song
2024-07-29 13:32 ` Chuanhua Han
2024-07-29 14:16 ` Dan Carpenter
2024-07-26 9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song
2024-07-27 5:58 ` kernel test robot
2024-07-29 1:37 ` Barry Song
2024-07-29 3:52 ` Matthew Wilcox
2024-07-29 4:49 ` Barry Song
2024-07-29 16:11 ` Christoph Hellwig
2024-07-29 20:11 ` Barry Song
2024-07-30 16:30 ` Christoph Hellwig
2024-07-30 19:28 ` Nhat Pham
2024-07-30 21:06 ` Barry Song
2024-07-31 18:35 ` Nhat Pham
2024-08-01 3:00 ` Sergey Senozhatsky
2024-08-01 20:55 ` Chris Li
2024-08-12 8:27 ` Christoph Hellwig
2024-08-12 8:44 ` Barry Song
2024-07-30 2:27 ` Chuanhua Han
2024-07-30 8:36 ` Ryan Roberts
2024-07-30 8:47 ` David Hildenbrand
2024-08-05 6:10 ` Huang, Ying
2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song
2024-08-02 12:20 ` [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song
2024-08-02 17:29 ` Chris Li
2024-08-02 12:20 ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song
2024-08-03 19:08 ` Andrew Morton
2024-08-12 8:26 ` Christoph Hellwig
2024-08-12 8:53 ` Barry Song
2024-08-12 11:38 ` Christoph Hellwig
2024-08-15 9:47 ` Kairui Song
2024-08-15 13:27 ` Kefeng Wang
2024-08-15 23:06 ` Barry Song
2024-08-16 16:50 ` Kairui Song [this message]
2024-08-16 20:34 ` Andrew Morton
2024-08-27 3:41 ` Chuanhua Han
2024-08-16 21:16 ` Matthew Wilcox
2024-08-16 21:39 ` Barry Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAMgjq7AeVkmOiVFa4-iP4nbyX3LHft_5wfvV_raj_N+twdzMKQ@mail.gmail.com \
--to=ryncsn@gmail.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hanchuanhua@oppo.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=kaleshsingh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=nphamcs@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=senozhatsky@chromium.org \
--cc=shakeel.butt@linux.dev \
--cc=shy828301@gmail.com \
--cc=surenb@google.com \
--cc=v-songbaohua@oppo.com \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
--cc=xiang@kernel.org \
--cc=ying.huang@intel.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox