From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8BF0DC3DA4A for ; Fri, 16 Aug 2024 16:50:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F3CDA6B0172; Fri, 16 Aug 2024 12:50:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EECE96B0175; Fri, 16 Aug 2024 12:50:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB42C6B0179; Fri, 16 Aug 2024 12:50:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BDCBD6B0172 for ; Fri, 16 Aug 2024 12:50:21 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 68889140162 for ; Fri, 16 Aug 2024 16:50:21 +0000 (UTC) X-FDA: 82458696642.28.DAA2506 Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by imf12.hostedemail.com (Postfix) with ESMTP id 725F040011 for ; Fri, 16 Aug 2024 16:50:19 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SxaK0nGF; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf12.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723826964; a=rsa-sha256; cv=none; b=IGtw1csM6wWmpQgiSurEC5XNe6NaNwopXt5SbbQ9Hb7jIVuOCZcc4uI//jDmunSmgoU5N8 fX6OlTJQYNcsJtHsYEgPO203RqhWIIz4xaQ5mN0WNJWwhbUFjaTHvgwI10f4imiVrpJM7B JZRaA6hjTOia7suMKgizbeEP4bmYtPM= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SxaK0nGF; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf12.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723826964; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Zulkl78vUfEpDmCfMVVTaarojiyuM5236F7awC7fIZE=; b=42r9u3KbFUSlQlU8+QFrekb1VmFhiyEOZl+ETnUKl3xhiZRGPO3Mr+/LPhDmf0M388irhY MGZyC+RaxmdwI7Seke1yoy3Fxbpfkg4XjI+HSpXTE3idBbjD/I1bos2sR6nkZKwfP32kFL b16IKJxE0x4uNMBt9Ea4mrvBfU9wa3o= Received: by mail-lj1-f180.google.com with SMTP id 38308e7fff4ca-2ef2cce8be8so24583951fa.1 for ; Fri, 16 Aug 2024 09:50:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723827018; x=1724431818; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Zulkl78vUfEpDmCfMVVTaarojiyuM5236F7awC7fIZE=; b=SxaK0nGFKedPFudErpta/3ksKUCRaTI7IKTGA6FzVKV4hxRKK2/KaZRpNo4zM2q2VC TDCoHCFWkzNMXBIYfq/v66OPm+/Y+bPwJZV0/jG6zl01tDcudJqKvpQwkPrBoJU4bKu0 WBrKgFsWk9mYoAGpmVbSo1rFHjAptfORPIQmzZ60Jo4EOKA4oyFenUMholfGNXAKurax qS4Oez898Rwh67cAtNK1eaFwFImBtJFg4VhBOicpKEd1/tfvboV65PLhopmCc09WmeoS bYOAHzHGo9kpR7VLUrHBeIbmhNCjQ6VZGxBtMlAh4Mh7crbcWODvNHXh9jytddQRV1cF +JZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723827018; x=1724431818; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Zulkl78vUfEpDmCfMVVTaarojiyuM5236F7awC7fIZE=; b=dszFopIq4g9eQMIQFVbKks63phouDR2u/K0TqVMf4kU5zZV6ouVw6Z0SvOWxeuutT9 NteGgSJOlQEeePpEY5NugSKsJov1TfezU2f3JlT0pyDsqtftICDEFfs3OPkUHRa3fSA+ MAw3sHrpgLP5Cl6cBUXATdufxAC4TNfSilvQQpeG40KzHgiwG5/N5NJhDtE1/TnWWBva +Q0wHBI+OKX0dshVfw/FVPuaBC+5gEecYxp8yMGEIP2lcCnRqjFKf6ngK80+fc1lw66o idx6mnLx37SyLNVA+2p7+tIGj7pbq9WQ/08QgaPuppcgAkPpoXLLltUMpFg3NI0h2KNF tyeg== X-Forwarded-Encrypted: i=1; AJvYcCW6YGYN9L1LDGnM9qh4ScJikWKLHZNsSBFMxQQhLCnmGgnhO1FnS1IzHanniib1sHLoJsw7nPdEaxpwz2aEkUr6btI= X-Gm-Message-State: AOJu0YzVrnWBSr51fVhMVXkLnsHQy5G0UFa9chvBUFXnqg8JrY5WZFhS SPxwKujYjR1i6ruayLhA0zmZPpUM6pfAQ0mXmhw45TGKj4//bdykXimx+zxE9mn4MNkw1ewTJNY 8XLmBwj9Ur3zivcW3htOIkhFU4Ms= X-Google-Smtp-Source: AGHT+IFvpmisqcO91HL15XeFzWURZYYRUwYHM3lSoqrS35PxXofn7CCaVaQEg3qDLh7FezaHL0PD+OhC7LwLemjnvbw= X-Received: by 2002:a05:651c:1543:b0:2ef:2247:9881 with SMTP id 38308e7fff4ca-2f3be5fb455mr24819501fa.31.1723827017235; Fri, 16 Aug 2024 09:50:17 -0700 (PDT) MIME-Version: 1.0 References: <20ed69ad-5dad-446b-9f01-86ad8b1c67fa@huawei.com> <20240815230612.77266-1-21cnbao@gmail.com> In-Reply-To: <20240815230612.77266-1-21cnbao@gmail.com> From: Kairui Song Date: Sat, 17 Aug 2024 00:50:00 +0800 Message-ID: Subject: Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices To: Barry Song <21cnbao@gmail.com> Cc: wangkefeng.wang@huawei.com, akpm@linux-foundation.org, baolin.wang@linux.alibaba.com, chrisl@kernel.org, david@redhat.com, hanchuanhua@oppo.com, hannes@cmpxchg.org, hch@infradead.org, hughd@google.com, kaleshsingh@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.com, minchan@kernel.org, nphamcs@gmail.com, ryan.roberts@arm.com, senozhatsky@chromium.org, shakeel.butt@linux.dev, shy828301@gmail.com, surenb@google.com, v-songbaohua@oppo.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yosryahmed@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 725F040011 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 7t37kpxpuw99fny1zfy15ke9ufj5rsph X-HE-Tag: 1723827019-221445 X-HE-Meta: U2FsdGVkX1/14UIcsJOEYYErcOANz9VU8QgEYO5kwhjzvVIoOrSxyqnKDxUdoiR//bUxppWM2UMyHaeNHMddCeb6WEfUwkuebeUnxgfB/zSyV/lZC4TuHVwFU6V4xDj6r+QIcgZeuYUDysw9XRc8Xcw5D4ZcZfRR40Uj9rmqpzPx8bPjK50VecO8jULua8Yvzw9FbnxdEt4JCb0WZr5nHOa8gR2BKeFNTBK/sikPaIBQAjQzoN8MnuEYOzsUp0krPK8BQ8kBmjQn0LDiHHEmcvuvbkIkqwkn6Y2FZmU37KBDXk8RUnoTo+Kr6XiVePE59gfTsvUHo9RlyRBcsSqotZ5Ka9gcsmxdhXyPBjNfgtkZe3dA0Ut3WvZMFZUZjsw0rlfIrBtClpngu/RRqzltdAImM9YSyHwjD8XzD4aqq+gIfxZkQgP46e6OJ6nvFA8cwfKuoE7S/7/lDLAqvNb5cRPdpxSSMiFsfJFuBBQ8/5x05B+MiDgKGNVRvQ66cwcPhgIJskkz8PQXsqSnvmTvLUeexNglTUHb3/m2XXYazRG8LMFcmhAWRxRk8v2gu4JjjxWOd2ITUmTnyb9TVnkBgHN1Y+BbW8rAjC36ljc1dnz22YVLhHeeKA0Vt+Hc2oTLE+1kUOrHz0Go1Qxx0mAdomJT9KptpZMNH9dpUymPnQc1uwZQkadrNmrq03f8aZN6snpYHtMjjJ3ZS9MLFQ4BuV91JObIkXH9G0YmwNMWTeIklAAe1r2gUMrJGCSaTshBlUEHuXnfX/lSqZ56NXyyVGxi5LAKgRsvp4fQZOZlBVBAd1p7loXO8SxnjanFtmNtBc2asRZJ2Homi5L6pxdooafTnk/q4YE9gb9GvVoJ3qYAe/7LVQzJ6ygBli8ZhanwLGYQkvHvCR2Zw15pZwKPu4DNGYTsDT4wzVsuFkvkS9VTjOwOD7bkf0pql+1uzXzx/V+YSAYlwmQC8mtuucv Uuu6wkY1 IF4QSmYV7dny/6PsPcVOqHC5h9Cbf3Lu/feUBJgSJaf7CHbp4+ZwQ/ZhNwSHErw44TVMupJhkosWKe3rFV6YxEypXBwGU71EHNL6J6JdOGCOsv4J2Q125Yq/0xIwswmHhtZ8hvmxc9c2ceGd0p+jZqEWkg9WcTzOi9udd7py3/I+HIRyGgt4crkAUrbrZP0OeVAly4wTLwwDuCHKaJZQt1TIWZxKliWWCIn0nwLpvtNb1FIliYYIXdB3M1CNBVLrV/AUVSHFhVyh3JAeXmdKM9cK3ww/23GyXgbpPb5L0f7YJR9MnE8k/Cd1R5YzXJkllVPZGrt+rcBGSjC+pX2TPkXSgxPsTLlVyl332lW3D8mv+zILArlMp/QEkT+KwIBjulJNFzAZOXX+Ds4r0QHGSX21ibg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Aug 16, 2024 at 7:06=E2=80=AFAM Barry Song <21cnbao@gmail.com> wrot= e: > > On Fri, Aug 16, 2024 at 1:27=E2=80=AFAM Kefeng Wang wrote: > > > > > > > > On 2024/8/15 17:47, Kairui Song wrote: > > > On Fri, Aug 2, 2024 at 8:21=E2=80=AFPM Barry Song <21cnbao@gmail.com>= wrote: > > >> > > >> From: Chuanhua Han > > > > > > Hi Chuanhua, > > > > > >> > > ... > > > > >> + > > >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > >> +{ > > >> + struct vm_area_struct *vma =3D vmf->vma; > > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > >> + unsigned long orders; > > >> + struct folio *folio; > > >> + unsigned long addr; > > >> + swp_entry_t entry; > > >> + spinlock_t *ptl; > > >> + pte_t *pte; > > >> + gfp_t gfp; > > >> + int order; > > >> + > > >> + /* > > >> + * If uffd is active for the vma we need per-page fault fide= lity to > > >> + * maintain the uffd semantics. > > >> + */ > > >> + if (unlikely(userfaultfd_armed(vma))) > > >> + goto fallback; > > >> + > > >> + /* > > >> + * A large swapped out folio could be partially or fully in = zswap. We > > >> + * lack handling for such cases, so fallback to swapping in = order-0 > > >> + * folio. > > >> + */ > > >> + if (!zswap_never_enabled()) > > >> + goto fallback; > > >> + > > >> + entry =3D pte_to_swp_entry(vmf->orig_pte); > > >> + /* > > >> + * Get a list of all the (large) orders below PMD_ORDER that= are enabled > > >> + * and suitable for swapping THP. > > >> + */ > > >> + orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, > > >> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER= ) - 1); > > >> + orders =3D thp_vma_suitable_orders(vma, vmf->address, orders= ); > > >> + orders =3D thp_swap_suitable_orders(swp_offset(entry), vmf->= address, orders); > > >> + > > >> + if (!orders) > > >> + goto fallback; > > >> + > > >> + pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->= address & PMD_MASK, &ptl); > > >> + if (unlikely(!pte)) > > >> + goto fallback; > > >> + > > >> + /* > > >> + * For do_swap_page, find the highest order where the aligne= d range is > > >> + * completely swap entries with contiguous swap offsets. > > >> + */ > > >> + order =3D highest_order(orders); > > >> + while (orders) { > > >> + addr =3D ALIGN_DOWN(vmf->address, PAGE_SIZE << order= ); > > >> + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << = order)) > > >> + break; > > >> + order =3D next_order(&orders, order); > > >> + } > > >> + > > >> + pte_unmap_unlock(pte, ptl); > > >> + > > >> + /* Try allocating the highest of the remaining orders. */ > > >> + gfp =3D vma_thp_gfp_mask(vma); > > >> + while (orders) { > > >> + addr =3D ALIGN_DOWN(vmf->address, PAGE_SIZE << order= ); > > >> + folio =3D vma_alloc_folio(gfp, order, vma, addr, tru= e); > > >> + if (folio) > > >> + return folio; > > >> + order =3D next_order(&orders, order); > > >> + } > > >> + > > >> +fallback: > > >> +#endif > > >> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->ad= dress, false); > > >> +} > > >> + > > >> + > > >> /* > > >> * We enter with non-exclusive mmap_lock (to exclude vma changes, > > >> * but allow concurrent faults), and pte mapped but not yet locked= . > > >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf= ) > > >> if (!folio) { > > >> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > >> __swap_count(entry) =3D=3D 1) { > > >> - /* > > >> - * Prevent parallel swapin from proceeding w= ith > > >> - * the cache flag. Otherwise, another thread= may > > >> - * finish swapin first, free the entry, and = swapout > > >> - * reusing the same entry. It's undetectable= as > > >> - * pte_same() returns true due to entry reus= e. > > >> - */ > > >> - if (swapcache_prepare(entry, 1)) { > > >> - /* Relax a bit to prevent rapid repe= ated page faults */ > > >> - schedule_timeout_uninterruptible(1); > > >> - goto out; > > >> - } > > >> - need_clear_cache =3D true; > > >> - > > >> /* skip swapcache */ > > >> - folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVAB= LE, 0, > > >> - vma, vmf->address, f= alse); > > >> + folio =3D alloc_swap_folio(vmf); > > >> page =3D &folio->page; > > >> if (folio) { > > >> __folio_set_locked(folio); > > >> __folio_set_swapbacked(folio); > > >> > > >> + nr_pages =3D folio_nr_pages(folio); > > >> + if (folio_test_large(folio)) > > >> + entry.val =3D ALIGN_DOWN(ent= ry.val, nr_pages); > > >> + /* > > >> + * Prevent parallel swapin from proc= eeding with > > >> + * the cache flag. Otherwise, anothe= r thread may > > >> + * finish swapin first, free the ent= ry, and swapout > > >> + * reusing the same entry. It's unde= tectable as > > >> + * pte_same() returns true due to en= try reuse. > > >> + */ > > >> + if (swapcache_prepare(entry, nr_page= s)) { > > >> + /* Relax a bit to prevent ra= pid repeated page faults */ > > >> + schedule_timeout_uninterrupt= ible(1); > > >> + goto out_page; > > >> + } > > >> + need_clear_cache =3D true; > > >> + > > >> if (mem_cgroup_swapin_charge_folio(= folio, > > >> vma->vm_mm,= GFP_KERNEL, > > >> entry)) { > > >> ret =3D VM_FAULT_OOM; > > >> goto out_page; > > >> } > > > > > > After your patch, with build kernel test, I'm seeing kernel log > > > spamming like this: > > > [ 101.048594] pagefault_out_of_memory: 95 callbacks suppressed > > > [ 101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retryi= ng PF > > > [ 101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retryi= ng PF > > > [ 101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retryi= ng PF > > > [ 101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retryi= ng PF > > > [ 101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retryi= ng PF > > > [ 101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retryi= ng PF > > > [ 101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retryi= ng PF > > > [ 101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retryi= ng PF > > > ............ > > > > > > And heavy performance loss with workloads limited by memcg, mTHP enab= led. > > > > > > After some debugging, the problematic part is the > > > mem_cgroup_swapin_charge_folio call above. > > > When under pressure, cgroup charge fails easily for mTHP. One 64k > > > swapin will require a much more aggressive reclaim to success. > > > > > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is > > > gone and mTHP swapin should have a much higher swapin success rate. > > > But this might not be the right way. > > > > > > For this particular issue, maybe you can change the charge order, try > > > charging first, if successful, use mTHP. if failed, fallback to 4k? > > > > This is what we did in alloc_anon_folio(), see 085ff35e7636 > > ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"), > > 1) fallback earlier > > 2) using same GFP flags for allocation and charge > > > > but it seems that there is a little complicated for swapin charge > > Kefeng, thanks! I guess we can continue using the same approach and > it's not too complicated. > > Kairui, sorry for the trouble and thanks for the report! could you > check if the solution below resolves the issue? On phones, we don't > encounter the scenarios you=E2=80=99re facing. > > From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001 > From: Barry Song > Date: Fri, 16 Aug 2024 10:51:48 +1200 > Subject: [PATCH] mm: fallback to next_order if charing mTHP fails > > When memcg approaches its limit, charging mTHP becomes difficult. > At this point, when the charge fails, we fallback to the next order > to avoid repeatedly retrying larger orders. > > Reported-by: Kairui Song > Signed-off-by: Barry Song > --- > mm/memory.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 0ed3603aaf31..6cba28ef91e7 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fa= ult *vmf) > while (orders) { > addr =3D ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > folio =3D vma_alloc_folio(gfp, order, vma, addr, true); > - if (folio) > - return folio; > + if (folio) { > + if (!mem_cgroup_swapin_charge_folio(folio, > + vma->vm_mm, gfp, entry)) > + return folio; > + folio_put(folio); > + } > order =3D next_order(&orders, order); > } > > @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > } > need_clear_cache =3D true; > > - if (mem_cgroup_swapin_charge_folio(folio, > + if (nr_pages =3D=3D 1 && mem_cgroup_swapi= n_charge_folio(folio, > vma->vm_mm, GFP_K= ERNEL, > entry)) { > ret =3D VM_FAULT_OOM; > -- > 2.34.1 > Hi Barry After the fix the spamming log is gone, thanks for the fix. > > Thanks > Barry >