From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2152FC52D7C
	for <linux-mm@archiver.kernel.org>; Thu, 15 Aug 2024 13:28:05 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A904D6B00ED; Thu, 15 Aug 2024 09:28:04 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A40E66B00EE; Thu, 15 Aug 2024 09:28:04 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8E21E6B00EF; Thu, 15 Aug 2024 09:28:04 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 6F2F26B00ED
	for <linux-mm@kvack.org>; Thu, 15 Aug 2024 09:28:04 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 2816B815D7
	for <linux-mm@kvack.org>; Thu, 15 Aug 2024 13:28:04 +0000 (UTC)
X-FDA: 82454558088.21.C8A7B06
Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187])
	by imf05.hostedemail.com (Postfix) with ESMTP id 6AD89100019
	for <linux-mm@kvack.org>; Thu, 15 Aug 2024 13:28:00 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=none;
	spf=pass (imf05.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1723728385;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=dhYOI3cbpivUyc9+wbrB1bufhPFk2MwSRrcoXkCNveI=;
	b=kdzaerwv+2QPtUBkPv0DJW6WBM1n0iFNxft1yKSAhI5uTXRQbDiRaURJT6WrrVsCXl5+9Y
	UKY5HnLzqFcc0fIDhKhga29yPAlNBus5QLzkvPRIFq1uH9xKf6j2gBVKh4NA3okjbeZTgJ
	xxw+RR658JbRx9DzsoGtIOjNhdftqYk=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=none;
	spf=pass (imf05.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723728385; a=rsa-sha256;
	cv=none;
	b=EJA4HIZtcarCqIkBJNUqvF+PA/V2Ep3req0u0bXMnZzLilQARVRx2H7aKGiUn3zEI4Lzme
	3sysMN0opu702l/UI0Kxd64QhBd9h8Xx4SVdlDJZRNADJijUzvLgbMpvsSgXZbbsLTWrfU
	oyqrRdbzKXc0akShBIA5FrD8wxeLhYg=
Received: from mail.maildlp.com (unknown [172.19.162.254])
	by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4Wl5Ww0tQNzyQ4v;
	Thu, 15 Aug 2024 21:27:24 +0800 (CST)
Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138])
	by mail.maildlp.com (Postfix) with ESMTPS id 98F13180100;
	Thu, 15 Aug 2024 21:27:55 +0800 (CST)
Received: from [10.174.177.243] (10.174.177.243) by
 dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1544.11; Thu, 15 Aug 2024 21:27:54 +0800
Message-ID: <20ed69ad-5dad-446b-9f01-86ad8b1c67fa@huawei.com>
Date: Thu, 15 Aug 2024 21:27:53 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like
 devices
Content-Language: en-US
To: Kairui Song <ryncsn@gmail.com>, Chuanhua Han <hanchuanhua@oppo.com>, Barry
 Song <21cnbao@gmail.com>
CC: <akpm@linux-foundation.org>, <linux-mm@kvack.org>,
	<baolin.wang@linux.alibaba.com>, <chrisl@kernel.org>, <david@redhat.com>,
	<hannes@cmpxchg.org>, <hughd@google.com>, <kaleshsingh@google.com>,
	<linux-kernel@vger.kernel.org>, <mhocko@suse.com>, <minchan@kernel.org>,
	<nphamcs@gmail.com>, <ryan.roberts@arm.com>, <senozhatsky@chromium.org>,
	<shakeel.butt@linux.dev>, <shy828301@gmail.com>, <surenb@google.com>,
	<v-songbaohua@oppo.com>, <willy@infradead.org>, <xiang@kernel.org>,
	<ying.huang@intel.com>, <yosryahmed@google.com>, <hch@infradead.org>
References: <20240726094618.401593-1-21cnbao@gmail.com>
 <20240802122031.117548-1-21cnbao@gmail.com>
 <20240802122031.117548-3-21cnbao@gmail.com>
 <CAMgjq7DmSok3YYd6dqyyYxkK_wZg7-c2bW8BFfxhs1V86h=niw@mail.gmail.com>
From: Kefeng Wang <wangkefeng.wang@huawei.com>
In-Reply-To: <CAMgjq7DmSok3YYd6dqyyYxkK_wZg7-c2bW8BFfxhs1V86h=niw@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.177.243]
X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To
 dggpemf100008.china.huawei.com (7.185.36.138)
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 6AD89100019
X-Stat-Signature: m958id8bpythc6rdn97ftzowrfinp3f3
X-Rspam-User: 
X-HE-Tag: 1723728480-517542
X-HE-Meta: U2FsdGVkX1+qs3vPY8JBDyhV/ZRYn/7xAL9Ukd1SJFxrDUY3ERRL9XWfvtd5IWCeVaIfBsAFgc0WRaCqhUjPxkp2gXlmffIekMJnGH4xuwjqJzs4uG453uqMF8ohb+yXy1lM0niDYN0P3uXVnQGAqIOLNFQlzLTyKmO1faImKduaW3HMbAf/6cwA0/ZA1zYTcZXEZUG5EdCfXfVoawwn8eSorZ0zH/0oZfI7uKrbW3jeIYlbG/dodMQk82gsz+LokrUCoSgaXZI9mdXhr5AT10Qcc6ZLGiNxKdVuoqhT3+qxLXBxZl/0i7o8bfQzrg/7h0yn6S9wVCeVeKZp+pny4oQkaPwv89iX+qIpPNlTGhwSMEq9lyOsJnR2jareDGVMe0EjIyRvMcx/RCxf6WHn6NiV1ln1kveX6o0feqL9Fp4wWnceE27usp3AOLRJjLtjUCA//VeS/wTynvZ6kX0lqBtZjlWiYiAkmlOtUgYHkof+Nmxg2plr4OePCTHvOKV3pArVy6tDCqd3tqz07xWMTgby8ixFHy41RiuW/zWZA04Y5uvd/MdRQOjpk3FHgewNYGNTZmqvhNxoQgAIzr9r+L9T79EpeRbWxMg2VLuc0FzMZoOQlMJAxOGaB6eH4ksBf+wS9b+sC+RK6Rqp6VJYunfsEMFIikqMDb+H/KxwuXNDlPVy9VOAVXrJc3Z+VWAYtQRMyBpXpP7gwUsGJkMw5B1pD3x9YtbhmIRX54CgGTqD/8q7kTfgATdUsNyb7BnKPS8ahg5F3CLBVnlh6sYRLMslkqC8IqC4eYV1BY+Y9a6qxbKnvsM/qJCKOHBDafMJIxPNinGSul299qc0TgEkFqOtVgMpYgUII1lFBnhrT2k1QX/+5/n8f2ZPode8yybOhHG4nAbSc9ggbl1vUrCU99JfEdzScjgaKXpd41C1B8MDiRtvVuqQs+RqJ9jUO2nzVLO23avHiHYz3wk48rg
 UgLH8rfc
 e/l+hM/XQGe9Y2dMWUNDROnbszrNKPId6uqcEW3B9H35lx8nzbBr8U7Xk+kzo+iy42nZo4dCb7lXxCk3iCvLbYS7ur8PjZuFpYl3Qssw3eDl+919c8rK7aWpi2SiO4APiHROcLCK3x33yIJJPp1P51Mvph+buh2prd1PaOgujPk/zJT92hdXFeReTcxYvo/RkRJ1DsQGA3sB7FuOXGNPw2qzQEbqitkVb3FYplfdvsbaIlpAMRXWWl7I8tEwgu2QNc7BRGQMfzVDDVMp666Kl2WXmpg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 2024/8/15 17:47, Kairui Song wrote:
> On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> Hi Chuanhua,
> 
>>
...

>> +
>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>> +{
>> +       struct vm_area_struct *vma = vmf->vma;
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +       unsigned long orders;
>> +       struct folio *folio;
>> +       unsigned long addr;
>> +       swp_entry_t entry;
>> +       spinlock_t *ptl;
>> +       pte_t *pte;
>> +       gfp_t gfp;
>> +       int order;
>> +
>> +       /*
>> +        * If uffd is active for the vma we need per-page fault fidelity to
>> +        * maintain the uffd semantics.
>> +        */
>> +       if (unlikely(userfaultfd_armed(vma)))
>> +               goto fallback;
>> +
>> +       /*
>> +        * A large swapped out folio could be partially or fully in zswap. We
>> +        * lack handling for such cases, so fallback to swapping in order-0
>> +        * folio.
>> +        */
>> +       if (!zswap_never_enabled())
>> +               goto fallback;
>> +
>> +       entry = pte_to_swp_entry(vmf->orig_pte);
>> +       /*
>> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> +        * and suitable for swapping THP.
>> +        */
>> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
>> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
>> +
>> +       if (!orders)
>> +               goto fallback;
>> +
>> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
>> +       if (unlikely(!pte))
>> +               goto fallback;
>> +
>> +       /*
>> +        * For do_swap_page, find the highest order where the aligned range is
>> +        * completely swap entries with contiguous swap offsets.
>> +        */
>> +       order = highest_order(orders);
>> +       while (orders) {
>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
>> +                       break;
>> +               order = next_order(&orders, order);
>> +       }
>> +
>> +       pte_unmap_unlock(pte, ptl);
>> +
>> +       /* Try allocating the highest of the remaining orders. */
>> +       gfp = vma_thp_gfp_mask(vma);
>> +       while (orders) {
>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +               if (folio)
>> +                       return folio;
>> +               order = next_order(&orders, order);
>> +       }
>> +
>> +fallback:
>> +#endif
>> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
>> +}
>> +
>> +
>>   /*
>>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>    * but allow concurrent faults), and pte mapped but not yet locked.
>> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>          if (!folio) {
>>                  if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>                      __swap_count(entry) == 1) {
>> -                       /*
>> -                        * Prevent parallel swapin from proceeding with
>> -                        * the cache flag. Otherwise, another thread may
>> -                        * finish swapin first, free the entry, and swapout
>> -                        * reusing the same entry. It's undetectable as
>> -                        * pte_same() returns true due to entry reuse.
>> -                        */
>> -                       if (swapcache_prepare(entry, 1)) {
>> -                               /* Relax a bit to prevent rapid repeated page faults */
>> -                               schedule_timeout_uninterruptible(1);
>> -                               goto out;
>> -                       }
>> -                       need_clear_cache = true;
>> -
>>                          /* skip swapcache */
>> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>> -                                               vma, vmf->address, false);
>> +                       folio = alloc_swap_folio(vmf);
>>                          page = &folio->page;
>>                          if (folio) {
>>                                  __folio_set_locked(folio);
>>                                  __folio_set_swapbacked(folio);
>>
>> +                               nr_pages = folio_nr_pages(folio);
>> +                               if (folio_test_large(folio))
>> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
>> +                               /*
>> +                                * Prevent parallel swapin from proceeding with
>> +                                * the cache flag. Otherwise, another thread may
>> +                                * finish swapin first, free the entry, and swapout
>> +                                * reusing the same entry. It's undetectable as
>> +                                * pte_same() returns true due to entry reuse.
>> +                                */
>> +                               if (swapcache_prepare(entry, nr_pages)) {
>> +                                       /* Relax a bit to prevent rapid repeated page faults */
>> +                                       schedule_timeout_uninterruptible(1);
>> +                                       goto out_page;
>> +                               }
>> +                               need_clear_cache = true;
>> +
>>                                  if (mem_cgroup_swapin_charge_folio(folio,
>>                                                          vma->vm_mm, GFP_KERNEL,
>>                                                          entry)) {
>>                                          ret = VM_FAULT_OOM;
>>                                          goto out_page;
>>                                  }
> 
> After your patch, with build kernel test, I'm seeing kernel log
> spamming like this:
> [  101.048594] pagefault_out_of_memory: 95 callbacks suppressed
> [  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> ............
> 
> And heavy performance loss with workloads limited by memcg, mTHP enabled.
> 
> After some debugging, the problematic part is the
> mem_cgroup_swapin_charge_folio call above.
> When under pressure, cgroup charge fails easily for mTHP. One 64k
> swapin will require a much more aggressive reclaim to success.
> 
> If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
> gone and mTHP swapin should have a much higher swapin success rate.
> But this might not be the right way.
> 
> For this particular issue, maybe you can change the charge order, try
> charging first, if successful, use mTHP. if failed, fallback to 4k?

This is what we did in alloc_anon_folio(), see 085ff35e7636
("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"),
1) fallback earlier
2) using same GFP flags for allocation and charge

but it seems that there is a little complicated for swapin charge


> 
>> -                               mem_cgroup_swapin_uncharge_swap(entry, 1);
>> +                               mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
>>
>>                                  shadow = get_shadow_from_swap_cache(entry);
>>                                  if (shadow)
>> @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                  goto out_nomap;
>>          }
>>
>> +       /* allocated large folios for SWP_SYNCHRONOUS_IO */
>> +       if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
>> +               unsigned long nr = folio_nr_pages(folio);
>> +               unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
>> +               unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
>> +               pte_t *folio_ptep = vmf->pte - idx;
>> +
>> +               if (!can_swapin_thp(vmf, folio_ptep, nr))
>> +                       goto out_nomap;
>> +
>> +               page_idx = idx;
>> +               address = folio_start;
>> +               ptep = folio_ptep;
>> +               goto check_folio;
>> +       }
>> +
>>          nr_pages = 1;
>>          page_idx = 0;
>>          address = vmf->address;
>> @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                  folio_add_lru_vma(folio, vma);
>>          } else if (!folio_test_anon(folio)) {
>>                  /*
>> -                * We currently only expect small !anon folios, which are either
>> -                * fully exclusive or fully shared. If we ever get large folios
>> -                * here, we have to be careful.
>> +                * We currently only expect small !anon folios which are either
>> +                * fully exclusive or fully shared, or new allocated large folios
>> +                * which are fully exclusive. If we ever get large folios within
>> +                * swapcache here, we have to be careful.
>>                   */
>> -               VM_WARN_ON_ONCE(folio_test_large(folio));
>> +               VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
>>                  VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>>                  folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>>          } else {
>> @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>   out:
>>          /* Clear the swap cache pin for direct swapin after PTL unlock */
>>          if (need_clear_cache)
>> -               swapcache_clear(si, entry, 1);
>> +               swapcache_clear(si, entry, nr_pages);
>>          if (si)
>>                  put_swap_device(si);
>>          return ret;
>> @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                  folio_put(swapcache);
>>          }
>>          if (need_clear_cache)
>> -               swapcache_clear(si, entry, 1);
>> +               swapcache_clear(si, entry, nr_pages);
>>          if (si)
>>                  put_swap_device(si);
>>          return ret;
>> --
>> 2.34.1
>>
>>
>