From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2152FC52D7C for ; Thu, 15 Aug 2024 13:28:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A904D6B00ED; Thu, 15 Aug 2024 09:28:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A40E66B00EE; Thu, 15 Aug 2024 09:28:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8E21E6B00EF; Thu, 15 Aug 2024 09:28:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6F2F26B00ED for ; Thu, 15 Aug 2024 09:28:04 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 2816B815D7 for ; Thu, 15 Aug 2024 13:28:04 +0000 (UTC) X-FDA: 82454558088.21.C8A7B06 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf05.hostedemail.com (Postfix) with ESMTP id 6AD89100019 for ; Thu, 15 Aug 2024 13:28:00 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723728385; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dhYOI3cbpivUyc9+wbrB1bufhPFk2MwSRrcoXkCNveI=; b=kdzaerwv+2QPtUBkPv0DJW6WBM1n0iFNxft1yKSAhI5uTXRQbDiRaURJT6WrrVsCXl5+9Y UKY5HnLzqFcc0fIDhKhga29yPAlNBus5QLzkvPRIFq1uH9xKf6j2gBVKh4NA3okjbeZTgJ xxw+RR658JbRx9DzsoGtIOjNhdftqYk= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723728385; a=rsa-sha256; cv=none; b=EJA4HIZtcarCqIkBJNUqvF+PA/V2Ep3req0u0bXMnZzLilQARVRx2H7aKGiUn3zEI4Lzme 3sysMN0opu702l/UI0Kxd64QhBd9h8Xx4SVdlDJZRNADJijUzvLgbMpvsSgXZbbsLTWrfU oyqrRdbzKXc0akShBIA5FrD8wxeLhYg= Received: from mail.maildlp.com (unknown [172.19.162.254]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4Wl5Ww0tQNzyQ4v; Thu, 15 Aug 2024 21:27:24 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id 98F13180100; Thu, 15 Aug 2024 21:27:55 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 15 Aug 2024 21:27:54 +0800 Message-ID: <20ed69ad-5dad-446b-9f01-86ad8b1c67fa@huawei.com> Date: Thu, 15 Aug 2024 21:27:53 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Content-Language: en-US To: Kairui Song , Chuanhua Han , Barry Song <21cnbao@gmail.com> CC: , , , , , , , , , , , , , , , , , , , , , , References: <20240726094618.401593-1-21cnbao@gmail.com> <20240802122031.117548-1-21cnbao@gmail.com> <20240802122031.117548-3-21cnbao@gmail.com> From: Kefeng Wang In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemf100008.china.huawei.com (7.185.36.138) X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 6AD89100019 X-Stat-Signature: m958id8bpythc6rdn97ftzowrfinp3f3 X-Rspam-User: X-HE-Tag: 1723728480-517542 X-HE-Meta: U2FsdGVkX1+qs3vPY8JBDyhV/ZRYn/7xAL9Ukd1SJFxrDUY3ERRL9XWfvtd5IWCeVaIfBsAFgc0WRaCqhUjPxkp2gXlmffIekMJnGH4xuwjqJzs4uG453uqMF8ohb+yXy1lM0niDYN0P3uXVnQGAqIOLNFQlzLTyKmO1faImKduaW3HMbAf/6cwA0/ZA1zYTcZXEZUG5EdCfXfVoawwn8eSorZ0zH/0oZfI7uKrbW3jeIYlbG/dodMQk82gsz+LokrUCoSgaXZI9mdXhr5AT10Qcc6ZLGiNxKdVuoqhT3+qxLXBxZl/0i7o8bfQzrg/7h0yn6S9wVCeVeKZp+pny4oQkaPwv89iX+qIpPNlTGhwSMEq9lyOsJnR2jareDGVMe0EjIyRvMcx/RCxf6WHn6NiV1ln1kveX6o0feqL9Fp4wWnceE27usp3AOLRJjLtjUCA//VeS/wTynvZ6kX0lqBtZjlWiYiAkmlOtUgYHkof+Nmxg2plr4OePCTHvOKV3pArVy6tDCqd3tqz07xWMTgby8ixFHy41RiuW/zWZA04Y5uvd/MdRQOjpk3FHgewNYGNTZmqvhNxoQgAIzr9r+L9T79EpeRbWxMg2VLuc0FzMZoOQlMJAxOGaB6eH4ksBf+wS9b+sC+RK6Rqp6VJYunfsEMFIikqMDb+H/KxwuXNDlPVy9VOAVXrJc3Z+VWAYtQRMyBpXpP7gwUsGJkMw5B1pD3x9YtbhmIRX54CgGTqD/8q7kTfgATdUsNyb7BnKPS8ahg5F3CLBVnlh6sYRLMslkqC8IqC4eYV1BY+Y9a6qxbKnvsM/qJCKOHBDafMJIxPNinGSul299qc0TgEkFqOtVgMpYgUII1lFBnhrT2k1QX/+5/n8f2ZPode8yybOhHG4nAbSc9ggbl1vUrCU99JfEdzScjgaKXpd41C1B8MDiRtvVuqQs+RqJ9jUO2nzVLO23avHiHYz3wk48rg UgLH8rfc e/l+hM/XQGe9Y2dMWUNDROnbszrNKPId6uqcEW3B9H35lx8nzbBr8U7Xk+kzo+iy42nZo4dCb7lXxCk3iCvLbYS7ur8PjZuFpYl3Qssw3eDl+919c8rK7aWpi2SiO4APiHROcLCK3x33yIJJPp1P51Mvph+buh2prd1PaOgujPk/zJT92hdXFeReTcxYvo/RkRJ1DsQGA3sB7FuOXGNPw2qzQEbqitkVb3FYplfdvsbaIlpAMRXWWl7I8tEwgu2QNc7BRGQMfzVDDVMp666Kl2WXmpg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/8/15 17:47, Kairui Song wrote: > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote: >> >> From: Chuanhua Han > > Hi Chuanhua, > >> ... >> + >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf) >> +{ >> + struct vm_area_struct *vma = vmf->vma; >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + unsigned long orders; >> + struct folio *folio; >> + unsigned long addr; >> + swp_entry_t entry; >> + spinlock_t *ptl; >> + pte_t *pte; >> + gfp_t gfp; >> + int order; >> + >> + /* >> + * If uffd is active for the vma we need per-page fault fidelity to >> + * maintain the uffd semantics. >> + */ >> + if (unlikely(userfaultfd_armed(vma))) >> + goto fallback; >> + >> + /* >> + * A large swapped out folio could be partially or fully in zswap. We >> + * lack handling for such cases, so fallback to swapping in order-0 >> + * folio. >> + */ >> + if (!zswap_never_enabled()) >> + goto fallback; >> + >> + entry = pte_to_swp_entry(vmf->orig_pte); >> + /* >> + * Get a list of all the (large) orders below PMD_ORDER that are enabled >> + * and suitable for swapping THP. >> + */ >> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, >> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); >> + orders = thp_vma_suitable_orders(vma, vmf->address, orders); >> + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); >> + >> + if (!orders) >> + goto fallback; >> + >> + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); >> + if (unlikely(!pte)) >> + goto fallback; >> + >> + /* >> + * For do_swap_page, find the highest order where the aligned range is >> + * completely swap entries with contiguous swap offsets. >> + */ >> + order = highest_order(orders); >> + while (orders) { >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); >> + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) >> + break; >> + order = next_order(&orders, order); >> + } >> + >> + pte_unmap_unlock(pte, ptl); >> + >> + /* Try allocating the highest of the remaining orders. */ >> + gfp = vma_thp_gfp_mask(vma); >> + while (orders) { >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); >> + if (folio) >> + return folio; >> + order = next_order(&orders, order); >> + } >> + >> +fallback: >> +#endif >> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); >> +} >> + >> + >> /* >> * We enter with non-exclusive mmap_lock (to exclude vma changes, >> * but allow concurrent faults), and pte mapped but not yet locked. >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> if (!folio) { >> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >> __swap_count(entry) == 1) { >> - /* >> - * Prevent parallel swapin from proceeding with >> - * the cache flag. Otherwise, another thread may >> - * finish swapin first, free the entry, and swapout >> - * reusing the same entry. It's undetectable as >> - * pte_same() returns true due to entry reuse. >> - */ >> - if (swapcache_prepare(entry, 1)) { >> - /* Relax a bit to prevent rapid repeated page faults */ >> - schedule_timeout_uninterruptible(1); >> - goto out; >> - } >> - need_clear_cache = true; >> - >> /* skip swapcache */ >> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, >> - vma, vmf->address, false); >> + folio = alloc_swap_folio(vmf); >> page = &folio->page; >> if (folio) { >> __folio_set_locked(folio); >> __folio_set_swapbacked(folio); >> >> + nr_pages = folio_nr_pages(folio); >> + if (folio_test_large(folio)) >> + entry.val = ALIGN_DOWN(entry.val, nr_pages); >> + /* >> + * Prevent parallel swapin from proceeding with >> + * the cache flag. Otherwise, another thread may >> + * finish swapin first, free the entry, and swapout >> + * reusing the same entry. It's undetectable as >> + * pte_same() returns true due to entry reuse. >> + */ >> + if (swapcache_prepare(entry, nr_pages)) { >> + /* Relax a bit to prevent rapid repeated page faults */ >> + schedule_timeout_uninterruptible(1); >> + goto out_page; >> + } >> + need_clear_cache = true; >> + >> if (mem_cgroup_swapin_charge_folio(folio, >> vma->vm_mm, GFP_KERNEL, >> entry)) { >> ret = VM_FAULT_OOM; >> goto out_page; >> } > > After your patch, with build kernel test, I'm seeing kernel log > spamming like this: > [ 101.048594] pagefault_out_of_memory: 95 callbacks suppressed > [ 101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > ............ > > And heavy performance loss with workloads limited by memcg, mTHP enabled. > > After some debugging, the problematic part is the > mem_cgroup_swapin_charge_folio call above. > When under pressure, cgroup charge fails easily for mTHP. One 64k > swapin will require a much more aggressive reclaim to success. > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is > gone and mTHP swapin should have a much higher swapin success rate. > But this might not be the right way. > > For this particular issue, maybe you can change the charge order, try > charging first, if successful, use mTHP. if failed, fallback to 4k? This is what we did in alloc_anon_folio(), see 085ff35e7636 ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"), 1) fallback earlier 2) using same GFP flags for allocation and charge but it seems that there is a little complicated for swapin charge > >> - mem_cgroup_swapin_uncharge_swap(entry, 1); >> + mem_cgroup_swapin_uncharge_swap(entry, nr_pages); >> >> shadow = get_shadow_from_swap_cache(entry); >> if (shadow) >> @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> goto out_nomap; >> } >> >> + /* allocated large folios for SWP_SYNCHRONOUS_IO */ >> + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { >> + unsigned long nr = folio_nr_pages(folio); >> + unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); >> + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; >> + pte_t *folio_ptep = vmf->pte - idx; >> + >> + if (!can_swapin_thp(vmf, folio_ptep, nr)) >> + goto out_nomap; >> + >> + page_idx = idx; >> + address = folio_start; >> + ptep = folio_ptep; >> + goto check_folio; >> + } >> + >> nr_pages = 1; >> page_idx = 0; >> address = vmf->address; >> @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> folio_add_lru_vma(folio, vma); >> } else if (!folio_test_anon(folio)) { >> /* >> - * We currently only expect small !anon folios, which are either >> - * fully exclusive or fully shared. If we ever get large folios >> - * here, we have to be careful. >> + * We currently only expect small !anon folios which are either >> + * fully exclusive or fully shared, or new allocated large folios >> + * which are fully exclusive. If we ever get large folios within >> + * swapcache here, we have to be careful. >> */ >> - VM_WARN_ON_ONCE(folio_test_large(folio)); >> + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); >> VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); >> folio_add_new_anon_rmap(folio, vma, address, rmap_flags); >> } else { >> @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> out: >> /* Clear the swap cache pin for direct swapin after PTL unlock */ >> if (need_clear_cache) >> - swapcache_clear(si, entry, 1); >> + swapcache_clear(si, entry, nr_pages); >> if (si) >> put_swap_device(si); >> return ret; >> @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> folio_put(swapcache); >> } >> if (need_clear_cache) >> - swapcache_clear(si, entry, 1); >> + swapcache_clear(si, entry, nr_pages); >> if (si) >> put_swap_device(si); >> return ret; >> -- >> 2.34.1 >> >> >