From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98B35C30653 for ; Thu, 4 Jul 2024 13:39:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 141D76B0083; Thu, 4 Jul 2024 09:39:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0F24B6B0088; Thu, 4 Jul 2024 09:39:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EFB2F6B008A; Thu, 4 Jul 2024 09:39:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D2F796B0083 for ; Thu, 4 Jul 2024 09:39:55 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 72EEA412BB for ; Thu, 4 Jul 2024 13:39:55 +0000 (UTC) X-FDA: 82302178350.21.B6862A5 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf07.hostedemail.com (Postfix) with ESMTP id 5EAE140014 for ; Thu, 4 Jul 2024 13:39:53 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; spf=pass (imf07.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720100380; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Fxt1OsaDZLQrUDP/BIWMaw+DYp8mRnW3kxBMlNkSMrI=; b=mKk6mj+2UiXuiu1gdNnOF3k4w04EY/7vZGNy69Iaqa6dzpxvV7pnGjRhz5xdXxn/6/M7gO EenooFb2voAaor/A8hf+90QHvM0fpSxkPIVjr6nkuu69rV+TacCqzBj3XdRG1ZRQks8vru +Oogo1cXeKeuugNKjVjhn7xuYi6n9Ho= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; spf=pass (imf07.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720100380; a=rsa-sha256; cv=none; b=MlGTsQgyTScTr3EZyCXQGd174hgrwf2UmBdA8MNirD5nQ9hkhxj2INfXBxb9W1+izrQqjL zoz3kqqXnxLYyf50nR/d+0wbJJZMgZL/mxU6PGHjCdVlef+gosNy9E+ozFfp373dc1ChB5 B1EBZTgewUw4+v4M7yqu4CGlDvfqKUA= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 554BF367; Thu, 4 Jul 2024 06:40:17 -0700 (PDT) Received: from [10.1.29.168] (XHFQ2J9959.cambridge.arm.com [10.1.29.168]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 308AE3F766; Thu, 4 Jul 2024 06:39:51 -0700 (PDT) Message-ID: Date: Thu, 4 Jul 2024 14:39:49 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: huge zero page confusion Content-Language: en-GB To: "Kirill A. Shutemov" Cc: Hugh Dickins , Mel Gorman , Linux-MM , Catalin Marinas , David Hildenbrand , Matthew Wilcox References: <1cfae0c0-96a2-4308-9c62-f7a640520242@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 5EAE140014 X-Stat-Signature: zq4tsxwmmk7pe5xdaydwm6hhre3n3ycg X-Rspam-User: X-HE-Tag: 1720100393-346653 X-HE-Meta: U2FsdGVkX18f4q9oxleumxr+dQscojyJJtjRtaSiMsD5uPgFc6qyGJWBBY9IsgNvc9luaXYNOTRqZQCrYO2+gg0A7kTzKBlIwl0pWdztZ/KW9PUCeivn1cKOo/OsTj42jraNPVks71JDO8ozQZ60UORM4GKGU21tAEUWLrADrn9gPq9NNOOq+EAU31G/yfeNm/U7hmPH0/U4OUMO/FSGZSiWzWszY+ygSbWwkNSkTpt7d6QnzrpPUNl7v/57JVt73ez3tm3VxbGOLM8yVc5XWoIr1NXyKozo5yXO/peiL4dpEsJje562RVkyYQgKAfFXlBfMhnsOkJBnaIBkQlgY3BD1oKshO6t6xue+0Qa5Bvc814pYsUQdC2Zw3TZD9ueHnskQGNx6xHu+IDJ4Z574zxwOHBfCz3aHI3+dpiTZPoI5xkSZx8pVlOdMrLSfy5wl/3W50pcNXxjzk3sEcw8dGbNwLygPRD/q3yC+yqoRR66sX91kRdQhzexVxi8TT9DGTfa2ifPcXboDayobzKZ/qmG1ODEPPUDNkiPrjZRvYeRM7gPECFriV1feodtF689ng64Y9QEDK4zKtvNNJ8wyFf9NlGO9Ohgq7pOaGvHrkCn0p8Pnz3Vtt7FoqLaoDcT9xQyiLohopIZCeBuRqUoef9RriPSRMGAlV/51PW/BScvxEjiO83NY7kWG6xfWt/3fcBeTseFiBXfN/BNIVDZz2pwIKYVjspaMVHL1JBQYOkkMF2cKwqXYgDUBjozUDIQT+KB78j2k7tF3L8I5yAJmikIrHXhGMbrlOdqfeKe4oTj+7quBjfUwwIhJ356WO2R7cnDfNrlCc6ZJmDgtz/Ga15JIjv13EQnqTaF/Dtu9htaulP77lEhaiUYsljJotCPT2jWkq2+h6mwuDyVg2CqykdTVOA9Eb30ccTR+taqzP4FB7vATxNNhKytKvvYHs3usf5rdae/o6VVyzz1d5B4 w8omNpf4 P/m1vcALramkIiD/zjlic9Ti6/AohRBhJ9URdjkGM8H32MhvUBJ16P4J16pJcImfgPKK+x1lSKxX+zTxLk7tZzSmgIws4yLZGB8kHNFMRyhYHNi29+ZL7xiSxiq12Db27ZTv1CCa1VJnRJxE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 04/07/2024 12:41, Kirill A. Shutemov wrote: > On Wed, Jul 03, 2024 at 06:37:48PM +0100, Ryan Roberts wrote: >> Hi Kirill, Hugh, Mel, >> >> We recently had a problem reported at [1] that due to aarch64 arch requiring >> that atomic RMW instructions raise a read fault, followed by a write fault, this >> causes a huge zero page to be faulted in during the read fault, then the write >> fault shatters the huge zero page, installing small zero pages for every PTE in >> the PMD region, except the faulting address which gets a writable private page. >> >> A number of ways were discussed to solve that problem. But it got me wondering >> why we have this behaviour in general for huge zero page? This seems like odd >> behaviour to me. Surely it would be less effort and more aligned with the app's >> expectations to notice the huge zero page in the PMD, remove it, and install a >> THP, as would have been done if pmd_none() was true? Or if there is a reason to >> shatter on write, why not do away with the huge zero page and save some memory, >> and just install a PMD's worth of small zero pages on fault? >> >> Perhaps replacing the huge zero page with a huge THP on write fault would have >> been a better behavior at the time, but perhaps changing that behaviour now >> risks a memory bloat regression in some workloads? > > Yeah, I agree that WP fault on zero page page should give THP. I think > treating zero page as none PMD on write page fault should be safe and > reasonable. So you're not concerned about potential for memory bloat regressions in apps that are deployed today? I'm a bit nervous to make the change without a bunch of testing... > > Untested patch is below. > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 2aa986a5cd1b..04c252303951 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -552,6 +552,11 @@ static inline bool thp_migration_supported(void) > } > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > +static inline bool pmd_none_or_zero_folio(pmd_t pmd) > +{ > + return pmd_none(pmd) || is_huge_zero_pmd(pmd); > +} > + > static inline int split_folio_to_list_to_order(struct folio *folio, > struct list_head *list, int new_order) > { > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 89932fd0f62e..fdd5236004bc 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -951,7 +951,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, > __folio_mark_uptodate(folio); > > vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); > - if (unlikely(!pmd_none(*vmf->pmd))) { > + if (unlikely(!pmd_none_or_zero_folio(*vmf->pmd))) { Hmm not sure about this; Wouldn't we need to "uninstall" the huge zero page somehow? I'm guessing TLB invalidation and ref count decrement on the zero page (assuming its ref counted... perhaps its not). > goto unlock_release; > } else { > pmd_t entry; > @@ -1536,8 +1536,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); > VM_BUG_ON_VMA(!vma->anon_vma, vma); > > - if (is_huge_zero_pmd(orig_pmd)) > - goto fallback; > + VM_BUG_ON(is_huge_zero_pmd(orig_pmd)); > > spin_lock(vmf->ptl); > > @@ -1606,7 +1605,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > unlock_fallback: > folio_unlock(folio); > spin_unlock(vmf->ptl); > -fallback: > __split_huge_pmd(vma, vmf->pmd, vmf->address, false, NULL); > return VM_FAULT_FALLBACK; > } > diff --git a/mm/memory.c b/mm/memory.c > index 0f47a533014e..cc12deeb0593 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5488,15 +5488,15 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, > if (pud_trans_unstable(vmf.pud)) > goto retry_pud; > > - if (pmd_none(*vmf.pmd) && > + vmf.orig_pmd = pmdp_get_lockless(vmf.pmd); > + > + if (pmd_none_or_zero_folio(vmf.orig_pmd) && > thp_vma_allowable_order(vma, vm_flags, > TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER)) { > ret = create_huge_pmd(&vmf); > if (!(ret & VM_FAULT_FALLBACK)) > return ret; > } else { > - vmf.orig_pmd = pmdp_get_lockless(vmf.pmd); > - > if (unlikely(is_swap_pmd(vmf.orig_pmd))) { > VM_BUG_ON(thp_migration_supported() && > !is_pmd_migration_entry(vmf.orig_pmd));