From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD460C3DA7A for ; Thu, 5 Jan 2023 10:44:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5FE1A8E0005; Thu, 5 Jan 2023 05:44:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5AC988E0002; Thu, 5 Jan 2023 05:44:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 44D058E0005; Thu, 5 Jan 2023 05:44:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3204A8E0002 for ; Thu, 5 Jan 2023 05:44:23 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id F3F3A140D2E for ; Thu, 5 Jan 2023 10:44:22 +0000 (UTC) X-FDA: 80320411164.29.5FEBFC0 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf03.hostedemail.com (Postfix) with ESMTP id C65B720002 for ; Thu, 5 Jan 2023 10:44:20 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Pio0qY3f; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf03.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1672915460; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dHGUlysATM3xDPfSPCYiDsP85l+XY5+rxynfVJoLVqg=; b=ol6dVADhrfDMKCloq27Ooy1sN32Y+xlGqC/nsHl3Cby3gNsTNqxSXZ0H4Wd2UVya80/t+I 2NgSigVW2sCqdt5jNYJsXKWeZnXuHQL3FZpxGV+D7CBrR/1UgviOy2iV0Si4O55GWLPn9a wAQ7fmlrbSDuAwyivp4VI5PqRd8oLGg= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Pio0qY3f; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf03.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1672915460; a=rsa-sha256; cv=none; b=wrXBB6HqIfAJupaHDEVUrq2FMy/l4whBcjFOllYGRo8sRJhsaXs2b2CURmpG4J3Nz0o5/Y AZk385djwENvKO1NUhvTqAqK6HQIfXyCfik+/TjeTkQv1MKWWtOref40PnzdkVaEzGWGvm aU0+d3yrIOYMAq+cQ+CaSm8yLqH9Ddc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1672915460; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dHGUlysATM3xDPfSPCYiDsP85l+XY5+rxynfVJoLVqg=; b=Pio0qY3f2TOyY42/Ogwpr7EoqTLiILZbxuDxuF5PXlWKgZQ5v/wsQ5rat32jS+QBRdKLTd B73FtPi1133a9W31akERhpWxmP0O4A2Xuug8+9akvRT9Ga6Z/20OCLZVlu28UrtxtU/DB+ bbyOuQ1nDYCdbQNP1crYiAWS6zxVQhA= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-169-J7XXT3b2MRuJDjbzrPB5vQ-1; Thu, 05 Jan 2023 05:44:16 -0500 X-MC-Unique: J7XXT3b2MRuJDjbzrPB5vQ-1 Received: by mail-wr1-f70.google.com with SMTP id o5-20020adfba05000000b0029064ccbe46so2298179wrg.9 for ; Thu, 05 Jan 2023 02:44:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dHGUlysATM3xDPfSPCYiDsP85l+XY5+rxynfVJoLVqg=; b=dQxrY18j4mYuDGEshDUbPLc29b2tuuu5jeEULi6PjG+s1GmfQsTJpM6yvIfPnemz4H MZCQL6SHqBxzOf8VfIw13GMvSCqm969zzVvA2peY/140Jkrnyvym6Sr9c0urp+pKVecw i5FDs25MUPuAIzP4i/+1vsmC2YKIdUfKkYfsc8FwsI0120658HaHLOWGip2+IBppCH5L Myh6p0sZzQpDWdaSedOfJaEoAeeGEP3YLwYSfIqJqguCuIKoemAOpuISfXr5nEZIA3L/ k4aOYxwKEnY0q9rkfz8i4c39cmFZZ/DPlSNC45cqnkzPc/sI3MKulzGq3TzXn4zqGOmc Qnww== X-Gm-Message-State: AFqh2konh14+06QePxSmsCeDLSOtuhXekVrxr/Bmp/CNm8hegvwW9qfh gm4JJsu+c52OKc5RR+n3uNPK8D55INsV50smx4OQ8Yu+E2F+OMHoMpOKKDStcrKekHz1Ty4DyAZ S9Q1Qgi8FBh0= X-Received: by 2002:a05:600c:4da0:b0:3d2:3a53:2cd6 with SMTP id v32-20020a05600c4da000b003d23a532cd6mr36441754wmp.9.1672915455426; Thu, 05 Jan 2023 02:44:15 -0800 (PST) X-Google-Smtp-Source: AMrXdXsl4R+7ZBPxCWTZs7iHoDXZlo48wqxcAg3HTZQ0UbVr1JGxsNvD/iszjBEz6cdqocLUbGnWkw== X-Received: by 2002:a05:600c:4da0:b0:3d2:3a53:2cd6 with SMTP id v32-20020a05600c4da000b003d23a532cd6mr36441736wmp.9.1672915455163; Thu, 05 Jan 2023 02:44:15 -0800 (PST) Received: from ?IPV6:2003:cb:c707:6e00:ff02:ec7a:ded5:ec1e? (p200300cbc7076e00ff02ec7aded5ec1e.dip0.t-ipconnect.de. [2003:cb:c707:6e00:ff02:ec7a:ded5:ec1e]) by smtp.gmail.com with ESMTPSA id o19-20020a05600c339300b003cff309807esm1823807wmp.23.2023.01.05.02.44.14 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 05 Jan 2023 02:44:14 -0800 (PST) Message-ID: <86d5f618-800d-9672-56c4-9309ef222a39@redhat.com> Date: Thu, 5 Jan 2023 11:44:13 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 To: Yang Shi Cc: Hugh Dickins , Andrew Morton , Jann Horn , Zach O'Keefe , Song Liu , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <32be06f-f64-6632-4c36-bed7c0695a3b@google.com> <7ff97950-b524-db06-9ad6-e98b80dcfefa@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] mm/khugepaged: fix collapse_pte_mapped_thp() to allow anon_vma In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: C65B720002 X-Stat-Signature: za4uh6dgjsxd54eo9jhabx6otoa97oxf X-HE-Tag: 1672915460-271508 X-HE-Meta: U2FsdGVkX184rHCC0jN6pQNAQ5vhHCQKQYEw9mmXoKSg8EvytY0Y3YtJph1ApTJ5vEtHFdemY/wCJpnRUmfApDFBASVac3zweyllPfeCBSxzFij9L3W3NQTQjLQcZEVlKCJ6gVGgnI35dEpmQKkOF6J0sLATcQxWCK2wFItw8B3bhepx3cTRz+qUyIgpYS56G4c0yrfD2u4yjXSyl9xWB/LPTMyOKEjo6BDTMrfqkbt7gnk7toyS0ITvsAgYoUrcgbRpmZG0a4M86//oAoX52DemKlBEHan+/ZW8psBuBw421EsgAr1/kKGEzYQgZXKYGC7L+GpvRhWHoGYA8qw71yrFMf3pJr1JO+2t66hr4usNyt+zUc8Ioh40enGwZahmZK8n9aYXBE/qqrWuSep4MbsjV5ZJa32KdJufjKXCjx2YqT/Mn6JivnFdgwtgXrmRmViBMytF37vANUwlszE+Ckn1IrDoHG1o+JhZng19ByFO0e3/+a6C/nDeUcsZVNEfYiyd3+wwS6EWSY6KGhtJGqSbggmF2g0Sz8i/Yfy8x8vRcXm36zK66ytwYlcz6EUgc7lbxF22zXDC8VXdyfozZzbGVb/Gv4JPMRXc198WkCk6akY5en93ofKJBA3FBeMoq6SWetQDJHpHjQkLVMBhOb0A+ytkqtfzQh58Fyv7E6RAYs47RYEEmxHBSBY4cxYB/89bcBzOsy0Ja3yYMZBb6k3FLhFRgOcexB/sJk931avcDtUuPo0hwEuC3cftN8gYIsHT42NaCgJF2/zd3xs2DHNFpFW25ra7OEyGKZOdPF/jmOgeNLLsbmuhNM36XwwWerhKEIGWowyA2dw0MEC6D0zaFYRYhFi6p0x8M0rEMxzRlUT/OT06UJ3h2KE1q3atI8IrZlZbHlAxCiaL6/vIXAN6Lvb8d6uXdHfyOlrWUO2oaad6FpAQOrHAfRllXB7xDopH8FXkw840IgcA4ND 2fkwEQ2l lQjlru4HrqjsA22EeFtW6BakfN2IJ7SoQkqsCI1uH6ev/wDc97yO9Ft7Y8vDUCNMr0rvmb/nIRE9rJLYbN/iYl+nIkoWwQjg7xKYMrJ/x3rcbx2o9O3lOBHal7IUNBDVzzDc+YhBJspSFiCfr0XF6UWyiYkxQuxj6AYwBNoUnz2h08Mh4xFdrm4q65vuKzZg8575f X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 05.01.23 01:03, Yang Shi wrote: > On Wed, Jan 4, 2023 at 1:20 AM David Hildenbrand wrote: >> >>>> Or am I wrong? >>>> >>>>> Is anon_vma lock required? Almost not: if any page other than expected >>>>> subpage of the non-anon huge page is found in the page table, collapse is >>>>> aborted without making any change. However, it is possible that an anon >>>>> page was CoWed from this extent in another mm or vma, in which case a >>>>> concurrent lookup might look here: so keep it away while clearing pmd >>>>> (but perhaps we shall go back to using pmd_lock() there in future). >>>>> >>>>> Note that collapse_pte_mapped_thp() is exceptional in freeing a page table >>>>> without having cleared its ptes: I'm uneasy about that, and had thought >>>>> pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem, >>>>> and we would have to move the mmu_notification if clearing those ptes. >>>>> >>>>> Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table >>>>> retraction") >>>>> Signed-off-by: Hugh Dickins >>>>> Cc: Jann Horn >>>>> Cc: Yang Shi >>>>> Cc: David Hildenbrand >>>>> Cc: Zach O'Keefe >>>>> Cc: Song Liu >>>>> Cc: [5.4+] >>>>> --- >>>>> What this fixes is not a dangerous instability! But I suggest Cc stable >>>>> because uprobes "healing" has regressed in that way, so this should follow >>>>> 8d3c106e19e8 into those stable releases where it was backported (and may >>>>> want adjustment there - I'll supply backports as needed). >>>> >>>> If it's really something that doesn't matter in practice (e.g., -1% >>>> performance while debugging :) ), I guess no CC is needed. If there are real >>>> production workloads that suffer, I guess ccing stable is fine. >>> >>> It's about recovering performance *after* debugging. It is not something >>> that is of any value to me personally, nor (so far as I know) to anyone >>> whom I work with. But it is something which Song Liu went to the trouble >>> to make possible in his "THP aware uprobe" series three years ago, and it >>> is something which Jann unintentionally regressed in his recent commit: >>> so I thought it proper to reinstate where regressed. >> >> Right, although I wonder if that original series fixed a real >> performance issue or was more a "this makes sense, let's just optimize >> this corner case by some serious complexity". I hope it's not the latter :) >> >>> >>> (What I do have more of an investment in, is for MADV_COLLAPSE to be able >>> to collapse some extents in a large vma where some other extent got CoWed, >>> so giving the whole vma an anon_vma. But that's not an issue for -stable, >>> and I cannot tell you offhand whether undoing this anon_vma exclusion is >>> enough to enable that or not - I suspect not, I suspect a result code or >>> switch statement needs to be adjusted too.) >> >> Yeah, having a single COWed page in a large MAP_PRIVATE file/shmem >> mapping would disable collapse, so it's the right thing to do. >> >> Thinking about it some more, and the effective code change, stable >> doesn't sound wrong. >> >>>> >>>> >>>> Side note: set_huge_pmd() wins the award of "ugliest mm function of early >>>> 2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be >>>> writable or not. Turns out it's communicated via vm_fault->flags. Just >>>> horrible. >>> >>> I firmly disagree - it's from 2022! and much too small to be ugliest; >>> but I haven't thought about the aspect that is bothering you there. >> >> The ugliest I stumbled over in early 2023 -- until January 2nd :D >> >>> >>> What's bothered me most about it, is the way its name, and the naming of >>> the do_set_pmd() it interfaces with, give no hint that they are entirely >>> about file (or shmem) vmas, and would not work right on anon vmas >>> (I forget whether it's just a matter of which stats updated, or more). >> >> Yes. I dug very deep into in-place collapse yesterday because I was >> briefly concerned about anon THP, and it took me longer to understand >> that whole machinery than it should (and that anon THP never ever >> collapse in-place). >> >> Some of that khugepaged stuff needs some *serious* cleanups and >> refactoring. do_set_pmd() is not an exception. >> >> >> Some more examples: >> >> if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) { >> ... >> hpage_collapse_scan_file() >> } else { >> hpage_collapse_scan_pmd() >> ... >> } >> >> >> 1) hpage_collapse_scan_pmd() is only for anon memory. Totally obvious >> from the name. But why are we potentially calling it for VMAs that >> are not applicable? For maximum David confusion? > > IIRC the VMAs are checked before, what do you mean about "not > applicable"? But anyway khugepaged/MADV_COLLAPSE does release and I assume when CONFIG_SHMEM=n with ordinary file-thp we'll end up calling it. > reacquire mmap_lock multiple times, so there are multiple places to > check VMAs validity. > hpage_collapse_scan_pmd() should be renamed to something like hpage_collapse_scan_an/on() and the duplicate code in khugepaged_scan_mm_slot() and madvise_collapse() should be factored out into something like: hpage_collapse_scan(vma, addr, cc) { if (vma->vm_file) { ... hpage_collapse_scan_file() ... } else if (vma_is_anonymous(vma)) { hpage_collapse_scan_anon() } else { WARN_ON_ONCE(); } } Any CONFIG_SHMEM etc. optimizations to compile that code out should go into hpage_collapse_scan_file() IMHO. ... also properly checking for ordinary file THP support. ... and we'd really decide on a terminology "transhuge", "hugepage", "hpage", it's a mess. hpage might be easiest, or simply "thp". We just need a way to distinguish all that stuff from hugetlb. >> >> 2) "IS_ENABLED(CONFIG_SHMEM) && vma->vm_file" is also supposed to cover >> ordinary file-thp. Totally obvious from the IS_ENABLED(CONFIG_SHMEM) >> ... I probably spent 30minutes understanding what's happening here. >> Just misleading and wrong without CONFIG_SHMEM. >> >> >> ... and what's easier to get than this magic set of boolean flags: >> >> hugepage_vma_check(vma, vma->vm_flags, false, false, true) > > This is not perfect. I was thinking about changing them to one flag, > just like TTU_ flags used by try_to_unmap(). That may make things > cleaner. > We should provide similar flags to hugepage_vma_revalidate() and just replace the "cc" parameter by a way to indicate is_khugepaged. Passing in cc is just overkill. We'd name the functions thp_vma_validate() and thp_vma_revalidate() or sth. like that. >> >> ... and obviously >> hugepage_vma_revalidate() >> is supposed to be a follow up to a previous >> hugepage_vma_check() >> and totally different from >> transhuge_vma_suitable() >> >> Hard to make it even less consistent. > > This was after my cleanup, it was much messier before. And I did add > comments to make them more understandable, but anyway better naming is > definitely welcome. Yeah, I appreciate any previous and any future cleanups in that area. For example: why even *care* about the complexity of installing a PMD in collapse_pte_mapped_thp() using set_huge_pmd() just for MADV_COLLAPSE? Sure, we avoid a single page fault afterwards, but is this *really* worth the extra code here? I mean, after we installed the PMD, the page could just get reclaimed either way, so there is no guarantee that we have a PMD mapped once we return to user space IIUC. Anyhow, don't want to hijack this thread. I was just forced to understand that code and a lot jumped at me :) -- Thanks, David / dhildenb