From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3097C4167B for ; Thu, 7 Dec 2023 12:08:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 346856B007B; Thu, 7 Dec 2023 07:08:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2F6AD6B007E; Thu, 7 Dec 2023 07:08:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1BE1E6B0080; Thu, 7 Dec 2023 07:08:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 0C5076B007B for ; Thu, 7 Dec 2023 07:08:30 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id DBA8C1A01C0 for ; Thu, 7 Dec 2023 12:08:29 +0000 (UTC) X-FDA: 81539899938.20.EF3C268 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf19.hostedemail.com (Postfix) with ESMTP id CA45D1A000A for ; Thu, 7 Dec 2023 12:08:27 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=none; spf=pass (imf19.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701950908; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gsWFRN0mvE4nJnKPI+gRXZlikGea6vs33JNZrEsoB+0=; b=vSaAh9rdb1q4ZgP2nUAbMehi/DIlGLZmGMa9zWnY5F0GLdNS9VydayLTl/0+i90KbHfG94 xXhCaARIzfbiYFaCsph1oZK8/AHwbh/nVF4yQBWZ/bNjVb9qm5lzZSfsuRDNb/G/QZtPDH cUKbBDcqgZ8OEcNBXRpnzLMYSk530Mo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701950908; a=rsa-sha256; cv=none; b=59tFmXSZhfkVmzd8HKEwvEKGvKWrngTa2Qm51+Tdab/UQCohgORYY8P1U5MG16Nq/qRYSj p1z0Z4WVAw+NGQcI/FnIdPr+8VdbzqEc6IEFau9/1cL10a+g+LuEvNWfEnCky6K0XJBMhn W8/gs2ZCon4pMjpZAMNP8wcxgFbf0mY= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=none; spf=pass (imf19.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D3F1B12FC; Thu, 7 Dec 2023 04:09:12 -0800 (PST) Received: from [10.1.32.134] (XHFQ2J9959.cambridge.arm.com [10.1.32.134]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4777D3F6C4; Thu, 7 Dec 2023 04:08:23 -0800 (PST) Message-ID: <126c3b71-1acc-4851-9986-4228cb8a8660@arm.com> Date: Thu, 7 Dec 2023 12:08:22 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP Content-Language: en-GB To: David Hildenbrand , Andrew Morton , Matthew Wilcox , Yin Fengwei , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Alistair Popple Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org References: <20231204102027.57185-1-ryan.roberts@arm.com> <20231204102027.57185-5-ryan.roberts@arm.com> <71040a8c-4ea1-4f21-8ac8-65f7c25c217e@redhat.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: f4e6esj4ac3xqhm16op6uez1ys61dpqr X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: CA45D1A000A X-Rspam-User: X-HE-Tag: 1701950907-518847 X-HE-Meta: U2FsdGVkX19+K3jWyveLYsNlUsMRTHJ5Ks5BqsmzMp3g6VUuJhe0DHlLiX7AFLxkwODVhrkoCE3snKJSb48kcK+tqXM7hiFJtR+bVzkplJMvZyZLP6Qb4HwhThB1dOGh+Ej0pNTdCHmtZkRhTy69kCt1PqpCfELSlb73afdRR9y1z8xHP+RsclZBkCuXEZt/UZOfUG+OheR1wbRZyv4qhDTl43RHkQenVfV6GJzyReoagaQVdKSXDJhwh55wltZA3ANM0OUbo5j7ChRWLA/QGDZ8P3NEbbeit/R97HgRgLZuSH8TElLC5xA4aGMkFIsAMpXNZ7kx2Y3PhO+7pt9oRWrbISehbduvgaJ97JFvtmt5MPcv3jvI+N+QnLiFJcojdbREDiL7uGzYBYC9bJuJJzW5Hd81F9yPlsvj36zqGOswxbKfxLWACiOHAPDtd/do+MyVJ0SYG/4QbkuzlizD7uMYSaZF5iWhYu10TM5gjuB5hnYi5Ge6E3AbKjazad4q4fFtHPwb/IUWY6OUhPdR6440mdS6PMFmEXjqQYAqhSwy1c7ohNPzLoiUqJihTvHmQvSiTwTbfw8I9q1k0fZgf+AMXfZNoUza5qa0FPzMvZdjSkHyP4b8ZGA/GDiSR48n0C9drOrveZ4VOe4o3v+EVgWWkUsLlv3XdJbFki7sBianGE5yegtawlMW09F946q3wFga8mRg8ZoQPRaNjrKkn0GhkyQIL816yv50W0sxfqqurBQHjwG3ta1kVZFIHSA3WvjuGfmPUuOKNqTv5N1cuQhBL1zUf38EphcSGjPjR5NQd+QW3OF0wwqLboCzvL7ZP71tk2gr8giUTtHwhflACKMZfZlHnoY+mmXzm7JscDPE39+Bbz8Ze37SbVCEaZitdm2BaLAwg1vdSCfz4i3N/xmLr4wmCnHVeZkPYIoUPucLTGWTX7XxDnpDlD9mLh4i31Sx2oRCcGDPkGWztMT +1j61sx/ XqKnBifAZR+Jv56OoD4p7jibjtJzQmB6H7fqWGnv5Tae1NROYJjpBG7hlo+okLrOA4mMpEqjsQ/t58PovIAtLSf8Vgx9e7YVAGZm0dfB9hqYrNPBXC+UrL2wGFLa5GHD4mgHxCfzc/krF4d1cXJX5MkpRtXBc47IU9EWgeuQkAvSmSQku8tBrer+JBVZYKcfRSFnncGdLbkY43KRxj4hUdnGNWxfEUnwNJ0oL X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 07/12/2023 11:08, David Hildenbrand wrote: > [...] > >>> >>> Nit: the orders = ... order = ... looks like this might deserve a helper >>> function that makes this easier to read. >> >> To be honest, the existing function that I've modified is a bit of a mess. > > It's all an ugly mess and I hate it. > > It would be cleanest if we'd just have "thp_vma_configured_orders()" that gives > us all configured orders for the given VMA+flags combination. No passing in of > orders, try handling the masking in the caller. > > Then, we move that nasty "transhuge_vma_suitable" handling for !in_pf out of > there and handle that in the callers. The comment "Huge fault does the check in > fault handlers. And this check is not suitable for huge PUD fault handlers." > already makes me angry, what a mess. My thp_vma_suitable_order[s]() does now at least work correctly for PUD. > > > Then, we'd have a thp_vma_fitting_orders() / thp_vma_is_fitting_order() function > that does the filtering only based on the given address + vma size/alignment. > That's roughly "thp_vma_suitable_orders()". > > > Finding a good name to combine both could be something like > "thp_vma_possible_orders()". > > > Would make more sense to me (but again, German guy, so it's probably all wrong). > > >> thp_vma_allowable_orders() calls thp_vma_suitable_orders() if we are not in a >> page fault, because the page fault handlers already do that check themselves. It >> would be nice to refactor the whole thing so that thp_vma_allowable_orders() is >> a strict superset of thp_vma_suitable_orders(). Then this can just call >> thp_vma_allowable_orders(). But that's going to start touching the PMD and PUD >> handlers, so prefer if we leave that for a separate patch set. >> >>> >>> Nit: Why call thp_vma_suitable_orders if the orders are already 0? Again, some >>> helper might be reasonable where that is handled internally. >> >> Because thp_vma_suitable_orders() will handle it safely and is inline, so it >> should just as efficient? This would go away with the refactoring described >> above. > > Right. Won't win in a beauty contest. Some simple helper might make this much > easier to digest. > >> >>> >>> Comment: For order-0 we'll always perform a function call to both >>> thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should perform some >>> fast and efficient check if any >> this VMA, and in that case just fallback before doing more expensive checks. >> >> thp_vma_allowable_orders() is inline as you mentioned. >> >> I was deliberately trying to keep all the decision logic in one place >> (thp_vma_suitable_orders) because it's already pretty complicated. But if you >> insist, how about this in the header: >> >> static inline >> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, >>                        unsigned long vm_flags, bool smaps, >>                        bool in_pf, bool enforce_sysfs, >>                        unsigned long orders) >> { >>     /* Optimization to check if required orders are enabled early. */ >>     if (enforce_sysfs && vma_is_anonymous(vma)) { >>         unsigned long mask = READ_ONCE(huge_anon_orders_always); >> >>         if (vm_flags & VM_HUGEPAGE) >>             mask |= READ_ONCE(huge_anon_orders_madvise); >>         if (hugepage_global_always() || >>             ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled())) >>             mask |= READ_ONCE(huge_anon_orders_inherit); >> >>         orders &= mask; >>         if (!orders) >>             return 0; >>         >>         enforce_sysfs = false; >>     } >> >>     return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, >>                       enforce_sysfs, orders); >> } >> >> Then the above check can be removed from __thp_vma_allowable_orders() - it will >> still retain the `if (enforce_sysfs && !vma_is_anonymous(vma))` part. >> > > Better. I still kind-of hate having to pass in orders here. Such masking is > better done in the caller (see above how it might be done when moving the > transhuge_vma_suitable() check out). > >> >>> >>>> + >>>> +    if (!orders) >>>> +        goto fallback; >>>> + >>>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); >>>> +    if (!pte) >>>> +        return ERR_PTR(-EAGAIN); >>>> + >>>> +    order = first_order(orders); >>>> +    while (orders) { >>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); >>>> +        vmf->pte = pte + pte_index(addr); >>>> +        if (pte_range_none(vmf->pte, 1 << order)) >>>> +            break; >>> >>> Comment: Likely it would make sense to scan only once and determine the "largest >>> none range" around that address, having the largest suitable order in mind. >> >> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this, >> IIRC. Perhaps this an optimization opportunity for later? > > Yes, definetly. > >> >>> >>>> +        order = next_order(&orders, order); >>>> +    } >>>> + >>>> +    vmf->pte = NULL; >>> >>> Nit: Can you elaborate why you are messing with vmf->pte here? A simple helper >>> variable will make this code look less magical. Unless I am missing something >>> important :) >> >> Gahh, I used to pass the vmf to what pte_range_none() was refactored into (an >> approach that was suggested by Yu Zhou IIRC). But since I did some refactoring >> based on some comments from JohnH, I see I don't need that anymore. Agreed; it >> will be much clearer just to use a local variable. Will fix. >> >>> >>>> +    pte_unmap(pte); >>>> + >>>> +    gfp = vma_thp_gfp_mask(vma); >>>> + >>>> +    while (orders) { >>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); >>>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true); >>>> +        if (folio) { >>>> +            clear_huge_page(&folio->page, addr, 1 << order); >>>> +            return folio; >>>> +        } >>>> +        order = next_order(&orders, order); >>>> +    } >>>> + >>> >>> Queestion: would it make sense to combine both loops? I suspect memory >>> allocations with pte_offset_map()/kmao are problematic. >> >> They are both operating on separate orders; next_order() is "consuming" an order >> by removing the current one from the orders bitfield and returning the next one. >> >> So the first loop starts at the highest order and keeps checking lower orders >> until one fully fits in the VMA. And the second loop starts at the first order >> that was found to fully fits and loops to lower orders until an allocation is >> successful. > > Right, but you know from the first loop which order is applicable (and will be > fed to the second loop) and could just pte_unmap(pte) + tryalloc. If that fails, > remap and try with the next orders. You mean something like this? pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); if (!pte) return ERR_PTR(-EAGAIN); order = highest_order(orders); while (orders) { addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); if (!pte_range_none(pte + pte_index(addr), 1 << order)) { order = next_order(&orders, order); continue; } pte_unmap(pte); folio = vma_alloc_folio(gfp, order, vma, addr, true); if (folio) { clear_huge_page(&folio->page, vmf->address, 1 << order); return folio; } pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); if (!pte) return ERR_PTR(-EAGAIN); order = next_order(&orders, order); } pte_unmap(pte); I don't really like that because if high order folio allocations fail, then you are calling pte_range_none() again for the next lower order; once that check has succeeded for an order it shouldn't be required for any lower orders. In this case you also have lots of pte map/unmap. The original version feels more efficient to me. > > That would make the code certainly easier to understand. That "orders" magic of > constructing, filtering, walking is confusing :) > > > I might find some time today to see if there is an easy way to cleanup all what > I spelled out above. It really is a mess. But likely that cleanup could be > deferred (but you're touching it, so ... :) ). I'm going to ignore the last 5 words. I heard the "that cleanup could be deferred" part loud and clear though :) >