From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64377C4167B for ; Thu, 7 Dec 2023 11:23:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D5DE86B0087; Thu, 7 Dec 2023 06:23:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D0DF26B0088; Thu, 7 Dec 2023 06:23:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BAEAB6B0089; Thu, 7 Dec 2023 06:23:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id AC7056B0087 for ; Thu, 7 Dec 2023 06:23:06 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 7A4121C057B for ; Thu, 7 Dec 2023 11:23:06 +0000 (UTC) X-FDA: 81539785572.01.7D5A2C3 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf14.hostedemail.com (Postfix) with ESMTP id 863F2100009 for ; Thu, 7 Dec 2023 11:23:04 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701948184; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=n1qw5FBvQG4ZQd8+AY0SoFv2851ps+tEWS9GwrW6a8I=; b=ziYlgjMugXH6T3MJQ2SHl+4yatLr5B/wVcYRBU9GGU8D5ordcS75BGgank7ejTWHs2sHAj 2mXQOIMZ/JamU9IYmtxUzLbUCoSDjnG7agIf9x/aGtgQCKz1HKHD0cTRp0reiauO17NkDb zpedu0rEJwwJ8jGiyc9l4ZyaXAPB9lk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701948184; a=rsa-sha256; cv=none; b=YV4z7hoGsIAFK6gWq62YbiekOaeOETaepiLzn2ll9rXp3Bgmc28Pi4ef5zYqwC/wNWhh5Z Qz6CO3R6/PUdtc0dnUBllwYoAsZ0AAnytDD87YxPjyaH14UW8gc/sWtf4IapSlfkcClFIu HryYkCPeVLYbFbsim3eZVxtHUqKUSgA= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5D9EC12FC; Thu, 7 Dec 2023 03:23:49 -0800 (PST) Received: from [10.1.32.134] (XHFQ2J9959.cambridge.arm.com [10.1.32.134]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A05F73F762; Thu, 7 Dec 2023 03:23:00 -0800 (PST) Message-ID: <4aa520f0-7c84-4e93-88bf-aee6d8d3ea70@arm.com> Date: Thu, 7 Dec 2023 11:22:59 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface Content-Language: en-GB To: David Hildenbrand , Andrew Morton , Matthew Wilcox , Yin Fengwei , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Alistair Popple Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org References: <20231204102027.57185-1-ryan.roberts@arm.com> <20231204102027.57185-4-ryan.roberts@arm.com> <004ed23d-5571-4474-b7fe-7bc08817c165@redhat.com> <378afc6b-f93a-48ad-8aa6-ab171f3b9613@redhat.com> From: Ryan Roberts In-Reply-To: <378afc6b-f93a-48ad-8aa6-ab171f3b9613@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 863F2100009 X-Rspam-User: X-Stat-Signature: i1xe4fqhzxg9fww7eztp9aximpiykjx6 X-Rspamd-Server: rspam03 X-HE-Tag: 1701948184-241160 X-HE-Meta: U2FsdGVkX1/+jeMrh8Yc4dudRIGdfJ1NHbHMjMXk1An5j8ZtiCDWe5Tjomb/ymhKddErRExI5LmKpEpSQWn4pucpD1nD3pjxYu7UO/NhgPwsEfI7pxk7lFcKV44Duw5d48jqC2IHBiEs+s7ik0sv6W7PcbjkMqfw6nBiNtu7vOGDFT/bUc4UqwlhzIs5uocTw2JTTmXmbPhYo5GvFvVaK6G9ivVVIBaTfnFd6fMZHP/4AsFqqeOJdji5BJV9E1sbxc8dJ7mSKp0j3BHubT5gv34v0mNzm2zthJF3sdEiNvikS9d8lrltLx0fDbAqYiJWomhfGFCVPk2+BOW11lRNUy3BMicUu1SgjxnByVj8qxHivd76UHr4qwGoxOECRtZ5BInz/l2h2O8w7sQDIRzENjmT6XXhaWYzaljZ2Pu8YUwu0Q0YBh8v4qxDxQjFBByzyipQaT41RhAeEGzD/Sa5ggLKB0stdUH82PpOwxfukaDh62f8r0u/SgFlzPht7cX+OwRKaVZF+7pIAT2a6AQR4sHq+kHKYwY9l1fJUPkFAiuiQ2UrBZTLEj3Ost+1hcy8HDvylKuLEShmu6XBGTgRaErUknsnBqBA2gK796vjolZVJ9n2B6L5t2u6Jh9ZJ/+f7A2ZA2DaY5BPyBtubeTqhbNoJQVIBgEuFTgiGVTNcTu+NUmla5ZnSN8AKPJItLhacA49WH0BzQxpPCvRVyeZX8qVouiPqJLeZwFuhzqc7Z/hVxLppA50ZedaAO2mbVOAOhMYTn3pqn38Ls2VAeGojLgV9ySLI4UKHrZMC5+Kus30/wcxWKSqfdPrUTv/jXOX4/PwFbq4g+f9u9M/jEDwNJxjI1njhe0tWuegSbEJyGL8S+24YoYPTOS07fzk66QdMCl4PSXyLz+SspF2poQD6ztIyOJVP6MKHNApcE3IZ46PXds4NzE2WmIBBfYgx7F264Z2zsmjOrj/agX499H 7kW3htKJ X+J+7K52L4TBegMDxeuCY9viyImXEi34t5e9qjbkVCE7FHPvY1H/9hhzX5/Ra/sXUnkvQ4tchXpf1s2IRJxiPqbF5Cug8ldFxm+MVZbSniNEusaUI/Gkxhku6PuxJVQ2k0yh34uHj5KCBgXhY/zFrPNwc2A2B2bZu+j9o998kz1nB/JJ8y+r/a4OtPa1AsCailZPfkRaMIbAmP+EalpP6SD9euhpBGBmI9EkQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 07/12/2023 11:13, David Hildenbrand wrote: >>> >>>> + >>>>        if (!vma->vm_mm)        /* vdso */ >>>> -        return false; >>>> +        return 0; >>>>          /* >>>>         * Explicitly disabled through madvise or prctl, or some >>>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, >>>> unsigned long vm_flags, >>>>         * */ >>>>        if ((vm_flags & VM_NOHUGEPAGE) || >>>>            test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) >>>> -        return false; >>>> +        return 0; >>>>        /* >>>>         * If the hardware/firmware marked hugepage support disabled. >>>>         */ >>>>        if (transparent_hugepage_flags & (1 << >>>> TRANSPARENT_HUGEPAGE_UNSUPPORTED)) >>>> -        return false; >>>> +        return 0; >>>>          /* khugepaged doesn't collapse DAX vma, but page fault is fine. */ >>>>        if (vma_is_dax(vma)) >>>> -        return in_pf; >>>> +        return in_pf ? orders : 0; >>>>          /* >>>>         * khugepaged special VMA and hugetlb VMA. >>>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, >>>> unsigned long vm_flags, >>>>         * VM_MIXEDMAP set. >>>>         */ >>>>        if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED)) >>>> -        return false; >>>> +        return 0; >>>>          /* >>>> -     * Check alignment for file vma and size for both file and anon vma. >>>> +     * Check alignment for file vma and size for both file and anon vma by >>>> +     * filtering out the unsuitable orders. >>>>         * >>>>         * Skip the check for page fault. Huge fault does the check in fault >>>> -     * handlers. And this check is not suitable for huge PUD fault. >>>> +     * handlers. >>>>         */ >>>> -    if (!in_pf && >>>> -        !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE))) >>>> -        return false; >>>> +    if (!in_pf) { >>>> +        int order = first_order(orders); >>>> +        unsigned long addr; >>>> + >>>> +        while (orders) { >>>> +            addr = vma->vm_end - (PAGE_SIZE << order); >>>> +            if (thp_vma_suitable_orders(vma, addr, BIT(order))) >>>> +                break; >>> >>> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the >>> compiler is smart enough to optimize the loop and everyything else out. >> >> I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive, >> then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order() >> (that's basically how it is laid out already, just all in one function). Is that >> what you are requesting? > > You got the spirit, yes. > >>> >>> [...] >>> >>>> + >>>> +static ssize_t thpsize_enabled_store(struct kobject *kobj, >>>> +                     struct kobj_attribute *attr, >>>> +                     const char *buf, size_t count) >>>> +{ >>>> +    int order = to_thpsize(kobj)->order; >>>> +    ssize_t ret = count; >>>> + >>>> +    if (sysfs_streq(buf, "always")) { >>>> +        set_bit(order, &huge_anon_orders_always); >>>> +        clear_bit(order, &huge_anon_orders_inherit); >>>> +        clear_bit(order, &huge_anon_orders_madvise); >>>> +    } else if (sysfs_streq(buf, "inherit")) { >>>> +        set_bit(order, &huge_anon_orders_inherit); >>>> +        clear_bit(order, &huge_anon_orders_always); >>>> +        clear_bit(order, &huge_anon_orders_madvise); >>>> +    } else if (sysfs_streq(buf, "madvise")) { >>>> +        set_bit(order, &huge_anon_orders_madvise); >>>> +        clear_bit(order, &huge_anon_orders_always); >>>> +        clear_bit(order, &huge_anon_orders_inherit); >>>> +    } else if (sysfs_streq(buf, "never")) { >>>> +        clear_bit(order, &huge_anon_orders_always); >>>> +        clear_bit(order, &huge_anon_orders_inherit); >>>> +        clear_bit(order, &huge_anon_orders_madvise); >>> >>> Note: I was wondering for a second if some concurrent cames could lead to an >>> inconsistent state. I think in the worst case we'll simply end up with "never" >>> on races. >> >> You mean if different threads try to write different values to this file >> concurrently? Or if there is a concurrent fault that tries to read the flags >> while they are being modified? > > I thought about what you said first, but what you said last might also apply. As > long as "nothing breaks", all good. > >> >> I thought about this for a long time too and wasn't sure what was best. The >> existing global enabled store impl clears the bits first then sets the bit. With >> this approach you can end up with multiple bits set if there is a race to set >> diffierent values, and you can end up with a faulting thread seeing never if it >> reads the bits after they have been cleared but before setting them. > > Right, but user space is playing stupid games and can win stupid prices. As long > as nothing breaks, we're good. > >> >> I decided to set the new bit before clearing the old bits, which is different; A >> racing fault will never see "never" but as you say, a race to set the file could >> result in "never" being set. >> >> On reflection, it's probably best to set the bit *last* like the global control >> does? > > Probably might just slap a simple spinlock in there, so at least the writer side > is completely serialized. Then you can just set the bit last. It's unlikely that > readers will actually run into issues, and if they ever would, we could use some > rcu magic to let them read a consistent state. I'd prefer to leave it as it is now; clear first, set last without any explicit serialization. I've convinced myself that nothing breaks and its the same pattern used by the global control so its consistent. Unless you're insisting on the spin lock?