From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 35724C4167B for ; Thu, 7 Dec 2023 11:44:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B24656B0085; Thu, 7 Dec 2023 06:44:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id ADEBD6B0088; Thu, 7 Dec 2023 06:44:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 99B3F6B0089; Thu, 7 Dec 2023 06:44:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 8A7076B0085 for ; Thu, 7 Dec 2023 06:44:10 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 659FAA0DB7 for ; Thu, 7 Dec 2023 11:44:10 +0000 (UTC) X-FDA: 81539838660.02.3172A0C Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf15.hostedemail.com (Postfix) with ESMTP id 7049DA0008 for ; Thu, 7 Dec 2023 11:44:08 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701949448; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=OWm3EVXYxIU8REx54bUG7LYlNUS9aQ3BoIQPq1Ru78M=; b=QXWaLLm2LZdfYaaR1gIjlttfTbu9BEXbSONIa92IUUBgYM8V857ABLiTIgLWqLGSNjTdYp bAGonMQEEmhIwXz3GL3I4v4GIZOb0pWCNgFRui3ACAeVEenEeVsoL7cPRcVxuVywUpBwej QuxZZCQwR6DWC5oJNOwo0YdsZHxfas0= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701949448; a=rsa-sha256; cv=none; b=nJEnPhfY00QCNNXX1RXeNhWK6ZTspQkYl8hPF9vUGiThoSFZfYaRpVbdxTI9t1rueOhpNg 7LC/INvJYZlo4HmJQ3C0B+WODSgAIJGh5ry/DdLic7vzpXWm77SOaXrzwlvi1I/0oQHf3Q atjxb68Ybh5TGaVXUUwAZk9UMqSgaN8= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7201312FC; Thu, 7 Dec 2023 03:44:53 -0800 (PST) Received: from [10.1.32.134] (XHFQ2J9959.cambridge.arm.com [10.1.32.134]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B94403F762; Thu, 7 Dec 2023 03:44:04 -0800 (PST) Message-ID: <856f4317-43d9-41bb-8096-1eef69c86d3b@arm.com> Date: Thu, 7 Dec 2023 11:44:03 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface Content-Language: en-GB To: David Hildenbrand , Andrew Morton , Matthew Wilcox , Yin Fengwei , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Alistair Popple Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org References: <20231204102027.57185-1-ryan.roberts@arm.com> <20231204102027.57185-4-ryan.roberts@arm.com> <004ed23d-5571-4474-b7fe-7bc08817c165@redhat.com> <378afc6b-f93a-48ad-8aa6-ab171f3b9613@redhat.com> <4aa520f0-7c84-4e93-88bf-aee6d8d3ea70@arm.com> <44fdd46b-ad46-4ae2-a20f-20374acdf464@redhat.com> From: Ryan Roberts In-Reply-To: <44fdd46b-ad46-4ae2-a20f-20374acdf464@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: dbne5d9zoga6sofwyz674dab8yoq8gqi X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 7049DA0008 X-HE-Tag: 1701949448-727543 X-HE-Meta: U2FsdGVkX18tKMYr70NH5L6D69iaZgc75WrsGPAitbZAAy3rLsb11XH4JYTNqCbChep7t4Vrz4BzhEXcP8vdEtwC7KOKYOcbGRyeU7xbS4cmfWRTeY7w3fn7li8JnPY6Y7oW8psEVU+uBi1IpqHNh9D4+Bi1DuVeOl49lJLSc/3w3pOy9c1SNPKDGgENbxSyzuhpjk64UPgLsF0FFQC+1wZiVPYsgX2gEqJ12EX3G7PL9kLuVyWayohR4XxRIhLzby+c2mZKPhYSnRO903iFru+pRaL0m8bjQU8xmlVMGEQFYSoD3DfR5azrKpQUVEcPu9efTiSNEc2RnR2KWKAbJzhTEQ8E/9XBhpj8Bt/q9wcQyG0pSxWUjBgUArOGInVMuRe9w67GfMqdNy/cwkjVCkvavB/QuKOqqrCrotdBRY5FNwduJXBjEHDPRJ+9of8orW907PjFeXFRnUPw9Htt4okbWuJ9KDgHBmK05+1E7uBJfJV5ij+xfvdn1c+ndp4xIwIeWzD3PE9JwDgFdR+MZSUNQoFiIg+bTP4zvHTUFL+pHWFHUYiMYf++p5jr98vo0f6QLdZwIgYuy81PESrdYyJ6eEA1v/Jkw22nvo/TXaUkKOwawq+ApdTsfHmpd4Un1Gh89s2UKqCKvyzO4BfrLBccfrsgn8sE0Q2rB93A47+k9Btvfw8NhMkN+1U4Xjyptocq67RW08E4HPD/sgIEnTEPJBaXfqjRzm4kOo9u3BLT8kIUvjQUJSx6zNVeq9U2FcrxuUPRyNOWI+rLv0M3dybZscBUnUDx5QxPKswY75e5vhBkXp8yN7I4PnuK2h/yfBDFgTTZaVqGgWdVvmWfyfg1QmheozlO8102gmIzrfnhcHzD6GYG09kcWU0t8Tr1CtCgdjgqjTthuVNK0Ak9pL6cIv7uWdt4XLXt3TmKTyqdaeUBbneauOwL14xjsS3KZvb9A7uhMsKZRKGtUcE jN9k9sW9 90N8tH0WwD1NsHlyetuKU7eJMHHnDHqlVwt6flOu7ujQTCFXRDvV8azFU6o6Lwy5uJ7nJMePwSyMnYKI715fOGUu2WngWAxbm7amQDlHoAPZMIIJgA5jlVTNs5Opql1T1PaQTAphVFx2dEb6pABy61qnaEUFNCTPrjKheP1WkUj5EgH/1IXsJmQUWXUJpPWycSKrsq7damLQ/ULz4YCucNk8x5lpAY1Pv8sOJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 07/12/2023 11:25, David Hildenbrand wrote: > On 07.12.23 12:22, Ryan Roberts wrote: >> On 07/12/2023 11:13, David Hildenbrand wrote: >>>>> >>>>>> + >>>>>>         if (!vma->vm_mm)        /* vdso */ >>>>>> -        return false; >>>>>> +        return 0; >>>>>>           /* >>>>>>          * Explicitly disabled through madvise or prctl, or some >>>>>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, >>>>>> unsigned long vm_flags, >>>>>>          * */ >>>>>>         if ((vm_flags & VM_NOHUGEPAGE) || >>>>>>             test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) >>>>>> -        return false; >>>>>> +        return 0; >>>>>>         /* >>>>>>          * If the hardware/firmware marked hugepage support disabled. >>>>>>          */ >>>>>>         if (transparent_hugepage_flags & (1 << >>>>>> TRANSPARENT_HUGEPAGE_UNSUPPORTED)) >>>>>> -        return false; >>>>>> +        return 0; >>>>>>           /* khugepaged doesn't collapse DAX vma, but page fault is fine. */ >>>>>>         if (vma_is_dax(vma)) >>>>>> -        return in_pf; >>>>>> +        return in_pf ? orders : 0; >>>>>>           /* >>>>>>          * khugepaged special VMA and hugetlb VMA. >>>>>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, >>>>>> unsigned long vm_flags, >>>>>>          * VM_MIXEDMAP set. >>>>>>          */ >>>>>>         if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED)) >>>>>> -        return false; >>>>>> +        return 0; >>>>>>           /* >>>>>> -     * Check alignment for file vma and size for both file and anon vma. >>>>>> +     * Check alignment for file vma and size for both file and anon vma by >>>>>> +     * filtering out the unsuitable orders. >>>>>>          * >>>>>>          * Skip the check for page fault. Huge fault does the check in fault >>>>>> -     * handlers. And this check is not suitable for huge PUD fault. >>>>>> +     * handlers. >>>>>>          */ >>>>>> -    if (!in_pf && >>>>>> -        !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE))) >>>>>> -        return false; >>>>>> +    if (!in_pf) { >>>>>> +        int order = first_order(orders); >>>>>> +        unsigned long addr; >>>>>> + >>>>>> +        while (orders) { >>>>>> +            addr = vma->vm_end - (PAGE_SIZE << order); >>>>>> +            if (thp_vma_suitable_orders(vma, addr, BIT(order))) >>>>>> +                break; >>>>> >>>>> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the >>>>> compiler is smart enough to optimize the loop and everyything else out. >>>> >>>> I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive, >>>> then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order() >>>> (that's basically how it is laid out already, just all in one function). Is >>>> that >>>> what you are requesting? >>> >>> You got the spirit, yes. >>> >>>>> >>>>> [...] >>>>> >>>>>> + >>>>>> +static ssize_t thpsize_enabled_store(struct kobject *kobj, >>>>>> +                     struct kobj_attribute *attr, >>>>>> +                     const char *buf, size_t count) >>>>>> +{ >>>>>> +    int order = to_thpsize(kobj)->order; >>>>>> +    ssize_t ret = count; >>>>>> + >>>>>> +    if (sysfs_streq(buf, "always")) { >>>>>> +        set_bit(order, &huge_anon_orders_always); >>>>>> +        clear_bit(order, &huge_anon_orders_inherit); >>>>>> +        clear_bit(order, &huge_anon_orders_madvise); >>>>>> +    } else if (sysfs_streq(buf, "inherit")) { >>>>>> +        set_bit(order, &huge_anon_orders_inherit); >>>>>> +        clear_bit(order, &huge_anon_orders_always); >>>>>> +        clear_bit(order, &huge_anon_orders_madvise); >>>>>> +    } else if (sysfs_streq(buf, "madvise")) { >>>>>> +        set_bit(order, &huge_anon_orders_madvise); >>>>>> +        clear_bit(order, &huge_anon_orders_always); >>>>>> +        clear_bit(order, &huge_anon_orders_inherit); >>>>>> +    } else if (sysfs_streq(buf, "never")) { >>>>>> +        clear_bit(order, &huge_anon_orders_always); >>>>>> +        clear_bit(order, &huge_anon_orders_inherit); >>>>>> +        clear_bit(order, &huge_anon_orders_madvise); >>>>> >>>>> Note: I was wondering for a second if some concurrent cames could lead to an >>>>> inconsistent state. I think in the worst case we'll simply end up with "never" >>>>> on races. >>>> >>>> You mean if different threads try to write different values to this file >>>> concurrently? Or if there is a concurrent fault that tries to read the flags >>>> while they are being modified? >>> >>> I thought about what you said first, but what you said last might also apply. As >>> long as "nothing breaks", all good. >>> >>>> >>>> I thought about this for a long time too and wasn't sure what was best. The >>>> existing global enabled store impl clears the bits first then sets the bit. >>>> With >>>> this approach you can end up with multiple bits set if there is a race to set >>>> diffierent values, and you can end up with a faulting thread seeing never if it >>>> reads the bits after they have been cleared but before setting them. >>> >>> Right, but user space is playing stupid games and can win stupid prices. As long >>> as nothing breaks, we're good. >>> >>>> >>>> I decided to set the new bit before clearing the old bits, which is >>>> different; A >>>> racing fault will never see "never" but as you say, a race to set the file >>>> could >>>> result in "never" being set. >>>> >>>> On reflection, it's probably best to set the bit *last* like the global control >>>> does? >>> >>> Probably might just slap a simple spinlock in there, so at least the writer side >>> is completely serialized. Then you can just set the bit last. It's unlikely that >>> readers will actually run into issues, and if they ever would, we could use some >>> rcu magic to let them read a consistent state. >> >> I'd prefer to leave it as it is now; clear first, set last without any explicit >> serialization. I've convinced myself that nothing breaks and its the same >> pattern used by the global control so its consistent. Unless you're insisting on >> the spin lock? > > No, not at all. But it would certainly remove any possible concerns :) OK fine, you win :). I'll add a spin lock on the writer side.