From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BE0EFC3DA45 for ; Sat, 13 Jul 2024 12:58:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4040A6B0085; Sat, 13 Jul 2024 08:58:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B4286B0088; Sat, 13 Jul 2024 08:58:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2A28E6B0089; Sat, 13 Jul 2024 08:58:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 09A7F6B0085 for ; Sat, 13 Jul 2024 08:58:08 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 7FA81A0F51 for ; Sat, 13 Jul 2024 12:58:07 +0000 (UTC) X-FDA: 82334732214.29.F740590 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by imf28.hostedemail.com (Postfix) with ESMTP id C9B2DC000C for ; Sat, 13 Jul 2024 12:58:04 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=o9kxFIyO; spf=pass (imf28.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720875450; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ynrMmi6W+olSeq5whEcPJ546MWxiMPRjwUEKXsXjNXk=; b=xvjkA78EXtVAFqUUpXQ5v0+R367n7wQiyeZrN7rWVQvlDZ1lnfF1wDSoJLo+b1SKOjtJG4 /FgSTWLuzvQS1zwfQ+F23FF5bGjoQM/sICrFyn5EK+srlKpa+n1rWyl2DCGa7nm6NEULCq BnWrLAnwwzVD2cFO6ui1Qh7DM3gLqsc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720875450; a=rsa-sha256; cv=none; b=q5Ml9LtgYw17Mb3/NHqb/yohRcY99/LSRmD0y03Tk8+M2FO65KqDy6+K9uOSYorfTDmdi/ J+SSNFZ6lK/TNFx2NC1Hf5pM4OwUWl7oF0REcr6Q1Ym8bOD55l/b453iYtHMe2Q/wipEfD Vc7Qd4uoUspyZky2Sz7saLLswKjO5qs= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=o9kxFIyO; spf=pass (imf28.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1720875481; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=ynrMmi6W+olSeq5whEcPJ546MWxiMPRjwUEKXsXjNXk=; b=o9kxFIyOIoboReqQDjsHNlhvMxiu+QLc+mGVppleM7UBYB/6gZtQhlKfdrFWEq7yj4SlJKdY9Lyps5RVIAMMqILeZXTUCdJNyePfUJvrP8YKkEW0eDcQDILiDzMJPJIWJ2SKP4iv0gRjDOHu+HksXMI8h9fPWkDBX5PmqLzNaEo= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033037067110;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0WAR.dHF_1720875480; Received: from 30.32.91.177(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WAR.dHF_1720875480) by smtp.aliyun-inc.com; Sat, 13 Jul 2024 20:58:01 +0800 Message-ID: Date: Sat, 13 Jul 2024 20:57:59 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm/huge_memory: Avoid PMD-size page cache if needed To: David Hildenbrand , Gavin Shan , Matthew Wilcox Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, william.kucharski@oracle.com, ryan.roberts@arm.com, shan.gavin@gmail.com References: <20240711104840.200573-1-gshan@redhat.com> <63a0364b-a2e0-48c2-b255-e976112deeb1@redhat.com> <11c95a82-cf13-414b-b489-1dd48255e022@redhat.com> From: Baolin Wang In-Reply-To: <11c95a82-cf13-414b-b489-1dd48255e022@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: pajddsm1ra4mcqr7mudgh1u1p1cuhh5m X-Rspamd-Queue-Id: C9B2DC000C X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1720875484-41140 X-HE-Meta: U2FsdGVkX18uasBLvS3iPf0Lkpf9oM9ddYyxzyJs0qclwRSmfKWVcBieQ0sJ/2/EWzW+1mT3QawXP0iw3vBaxNL5vr65ST07qreBClmBvDvWZkgDl6AAPfHn2oVKeiW1foa4NgZOhiAkYo++IupZbmyS/iY3az9Ju4sappbQU390kxNo65NnF+pSZZa5vIhIa7glwM4AbLNiP0rrLdY4xZsj2pi9uNaDNVQW9k+yRIk8epCvaDFqWCZiJlHHwiu3zmvrggRlJV6w6lN9B7qOy8XQHUoJCTqJk3c+0RCTeBd9wXOXyJyXbIBryrygESzPrTFEHhcwrcIOMglJ/dyYTvCsWl8C2pLGZhjYYolQlL06rx/Lw/mtD2B7CXz6ZZx2NHkPzT8kCMkZPnQJm4IgkIa2hSYPYtUyOXtkTig91DnBCPspVm0cHyrflq+pHJEEM8f+QUjT6AdKr+XI15Khnu9txiYTn/qHI7E2T4sCC3BHZ5s+9BNnHVqRW7TukaZGdEqdKH+81PKGUiNtjNpomXD9GQ4C3UsS35JT24mX/kakgDFMkg13rW2dMnboQVyh4ed83lQelzf89OpPzZH7+fzUt8DKGUXmBh3henymbhayhR1mY3tzj+fZ40+P4gheXM/Y4nai5rCfajL+QMD5Q/YGzqo0yTCAgtHUx/zwtCxNoGlJAUjKmmWYoQzTjGQUV0W1tHL7NjPrGwjZ4peGnfiaIeiehgh4RWKVrXNC4HonIKFiuf0fiGXqG/iCNFVyPAy1oKclG/V94nrQxIOt6zR5G2JM1T8RCDXzzu2elgOHuLFgJMHnLSvIYMelNy/n307vxIovA6INNPLXhrDl6nZfUbxkVijnVsQhaTJomTkgAAbmUCslGkUUL2yZMnDXKqwv/UEFglOmFtUbp/QAI3uDLP9++OiVKiDUX+vGtK+UA7gXVbHOqiJXC7V3OTBTTXww2kNlfq1Hb2xvEug NQ+jok3p LuWk8BC1WFF8nNC55HLNVGBb0+XF9ffd+PcGy5TLCFCn6DUmA5qgJdw897mh4FiC12q4AbWsUIiOwb546pquaoEv1zwGuyILOX5+vHvKqv7cJJiUxakOPrpdt27CP3GqggnZ6V0eP0XRlvoUz01vlctMus5/+LPWAt1wruYLQ0+F5QHkMhUV4FitZvemop5PrfCNGHMqfrop320qEsm3n0rxeLQwE6Sba1AopQMovMR2u8uja+4kg2drcuemt1cEub2CO3HsYcuxtoU2Rek1OvkWbXWYGTTgeZvNexjMmHpUJGZQDWyuZLy98tn0ObYwglhpYLdxcE/koCIu7hJE/TQG7e1AXXRTV3rJwfcTJFzsoKWkYkwhjEr07vw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/7/13 12:17, David Hildenbrand wrote: > On 13.07.24 06:01, Baolin Wang wrote: >> >> >> On 2024/7/13 09:03, David Hildenbrand wrote: >>> On 12.07.24 07:39, Gavin Shan wrote: >>>> On 7/12/24 7:03 AM, David Hildenbrand wrote: >>>>> On 11.07.24 22:46, Matthew Wilcox wrote: >>>>>> On Thu, Jul 11, 2024 at 08:48:40PM +1000, Gavin Shan wrote: >>>>>>> +++ b/mm/huge_memory.c >>>>>>> @@ -136,7 +136,8 @@ unsigned long __thp_vma_allowable_orders(struct >>>>>>> vm_area_struct *vma, >>>>>>>             while (orders) { >>>>>>>                 addr = vma->vm_end - (PAGE_SIZE << order); >>>>>>> -            if (thp_vma_suitable_order(vma, addr, order)) >>>>>>> +            if (!(vma->vm_file && order > MAX_PAGECACHE_ORDER) && >>>>>>> +                thp_vma_suitable_order(vma, addr, order)) >>>>>>>                     break; >>>>>> >>>>>> Why does 'orders' even contain potential orders that are larger than >>>>>> MAX_PAGECACHE_ORDER? >>>>>> >>>>>> We do this at the top: >>>>>> >>>>>>            orders &= vma_is_anonymous(vma) ? >>>>>>                            THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE; >>>>>> >>>>>> include/linux/huge_mm.h:#define THP_ORDERS_ALL_FILE >>>>>> (BIT(PMD_ORDER) | BIT(PUD_ORDER)) >>>>>> >>>>>> ... and that seems very wrong.  We support all kinds of orders for >>>>>> files, not just PMD order.  We don't support PUD order at all. >>>>>> >>>>>> What the hell is going on here? >>>>> >>>>> yes, that's just absolutely confusing. I mentioned it to Ryan lately >>>>> that we should clean that up (I wanted to look into that, but am >>>>> happy if someone else can help). >>>>> >>>>> There should likely be different defines for >>>>> >>>>> DAX (PMD|PUD) >>>>> >>>>> SHMEM (PMD) -- but soon more. Not sure if we want separate ANON_SHMEM >>>>> for the time being. Hm. But shmem is already handles separately, so >>>>> maybe we can just ignore shmem here. >>>>> >>>>> PAGECACHE (1 .. MAX_PAGECACHE_ORDER) >>>>> >>>>> ? But it's still unclear to me. >>>>> >>>>> At least DAX must stay special I think, and PAGECACHE should be >>>>> capped at MAX_PAGECACHE_ORDER. >>>>> >>>> >>>> David, I can help to clean it up. Could you please help to confirm the >>>> following >>> >>> Thanks! >>> >>>> changes are exactly what you're suggesting? Hopefully, there are >>>> nothing I've missed. >>>> The original issue can be fixed by the changes. With the changes >>>> applied, madvise(MADV_COLLAPSE) >>>> returns with errno -22 in the test program. >>>> >>>> The fix tag needs to adjusted either. >>>> >>>> Fixes: 3485b88390b0 ("mm: thp: introduce multi-size THP sysfs >>>> interface") >>>> >>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>>> index 2aa986a5cd1b..45909efb0ef0 100644 >>>> --- a/include/linux/huge_mm.h >>>> +++ b/include/linux/huge_mm.h >>>> @@ -74,7 +74,12 @@ extern struct kobj_attribute shmem_enabled_attr; >>>>     /* >>>>      * Mask of all large folio orders supported for file THP. >>>>      */ >>>> -#define THP_ORDERS_ALL_FILE    (BIT(PMD_ORDER) | BIT(PUD_ORDER)) >>> >>> DAX doesn't have any MAX_PAGECACHE_ORDER restrictions (like hugetlb). So >>> this should be >>> >>> /* >>>    * FSDAX never splits folios, so the MAX_PAGECACHE_ORDER limit does >>> not >>>    * apply here. >>>    */ >>> THP_ORDERS_ALL_FILE_DAX ((BIT(PMD_ORDER) | BIT(PUD_ORDER)) >>> >>> Something like that >>> >>>> +#define THP_ORDERS_ALL_FILE_DAX                \ >>>> +       ((BIT(PMD_ORDER) | BIT(PUD_ORDER)) & (BIT(MAX_PAGECACHE_ORDER >>>> + 1) - 1)) >>>> +#define THP_ORDERS_ALL_FILE_DEFAULT    \ >>>> +       ((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0)) >>>> +#define THP_ORDERS_ALL_FILE            \ >>>> +       (THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT) >>> >>> Maybe we can get rid of THP_ORDERS_ALL_FILE (to prevent abuse) and fixup >>> THP_ORDERS_ALL instead. >>> >>>>     /* >>>>      * Mask of all large folio orders supported for THP. >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>> index 2120f7478e55..4690f33afaa6 100644 >>>> --- a/mm/huge_memory.c >>>> +++ b/mm/huge_memory.c >>>> @@ -88,9 +88,17 @@ unsigned long __thp_vma_allowable_orders(struct >>>> vm_area_struct *vma, >>>>            bool smaps = tva_flags & TVA_SMAPS; >>>>            bool in_pf = tva_flags & TVA_IN_PF; >>>>            bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS; >>>> +       unsigned long supported_orders; >>>> + >>>>            /* Check the intersection of requested and supported >>>> orders. */ >>>> -       orders &= vma_is_anonymous(vma) ? >>>> -                       THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE; >>>> +       if (vma_is_anonymous(vma)) >>>> +               supported_orders = THP_ORDERS_ALL_ANON; >>>> +       else if (vma_is_dax(vma)) >>>> +               supported_orders = THP_ORDERS_ALL_FILE_DAX; >>>> +       else >>>> +               supported_orders = THP_ORDERS_ALL_FILE_DEFAULT; >>> >>> This is what I had in mind. >>> >>> But, do we have to special-case shmem as well or will that be handled >>> correctly? >> >> For anonymous shmem, it is now same as anonymous THP, which can utilize >> THP_ORDERS_ALL_ANON. >> For tmpfs, we currently only support PMD-sized THP >> (will support more larger orders in the future). Therefore, I think we >> can reuse THP_ORDERS_ALL_ANON for shmem now: >> >> if (vma_is_anonymous(vma) || shmem_file(vma->vm_file))) >>     supported_orders = THP_ORDERS_ALL_ANON; >> ...... >> > > > It should be THP_ORDERS_ALL_FILE_DEFAULT (MAX_PAGECACHE_ORDER imitation > applies). Yes, indeed, I missed MAX_PAGECACHE_ORDER limitation.