From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4481EC3DA45 for ; Sat, 13 Jul 2024 04:11:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 672286B007B; Sat, 13 Jul 2024 00:11:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5FB806B0083; Sat, 13 Jul 2024 00:11:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 49B5D6B0085; Sat, 13 Jul 2024 00:11:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 296256B007B for ; Sat, 13 Jul 2024 00:11:43 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id AC15AC095E for ; Sat, 13 Jul 2024 04:11:42 +0000 (UTC) X-FDA: 82333405644.24.97891A2 Received: from out30-100.freemail.mail.aliyun.com (out30-100.freemail.mail.aliyun.com [115.124.30.100]) by imf23.hostedemail.com (Postfix) with ESMTP id 53101140006 for ; Sat, 13 Jul 2024 04:11:38 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=ZwH+lPyi; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf23.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720843872; a=rsa-sha256; cv=none; b=MNdagtZ+OT8KQoTTVIc3QbgmU/Of0MMdbAe+jc3Ozbkpvl8uj2fHjYs5wK5uD+9ZfAdJSc cjxugoy7eApf71WxfVJjScjb+DzX8ZsHcbu/aF64n990TtcJqhEGCEhmi2aCBDpqZ5kkBC /1bqoWrgycBVoL9W+uOa/w5yGyyKiCQ= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=ZwH+lPyi; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf23.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720843872; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=l1zz8KU0ZreEJ+A1Pk7yXjuLTLJQY3EYOHYwky64JnU=; b=n0KeRdBI9+NogjoXoUolEspgLNBq2EneB/9NrlBnPuic8B7xDrkJYXcYWHBzt+nkfipKcB 4akva4O+LUNFFgC5AgjbBXPmoWUPo4rRenSbCSqyR1vEyDCxGpx1QojAYqdFNMzhPRWRN3 gDt0OD7ii+V32H1UJN+GOVahq6c9xBQ= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1720843281; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=l1zz8KU0ZreEJ+A1Pk7yXjuLTLJQY3EYOHYwky64JnU=; b=ZwH+lPyiaR8vDzhxyXEydf/5WUbNtbaHPA9TT22eeC/ZqEySrHbarW7QHjcEaI7ehcWYKBPGQvhCWCJwhlP208ZDUkXySMGZLAcfl1EF4GebsQfXCoW/8SfuWB7QkoI4TQMYj2EvknJI+1Kdc0m/bI/2q51nNdwG1Ekk5oNoipc= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R841e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033037067111;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0WAQ.UFK_1720843278; Received: from 30.120.178.44(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WAQ.UFK_1720843278) by smtp.aliyun-inc.com; Sat, 13 Jul 2024 12:01:19 +0800 Message-ID: Date: Sat, 13 Jul 2024 12:01:17 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm/huge_memory: Avoid PMD-size page cache if needed To: David Hildenbrand , Gavin Shan , Matthew Wilcox Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, william.kucharski@oracle.com, ryan.roberts@arm.com, shan.gavin@gmail.com References: <20240711104840.200573-1-gshan@redhat.com> <63a0364b-a2e0-48c2-b255-e976112deeb1@redhat.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 53101140006 X-Stat-Signature: 5kxqi6s3ay93jf8ak4qorppu13f6hc5g X-Rspam-User: X-HE-Tag: 1720843898-77559 X-HE-Meta: U2FsdGVkX1+D6kv9sqlddsFiin7OnjigN0IrC7V/YFyyz49tU13iNIYga3XjSbXg3cAerH+Qlx+SnrMaY0n5Y/JecNRSFcMi58VUtz55b5iDeOLGJlEV87DvpQMgMaRLqRv4EUKul6zk/EMM2SzhlgmYNSDFhN0/MSzjjbxyHF8z4muI16j0MVXTjbuwqe+KzarMds4N+/hl1kGR/tU2c+zsypiCw9llwJB0MNGJC9NDtPY9J1sTWHXKJhE29H/j5p7Iu6k5dlh1yf7oUuVuJryXmYBU4EBLWiuAxqNB7TlojxsJ5GRhwgd63Bz47HR/yCsfqh4WrgUDjbJeLDLvdvYKBGDm1GaJwAHQU+Foewk3ZxQV/1iM2TGBnonXarED1J188Ryqa573nkKll85J64G7L6oDEYR01Hh4rKWfoZBz2e0Zl/eD2wwqQjx+5AK9PoSR69jsSBXTvb6615OwmtEojBT05X1lXfK+HE+BpQDadBl35k+e5erX9onBumWA090DzayjkYXlfMDYSF54nw/fl8IVWM1C/RGdltxIiKMQ4C8DNdvbtcNd3Wt2Rz2cCxxEhkf4xuW1aKWwNOtCuAHTVLngjI1vxITxruyhAKjzi/B3tdgr2vP463Pj75iYtS0AyAxMKu0TC72sETMKYWZ//91zAWVfJu6n/KyXA95hNutwvYYZnRvukuhw1yt0nZJ76QI4ZXTsZH6P957dpAHwUyti27S0NPdfmnQCnzUuQY7xnCf+kBhrcnEWw5olDYda9ry7oE37zn+aks5BuStGxIGvr7aeXUDPLiyBc+1W79JsKbcvxbH222FgHFUVz9PIAHZReYOZ7NRDj/H5Mz4uNwj8/IoGsp5w+adXE441C+q7sDMeHqZ9FhYu/7PnmK0SmN3NfWxzODG1mDGtWJ+SPYbEi2P109HnVIklgvwJff/c8rn6AnEOpBfTHlYAsq8M9PGDQEzldI/nfTQ nTZxKz3T jTFuuIouYGNTHibK9KhRHiG8Rz05SvxACdp4ModtF3rHr9oJCI55W1cTH2ikJ/P9X0DXNJ+OMo1jxtTdZgyNzm6zUbwwYhMnayn2b+3RZwob56rrZ7lKzHV9Wdd1eAW4kaJSBnHeXOlutJ2fbNMeBBz5oivgOIw16z0Q46rk1GUuowAVMYj2UTPb2XZmcJ1TxJpEL15j0H0DQ7MZ7flIyTGdHKLdsA9FkKBuAQR4OTzUDkM94Aa6zRdlztOV1keRG3HDWwSSuoW4plndilf5ld1S5j4R5cu/CcS6GNICxjrr8Zri5E50nYmP8gyVPt6lxhdVx25xOu+jBR1kJ2uLdSEIbADCJkLnwCStp5GVRp21mcQtCcSiXbOaUnA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/7/13 09:03, David Hildenbrand wrote: > On 12.07.24 07:39, Gavin Shan wrote: >> On 7/12/24 7:03 AM, David Hildenbrand wrote: >>> On 11.07.24 22:46, Matthew Wilcox wrote: >>>> On Thu, Jul 11, 2024 at 08:48:40PM +1000, Gavin Shan wrote: >>>>> +++ b/mm/huge_memory.c >>>>> @@ -136,7 +136,8 @@ unsigned long __thp_vma_allowable_orders(struct >>>>> vm_area_struct *vma, >>>>>            while (orders) { >>>>>                addr = vma->vm_end - (PAGE_SIZE << order); >>>>> -            if (thp_vma_suitable_order(vma, addr, order)) >>>>> +            if (!(vma->vm_file && order > MAX_PAGECACHE_ORDER) && >>>>> +                thp_vma_suitable_order(vma, addr, order)) >>>>>                    break; >>>> >>>> Why does 'orders' even contain potential orders that are larger than >>>> MAX_PAGECACHE_ORDER? >>>> >>>> We do this at the top: >>>> >>>>           orders &= vma_is_anonymous(vma) ? >>>>                           THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE; >>>> >>>> include/linux/huge_mm.h:#define THP_ORDERS_ALL_FILE >>>> (BIT(PMD_ORDER) | BIT(PUD_ORDER)) >>>> >>>> ... and that seems very wrong.  We support all kinds of orders for >>>> files, not just PMD order.  We don't support PUD order at all. >>>> >>>> What the hell is going on here? >>> >>> yes, that's just absolutely confusing. I mentioned it to Ryan lately >>> that we should clean that up (I wanted to look into that, but am >>> happy if someone else can help). >>> >>> There should likely be different defines for >>> >>> DAX (PMD|PUD) >>> >>> SHMEM (PMD) -- but soon more. Not sure if we want separate ANON_SHMEM >>> for the time being. Hm. But shmem is already handles separately, so >>> maybe we can just ignore shmem here. >>> >>> PAGECACHE (1 .. MAX_PAGECACHE_ORDER) >>> >>> ? But it's still unclear to me. >>> >>> At least DAX must stay special I think, and PAGECACHE should be >>> capped at MAX_PAGECACHE_ORDER. >>> >> >> David, I can help to clean it up. Could you please help to confirm the >> following > > Thanks! > >> changes are exactly what you're suggesting? Hopefully, there are >> nothing I've missed. >> The original issue can be fixed by the changes. With the changes >> applied, madvise(MADV_COLLAPSE) >> returns with errno -22 in the test program. >> >> The fix tag needs to adjusted either. >> >> Fixes: 3485b88390b0 ("mm: thp: introduce multi-size THP sysfs interface") >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 2aa986a5cd1b..45909efb0ef0 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -74,7 +74,12 @@ extern struct kobj_attribute shmem_enabled_attr; >>    /* >>     * Mask of all large folio orders supported for file THP. >>     */ >> -#define THP_ORDERS_ALL_FILE    (BIT(PMD_ORDER) | BIT(PUD_ORDER)) > > DAX doesn't have any MAX_PAGECACHE_ORDER restrictions (like hugetlb). So > this should be > > /* >  * FSDAX never splits folios, so the MAX_PAGECACHE_ORDER limit does not >  * apply here. >  */ > THP_ORDERS_ALL_FILE_DAX ((BIT(PMD_ORDER) | BIT(PUD_ORDER)) > > Something like that > >> +#define THP_ORDERS_ALL_FILE_DAX                \ >> +       ((BIT(PMD_ORDER) | BIT(PUD_ORDER)) & (BIT(MAX_PAGECACHE_ORDER >> + 1) - 1)) >> +#define THP_ORDERS_ALL_FILE_DEFAULT    \ >> +       ((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0)) >> +#define THP_ORDERS_ALL_FILE            \ >> +       (THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT) > > Maybe we can get rid of THP_ORDERS_ALL_FILE (to prevent abuse) and fixup > THP_ORDERS_ALL instead. > >>    /* >>     * Mask of all large folio orders supported for THP. >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 2120f7478e55..4690f33afaa6 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -88,9 +88,17 @@ unsigned long __thp_vma_allowable_orders(struct >> vm_area_struct *vma, >>           bool smaps = tva_flags & TVA_SMAPS; >>           bool in_pf = tva_flags & TVA_IN_PF; >>           bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS; >> +       unsigned long supported_orders; >> + >>           /* Check the intersection of requested and supported orders. */ >> -       orders &= vma_is_anonymous(vma) ? >> -                       THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE; >> +       if (vma_is_anonymous(vma)) >> +               supported_orders = THP_ORDERS_ALL_ANON; >> +       else if (vma_is_dax(vma)) >> +               supported_orders = THP_ORDERS_ALL_FILE_DAX; >> +       else >> +               supported_orders = THP_ORDERS_ALL_FILE_DEFAULT; > > This is what I had in mind. > > But, do we have to special-case shmem as well or will that be handled > correctly? For anonymous shmem, it is now same as anonymous THP, which can utilize THP_ORDERS_ALL_ANON. For tmpfs, we currently only support PMD-sized THP (will support more larger orders in the future). Therefore, I think we can reuse THP_ORDERS_ALL_ANON for shmem now: if (vma_is_anonymous(vma) || shmem_file(vma->vm_file))) supported_orders = THP_ORDERS_ALL_ANON; ......