From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0A4D6C433EF
	for <linux-mm@archiver.kernel.org>; Wed,  1 Jun 2022 11:27:04 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 575668D0009; Wed,  1 Jun 2022 07:27:04 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5240E8D0006; Wed,  1 Jun 2022 07:27:04 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3E6328D0009; Wed,  1 Jun 2022 07:27:04 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 2EA968D0006
	for <linux-mm@kvack.org>; Wed,  1 Jun 2022 07:27:04 -0400 (EDT)
Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id D42D9341AA
	for <linux-mm@kvack.org>; Wed,  1 Jun 2022 11:27:03 +0000 (UTC)
X-FDA: 79529440326.31.00192F9
Received: from out30-42.freemail.mail.aliyun.com (out30-42.freemail.mail.aliyun.com [115.124.30.42])
	by imf04.hostedemail.com (Postfix) with ESMTP id 913A740067
	for <linux-mm@kvack.org>; Wed,  1 Jun 2022 11:26:39 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=alimailimapcm10staff010182156082;MF=rongwei.wang@linux.alibaba.com;NM=1;PH=DS;RN=5;SR=0;TI=SMTPD_---0VF3iTWj_1654082812;
Received: from 30.240.97.18(mailfrom:rongwei.wang@linux.alibaba.com fp:SMTPD_---0VF3iTWj_1654082812)
          by smtp.aliyun-inc.com(127.0.0.1);
          Wed, 01 Jun 2022 19:26:53 +0800
Message-ID: <08fb5da7-a390-1880-da3f-e1d480047caa@linux.alibaba.com>
Date: Wed, 1 Jun 2022 19:26:51 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0)
 Gecko/20100101 Thunderbird/101.0
Subject: Re: mm/khugepaged: collapse file/shmem compound pages
Content-Language: en-US
To: Zach O'Keefe <zokeefe@google.com>
Cc: Matthew Wilcox <willy@infradead.org>, David Rientjes
 <rientjes@google.com>, "linux-mm@kvack.org" <linux-mm@kvack.org>,
 Hugh Dickins <hughd@google.com>
References: <CAAa6QmTLOLoygZeGgnsVHH_+wV78cN45aqbsYnXPACjNME7jCw@mail.gmail.com>
 <Yo5+c6vFyLgjtVsG@casper.infradead.org>
 <CAAa6QmQhqHrKa5M4vRCAPtOa4pTet6MfELprN2Wb0rv46PSjTA@mail.gmail.com>
 <Yo71pPTkB7taSb9Y@casper.infradead.org>
 <CAAa6QmRfCrGc1RYyy_o4dGiKZJ8ZehBH9Lfg5g09SwXvBXx7HQ@mail.gmail.com>
 <YpBJy9wQXABZeHLL@casper.infradead.org>
 <CAAa6QmQgAYJonu=mbv5NZ3DuYOphc9wj2PYV3gBg3=DH_aSM-A@mail.gmail.com>
 <YpGbnbi44JqtRg+n@casper.infradead.org>
 <CAAa6QmT8WBsbZLunGDJQtZDmaibfU=MPsoG0cxFGkCj=qcr24Q@mail.gmail.com>
 <cd781b7f-f5f2-4a44-a338-1ccc0f03d86d@linux.alibaba.com>
 <CAAa6QmRo=SjCGSdhHyxmkzxyiR8S9EUdRr0CXSenWaa+-7e5bg@mail.gmail.com>
From: Rongwei Wang <rongwei.wang@linux.alibaba.com>
In-Reply-To: <CAAa6QmRo=SjCGSdhHyxmkzxyiR8S9EUdRr0CXSenWaa+-7e5bg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Authentication-Results: imf04.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=alibaba.com;
	spf=pass (imf04.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.42 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com
X-Stat-Signature: iz6u4bsmmta8d5c1dbpi7i3da4rsih9t
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 913A740067
X-HE-Tag: 1654082799-318981
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 6/1/22 1:19 PM, Zach O'Keefe wrote:
> On Sun, May 29, 2022 at 6:25 PM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 5/30/22 5:36 AM, Zach O'Keefe wrote:
>>> On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
>>>>> On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>>> Because PageTransCompound() does not do what it says on the tin.
>>>>>>
>>>>>> static inline int PageTransCompound(struct page *page)
>>>>>> {
>>>>>>           return PageCompound(page);
>>>>>> }
>>>>>>
>>>>>> So any compound page is treated as if it's a PMD-sized page.
>>>>>
>>>>> Right - therein lies the problem :) I think I misattributed your
>>>>> comment "we'll simply skip over it because the code believes that
>>>>> means it's already a PMD" as a solution, not as the current state of
>>>>> things. What we need to be able to do is:
>>>>>
>>>>> 1) If folio order == 0: do what we've been doing
>>>>> 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
>>>>> pmd-mapped. If it is, we're done. If not, continue to step (3)
>>>>
>>>> I would not do that part.  Just leave it alone and assume everything's
>>>> good.
>>>
>>> Sorry if I keep pressing the issue here - but why not check? If the
>>> goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible
>>> memory at the pmd level, then these pte-mapped hugepages that we might
>>> discover in step (2) are actually the cheapest memory to collapse
>>> since we can do the collapse in-place.
>>>
>>>>> 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
>>>>> it's ~ same as step (1)
>>>>
>>>> Yes, exactly this.
>>>>
>>>>>>> I thought the point / benefit of khugepaged was precisely to try and
>>>>>>> find places where we can collapse many pte entries into a single pmd
>>>>>>> mapping?
>>>>>>
>>>>>> Ideally, yes.  But if a file is mapped at an address which isn't
>>>>>> PMD-aligned, it can't.  Maybe it should just decline to operate in that
>>>>>> case.
>>>>>
>>>>> To make sure I'm not missing anything here: It's not actually
>>>>> important that the file is mapped at a pmd-aligned address. All that
>>>>> is important is that the region of memory being collapsed is
>>>>> pmd-aligned. If we wanted to collapse memory mapped to the start of
>>>>> the file, then sure, the file has to be mapped suitably.
>>>>
>>>> Ah, what you're probably missing is that for file pages/folios, they
>>>> have to be naturally aligned.  The data structure underlying the
>>>> page cache simply can't cope with askew pages.  (It kind of can under
>>>> some circumstances that are so complicated that they shouldn't be
>>>> explained, and it's far easier just to say "folios must be naturally
>>>> aligned within the file")
>>>
>>> I'm trying to understand what you mean by "naturally aligned" here.
>>> I'm operating under the assumption that all file pages map to
>>> page-sized offsets within a file (e.g. linear_page_address()) and that
>>> files are mapped at a page-aligned address. In the event we want to
>>> collapse file-backed memory, if the virtual address of said memory is
>>> hugepage-aligned, I don't think it's necessary that the address maps
>>> to a hugepage-sized offset in the file? I.e. on x86, the file could
>> Hi, Zach
>>
>> I'm not sure get your question rightly. We submitted patch set to
>> support file THP can been used transparently, likes THP[1].
>>
>> [1]https://lore.kernel.org/linux-mm/20211009092658.59665-2-rongwei.wang@linux.alibaba.com/
>>
>> In this patch, I remember that we need to check if '(vma->vm_start >>
>> PAGE_SHIFT) - vma->vm_pgoff' is align with HPAGE_PMD_NR. likes:
>>
>> +static inline bool vma_is_hugetext(struct vm_area_struct *vma,
>> +                                  unsigned long vm_flags)
>> +{
>> +       if (!(vm_flags & VM_EXEC))
>> +               return false;
>> +
>> +       if (vma->vm_file && !inode_is_open_for_write(vma->vm_file->f_inode))
>> +               return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
>> +                               HPAGE_PMD_NR);
>> +
>> +       return false;
>> +}
>>
>> There is a little different with anon THP here.
>>> itself be mapped to the start of the last page in a 2MiB region ,X,
>>
>> Maybe it's related with ELF's 'Align' parameter in your current system?
>> If 'Align' set to 2MB (by 'readelf -l /path/exec'), it's probably meets
>> the above alignment check.
>>
>> And the default Align parameter is related to binutils version, also can
>> be set in compile time by '-z max-page-size=<align size>' option.
>>
>> Hope it is helpful:)
>>
>> -wrw
> 
> Hey Rongwei!
> 
> Thanks for the code / help. Took a little bit, but hughd has
> enlightened me on the problem (thanks Hugh!). Likewise, apologies for
> not understanding your previous comment regarding folio alignment,
> Matthew.
> 
> Also, thanks for linking your patchset, and sorry for missing it
> previously. It seems we're interested in the same problem! Hopefully
> this work can be beneficial to your use case as well.
Hi Zach!

Thanks you, too. Recently, we are trying to use process_madvise()+DAMON 
to find hot .text, especially x86, and then collapsing into huge pages. 
It seems that process_madvise(MADV_COLLAPSE) is feasible.

Anyway, thanks your nice work.
-wrw
> 
> Thanks again for your time,
> Zach
> 
> 
> 
>>> and that wouldn't prevent us from collapsing the 2MiB region starting
>>> at X+4KiB.
>>>
>>>>>>>> shmem still expects folios to be of order either 0 or PMD_ORDER.
>>>>>>>> That assumption extends into the swap code and I haven't had the heart
>>>>>>>> to go and fix all those places yet.  Plus Neil was doing major surgery
>>>>>>>> to the swap code in the most recent deveopment cycle and I didn't want
>>>>>>>> to get in his way.
>>>>>>>>
>>>>>>>> So I am absolutely fine with khugepaged allocating a PMD-size folio for
>>>>>>>> any inode that claims mapping_large_folio_support().  If any filesystems
>>>>>>>> break, we'll fix them.
>>>>>>>
>>>>>>> Just for clarification, what is the equivalent code today that
>>>>>>> enforces mapping_large_folio_support()? I.e. today, khugepaged can
>>>>>>> successfully collapse file without checking if the inode supports it
>>>>>>> (we only check that it's a regular file not opened for writing).
>>>>>>
>>>>>> Yeah, that's a dodgy hack which needs to go away.  But we need a lot
>>>>>> more filesystems converted to supporting large folios before we can
>>>>>> delete it.  Not your responsibility; I'm doing my best to encourage
>>>>>> fs maintainers to do this part.
>>>>>
>>>>> Got it. In the meantime, do we want to check the old conditions +
>>>>> mapping_large_folio_support()?
>>>>
>>>> Yes, that should work.  khugepaged should be free to create large
>>>> folios if the underlying filesystem supports them OR (executable,
>>>> read-only, CONFIG_THP_RO, etc, etc).
>>>
>>> Thanks for confirming!
>>>
>>>>>>> Also, just to check, there isn't anything wrong with following
>>>>>>> collapse_file()'s approach, even for folios of 0 < order <
>>>>>>> HPAGE_PMD_ORDER? I.e this part:
>>>>>>>
>>>>>>>    * Basic scheme is simple, details are more complex:
>>>>>>>    *  - allocate and lock a new huge page;
>>>>>>>    *  - scan page cache replacing old pages with the new one
>>>>>>>    *    + swap/gup in pages if necessary;
>>>>>>>    *    + fill in gaps;
>>>>>>>    *    + keep old pages around in case rollback is required;
>>>>>>>    *  - if replacing succeeds:
>>>>>>>    *    + copy data over;
>>>>>>>    *    + free old pages;
>>>>>>>    *    + unlock huge page;
>>>>>>>    *  - if replacing failed;
>>>>>>>    *    + put all pages back and unfreeze them;
>>>>>>>    *    + restore gaps in the page cache;
>>>>>>>    *    + unlock and free huge page;
>>>>>>>    */
>>>>>>
>>>>>> Correct.  At least, as far as I know!  Working on folios has been quite
>>>>>> the education for me ...
>>>>>
>>>>> Great! Well, perhaps I'll run into a snafu here or there (and
>>>>> hopefully learn something myself) but this gives me enough confidence
>>>>> to naively give it a try and see what happens!
>>>>>
>>>>> Again, thank you very much for your time, help and advice with this,
>>>>
>>>> You're welcome!  Thanks for putting in some work on this project!
>>>
>>> No problem! Hopefully this can benefit a bunch of existing users.
>>>
>>> Thanks again,
>>> Zach