From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A4D6C433EF for ; Wed, 1 Jun 2022 11:27:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 575668D0009; Wed, 1 Jun 2022 07:27:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5240E8D0006; Wed, 1 Jun 2022 07:27:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E6328D0009; Wed, 1 Jun 2022 07:27:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2EA968D0006 for ; Wed, 1 Jun 2022 07:27:04 -0400 (EDT) Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id D42D9341AA for ; Wed, 1 Jun 2022 11:27:03 +0000 (UTC) X-FDA: 79529440326.31.00192F9 Received: from out30-42.freemail.mail.aliyun.com (out30-42.freemail.mail.aliyun.com [115.124.30.42]) by imf04.hostedemail.com (Postfix) with ESMTP id 913A740067 for ; Wed, 1 Jun 2022 11:26:39 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=alimailimapcm10staff010182156082;MF=rongwei.wang@linux.alibaba.com;NM=1;PH=DS;RN=5;SR=0;TI=SMTPD_---0VF3iTWj_1654082812; Received: from 30.240.97.18(mailfrom:rongwei.wang@linux.alibaba.com fp:SMTPD_---0VF3iTWj_1654082812) by smtp.aliyun-inc.com(127.0.0.1); Wed, 01 Jun 2022 19:26:53 +0800 Message-ID: <08fb5da7-a390-1880-da3f-e1d480047caa@linux.alibaba.com> Date: Wed, 1 Jun 2022 19:26:51 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Thunderbird/101.0 Subject: Re: mm/khugepaged: collapse file/shmem compound pages Content-Language: en-US To: Zach O'Keefe Cc: Matthew Wilcox , David Rientjes , "linux-mm@kvack.org" , Hugh Dickins References: From: Rongwei Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Authentication-Results: imf04.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf04.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.42 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com X-Stat-Signature: iz6u4bsmmta8d5c1dbpi7i3da4rsih9t X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 913A740067 X-HE-Tag: 1654082799-318981 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 6/1/22 1:19 PM, Zach O'Keefe wrote: > On Sun, May 29, 2022 at 6:25 PM Rongwei Wang > wrote: >> >> >> >> On 5/30/22 5:36 AM, Zach O'Keefe wrote: >>> On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox wrote: >>>> >>>> On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote: >>>>> On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox wrote: >>>>>> Because PageTransCompound() does not do what it says on the tin. >>>>>> >>>>>> static inline int PageTransCompound(struct page *page) >>>>>> { >>>>>> return PageCompound(page); >>>>>> } >>>>>> >>>>>> So any compound page is treated as if it's a PMD-sized page. >>>>> >>>>> Right - therein lies the problem :) I think I misattributed your >>>>> comment "we'll simply skip over it because the code believes that >>>>> means it's already a PMD" as a solution, not as the current state of >>>>> things. What we need to be able to do is: >>>>> >>>>> 1) If folio order == 0: do what we've been doing >>>>> 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_ >>>>> pmd-mapped. If it is, we're done. If not, continue to step (3) >>>> >>>> I would not do that part. Just leave it alone and assume everything's >>>> good. >>> >>> Sorry if I keep pressing the issue here - but why not check? If the >>> goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible >>> memory at the pmd level, then these pte-mapped hugepages that we might >>> discover in step (2) are actually the cheapest memory to collapse >>> since we can do the collapse in-place. >>> >>>>> 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully >>>>> it's ~ same as step (1) >>>> >>>> Yes, exactly this. >>>> >>>>>>> I thought the point / benefit of khugepaged was precisely to try and >>>>>>> find places where we can collapse many pte entries into a single pmd >>>>>>> mapping? >>>>>> >>>>>> Ideally, yes. But if a file is mapped at an address which isn't >>>>>> PMD-aligned, it can't. Maybe it should just decline to operate in that >>>>>> case. >>>>> >>>>> To make sure I'm not missing anything here: It's not actually >>>>> important that the file is mapped at a pmd-aligned address. All that >>>>> is important is that the region of memory being collapsed is >>>>> pmd-aligned. If we wanted to collapse memory mapped to the start of >>>>> the file, then sure, the file has to be mapped suitably. >>>> >>>> Ah, what you're probably missing is that for file pages/folios, they >>>> have to be naturally aligned. The data structure underlying the >>>> page cache simply can't cope with askew pages. (It kind of can under >>>> some circumstances that are so complicated that they shouldn't be >>>> explained, and it's far easier just to say "folios must be naturally >>>> aligned within the file") >>> >>> I'm trying to understand what you mean by "naturally aligned" here. >>> I'm operating under the assumption that all file pages map to >>> page-sized offsets within a file (e.g. linear_page_address()) and that >>> files are mapped at a page-aligned address. In the event we want to >>> collapse file-backed memory, if the virtual address of said memory is >>> hugepage-aligned, I don't think it's necessary that the address maps >>> to a hugepage-sized offset in the file? I.e. on x86, the file could >> Hi, Zach >> >> I'm not sure get your question rightly. We submitted patch set to >> support file THP can been used transparently, likes THP[1]. >> >> [1]https://lore.kernel.org/linux-mm/20211009092658.59665-2-rongwei.wang@linux.alibaba.com/ >> >> In this patch, I remember that we need to check if '(vma->vm_start >> >> PAGE_SHIFT) - vma->vm_pgoff' is align with HPAGE_PMD_NR. likes: >> >> +static inline bool vma_is_hugetext(struct vm_area_struct *vma, >> + unsigned long vm_flags) >> +{ >> + if (!(vm_flags & VM_EXEC)) >> + return false; >> + >> + if (vma->vm_file && !inode_is_open_for_write(vma->vm_file->f_inode)) >> + return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, >> + HPAGE_PMD_NR); >> + >> + return false; >> +} >> >> There is a little different with anon THP here. >>> itself be mapped to the start of the last page in a 2MiB region ,X, >> >> Maybe it's related with ELF's 'Align' parameter in your current system? >> If 'Align' set to 2MB (by 'readelf -l /path/exec'), it's probably meets >> the above alignment check. >> >> And the default Align parameter is related to binutils version, also can >> be set in compile time by '-z max-page-size=' option. >> >> Hope it is helpful:) >> >> -wrw > > Hey Rongwei! > > Thanks for the code / help. Took a little bit, but hughd has > enlightened me on the problem (thanks Hugh!). Likewise, apologies for > not understanding your previous comment regarding folio alignment, > Matthew. > > Also, thanks for linking your patchset, and sorry for missing it > previously. It seems we're interested in the same problem! Hopefully > this work can be beneficial to your use case as well. Hi Zach! Thanks you, too. Recently, we are trying to use process_madvise()+DAMON to find hot .text, especially x86, and then collapsing into huge pages. It seems that process_madvise(MADV_COLLAPSE) is feasible. Anyway, thanks your nice work. -wrw > > Thanks again for your time, > Zach > > > >>> and that wouldn't prevent us from collapsing the 2MiB region starting >>> at X+4KiB. >>> >>>>>>>> shmem still expects folios to be of order either 0 or PMD_ORDER. >>>>>>>> That assumption extends into the swap code and I haven't had the heart >>>>>>>> to go and fix all those places yet. Plus Neil was doing major surgery >>>>>>>> to the swap code in the most recent deveopment cycle and I didn't want >>>>>>>> to get in his way. >>>>>>>> >>>>>>>> So I am absolutely fine with khugepaged allocating a PMD-size folio for >>>>>>>> any inode that claims mapping_large_folio_support(). If any filesystems >>>>>>>> break, we'll fix them. >>>>>>> >>>>>>> Just for clarification, what is the equivalent code today that >>>>>>> enforces mapping_large_folio_support()? I.e. today, khugepaged can >>>>>>> successfully collapse file without checking if the inode supports it >>>>>>> (we only check that it's a regular file not opened for writing). >>>>>> >>>>>> Yeah, that's a dodgy hack which needs to go away. But we need a lot >>>>>> more filesystems converted to supporting large folios before we can >>>>>> delete it. Not your responsibility; I'm doing my best to encourage >>>>>> fs maintainers to do this part. >>>>> >>>>> Got it. In the meantime, do we want to check the old conditions + >>>>> mapping_large_folio_support()? >>>> >>>> Yes, that should work. khugepaged should be free to create large >>>> folios if the underlying filesystem supports them OR (executable, >>>> read-only, CONFIG_THP_RO, etc, etc). >>> >>> Thanks for confirming! >>> >>>>>>> Also, just to check, there isn't anything wrong with following >>>>>>> collapse_file()'s approach, even for folios of 0 < order < >>>>>>> HPAGE_PMD_ORDER? I.e this part: >>>>>>> >>>>>>> * Basic scheme is simple, details are more complex: >>>>>>> * - allocate and lock a new huge page; >>>>>>> * - scan page cache replacing old pages with the new one >>>>>>> * + swap/gup in pages if necessary; >>>>>>> * + fill in gaps; >>>>>>> * + keep old pages around in case rollback is required; >>>>>>> * - if replacing succeeds: >>>>>>> * + copy data over; >>>>>>> * + free old pages; >>>>>>> * + unlock huge page; >>>>>>> * - if replacing failed; >>>>>>> * + put all pages back and unfreeze them; >>>>>>> * + restore gaps in the page cache; >>>>>>> * + unlock and free huge page; >>>>>>> */ >>>>>> >>>>>> Correct. At least, as far as I know! Working on folios has been quite >>>>>> the education for me ... >>>>> >>>>> Great! Well, perhaps I'll run into a snafu here or there (and >>>>> hopefully learn something myself) but this gives me enough confidence >>>>> to naively give it a try and see what happens! >>>>> >>>>> Again, thank you very much for your time, help and advice with this, >>>> >>>> You're welcome! Thanks for putting in some work on this project! >>> >>> No problem! Hopefully this can benefit a bunch of existing users. >>> >>> Thanks again, >>> Zach