From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AF8B5C433F5 for ; Fri, 27 May 2022 16:28:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 495778D0006; Fri, 27 May 2022 12:28:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 447E08D0001; Fri, 27 May 2022 12:28:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 310D88D0006; Fri, 27 May 2022 12:28:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 252928D0001 for ; Fri, 27 May 2022 12:28:13 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id EFB1D20B19 for ; Fri, 27 May 2022 16:28:12 +0000 (UTC) X-FDA: 79512055224.11.33B2363 Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by imf21.hostedemail.com (Postfix) with ESMTP id B72001C000A for ; Fri, 27 May 2022 16:27:59 +0000 (UTC) Received: by mail-lj1-f179.google.com with SMTP id r3so5473412ljd.7 for ; Fri, 27 May 2022 09:28:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=EiVRi+gW0zSvqhaLpCuB1PNn2a9vcz8kH+uw5CW99ng=; b=Vl/QBigcq81v4oLRgIEDTJcGJuh1uO4jl1iyWxaMey20rE85I3GFXXmM47Tj8LK/5g EeLM2160V5oHt20zxEJI5GHJXPAfTxZABoVgwSrEN5WwXbBPyCYfpB+GbBef3Rqpycz6 I+rKVyW7oNqz+rfGYQn1LLa0xlouwdie81zg+gSdoMeSh1AKq+6/j6ioCNCJo7H4SrHp EGn58s3Kvt5chphRCxhoV5p0d2WCn5bQa2gRF5Soqcr332c88ungpR3ehHJkqJz8oiyZ FPimjt61M/cTxyoigSgoQOUzqKgiaAwjjax0KqCzIKRPew6418JyIdAubquVM5YvxNxp s7jQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=EiVRi+gW0zSvqhaLpCuB1PNn2a9vcz8kH+uw5CW99ng=; b=C5aXp8v1qGx8CnFl/zk3jxdrGH3WRCF572N+rUzvgXma5TblIPRymbtWHFS1VnUzNj rRp3zGGcCL78iWzDG8nSmWEvGttfhNa8hckLOA/tnNjYZLz0ZpHZPiQql/FFzZhMkKd3 zgu5ugz33BipUCJ/fnV2Ah8IL7GHtFwFyOqEJxjDG2GrKBnhQxZtB27TTDqzqjI8ajNx sddLs7Iwu2zIrKjHcA3BumK+UQqdjOyxuqh392mgSMooAH+vTuvNuc7ryAmgwYZudbki YFCkv0gKSHg0zS2dxdN+vlXLZzcACCnxmaMaN7xVLsql9xjHSxqd6R1HVuowbd/nFgb7 NfsQ== X-Gm-Message-State: AOAM5328cU+f9mPMd0NM7EiHF0H1yiCtoBjZpJXyztCe5rmMZjUUC3FE kaoTpl+z70jmF2aq/BRb3363yZOaZYsNaphVzdrXig== X-Google-Smtp-Source: ABdhPJxS+3vtG3tpyn3HVofmieNr2Iu6OVEG985uGzwBzCRDkFJeMh3qFZqh0kb0OuUq2rh/MrpQaUJJqbzIBpluKL4= X-Received: by 2002:a2e:84c9:0:b0:253:bd3e:63b3 with SMTP id q9-20020a2e84c9000000b00253bd3e63b3mr24324630ljh.350.1653668890382; Fri, 27 May 2022 09:28:10 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: "Zach O'Keefe" Date: Fri, 27 May 2022 09:27:33 -0700 Message-ID: Subject: Re: mm/khugepaged: collapse file/shmem compound pages To: Matthew Wilcox Cc: David Rientjes , "linux-mm@kvack.org" Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="Vl/QBigc"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of zokeefe@google.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=zokeefe@google.com X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: B72001C000A X-Stat-Signature: yhemsfm9utzheeyx17r5hfs7he7ookuo X-HE-Tag: 1653668879-528208 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox wrote: > > On Thu, May 26, 2022 at 05:54:27PM -0700, Zach O'Keefe wrote: > > On Wed, May 25, 2022 at 8:36 PM Matthew Wilcox wrote: > > > On Wed, May 25, 2022 at 06:23:52PM -0700, Zach O'Keefe wrote: > > > > On Wed, May 25, 2022 at 12:07 PM Matthew Wilcox wrote: > > > > > Anyway, that meaning behind that comment is that the PageTransCompound() > > > > > test is going to be true on any compound page (TransCompound doesn't > > > > > check that the page is necessarily a THP). So that particular test should > > > > > be folio_test_pmd_mappable(), but there are probably other things which > > > > > ought to be changed, including converting the entire file from dealing > > > > > in pages to dealing in folios. > > > > > > > > Right, at this point, the page might be a pmd-mapped THP, or it could > > > > be a pte-mapped compound page (I'm unsure if we can encounter compound > > > > pages outside hugepages). > > > > > > Today, there is a way. We can find a folio with an order between 0 and > > > PMD_ORDER if the underlying filesystem supports large folios and the > > > file is executable and we've enabled CONFIG_READ_ONLY_THP_FOR_FS. > > > In this case, we'll simply skip over it because the code believes that > > > means it's already a PMD. > > > > I think I'm missing something here - sorry. If the folio order is < > > HPAGE_PMD_ORDER, why does the code think it's a pmd? > > Because PageTransCompound() does not do what it says on the tin. > > static inline int PageTransCompound(struct page *page) > { > return PageCompound(page); > } > > So any compound page is treated as if it's a PMD-sized page. Right - therein lies the problem :) I think I misattributed your comment "we'll simply skip over it because the code believes that means it's already a PMD" as a solution, not as the current state of things. What we need to be able to do is: 1) If folio order == 0: do what we've been doing 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_ pmd-mapped. If it is, we're done. If not, continue to step (3) 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully it's ~ same as step (1) > > > > If we could tell it's already pmd-mapped, we're done :) IIUC, > > > > folio_test_pmd_mappable() is a necessary but not sufficient condition > > > > to determine this. > > > > > > It is necessary, but from khugepaged's point of view, it's sufficient > > > because khugepaged's job is to create PMD-sized folios -- it's not up to > > > khugepaged to ensure that PMD-sized folios are actually mapped using > > > a PMD. > > > > I thought the point / benefit of khugepaged was precisely to try and > > find places where we can collapse many pte entries into a single pmd > > mapping? > > Ideally, yes. But if a file is mapped at an address which isn't > PMD-aligned, it can't. Maybe it should just decline to operate in that > case. To make sure I'm not missing anything here: It's not actually important that the file is mapped at a pmd-aligned address. All that is important is that the region of memory being collapsed is pmd-aligned. If we wanted to collapse memory mapped to the start of the file, then sure, the file has to be mapped suitably. > > > There may be some other component of the system (eg DAMON?) > > > which has chosen to temporarily map the PMD-sized folio using PTEs > > > in order to track whether the memory is all being used. It may also > > > be the case that (for file-based memory), the VMA is mis-aligned and > > > despite creating a PMD-sized folio, it can't be mapped with a PMD. > > > > AFAIK DAMON doesn't do this pmd splitting to do subpage tracking for > > THPs. Also, I believe retract_page_tables() does make the check to see > > if the address is suitably hugepage aligned/sized. > > Maybe not DAMON itself, but it's something that various people are > talkig about doing; trying to determine whether THPs are worth using or > whether userspace has made the magic go-faster call without knowing > whether the valuable 2MB page is being entirely used. Got it - thanks for clarifying. > > > shmem still expects folios to be of order either 0 or PMD_ORDER. > > > That assumption extends into the swap code and I haven't had the heart > > > to go and fix all those places yet. Plus Neil was doing major surgery > > > to the swap code in the most recent deveopment cycle and I didn't want > > > to get in his way. > > > > > > So I am absolutely fine with khugepaged allocating a PMD-size folio for > > > any inode that claims mapping_large_folio_support(). If any filesystems > > > break, we'll fix them. > > > > Just for clarification, what is the equivalent code today that > > enforces mapping_large_folio_support()? I.e. today, khugepaged can > > successfully collapse file without checking if the inode supports it > > (we only check that it's a regular file not opened for writing). > > Yeah, that's a dodgy hack which needs to go away. But we need a lot > more filesystems converted to supporting large folios before we can > delete it. Not your responsibility; I'm doing my best to encourage > fs maintainers to do this part. Got it. In the meantime, do we want to check the old conditions + mapping_large_folio_support()? > > Also, just to check, there isn't anything wrong with following > > collapse_file()'s approach, even for folios of 0 < order < > > HPAGE_PMD_ORDER? I.e this part: > > > > * Basic scheme is simple, details are more complex: > > * - allocate and lock a new huge page; > > * - scan page cache replacing old pages with the new one > > * + swap/gup in pages if necessary; > > * + fill in gaps; > > * + keep old pages around in case rollback is required; > > * - if replacing succeeds: > > * + copy data over; > > * + free old pages; > > * + unlock huge page; > > * - if replacing failed; > > * + put all pages back and unfreeze them; > > * + restore gaps in the page cache; > > * + unlock and free huge page; > > */ > > Correct. At least, as far as I know! Working on folios has been quite > the education for me ... Great! Well, perhaps I'll run into a snafu here or there (and hopefully learn something myself) but this gives me enough confidence to naively give it a try and see what happens! Again, thank you very much for your time, help and advice with this, Best, Zach