From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C31C5C433EF for ; Sun, 29 May 2022 21:37:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 04D5A8D0002; Sun, 29 May 2022 17:37:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F3C898D0001; Sun, 29 May 2022 17:37:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E26D38D0002; Sun, 29 May 2022 17:37:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D091A8D0001 for ; Sun, 29 May 2022 17:37:08 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A3BE835016 for ; Sun, 29 May 2022 21:37:08 +0000 (UTC) X-FDA: 79520091336.28.DA17AB8 Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49]) by imf25.hostedemail.com (Postfix) with ESMTP id E5822A0049 for ; Sun, 29 May 2022 21:36:35 +0000 (UTC) Received: by mail-lf1-f49.google.com with SMTP id p22so14040130lfo.10 for ; Sun, 29 May 2022 14:37:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=dN6Q5NA7/0BkB/8RDSdLCJzO/sDwCU507CVY55MtKGQ=; b=IpHsJ9Bba1MEL8ZJDHq6sUNWmusjXnHyZ9WWwAWBU2FbXOWwZ8x1EYyz+1UXNMqnFf BVU6fSNwnWRk0fgiP15xlf4rYjrIxe7OnLAAYmbqjXGfkUCyvSlzMawYQS2ExWr++yBp 9yT/J8iBRiITw28fohFS5PPAGAXCS7XGL3ar8elCA9zCxIatpVxekN/VpXMepNatI3Ls DJoZr3No4QG71UbP+qB4YSpussFKlLW6/lVWf244UTdqHgx1VKeaNpwshF6YPTrMf6FI yfiyrESzSmSYEztPSAl4V/UThtqxklmBTTWjDc7Sz3MMr1bo0acz2Lp4Ikj7mrWZmBRP bxbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=dN6Q5NA7/0BkB/8RDSdLCJzO/sDwCU507CVY55MtKGQ=; b=YlGwihDkwsuHVfegGPNCM3x2qpsuO2uU9u0+DUHERhBMdd+dsnrAZ1ILmI+cnI1RZb Ahg2WcIX9I3zixzNCyjI+UhA1rbvb0ZFXF2Qr7ZbrNAtP/3rHQOwMqMXULApKyJRLU+b xuJnlRHtXjfb6OzVaNhdSPjGO8lgmYiiiVgi1Wde5Omob3GF7/fflAmz8KpdM4RB2lDV 9XmuEutK8ayCeG9seIk5UXUWbcUR4XycX5VOIMujm1hI9jDNcFVHDsNwt4zQco+hXHkK bvHhTnShqNHkz3X6zCrve9VgyYZPJyyXbYwyWdsSmn5Qriz0WhOaDgupPIphZgynL8eP hPHQ== X-Gm-Message-State: AOAM5305TIEZLJcdVGBgl+md61HzQ00BU2ngwX3idBBKVLymaoa8B/L0 I+2M0RJFVdPFOd0ZWDwy0iWoBx9TPWA0vc4zWHfkqQ== X-Google-Smtp-Source: ABdhPJwNoVZgnIaHCXMUJiJkcOZ9dR/AuJ54a/d1soLvENbPUjOTs7tJ4lZO+ruTfEtA87avZ6Oho+sIgLOL/Qk3A0Y= X-Received: by 2002:a05:6512:39d5:b0:478:c2b9:b36a with SMTP id k21-20020a05651239d500b00478c2b9b36amr9858764lfu.128.1653860225645; Sun, 29 May 2022 14:37:05 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: "Zach O'Keefe" Date: Sun, 29 May 2022 14:36:29 -0700 Message-ID: Subject: Re: mm/khugepaged: collapse file/shmem compound pages To: Matthew Wilcox Cc: David Rientjes , "linux-mm@kvack.org" Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: E5822A0049 X-Stat-Signature: ttbzm3xtadb8ks1fwkog76fbjwrfmhrc Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=IpHsJ9Bb; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of zokeefe@google.com designates 209.85.167.49 as permitted sender) smtp.mailfrom=zokeefe@google.com X-HE-Tag: 1653860195-845771 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox wrote: > > On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote: > > On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox wrote: > > > Because PageTransCompound() does not do what it says on the tin. > > > > > > static inline int PageTransCompound(struct page *page) > > > { > > > return PageCompound(page); > > > } > > > > > > So any compound page is treated as if it's a PMD-sized page. > > > > Right - therein lies the problem :) I think I misattributed your > > comment "we'll simply skip over it because the code believes that > > means it's already a PMD" as a solution, not as the current state of > > things. What we need to be able to do is: > > > > 1) If folio order == 0: do what we've been doing > > 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_ > > pmd-mapped. If it is, we're done. If not, continue to step (3) > > I would not do that part. Just leave it alone and assume everything's > good. Sorry if I keep pressing the issue here - but why not check? If the goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible memory at the pmd level, then these pte-mapped hugepages that we might discover in step (2) are actually the cheapest memory to collapse since we can do the collapse in-place. > > 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully > > it's ~ same as step (1) > > Yes, exactly this. > > > > > I thought the point / benefit of khugepaged was precisely to try and > > > > find places where we can collapse many pte entries into a single pmd > > > > mapping? > > > > > > Ideally, yes. But if a file is mapped at an address which isn't > > > PMD-aligned, it can't. Maybe it should just decline to operate in that > > > case. > > > > To make sure I'm not missing anything here: It's not actually > > important that the file is mapped at a pmd-aligned address. All that > > is important is that the region of memory being collapsed is > > pmd-aligned. If we wanted to collapse memory mapped to the start of > > the file, then sure, the file has to be mapped suitably. > > Ah, what you're probably missing is that for file pages/folios, they > have to be naturally aligned. The data structure underlying the > page cache simply can't cope with askew pages. (It kind of can under > some circumstances that are so complicated that they shouldn't be > explained, and it's far easier just to say "folios must be naturally > aligned within the file") I'm trying to understand what you mean by "naturally aligned" here. I'm operating under the assumption that all file pages map to page-sized offsets within a file (e.g. linear_page_address()) and that files are mapped at a page-aligned address. In the event we want to collapse file-backed memory, if the virtual address of said memory is hugepage-aligned, I don't think it's necessary that the address maps to a hugepage-sized offset in the file? I.e. on x86, the file could itself be mapped to the start of the last page in a 2MiB region ,X, and that wouldn't prevent us from collapsing the 2MiB region starting at X+4KiB. > > > > > shmem still expects folios to be of order either 0 or PMD_ORDER. > > > > > That assumption extends into the swap code and I haven't had the heart > > > > > to go and fix all those places yet. Plus Neil was doing major surgery > > > > > to the swap code in the most recent deveopment cycle and I didn't want > > > > > to get in his way. > > > > > > > > > > So I am absolutely fine with khugepaged allocating a PMD-size folio for > > > > > any inode that claims mapping_large_folio_support(). If any filesystems > > > > > break, we'll fix them. > > > > > > > > Just for clarification, what is the equivalent code today that > > > > enforces mapping_large_folio_support()? I.e. today, khugepaged can > > > > successfully collapse file without checking if the inode supports it > > > > (we only check that it's a regular file not opened for writing). > > > > > > Yeah, that's a dodgy hack which needs to go away. But we need a lot > > > more filesystems converted to supporting large folios before we can > > > delete it. Not your responsibility; I'm doing my best to encourage > > > fs maintainers to do this part. > > > > Got it. In the meantime, do we want to check the old conditions + > > mapping_large_folio_support()? > > Yes, that should work. khugepaged should be free to create large > folios if the underlying filesystem supports them OR (executable, > read-only, CONFIG_THP_RO, etc, etc). Thanks for confirming! > > > > Also, just to check, there isn't anything wrong with following > > > > collapse_file()'s approach, even for folios of 0 < order < > > > > HPAGE_PMD_ORDER? I.e this part: > > > > > > > > * Basic scheme is simple, details are more complex: > > > > * - allocate and lock a new huge page; > > > > * - scan page cache replacing old pages with the new one > > > > * + swap/gup in pages if necessary; > > > > * + fill in gaps; > > > > * + keep old pages around in case rollback is required; > > > > * - if replacing succeeds: > > > > * + copy data over; > > > > * + free old pages; > > > > * + unlock huge page; > > > > * - if replacing failed; > > > > * + put all pages back and unfreeze them; > > > > * + restore gaps in the page cache; > > > > * + unlock and free huge page; > > > > */ > > > > > > Correct. At least, as far as I know! Working on folios has been quite > > > the education for me ... > > > > Great! Well, perhaps I'll run into a snafu here or there (and > > hopefully learn something myself) but this gives me enough confidence > > to naively give it a try and see what happens! > > > > Again, thank you very much for your time, help and advice with this, > > You're welcome! Thanks for putting in some work on this project! No problem! Hopefully this can benefit a bunch of existing users. Thanks again, Zach