From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08179C433F5 for ; Sun, 17 Apr 2022 17:24:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1E3E96B0078; Sun, 17 Apr 2022 13:24:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 145A56B007B; Sun, 17 Apr 2022 13:24:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EDAAC6B007D; Sun, 17 Apr 2022 13:24:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0224.hostedemail.com [216.40.44.224]) by kanga.kvack.org (Postfix) with ESMTP id D79FC6B0078 for ; Sun, 17 Apr 2022 13:24:06 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 7C7C518343356 for ; Sun, 17 Apr 2022 17:24:06 +0000 (UTC) X-FDA: 79367044092.28.CDD23BA Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com [209.85.167.52]) by imf21.hostedemail.com (Postfix) with ESMTP id E8FD31C0003 for ; Sun, 17 Apr 2022 17:24:05 +0000 (UTC) Received: by mail-lf1-f52.google.com with SMTP id p10so21167448lfa.12 for ; Sun, 17 Apr 2022 10:24:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=PU7W/Uj7dAzT9ecADYzY9+GARdC3nUhp3Smc/8uCV4A=; b=QSQdaVfLIBJnLawfF8jGJnUJsfGOrjWo4ut9DhtPKmxqkBSg1LfyvWkCTsI6XRb6SX +M5NYkfgdldizSEWw678EaRc//8+zlf5h2+KEe7YzYsZgqPn8mns2tukcvt/68GCirDz 0W3YglY0QW0h9cEcXhGwY5Df4tHpU1KecoAOXmsoKA6km/1N2d8eLpQgSUoShd6Dfv+u 51RCFioPX4XL8dqyDGrQ4J0eE/RTuP+IpVyde0aPijgo5TLXNEpwINx51RdaT1iB9Yap BxRPDwiQPwe/FFwWWCE78T1Apbz8ixxlsokW4PWmizjy9PIoN2fSIeniLaDje2RkGFEp Yb9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=PU7W/Uj7dAzT9ecADYzY9+GARdC3nUhp3Smc/8uCV4A=; b=xzamzYBR1pIs1zTlQUry5ka/9LfO6W/I4MBNVSCsyhxul3MquaAB8bbozTXMM3RYtQ ADef4XEAULDs5BAVnrNo44dqdrLpnj2aCu825CyKxeqFqN34Bo+JeUaj0m88Otxduk3V F7eGcGAWK/pT8HweQ3DUGYAx51mF9NcDGIxT0pWaDO9XE1np6HbFWhd6Qi0SROygqD5o qT9PETc2mCwL4buzNlnSmHeMzokMi1Epn1P6wU9Fb11Db4m6Gj4hUzZsNxzy/HyF7+nC Wb0Zc1t2yKgMvTtbZ7aB8+WvJEJdvqIwTZT6bkH9m/w9JhHAUJEcz9A1/48bYuYh5Ivk hWTA== X-Gm-Message-State: AOAM5324fjMUvxgTARfa3QtryDrOxgKIPhtI0LheFkkPTRiHYATi/SGL E1Qs5KyMHMDvzfO9fisYIf5GnVbxtgWpbjH0utwaUg== X-Google-Smtp-Source: ABdhPJx4eAkaOtBzJvIBP/ZWJ03YHjGzLD4p9XNRvHoBpl/2ImP+IyM5h3soBP5AmOEQkUHLjaKyKGxTesArXHqA23A= X-Received: by 2002:ac2:4150:0:b0:46b:c3b6:e4ab with SMTP id c16-20020ac24150000000b0046bc3b6e4abmr5626901lfi.354.1650216243857; Sun, 17 Apr 2022 10:24:03 -0700 (PDT) MIME-Version: 1.0 References: <20220414180612.3844426-1-zokeefe@google.com> In-Reply-To: From: "Zach O'Keefe" Date: Sun, 17 Apr 2022 10:23:27 -0700 Message-ID: Subject: Re: [PATCH v2 00/12] mm: userspace hugepage collapse To: Peter Xu Cc: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Pasha Tatashin , SeongJae Park , Song Liu , Vlastimil Babka , Yang Shi , Zi Yan , linux-mm@kvack.org, Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: E8FD31C0003 X-Stat-Signature: 3dnxn4q3gz8s4a5fxs9ywzsic8t1p4rw Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=QSQdaVfL; spf=pass (imf21.hostedemail.com: domain of zokeefe@google.com designates 209.85.167.52 as permitted sender) smtp.mailfrom=zokeefe@google.com; dmarc=pass (policy=reject) header.from=google.com X-HE-Tag: 1650216245-753284 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000004, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Apr 16, 2022 at 12:26 PM Peter Xu wrote: > > Hi, Zach, > > On Fri, Apr 15, 2022 at 01:04:04PM -0700, Zach O'Keefe wrote: > > On Fri, Apr 15, 2022 at 6:39 AM Peter Xu wrote: > > > > > > On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote: > > > > Hey Peter, > > > > > > > > Thanks for taking the time to review! > > > > > > > > On Thu, Apr 14, 2022 at 5:04 PM Peter Xu wrote: > > > > > > > > > > Hi, Zach, > > > > > > > > > > On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote: > > > > > > process_madvise(2) > > > > > > > > > > > > Performs a synchronous collapse of the native pages > > > > > > mapped by the list of iovecs into transparent hugepages. > > > > > > > > > > > > Allocation semantics are the same as khugepaged, and depend on > > > > > > (1) the active sysfs settings > > > > > > /sys/kernel/mm/transparent_hugepage/enabled and > > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2) > > > > > > the VMA flags of the memory range being collapsed. > > > > > > > > > > > > Collapse eligibility criteria differs from khugepaged in that > > > > > > the sysfs files > > > > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared] > > > > > > are ignored. > > > > > > > > > > The userspace khugepaged idea definitely makes sense to me, though I'm > > > > > curious how the line is drown on the different behaviors here by explicitly > > > > > ignoring the max_ptes_* entries. > > > > > > > > > > Let's assume the initiative is to duplicate a more data-aware khugepaged in > > > > > the userspace, then IMHO it makes more sense to start with all the policies > > > > > that applies to khugepaged already, including max_pte_*. > > > > > > > > > > I can understand the willingness to provide even stronger semantics here > > > > > than khugepaged since the userspace could have very clear knowledge of how > > > > > to provision the memories (better than a kernel scanner). It's just that > > > > > IMHO it could be slightly confusing if the new interface only partially > > > > > apply the khugepaged rules. > > > > > > > > > > No strong opinion here. It could already been a trade-off after the > > > > > discussion from the RFC with Michal which I read.. Just curious about how > > > > > you made that design decision so feel free to read it as a pure question. > > > > > > > > > > > > > Understand your point here. The allocation and max_pte_* semantics are > > > > split between khugepaged-like and fault-like, respectively - which > > > > could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS > > > > flag to control the former's behavior, but agreed to keep things > > > > simple to start, and expand the interface if/when necessary. I opted > > > > to ignore max_ptes_* as the default since I envisioned that early > > > > adopters would "just want it to work". One such example would be > > > > backing executable text by hugepages on program load when many pages > > > > haven't been demand-paged in yet. > > > > > > > > What do you think? > > > > > > I'm just slightly worried that'll make the default MADV_COLLAPSE semantics > > > blurred. > > > > > > To me, a clean default definition for MADV_COLLAPSE would be nice, as "do > > > khugepaged on this range, and with current thread context". IMHO any > > > feature bits then can be supplementing special needs, and I'll take the thp > > > backing executable example to be one of the (good?) reason we'd need an > > > extra flag for ignoring the max_ptes_* knobs. > > > > > > So personally if I were you maybe I'll start with the simple scheme of that > > > (even if it won't immediately service a thing) but then add either the > > > defrag or ignore_max_ptes_* as feature bits later on, with clear use case > > > descriptions about why we need each of the feature flags. IMHO numbers > > > would be even more helpful when there's specific use cases on the show. > > > > > > Or, perhaps you think all potential MADV_COLLAPSE users should literally > > > skip max_ptes_* limitations always? > > > > > > > Thanks for your time and valuable feedback here, Peter. I had a response typed > > up, but after a few iterations became increasingly unsatisfied with my > > own response. > > > > I think this feature should be able to stand on its own without > > consideration of a userspace khugepaged, as we have existing concrete > > examples where it would be useful. In these cases, and I assume almost > > all other use-cases outside userspace khugepaged, max_ptes_* should be > > ignored as the fundamental assumption of MADV_COLLAPSE is that the > > user knows better, and IMHO, khugepaged heuristics shouldn't tell > > users they are wrong. > > Valid point. And actually right after I replied I thought similarly on > whether we need to connect the two interfaces at all.. > > It's just that it's very easy to go think like that after reading the cover > letter since that's exactly what it is comparing to. :) > Yes, this is my fault :) After others have had a chance to review / align, I'll include the immediate use cases in the v3 cover letter as well, rather than deep in individual patch messages. > There's definitely a difference view on user/kernel level of things, then > it sounds reasonable to me if we add a new interface it by default has a > stronger semantics otherwise we may not bother if with MADV_HUGEPAGE's > existance. > Yes, good point. > So maybe max_ptes_* won't even make sense for MADV_COLLAPSE in most cases > as you said. And that's a real pure question I asked above, and I feel > like your answer is actually "yes" we should always ignore the max_ptes_* > fields until there's a proof that it'll be helpful. > > > > > But this, as you mention, unsatisfactorily blurs the semantics of > > MADV_COLLAPSE: "act like khugepaged here, but not here". > > > > As such, WDYT about the reverse-side of the coin of what you proposed: > > to not couple the default behavior of MADV_COLLAPSE with khugepaged at > > all? I.e. Not tie the allocation semantics to > > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag. We can add > > flags as necessary when/if a reimplementation of khugepaged in > > userspace proves fruitful. > > Let's see whether others have thoughts, but what you proposed here makes > sense to me. > Great! Sounds good to me. Thank you again for your time, questions, and feedback! Best, Zach > Thanks, > > -- > Peter Xu >