From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC69BC4332F for ; Thu, 22 Dec 2022 00:03:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 171948E0002; Wed, 21 Dec 2022 19:03:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 122B88E0001; Wed, 21 Dec 2022 19:03:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F04B88E0002; Wed, 21 Dec 2022 19:03:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E0DCE8E0001 for ; Wed, 21 Dec 2022 19:03:08 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id C1997140284 for ; Thu, 22 Dec 2022 00:03:08 +0000 (UTC) X-FDA: 80267992056.20.C68F3DD Received: from mail-wm1-f48.google.com (mail-wm1-f48.google.com [209.85.128.48]) by imf15.hostedemail.com (Postfix) with ESMTP id 27C72A0013 for ; Thu, 22 Dec 2022 00:03:06 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=KVTlayKD; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf15.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.48 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671667387; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CNllVDs9Qyhjp9PrGXopMtZh54uW9rQgSBY6Q9L2kGM=; b=gQKtUCiC9AutCjr+bzLajTN4Aqtls4FgzcGvogadRzOgxuBDxrpfZogoo3OCPI47dg+bku UzP0+XdshyOCaiRrXRejriSl+dt5wHJyY2rJKVEgXrPlyR3WXoy1+l5g3cs7yMlRHjmn0Z P8qU9+6hESRFIJCHHPjVIevpEbVcmIA= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=KVTlayKD; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf15.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.48 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671667387; a=rsa-sha256; cv=none; b=cCNVEpRtDrbR93ms5I1TecyoQaJ5cq7IK8Hj48LHrKmNmr49AHQmAM7sJC86dWiEHmG4Ov WjMlg1Uk+FGgQEJzz2vg3p9bSLnbggfZO3lcKHH/vKebh0MT/+YhzHEvzxJ4CF7WPS8BxY ln51tFMNcZIrErKS/doan8eMLbo278s= Received: by mail-wm1-f48.google.com with SMTP id ay40so318219wmb.2 for ; Wed, 21 Dec 2022 16:03:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=CNllVDs9Qyhjp9PrGXopMtZh54uW9rQgSBY6Q9L2kGM=; b=KVTlayKDPZKGej0slx668biCWYZUYUsP+/RgpAhC2dIBXYGLL6Xd42VmnNCQnGKiN0 Ti8xztY7NPvLlDfKn0x/7x2HAnggy40U6Gg3LBRAd+jOxoK9a01MUVjDqh6WynYznAGx dkgzgyw9Qlhn7zwBzmQTyMesRW9h6RschShSkXd7eZggM9gS4Dc9xwFitMzt0I/tMCaB +/stG9omqsAFsEQeEQ1EDjVkMeBaI6d+0dbtvCWKXICg1Aga1fOOoGgsfpJovSrRPVzS qFZT+kq9e+RuVHPF8L0O9vMvV3UaI6gHZAcSchK2/s3NI+A/g5zeXchck6u7iPKwuc6s 7vTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CNllVDs9Qyhjp9PrGXopMtZh54uW9rQgSBY6Q9L2kGM=; b=1DMI5wFOqvh+SOdLt4IJkdGHcyYcwfGCIUvZh+wwNOC/5yNXBGeNoj1ZPgGJHfOJJD +PpqQUcw+3Z4xBd/qocaKzZVAkyvUnm+U1r9FC7E5bkgOZTJvU4wrgUGK/RWyWKYjkNp bgd9JJd/lImo3TaNBiWuEb1ek6vBVjrtnmMs/DKcaMSy1XW8CcAIDAUxecMo6/HfQOxo zEOFE6EtZSQie8FhnBIfi7xjAC7rAXOm/NtmlAXu2LeLxXMu41qxfcFCQEdhSb2acfJD C89xbr7snH7+qnytdhzhc2i9QnnfUidPVrdoyR3HVNB66/eA1XDi3aOzN5wwCCnJeUvb gqEA== X-Gm-Message-State: AFqh2kqzag+8MiZGS+Vo50dOYC9iJqB7n3L3nu0vyfHHRapH5ZObGOYj wgCNXUPmTcvpC8xOpsSSDV9kAtUbGAWWCJFeZhdUUQ== X-Google-Smtp-Source: AMrXdXssMeuEThAwiuA/nPhG8lWMTjYmTl90S+idsxGr6PF4pZQDqXoK9RYeSpQC0LZxyoBXCnn2H7m8lujbtZfJMgk= X-Received: by 2002:a1c:1901:0:b0:3cf:878c:6555 with SMTP id 1-20020a1c1901000000b003cf878c6555mr291926wmz.38.1671667385359; Wed, 21 Dec 2022 16:03:05 -0800 (PST) MIME-Version: 1.0 References: <20221021163703.3218176-1-jthoughton@google.com> <20221021163703.3218176-34-jthoughton@google.com> In-Reply-To: From: James Houghton Date: Wed, 21 Dec 2022 19:02:53 -0500 Message-ID: Subject: Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM To: Mike Kravetz Cc: Peter Xu , Muchun Song , David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 27C72A0013 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: s73eupkrd8cwpmsn95w34jd7qt5esaor X-HE-Tag: 1671667386-128630 X-HE-Meta: U2FsdGVkX19hDAn2Vsx9j8v73OySkonj6prYBOayj9xOeSU523frtBPmKcXlnnejCMWvktey5ShkPd8k+NG9Iy5vxo6nfYr7+u6GugrR/ZU2YZ9/QuRwHs8AvwgSMLBH0W0lyZXLY2lJw3ntNbIdFwyAaa24kKZmPDi3EbxeAOzCGGeHvToHBe7CGxSBEZ8/m1aAei9jQer/1/TtC7sYb8FrSUeJTLbkiT+MHpVDU49lk711/bpXTaS2kz6olDU1xX2px8V7Kvk+YdN9E8QkL+nfOv656JUzw2D+w+CEYP2hmF3BJbghHA8I5jajF07EhbL76I4SGgxKkkOrki/4qmvYrJOKFxD9B87XOVaI0TcPL0C9azbFvP7uSjOZ5wjGOkuHJNmjUgomemvXtg9dzDmpAwCwITQrfsai3Qv+kSiYgD33j8hrYIv/8kpoIMcYZdJ+z5TOTr2hPwZmY8j1XP6ltezj0mo0XFWUT43gREvMg8DvYjvyri890nit75GZ5im7LEO3F7ydCUVAlx7UGFZiJJHavpPbMUkpaGRIIQmPkAAf33qWd4cmC9ZnJyAtGPA4SK2LW+BI/lNcUAHNr8UdWbW5/75QzWS2b332i+BDpYNjbhCkNq/IC2M4GP21JQlbVQaU1QPyNwxfP1sng9Uz+POnUKabHFb+YyGRJpNXzYvUlFrmKa7OKoiE3A2shFisuxVmOf1S/n8MwqICUoinQuwY4PQxU1knpqhoQMjzOm3PDNhQ9U/ldILQZVGMjX1CJnHxcIWqC2emqq8Igi9oXtiM0soofP1DELjR6wgJ4W79RpWh9Rj36g8mAXJ18dJvQIQBeey1RxfNHdTzVrP3Ys7MQStviDdVT4dIFetax6qIclehqqqELlU3FfQoiE/QaTfd740W7EsPzPOIkIwPj03oKBKEHRtXWR06TVkS5wAoSO7/zQs4YDJ/UUVHAEMtt8UShvFwgCJSajK P66EhlaX /O1PUelgYBLd/UXfWxesBe1+jA3mh7T/sMPjRfJj6f/MM4Z2qjwnC25E8x9NcEZPqc6DJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Dec 21, 2022 at 5:32 PM Mike Kravetz wrote: > > On 12/21/22 17:10, Peter Xu wrote: > > On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote: > > > On 12/21/22 15:21, James Houghton wrote: > > > > Thanks for bringing this up, Peter. I think the main reason was: > > > > having separate UFFD_FEATUREs clearly indicates to userspace what is > > > > and is not supported. > > > > > > IIRC, I think we wanted to initially limit the usage to the very > > > specific use case (live migration). The idea is that we could then > > > expand usage as more use cases came to light. > > > > > > Another good thing is that userfaultfd has versioning built into the > > > API. Thus a user can determine if HGM is enabled in their running > > > kernel. > > > > I don't worry much on this one, afaiu if we have any way to enable hgm then > > the user can just try enabling it on a test vma, just like when an app > > wants to detect whether a new madvise() is present on the current host OS. That would be enough to test if HGM was merely present, but if specific features like 4K UFFDIO_CONTINUEs or 4K UFFDIO_WRITEPROTECTs were available. You could always check these by making a HugeTLB VMA and setting it up correctly for userfaultfd/etc., but that's a little messy. > > > > Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm > > would work too. I'm not opposed to this. > > > > > > > > > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller > > > > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't > > > > allowed as of this patch series, but it could be allowed in the > > > > future. To add support in the same way as this series, we would add > > > > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that > > > > having to add another feature isn't great; is this what you're > > > > concerned about? > > > > > > > > Considering MADV_ENABLE_HUGETLB... > > > > 1. If a user provides this, then the contract becomes: "the kernel may > > > > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at > > > > high-granularities, provided the support exists", but it becomes > > > > unclear to userspace to know what's supported and what isn't. > > > > 2. We would then need to keep track if a user explicitly enabled it, > > > > or if it got enabled automatically in response to memory poison, for > > > > example. Not a big problem, just a complication. (Otherwise, if HGM > > > > got enabled for poison, suddenly userspace would be allowed to do > > > > things it wasn't allowed to do before.) > > > > We could alternatively have two flags for each vma: (a) hgm_advised and (b) > > hgm_enabled. (a) always sets (b) but not vice versa. We can limit poison > > to set (b) only. For this patchset, it can be all about (a). My thoughts exactly. :) > > > > > > 3. This API makes sense for enabling HGM for something outside of > > > > userfaultfd, like MADV_DONTNEED. > > > > > > I think #3 is key here. Once we start applying HGM to things outside > > > userfaultfd, then more thought will be required on APIs. The API is > > > somewhat limited by design until the basic functionality is in place. > > > > Mike, could you elaborate what's the major concern of having hgm used > > outside uffd and live migration use cases? > > > > I feel like I miss something here. I can understand we want to limit the > > usage only when the user specifies using hgm because we want to keep the > > old behavior intact. However if we want another way to enable hgm it'll > > still need one knob anyway even outside uffd, and I thought that'll service > > the same purpose, or maybe not? > > I am not opposed to using hgm outside the use cases targeted by this series. > > It seems that when we were previously discussing the API we spent a bunch of > time going around in circles trying to get the API correct. That is expected > as it is more difficult to take all users/uses/abuses of the API into account. > > Since the initial use case was fairly limited, it seemed like a good idea to > limit the API to userfaultfd. In this way we could focus on the underlying > code/implementation and then expand as needed. Of course, with an eye on > anything that may be a limiting factor in the future. > > I was not aware of the uffd-wp use case, and am more than happy to discuss > expanding the API. So considering two API choices: 1. What we have now: UFFD_FEATURE_MINOR_HUGETLBFS_HGM for UFFDIO_CONTINUE, and later UFFD_FEATURE_WP_HUGETLBFS_HGM for UFFDIO_WRITEPROTECT. For MADV_DONTNEED, we could just suddenly start allowing high-granularity choices (not sure if this is bad; we started allowing it for HugeTLB recently with no other API change, AFAIA). 2. MADV_ENABLE_HGM or something similar. The changes to UFFDIO_CONTINUE/UFFDIO_WRITEPROTECT/MADV_DONTNEED come automatically, provided they are implemented. I don't mind one way or the other. Peter, I assume you prefer #2. Mike, what about you? If we decide on something other than #1, I'll make the change before sending v1 out. - James