From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5973DC4332F for ; Wed, 21 Dec 2022 22:10:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 796628E0002; Wed, 21 Dec 2022 17:10:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 747028E0001; Wed, 21 Dec 2022 17:10:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E7CF8E0002; Wed, 21 Dec 2022 17:10:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 520318E0001 for ; Wed, 21 Dec 2022 17:10:21 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 2D86E1608E1 for ; Wed, 21 Dec 2022 22:10:21 +0000 (UTC) X-FDA: 80267707842.08.7745AC6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf20.hostedemail.com (Postfix) with ESMTP id B13161C0018 for ; Wed, 21 Dec 2022 22:10:17 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=aTx+TEWc; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf20.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671660617; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ORMXOrE1l1VFBUbc91N+Gve8/Alb8MwMqSdeSP/tL+U=; b=stXZrm2eCPt0FL9U9CrShyjvm7B4odq1UD/YGQwxb5sZh4QfpDgy3kaO044hPqq3hHC0O1 jxn+1+45JDGfC7fo+EeMrW0cI/t3P9RIPBKeUBRyWUPPkw0riXbtdZCuAJZTHwz4Er9MqC g1EE4VsEIIcfYbPTvRl7eEQDIe1AU5M= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=aTx+TEWc; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf20.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671660617; a=rsa-sha256; cv=none; b=BOpZ4NZmiXCHwFci+QkX8HMOap8osKpi7evLIPqMH79aUPajek1gomkGR4WOGJxlVTU9V2 yR7bLFnYx6nj0odAIb3eV4KrBYXzAE6/BS4SMwQyRoUv5O7R0YgMncAvhRA6kVoFkfOJXT 8CVAah34uR6SAxRURewocsjSWUQs21Y= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1671660617; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ORMXOrE1l1VFBUbc91N+Gve8/Alb8MwMqSdeSP/tL+U=; b=aTx+TEWcPC/KGvu9mWvAFRHs6vPmgZF1ycjzdE/xBdyGK07B4LvaqgziCzwkgrGuDeRp1L 0dTU0jnATNOMM7dlhV6vlXRQXzhr1Elz7SY/1YI/OzsUnOFXp134a8dytoA35E8wUwAfzc c3xICLK5W5x5xOjQPUQXAqMpTeZs6L4= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-214-n5DJrxAKPoWbNKdvOQrwdw-1; Wed, 21 Dec 2022 17:10:15 -0500 X-MC-Unique: n5DJrxAKPoWbNKdvOQrwdw-1 Received: by mail-qv1-f72.google.com with SMTP id ee1-20020a0562140a4100b00528d0b262aaso40866qvb.18 for ; Wed, 21 Dec 2022 14:10:15 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=ORMXOrE1l1VFBUbc91N+Gve8/Alb8MwMqSdeSP/tL+U=; b=TZiDav0RDk3dgnn600epmBqu7vAKLFW/BKufCPYM48J91HNWQpm62I0ghCusZ9EfId lC57aIcVLECDiEWeuTzbUgo1Db1JecMeJuqs1eYO1+2PH0S1FXD0I6bjZTMneZN3kurH 4U1KZuNLlWYNNnhj/soiqBriI89eMb32/MiPPHHpnMi5p1aEZKDPc0azwT8mJ7NIEO8W qD4rUMaNOo9RgWpSRXdwI5QZnoHs6yn9USg1KW8IrU7MMc+dColWG6xndLGBTyrDGAAN 9lxcNX7xdv4sXsSK2AZaDtu/FmYcuf5iK24aj4KcAyroS8Jvx3gkV7rsfxB1azwXN0Ct ZeJg== X-Gm-Message-State: AFqh2kr2cxZsBJ4JfUpOzf1vJv3OMpxKavrQ4FqUR+LiePxteFAWj/5R rBgZeco/goFpaCQoflnHDgDK5g6f3MTLYh13ElP77v5RY1kAFgbVqp7usZZA2gM3T3PxvJr/efZ hLJsbMFSMCqE= X-Received: by 2002:ac8:7cac:0:b0:39c:da20:f6fa with SMTP id z12-20020ac87cac000000b0039cda20f6famr3918712qtv.30.1671660615211; Wed, 21 Dec 2022 14:10:15 -0800 (PST) X-Google-Smtp-Source: AMrXdXuiChcPtVLLjnoHJts1l6KsMm9wbb31b+CZVUw3o/igzqAZt7k+3UmeG2KkES3KSt1Vbcigfw== X-Received: by 2002:ac8:7cac:0:b0:39c:da20:f6fa with SMTP id z12-20020ac87cac000000b0039cda20f6famr3918682qtv.30.1671660614937; Wed, 21 Dec 2022 14:10:14 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-45-70-31-26-132.dsl.bell.ca. [70.31.26.132]) by smtp.gmail.com with ESMTPSA id q28-20020a05620a2a5c00b006f7ee901674sm11970353qkp.2.2022.12.21.14.10.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Dec 2022 14:10:14 -0800 (PST) Date: Wed, 21 Dec 2022 17:10:08 -0500 From: Peter Xu To: Mike Kravetz Cc: James Houghton , Muchun Song , David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , Zach O'Keefe , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH v2 33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM Message-ID: References: <20221021163703.3218176-1-jthoughton@google.com> <20221021163703.3218176-34-jthoughton@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: B13161C0018 X-Stat-Signature: exnbxk7fmyo61tjsizpedssufc1m5mf7 X-HE-Tag: 1671660617-388194 X-HE-Meta: U2FsdGVkX18IN1JngcAh5Dm9+7nghmsjZYl1jDUuKMB1MMbpaWL1WpOCnR8b/Egh+cSxL6z0wIHveMr0PeIFFcie2zPqKF86BEeQ/QPGu3ZDN7YQabA/wWSe06IXKKcFbV8mNKVn5IcjZht5DIcAEPIjFm3otuIdFD3y6bRWlTEASPfqTkKJioa+IsC6abOnsg+HiJN/WTgox1tpRFaUD6O3jrW5EV+zqVCyQQFm38Pdyye+ZaaTQIy/4uce6BZ0LjRD9pXDETmCqPlqTnL26JUsCC7uqBCf7m3DbWVkd9aDPS5ZSXBePem3d4/uUfX1qx6KDxV2DfufVYmmrmf/kF+Azo6I/d3KlxfchmvLjO6EmY7V/3ZkLD3KtAZdiqs0X3r5BwieY1cNi8lcgUxKrDvOAuYiTV80QWKe/0ijCQBrsog6kfePzntWlOQm/JhfES+4cuLrpPOopWy6TrTBou2woQk3E3anKLOeK/zC7NghU8t3SFG/t0FjsbJGIDLq2dr1E41w+VxVlUBmeWf+maeaed7WDg4j8t2Z/V9TG3CB1CycOszX8ePDWcRxmgDPmIvQx9afJ9CvbDWI/K6o0Ybr/H9YnU+WGrYlvtWC1pBQcHTJu5bhXmAbGMu3wgQjo9FSMxnfPnBTBgutjve0lbSg7qsE/KiFfGu7qwZYFyfShgpM8t4lihb2JFnEuRbsOksobUYa1U4vnNJbq/ufod3ivEjntiUkVSpk2LDtw6PqMroSW2qxF+2Z8mAKaD69y0SnzcdnHveUJqN2Uof+RqPbPVglVFn9IUQNbSQNMeTdR/HEDb5SZuFFJ4O/Ny+HhzqS0ZMHwv0ghMssv2oUKMzrtqnHA6bESPvpXXx5M4t0j53gxqUsblHe5W35LeeAflObDyc6w7cAI0zrF9Y76HVVL7tjyMv/uGaVe8XvzEWWm+VmcO/xbnAFKBGhptBvhbki2jKg5yiA9NSxNMZ HRsLkExW BESi9VNH1yntGN0kHc5qA6i6d+3AMwtT4MjdWyXQcODaqDQ62gnycC7rBJnJOZ5BTRPvo3c3P0LsA2Ph5SjRN7bARlQcgxEdtcO7SFRG4jmVfxDNzmUkF+guWVJKMKK2ByWgXoupKlyTlIMDAK/zdSSSOokSJKASeZhYK X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote: > On 12/21/22 15:21, James Houghton wrote: > > On Wed, Dec 21, 2022 at 2:23 PM Peter Xu wrote: > > > > > > James, > > > > > > On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote: > > > > On Wed, Nov 16, 2022 at 2:28 PM Peter Xu wrote: > > > > > > > > > > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote: > > > > > > Userspace must provide this new feature when it calls UFFDIO_API to > > > > > > enable HGM. Userspace can check if the feature exists in > > > > > > uffdio_api.features, and if it does not exist, the kernel does not > > > > > > support and therefore did not enable HGM. > > > > > > > > > > > > Signed-off-by: James Houghton > > > > > > > > > > It's still slightly a pity that this can only be enabled by an uffd context > > > > > plus a minor fault, so generic hugetlb users cannot directly leverage this. > > > > > > > > The idea here is that, for applications that can conceivably benefit > > > > from HGM, we have a mechanism for enabling it for that application. So > > > > this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I > > > > prefer this approach over something more general like MADV_ENABLE_HGM > > > > or something. > > > > > > Sorry to get back to this very late - I know this has been discussed since > > > the very early stage of the feature, but is there any reasoning behind? > > > > > > When I start to think seriously on applying this to process snapshot with > > > uffd-wp I found that the minor mode trick won't easily play - normally > > > that's a case where all the pages were there mapped huge, but when the app > > > wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller > > > pages, probably some size that the user can specify. It'll be non-trivial > > > to enable HGM during that phase using MINOR mode because in that case the > > > pages are all mapped. > > > > > > For the long term, I am just still worried the current interface is still > > > not as flexible. > > > > Thanks for bringing this up, Peter. I think the main reason was: > > having separate UFFD_FEATUREs clearly indicates to userspace what is > > and is not supported. > > IIRC, I think we wanted to initially limit the usage to the very > specific use case (live migration). The idea is that we could then > expand usage as more use cases came to light. > > Another good thing is that userfaultfd has versioning built into the > API. Thus a user can determine if HGM is enabled in their running > kernel. I don't worry much on this one, afaiu if we have any way to enable hgm then the user can just try enabling it on a test vma, just like when an app wants to detect whether a new madvise() is present on the current host OS. Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm would work too. > > > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller > > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't > > allowed as of this patch series, but it could be allowed in the > > future. To add support in the same way as this series, we would add > > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that > > having to add another feature isn't great; is this what you're > > concerned about? > > > > Considering MADV_ENABLE_HUGETLB... > > 1. If a user provides this, then the contract becomes: "the kernel may > > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at > > high-granularities, provided the support exists", but it becomes > > unclear to userspace to know what's supported and what isn't. > > 2. We would then need to keep track if a user explicitly enabled it, > > or if it got enabled automatically in response to memory poison, for > > example. Not a big problem, just a complication. (Otherwise, if HGM > > got enabled for poison, suddenly userspace would be allowed to do > > things it wasn't allowed to do before.) We could alternatively have two flags for each vma: (a) hgm_advised and (b) hgm_enabled. (a) always sets (b) but not vice versa. We can limit poison to set (b) only. For this patchset, it can be all about (a). > > 3. This API makes sense for enabling HGM for something outside of > > userfaultfd, like MADV_DONTNEED. > > I think #3 is key here. Once we start applying HGM to things outside > userfaultfd, then more thought will be required on APIs. The API is > somewhat limited by design until the basic functionality is in place. Mike, could you elaborate what's the major concern of having hgm used outside uffd and live migration use cases? I feel like I miss something here. I can understand we want to limit the usage only when the user specifies using hgm because we want to keep the old behavior intact. However if we want another way to enable hgm it'll still need one knob anyway even outside uffd, and I thought that'll service the same purpose, or maybe not? Thanks, -- Peter Xu