From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9859C54EBD for ; Tue, 10 Jan 2023 00:01:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2F2918E0002; Mon, 9 Jan 2023 19:01:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2A2478E0001; Mon, 9 Jan 2023 19:01:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 16A9B8E0002; Mon, 9 Jan 2023 19:01:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 05C918E0001 for ; Mon, 9 Jan 2023 19:01:41 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C73D212173D for ; Tue, 10 Jan 2023 00:01:40 +0000 (UTC) X-FDA: 80336935560.15.D3E59AB Received: from mail-ej1-f47.google.com (mail-ej1-f47.google.com [209.85.218.47]) by imf13.hostedemail.com (Postfix) with ESMTP id 27FFF20009 for ; Tue, 10 Jan 2023 00:01:38 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=bljXDsaY; spf=pass (imf13.hostedemail.com: domain of zokeefe@google.com designates 209.85.218.47 as permitted sender) smtp.mailfrom=zokeefe@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673308899; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=C1EfFzfabpcvKD8Z4XD1PnuKj//ORNqUpFVSgTL9doA=; b=Tn655+VqdQ+TxMslBOGGEW952PgaddnKumk47x6zW8O/RByo9FZ2V2mN9WtelIjgqZZ4Mm KEplU4CNgt30+RW/+0gunL1Rc/9XURmWdFRq13zoJetksjHvG/XtNJz01M0srN4S3KR19a ZeqlZfqAuuxPYsB9niH+z4Bo1YwS9IY= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=bljXDsaY; spf=pass (imf13.hostedemail.com: domain of zokeefe@google.com designates 209.85.218.47 as permitted sender) smtp.mailfrom=zokeefe@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673308899; a=rsa-sha256; cv=none; b=RE6KOgdMEv4VknFUmYDWdcBD/OmjX4Ak0wpY7fR+Hq4k/SHiTfNBbRtqtieWfuBt4cv7ht fl/AKv+SBPggBF8CL7nL8NZQgxUkMUtnprx+OTbTrg7ZYdyyPKcWyUGXyQh34TLXV/QsJc NS7Wh/cI7hotGZ1Mzk7isr6xwSH/SOc= Received: by mail-ej1-f47.google.com with SMTP id vm8so24425540ejc.2 for ; Mon, 09 Jan 2023 16:01:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=C1EfFzfabpcvKD8Z4XD1PnuKj//ORNqUpFVSgTL9doA=; b=bljXDsaYUCchWh5iEvR4owIXmDy0emx8cGCi9oxdUsZNMdEtoeJ6jsiPswcGUrpZrV 5Gr48my7EhtPBF/sO6GMslAki0O7fJsEyP223LJxQ96IrARoHXjvtfdc/QpAXM6cMGCf J+XQzModiguk5xVF/JO/u+EIuPdzukJhuHP5y+DVVIbBbr8pJgZScSCzFbYFMwWSYMen llJwQF4bK6XLeHGKtBj8g0/D7H54JRK5a5O3qduUSGby05wIC7CfK2ELfC3p7HYOXBiQ Lj2VL1As+dZcztekN57UgV6nA0DHu2SCcbw9kn1YfSQtAMQ9e2ASAtpVbyNBZ8WbIivB rq6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=C1EfFzfabpcvKD8Z4XD1PnuKj//ORNqUpFVSgTL9doA=; b=jzVokjxTgrEF6hp+xt9z8PURlCzYnmAe9coK+NF2Y+xoicR/jP5DpIVPHC6JntMcvf oPjk4bapdlqVdgb4Zh4XGrdOnxlsWYhyhEsnwEVk8u7W2+P+KkChrQ+Z+ogkecqj0D8u aEHwb08B3nuaIzcgr9PT4X8mWy7l+YnZeP4+BKPzANy96UqyQIA8I08xULN3aUKWERVd 0npVbxilPYfJctVXpspbmOUrUNbALTkvE/Nc2xzXGmbaiXQsVFHlgROnXGjzo8i7yUje W5HrC8KuF4WsIhUjzPfT7Wre38mV/+PkAQo9JmX4YMC9Atf6OjJ8WpcmhZdwoCryjWIQ P+xA== X-Gm-Message-State: AFqh2kp/aU7ppS5DMk3v+GuEJ5RwEW3eypY0TFlW3KhRHUawu5PdoXGd z7BTrpYxHz9DFQY0wgYZDBgAoFAOzBdtUUKWipRIYA== X-Google-Smtp-Source: AMrXdXtHRlaLgCreznIXqJJmsCLD7YOsbSH4m+WWHCdSVVPlJSIFj4Gl+qn/LHHUlYVcyHHJBYOtmDqEIn24JTWk4pM= X-Received: by 2002:a17:906:a2c3:b0:839:74cf:7c4c with SMTP id by3-20020a170906a2c300b0083974cf7c4cmr6866686ejb.265.1673308897469; Mon, 09 Jan 2023 16:01:37 -0800 (PST) MIME-Version: 1.0 References: <20230105101844.1893104-1-jthoughton@google.com> <20230105101844.1893104-10-jthoughton@google.com> <797b85c0-ec50-f340-30dd-5a63b51dc45a@redhat.com> In-Reply-To: <797b85c0-ec50-f340-30dd-5a63b51dc45a@redhat.com> From: "Zach O'Keefe" Date: Mon, 9 Jan 2023 16:01:00 -0800 Message-ID: Subject: Re: [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM To: David Hildenbrand Cc: James Houghton , Mike Kravetz , Muchun Song , Peter Xu , David Rientjes , Axel Rasmussen , Mina Almasry , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 27FFF20009 X-Stat-Signature: 3czkbrpwpt3cn3gc67cc96qnj1gnhoxg X-Rspam-User: X-HE-Tag: 1673308898-680655 X-HE-Meta: U2FsdGVkX184U6kLba4eMoTIQNp3ObI0XN8SysEDb88IWtrnq9BPkM1gGxEWN4EefxVhVR3oSF6TzEqBS6VXaRol6Shrhg/aCxxaFR4z9CyjaknBqZhATaoakaU06vhUVqGUQnTUHw7URoRfGtrLM2uc2KuAKnJln1SLLTSMhEpAXkbLgJOdDHjSkTest8jmJo3dwa8UQe9cbCnbruvPl1WAVnSYF5vOQD0ndqyMXb42B0Ioa797RFXRAuKcTxoThgvzydO56DLOfnQsB+tooLaqxJh0ba7lshs4oa5QOfH8znES382zhdoThTL2YwvtRc9WY0hVye5JH7EfZ2IdtVU0yom9UXrQWFcM6+DdEhl/u920KjD6jez4TsoydK0YWtW/B4y99nN+FgApcPJ60whORPWyY4fSq8ihX81eGdJw1ITRuUBa/cGljpTeCPtnvPwCwH3ZglP8cNviuzC4Q7P/vrGRGJ2oTLAd+DMFheV/zz1ZRxdiTlpoq7XvdSIkevQN4Uyzuo3h2IdULmYOlqz9Z4gO8awEbUUaMV+q37Bz9XdB5J9nh8hF8gU8SmPmBHsyo1Ued7VHhYtNpWiN1HPeTOi3lhYpUKdV3PSSMOkgLjbdAdRRl14KAooPWsof1NBdmFBYVXh6+/NDLCPcxRn5I4q1bAMGl4/iCWKU7BJkPwbRp3do4Ym6O+ZnXNF3oJ1WJ2OqnzyKXv11e0jvUm92/QF4INicbmqnMBGOcBeUDdDv0H+lTGKYxz0ph+CV3KfsO/pn2boaucRQASzL9XUyKXhTaL0cI8AG8mzM2Ik2J319giHgDX/SBDh86DjVaGOgXS8Dp0NmCAwQDrAm+0BG/Uv2SB/IeA5c4u+p2RiwTZe+rYrSc7ArKaiQ6f1lcARaxvc8jP0VT/UTK/dYgRBFXmU6dSyqkBbmJevooCQW/qM3frkP7xqiwYOSeAIYeqAr41oxQEWhAwrFOhk NbHLJVNn 73nMIBgW1tJNerNBppPEV/KlpbckCMUWF0eOT19b6zdlVqnOB0gloSOUeb54ueqaVdBO1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jan 5, 2023 at 7:29 AM David Hildenbrand wrote: > > On 05.01.23 11:18, James Houghton wrote: > > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable > > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be > > applied to non-HugeTLB memory in the future, if such an application is > > to arise. > > > > MADV_SPLIT provides several API changes for some syscalls on HugeTLB > > address ranges: > > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE > > alignment. > > 2. read()ing a page fault event from a userfaultfd will yield a > > PAGE_SIZE-rounded address, instead of a huge-page-size-rounded > > address (unless UFFD_FEATURE_EXACT_ADDRESS is used). > > > > There is no way to disable the API changes that come with issuing > > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page > > table mappings that come from the extended functionality that comes with > > using MADV_SPLIT. > > > > For post-copy live migration, the expected use-case is: > > 1. mmap(MAP_SHARED, some_fd) primary mapping > > 2. mmap(MAP_SHARED, some_fd) alias mapping > > 3. MADV_SPLIT the primary mapping > > 4. UFFDIO_REGISTER/etc. the primary mapping > > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the > > corresponding PAGE_SIZE sections in the primary mapping. > > > > More API changes may be added in the future. > > > > Signed-off-by: James Houghton > > --- > > arch/alpha/include/uapi/asm/mman.h | 2 ++ > > arch/mips/include/uapi/asm/mman.h | 2 ++ > > arch/parisc/include/uapi/asm/mman.h | 2 ++ > > arch/xtensa/include/uapi/asm/mman.h | 2 ++ > > include/linux/hugetlb.h | 2 ++ > > include/uapi/asm-generic/mman-common.h | 2 ++ > > mm/hugetlb.c | 3 +-- > > mm/madvise.c | 26 ++++++++++++++++++++++++++ > > 8 files changed, 39 insertions(+), 2 deletions(-) > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > > index 763929e814e9..7a26f3648b90 100644 > > --- a/arch/alpha/include/uapi/asm/mman.h > > +++ b/arch/alpha/include/uapi/asm/mman.h > > @@ -78,6 +78,8 @@ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */ > > I think we should make a split more generic, such that it also splits > (pte-maps) a THP. Has that been discussed? Thanks James / David. MADV_SPLIT for THP has come up a few times; firstly, during the initial RFC about hugepage collapse in process context, as the natural inverse operation required by a generic userspace-managed hugepage daemon, the second -- which is more immediately practical -- is to avoid stranding THPs on the deferred split queue (and thus still incurring the memcg charge) for too long [1]. However, its exact semantics / API have yet to be discussed / flushed out (though I'm planning to do exactly this in the near-term). Just as James has co-opted MADV_COLLAPSE for hugetlb, we can co-opt MADV_SPLIT for THP, when the time comes -- which I think makes a lot of sense. Hopefully I can get my ducks in order to start a discussion about this eminently. Best, Zach [1] https://lore.kernel.org/linux-mm/YZ9kUD5AG6inbUEg@xz-m1.local/ > -- > Thanks, > > David / dhildenb >