From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3C3CC61DA3 for ; Tue, 21 Feb 2023 16:33:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 30E086B0073; Tue, 21 Feb 2023 11:33:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2BE406B0074; Tue, 21 Feb 2023 11:33:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 15DD56B0078; Tue, 21 Feb 2023 11:33:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 03FD66B0073 for ; Tue, 21 Feb 2023 11:33:55 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id C9326AA9C7 for ; Tue, 21 Feb 2023 16:33:54 +0000 (UTC) X-FDA: 80491845588.09.5EBD3D8 Received: from mail-ua1-f46.google.com (mail-ua1-f46.google.com [209.85.222.46]) by imf25.hostedemail.com (Postfix) with ESMTP id AD9E3A001A for ; Tue, 21 Feb 2023 16:33:52 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=bLISuwze; spf=pass (imf25.hostedemail.com: domain of jthoughton@google.com designates 209.85.222.46 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676997232; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=P5QwKbkm/iOD0eivN0JrThpRZ2kVJDMNRvNnQDn2lDA=; b=adRIn/Bu46Si8twykzWc6CTGJ9OicAjYYL0iL2tM5ZhZxCAUTbPe/OhgxaViB5l1W2NzXy sooEN+iubrVdPaHy3DSdDymlf/vcCRnJtMj0/FUbOtQAF2cViSX06k3RHhhyal/hD57eqi CnTQEWCad7qg0P2va0gqhwkujMX/ufU= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=bLISuwze; spf=pass (imf25.hostedemail.com: domain of jthoughton@google.com designates 209.85.222.46 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676997232; a=rsa-sha256; cv=none; b=57Y5eomAu0qzVeu7GueLCuG5R3/puZ+bw+PE5S3xM4eU4jKoePwDsT7RibaAK0nwmo+MED Vn4mJ9E+NQ2ilLfzW+645s/W37iOwMfQKrQYd8YQ/R91/UVQj/I8YAO65RyOBEo3sBk9ef ru3gtILM+0RgtD7JCLBKG4GP6CyznFM= Received: by mail-ua1-f46.google.com with SMTP id x40so945379uaf.2 for ; Tue, 21 Feb 2023 08:33:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1676997232; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=P5QwKbkm/iOD0eivN0JrThpRZ2kVJDMNRvNnQDn2lDA=; b=bLISuwzeGIS8YsfOGNVhMY7o3t/I8jG9xLtk8algVvWbUp3rqa3pSHW3El2RyqBzx8 mzx5XDR+nbzv97GJujE0YSjGSkoUeLpBvCItgnLFo7AdALEccac+WdfHxj8iNuG45FXO 80iupZjQuzPUN7pzHKFrizo6Q9X+Cs3TSBw4Wr9W3WuuQX5H+waoDql8MJwXdyV5bbyK CIsWz8J2MkN+qftTf6w9SXxP9XQexWox+oGSZd2Avfq3NzXlI9pazULMf6Krc1cVrM8k YvGzh4r3cClx/jLw2xBf9uXbxm5dwD75OzNSb7LRAd2sM7I3v0YnBQN5p0EkdQNN6oeB idLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1676997232; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=P5QwKbkm/iOD0eivN0JrThpRZ2kVJDMNRvNnQDn2lDA=; b=ssS2MwtsZRNMykK/JcBRfW6F+Mm2HdDsce2UheiTGJue9asBbzFXeJ+a3SvDwi7KMY scmqMLWtBAzyFjuMwZa9OuU0s/lObVDi5OOR4a14RFKHPtk9CWmTvAjDsHiK/87hbshH mqEIHbF6I0C8ub8csiOgbV5ukY7SOhHkV5a+kt+iZgtT45QVFxLvi0ig18MY59ANUen+ CREzOGV0IqXFFCwIYcgqQCZdxk2LPZxPfG1bMGi9AnS/IBOMwfnaES/bKuxRBd5vT7ug 9Vlrij8USwraRO6f90kXuUt7v/Ykt267cYJuE/ZV+Of0bjuQWg3OxROOrWCM9YmKywJw 611w== X-Gm-Message-State: AO0yUKV1Rz+ZQ/ltky+cFSgjCi0auH++JGGipN4QxDTj7VUssyXhU3z9 KpzLAY56OUV5oRrTRUXIGdZCFyXLtLOf7iZYWTbx5w== X-Google-Smtp-Source: AK7set+E3SnRZ3Fn68UCuWNhXRHpJ+sjMpQ0ZiyBUMfYfhAN6dSMGYPcTe2Jov2CFDVOkG1EKEXhmFzlvD0AhbEKhuw= X-Received: by 2002:a05:6122:71b:b0:401:72fb:a212 with SMTP id 27-20020a056122071b00b0040172fba212mr1118157vki.27.1676997231766; Tue, 21 Feb 2023 08:33:51 -0800 (PST) MIME-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> <20230218002819.1486479-10-jthoughton@google.com> In-Reply-To: From: James Houghton Date: Tue, 21 Feb 2023 08:33:15 -0800 Message-ID: Subject: Re: [PATCH v2 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM To: Mina Almasry Cc: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton , David Hildenbrand , David Rientjes , Axel Rasmussen , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: AD9E3A001A X-Stat-Signature: beamz46fhnnjryxsbac5aa9j65418wag X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1676997232-291332 X-HE-Meta: U2FsdGVkX1/a8j58nN5Yycf00xUCIseVZ6U/P9yKik9XEOB1mgKgeJFDHZNiekjmgE+CET/fr613VZDlcZLOFNUcIKYMkVGMgtXCRBU8CwzIWzIrOEQkc1rH1fKP+f2htvWGkREye5W80GS2fBrH7WZzerW+FoZ8Sqad0aswz2ilzpittb2cdgerYYGJiMbZWYqz8qgK6pkxk2G238jnNmFOAt20TR47vuEc/bs1umMDMQG4TxIaT3yJGuZ0sEv1wbkh5Nc/wD4DPmxiFMtRSQfW4g3/i527xqqKgRrUJ8Jb9CY5AbsGsL8dkli1ChmK+HOo4dQ2D7wGTysywjs++d7F8GG+E5a/NPZ3h1FlmkM4a2D03Z67YDXMTlq+AalqQgnovegyt+cl9vltD9IjTyMxpgd26tqgyNg73QtgzlaxwvHzriDcrd+X39XNNm9MjZy/ZbG9PlwBc/BDa0GMorx2OXPwGRAwNYQx9Yed39KX3Aog5fp5jxua381qfAflKZjH2sxtpC39c9TU9OmyXuSGyzayPCkyWURI9h7+AW+XJ2f4N8g5CkesK0v0kumY+zf/f839prCacDqsRY0J04OiL5gvLWixtfCsRIc+hpTu+/jGqspFQRSxgF2wxh8M2ExZhtTkZOGAi9sZ4oAHht73WQaUyUYvPLDs+H/B4vUDGmP3kaqkfOc588dYHp/ViHfWveqryHpTos3eeNF9B28+RRgjG9HyNKcAyXBUOTK2T/3sKWQpNL0O1zVlo9C5WlD4Sk8+RCzElGdoSWhWizm1kiCiXzWjIZU+cCS3UFvIb8rMn+tt5EphzoDEMbPARG5of/1gmo6rgqfmLzOUHZfTFFi2WilI8KexydYiON0IoUTMHWHNN3OgwBGSPAnpyIWu1MaHOzllDPyOmsUfZI7EEYrk1HBhf4xi0TIOrKhYXZRbBUFUOXQKragSZ1HPi5Qs49HYUqzYJN2HQzS VXbCooiY se0E3+/RheAU1d8DQYqmfeI+3pkyG5PQ7dwHyL2TZ3pf5H8CJaa/qTfFnZocIpV6tOx4afbsOhSn71CUjSJ5by1l83gInVaCzyrTrE0U2xmGLGJaHlcFlJEnRcvi3MlNYUueTQXpR12oHWI53bR8rRIl1RSjElIXIR4TgMLbCx3MzN069ncYOyZl5TMLMFKkAiSXoEMz4yxHKqUiSQfj/mnJxar4qmK7IvTYNKMnM1whDtN07oHGDPu9OIcTCx2AEx5+Nls4FgijuUb2H7z432JOAwUFETlU/TF7k9WWV47h/1gOGcu/vgMUa4eaLOwfqxwqcPut1dByv74DeeeaXZmOxC6jabaZXLGbEgrB4Ft2TzBhYy9xhdjr6EA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Feb 17, 2023 at 5:58 PM Mina Almasry wrote= : > > On Fri, Feb 17, 2023 at 4:28=E2=80=AFPM James Houghton wrote: > > > > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable > > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be > > applied to non-HugeTLB memory in the future, if such an application is > > to arise. > > > > MADV_SPLIT provides several API changes for some syscalls on HugeTLB > > address ranges: > > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE > > alignment. > > 2. read()ing a page fault event from a userfaultfd will yield a > > PAGE_SIZE-rounded address, instead of a huge-page-size-rounded > > address (unless UFFD_FEATURE_EXACT_ADDRESS is used). > > > > There is no way to disable the API changes that come with issuing > > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page > > table mappings that come from the extended functionality that comes wit= h > > using MADV_SPLIT. > > > > So is a hugetlb page or VMA that has been MADV_SPLIT + MADV_COLLAPSE > distinct from a hugetlb page or vma that has not been? I thought > COLLAPSE would reverse the effects on SPLIT completely. Right now, MADV_COLLAPSE does *not* completely undo the effects of an MADV_SPLIT. The API changes that come from MADV_SPLIT aren't undone with an MADV_COLLAPSE. > > > For post-copy live migration, the expected use-case is: > > 1. mmap(MAP_SHARED, some_fd) primary mapping > > 2. mmap(MAP_SHARED, some_fd) alias mapping > > 3. MADV_SPLIT the primary mapping > > 4. UFFDIO_REGISTER/etc. the primary mapping > > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the > > corresponding PAGE_SIZE sections in the primary mapping. > > > > Huh, so MADV_SPLIT doesn't actually split an existing PMD mapping into > high granularity mappings. Instead it says that future mappings may be > high granularity? I assume they may not even be high granularity, like > if the alias mapping faulted in a full hugetlb page (without > UFFDIO_CONTINUE) that page would be regular mapped not high > granularity mapped. MADV_SPLIT just means "userspace is aware that they are able to start mapping HugeTLB pages at high-granularity". Right now the only way to get high-granularity mappings is with UFFDIO_CONTINUE, but there may be other ways in the future. As of this series, if you MADV_SPLIT a HugeTLB VMA and you aren't using userfaultfd minor faults, it's basically a no-op. The mappings that are created will still be huge. I could change this, but I don't really see a reason to right now. > > This may be bikeshedding but I do think a clearer name is warranted. > Maybe MADV_MAY_SPLIT or something. I agree -- MADV_MAY_SPLIT more accurately describes the HugeTLB functionality. I really don't mind what the MADV is called. I think enabling the high-granularity userfaultfd bits with a userfaultfd feature[1] worked reasonably well. There is some API discussion in that thread[1]. [1]: https://lore.kernel.org/linux-mm/20221021163703.3218176-34-jthoughton@= google.com/