From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0971EEB64D7 for ; Tue, 13 Jun 2023 15:17:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 93B956B0074; Tue, 13 Jun 2023 11:17:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C3E26B0075; Tue, 13 Jun 2023 11:17:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 73E458E0002; Tue, 13 Jun 2023 11:17:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 637936B0074 for ; Tue, 13 Jun 2023 11:17:29 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 19E7B40525 for ; Tue, 13 Jun 2023 15:17:29 +0000 (UTC) X-FDA: 80898078618.20.9615CAE Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174]) by imf19.hostedemail.com (Postfix) with ESMTP id DEA2F1A0032 for ; Tue, 13 Jun 2023 15:17:25 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b=L4aAMg9M; dmarc=none; spf=pass (imf19.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.222.174 as permitted sender) smtp.mailfrom=jgg@ziepe.ca ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686669446; a=rsa-sha256; cv=none; b=5lkbnioO93sAj728g5EsBgZVSa+gc5RXYwWzHP7uH8DtscwpU8cLCgsk6Nq9N/Rr/vDWm/ 8Vn1VtFil/CjnKpkwCJIcqNh54/DCw4Vr7MLtMcHE4O5HGHiDtmGCA1XCiF2FgZAbHqNLL mK4Okodi10BJSracm7s/9wf1uQCbTPQ= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b=L4aAMg9M; dmarc=none; spf=pass (imf19.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.222.174 as permitted sender) smtp.mailfrom=jgg@ziepe.ca ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686669446; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BSVdUwoO2FIozZphmCW2j66fw862THIOoGiOejtpFqE=; b=Ca2mI4bSMKmHD6Ldd37Rkj9W2K1Ju+A+ALUAoMlSTupxnteRJwBIPu3xo8B/JpVkI3vGA3 pQQt5aFpAaELSAyraaYzdTt25I/jAtMda45G1gYry50ItEuHwT+CAQqw0hTiIc4pUW3bLr aCmvyW460C0NqrTcC0OuhqeWbskhfqg= Received: by mail-qk1-f174.google.com with SMTP id af79cd13be357-75d528d0811so107734285a.0 for ; Tue, 13 Jun 2023 08:17:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; t=1686669445; x=1689261445; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=BSVdUwoO2FIozZphmCW2j66fw862THIOoGiOejtpFqE=; b=L4aAMg9MbrqFWTTkPfpWM24ZPGI7ZEFtJkfHyCbCO9g+Ul9Ao8xwmD34IJlzOck7WX XKe+4krPn1Fha0BG3uZyZQE8E2JyGGti3/W/b65lcYV3f2TCxTICy2kD1wsdA3XTrqDN Xzjfl5vu5WnTNMiAqlCx68PCH+zVx6wfNXDLxGR3avuz0KsxL/VMBzf6Eu0nfYwYL5gf 4fyaJhfqcPoZtJiQNE7NlCsS0oamAn/8At16JbTS/3Ph1fKuSrfy19XHkjIoCTNJ/BI/ 1jtkzPGK3zlPxfGRrXmfnzUTamf71bmX/TWb53p5MGJBwaX4XyCjDYm+2WztUqwxVzk5 dkgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686669445; x=1689261445; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=BSVdUwoO2FIozZphmCW2j66fw862THIOoGiOejtpFqE=; b=DkL99R1OqGid60mQTHGabC+lHYBusaVtyKIIznVVFu+c2f3NF2/6pS0KujNvoEWEwU GNsV66+KW9pdBlP1OflzecVLzknipuoXIKJMWefkqfflqq1SPMOhRkwJQN/5FcTigXsB +HG35I5EElofk7o+QUAQRc4ezo4BE018zeZngXzt0ZzDFTfUsVXLImHz/zYtXWoTeVlV q/RINIJMyPK7/0lW8Irca6Er40tny7IB5skr+mOlHns9rOdPZZoXJcp7bGTcbdBXqjqP UK6fTSL2FQajMmgIGHuc+Mk03fJuWA2PNszk7UTwuWM3il1Rwu/t/hEy/j27WiAvVH+J AjLA== X-Gm-Message-State: AC+VfDy2nj0n0BI2a9wenXLAeiu6PgJiz2pwzQPUDH3fveR7MS/jZX7Y s8LYTfNu2kcOoUEWhWryuKCKgQ== X-Google-Smtp-Source: ACHHUZ7rH9BwlG0q2Pla5j6siqt4Qv5s/LWINoNWnEu8VTlbzdtAN5U2zfuOao+/AkgpQxg+IqTMyA== X-Received: by 2002:a05:620a:3c93:b0:75f:f389:bcb0 with SMTP id tp19-20020a05620a3c9300b0075ff389bcb0mr9743555qkn.74.1686669444232; Tue, 13 Jun 2023 08:17:24 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-142-68-25-194.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.68.25.194]) by smtp.gmail.com with ESMTPSA id h27-20020a05620a13fb00b0075b3631eb91sm3610000qkl.132.2023.06.13.08.17.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Jun 2023 08:17:23 -0700 (PDT) Received: from jgg by wakko with local (Exim 4.95) (envelope-from ) id 1q95lr-004xP1-4O; Tue, 13 Jun 2023 12:17:23 -0300 Date: Tue, 13 Jun 2023 12:17:23 -0300 From: Jason Gunthorpe To: James Houghton Cc: Dan Williams , Mike Kravetz , David Hildenbrand , Miaohe Lin , Naoya Horiguchi , Peter Xu , Yosry Ahmed , linux-mm@kvack.org, Michal Hocko , Matthew Wilcox , David Rientjes , Axel Rasmussen , lsf-pc@lists.linux-foundation.org, Jiaqi Yan , jane.chu@oracle.com Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] HGM for hugetlbfs Message-ID: References: <20230602172723.GA3941@monkey> <7e0ce268-f374-8e83-2b32-7c53f025fec5@google.com> <7c42a738-d082-3338-dfb5-fd28f75edc58@redhat.com> <75d5662a-a901-1e02-4706-66545ad53c5c@redhat.com> <20230607220651.GC4122@monkey> <64824e07ba371_142af829493@dwillia2-xfh.jf.intel.com.notmuch> <20230608223543.GB88798@monkey> <64829e26edbc6_1433ac29475@dwillia2-xfh.jf.intel.com.notmuch> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: DEA2F1A0032 X-Stat-Signature: nxtwofx5xbqzofm5z4y7m3gmdgmkf5rd X-HE-Tag: 1686669445-313128 X-HE-Meta: U2FsdGVkX1/u5eUYmYO5CTJ/RpP1sBgyg18g9jKdqJ053FWbddCsaEwoa/7/TRHFEsou462/DQY0/aL0/lp6cHu5+nSXfp++IKJmvkj7U03++JYVMg8vuDRojP1vKWefid1OAr61jJar174UdE5j3o2D3Km0y/SorxjjNIhcyW32QydD7JIOT2poETVzztVK0w/dU3W53cjrgMZ0e6bUs39z5XVZjq5/WHPmxbhtm2DOTlMVYI9Qb4zRrj/aY/0qzFrVfSXwMn05lrkIvzlIU+BnORNSZNdII8/zRnE2NH/eINZpZAfkMtoBSvJCDnRFAYNKXUC85+oEAoKXP34Vhho6YaXA5c7rYcRy/6SDOCqIng0qLoTCuAck6/2AxuE2pgSAdsdqTI54W6vaxITgS0l4sTrU3eJB76/qXCtDXiehKvVSmaukqLO1yWEmti9u037evsYsEhSiBI6rNEJqbRd0hEzgzY/qAwvKLXI6x1/rJhZGjyO5fKKOWpUWXeEA4ieotVFTC9IJ6sD9DPOstFJdWS3TLJeiz/H6uBUwSUSVG+H7G0VijgKSf96zkbinYW5AMpCfB3kv3TQ8IykUa3zppMeQM4cqtg3n+XigbKCrSbPquzvZLl1XsxLafuAiKTX1MhhcftqJFHFbHg34nlJqY8p089gUJhv0rjBRoEIbNYFKnip+bOdKT0l9JW1esyJhsriudoLkEj9qExUTD63U6uBX3XpiiycnnTFzJUqfN46lTvqwqyA8Qiq9z2dVz9ObOc9bKL8kT6dZ5qfjsZy8ifemb/o+XDIzw2Em+Vi7qkG5y7hxqbbHhzvD444uHE45y/j2p+/+1Yw/FLpnXDUI7JrWQ6IeKdmR7zAjy4EVvUpOUKIL/bTeQUfSQY6g6WGvs7zZy5I/MDWhwV8pfVWNl6+ftSCgkHZzwPtEm/u/r1eXRM3kThQERU8uM4lAiGf+jgzqHz+y4tMoOqd lofBVbAH NpTyBHwjOg8vlowl5iwMePOUZ3vrNX43shNPJdpNNi3Z54z/IfK5xbxgOnx8n9wtIkRU5obIiZy7RX7MhfBJ7qbe6W6Bjxmbt7k/QLd9TRF1VN5LIq3F4nSabGx75W04OAl6upKux2Gk48bhKrPro3Hsg6iGN4bUh2UJPNrlU98jo2cKeMherGMezQUV7tI6r6rUSRrycxvFRG3UaItuF9XMZ+jzMOvrB61Cl4Qn1qthlt+TVOJQQeE6V/j8x4N9prC9EmPlFQXTKVfO+PSAS0wsBMIAiURXkz/Op/fWy1L3g7WmTzYd6lv+adXsVnNXsuhVU X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jun 09, 2023 at 01:20:19PM -0700, James Houghton wrote: > So, we could: > 1. Do what HGM does and have the kernel unmap the 4K page in the > userspace page tables. > 2. On-the-fly change the VMA for our hugepage to not be HugeTLB > anymore, and re-map all the good 4K pages. > 3. Tell userspace that it must change its mapping from HugeTLB to > something else, and move the good 4K pages into the new mapping. > (2) feels like more complexity than (1). If a user created a > MAP_HUGETLB mapping and now it isn't HugeTLB, that feels wrong. > > (3) today isn't possible, but with Jiaqi's improvement to hugetlbfs > read() it becomes possible. We'll need to have an extra 1G of memory > while we are doing this copying/recovery, and it isn't transparent at > all. It is transparent to the VM, it just has a longer EPT fault response time if the VM touches that range. > (3) is additionally painful when considering live migration. We have > to keep the 4K page unmapped after the migration (to keep it poisoned > from the guest's perspective), but the page is no longer *actually* > poisoned on the host. To get the memory we need to back our > fake-poisoned pages with tmpfs, we would need to free our 1G page. > Getting that page back later isn't trivial. Why does this change with #1? As David says you can't transparently "fix" the page, so when you migrate a VM with unavailable pages it must migrate those unavailable pages too, regardless if the kernel made them unavailable or userspace did. So, regardless, you end up with a VM that has holes in its address map. I guess if the hole is created from a PTE map of a 1G hugetlbfs it is easier to "heal" back to a full 1G map, but this healing could also be done by copying. It seems to me the main value of the kernel-side approach is that it eliminates the copies and makes the time the 1G page would be unavailable to the guest shorter. > So (1) still seems like the most natural solution, so the question > becomes: how exactly do we implement 4K unmapping? And that brings us > back to the main question about how HGM should be implemented in > general. IMHO if you can do it in userspace with a copy you can solve your urgent customer need and then have more time to do the big kernel rework required to optimize it with kernel support. Jason