From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40C39C77B7C for ; Wed, 24 May 2023 20:27:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 86279280001; Wed, 24 May 2023 16:27:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 811D2900002; Wed, 24 May 2023 16:27:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6D971280001; Wed, 24 May 2023 16:27:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 5DA02900002 for ; Wed, 24 May 2023 16:27:15 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 292AB1409BA for ; Wed, 24 May 2023 20:27:15 +0000 (UTC) X-FDA: 80826283230.28.A99606A Received: from mail-il1-f174.google.com (mail-il1-f174.google.com [209.85.166.174]) by imf27.hostedemail.com (Postfix) with ESMTP id 409114000E for ; Wed, 24 May 2023 20:27:13 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=4nzGvqxb; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of jthoughton@google.com designates 209.85.166.174 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684960033; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=24wmlHuMmQ9HzcfUKhH/LIqAjZc94+jdDOkkBwDDuFk=; b=tw/HrfoGd8osoWoK2UkH5Dj6RQP8MYTBoLd7I5RJAmY5u7fPRynbeaJ1TRwFdPPlrVDnm5 4l3HCeYCZBUGVT/VPWrBEPHC7mWIeKS6ELdziRNpHKPfTXuv1aJQAGrf090lUwOxL/qhst P3RDvZByPJyaqt9HzLpzGsEYEGEGEYA= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=4nzGvqxb; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of jthoughton@google.com designates 209.85.166.174 as permitted sender) smtp.mailfrom=jthoughton@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684960033; a=rsa-sha256; cv=none; b=01dXcLRYz8G8wBj7TeQxM1qjQEtFmj7hRSW/5vOlzEaTLxT0vcxMJa3MHSaxIKPj0Y1wVt P64Jb0xPpxmm0AJ6uHUuIxT6XkVwuG4lqTsehlu4cYnzsBkYpPyoXejmbSYgcPP4GDY7Go fte6DpiyOOhS2bC8GDC3or2iL/iU+wM= Received: by mail-il1-f174.google.com with SMTP id e9e14a558f8ab-33164ec77ccso39945ab.0 for ; Wed, 24 May 2023 13:27:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1684960032; x=1687552032; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=24wmlHuMmQ9HzcfUKhH/LIqAjZc94+jdDOkkBwDDuFk=; b=4nzGvqxbZE//Kv+of4y4u955FE7Ka69zPgv8C1/Y2dj77N1ntCs9lQaB72lS44gaeu qDcnbi/JQNPIg6drikFcg+2O2y00rY0z/CeLqaSUmjF82Rh2FTOTo8izDsdGoJ2o+DRf Qk9vmr9ufg1chZ4IkXCw7Ggpfa8B2ykz3olQxqgph1VYlWofb1rm66SQBoUxuyAKNNhJ 5Q1k8Sb/NihnH5cbHtyDWzmOpNMcnZ7JL6ed1LVI/g5neANNNzwATMAuF1NRtEk2/kl3 cF7ss+isoUmlGLwDPPj1RPCQ3+IzvkH1mvWbT7jB5iuv2uFbrOeKuTmSCy2C+sFo3Bzg K/0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684960032; x=1687552032; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=24wmlHuMmQ9HzcfUKhH/LIqAjZc94+jdDOkkBwDDuFk=; b=CKBLKZxBhg1VVC3HWEFcw53DXcZAZABRMXyOMTD2eXWJz5c1sUDAi9OkiHih2G+FJK m7+70MbiyL9q401ajhRTjLNAnQUEmYyz695rXo+k2thMTS7PtBlGmy8m89oes2EiTZtE bWpdc7MG5i3Ko1GOvrTyuu8d6lNkRG1QbyzqxP/int0/mlmDRq7HaWQsFtsiAQrHMmJ3 g2Y4PWeQ+uJffICYL/0XkUL+62PRNf4xXtFBHcuwGXm1AS+v7clM1YuXMZFDTpCoIIEd xcFk0i2faA4AN/fhZ6MReibL0pRUlRL0jVRH2eTwP64uA8f8PNGNx5W0sFTWKGENKHd0 UU/w== X-Gm-Message-State: AC+VfDzWegbWkX/2SF8dw4Ibn47INqEx2yKWjQy3HQttuyBbIODTOYck PXfTECUwIdwfIJVeUQQU9WHYF3hrDI73qqRgtjot3Q== X-Google-Smtp-Source: ACHHUZ5hjsnKP+nerwAO2q3dMQ0wcm1D47Egkv4t6BOZBTt8XbSlMqUEUnBHaMr3qEgUI4j6PCcWhJPegJSsO+zfWhw= X-Received: by 2002:a05:6e02:1565:b0:335:f8e9:2791 with SMTP id k5-20020a056e02156500b00335f8e92791mr52615ilu.18.1684960032278; Wed, 24 May 2023 13:27:12 -0700 (PDT) MIME-Version: 1.0 References: <20230306191944.GA15773@monkey> In-Reply-To: <20230306191944.GA15773@monkey> From: James Houghton Date: Wed, 24 May 2023 13:26:36 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs To: Mike Kravetz Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Peter Xu , Michal Hocko , Matthew Wilcox , David Hildenbrand , David Rientjes , Axel Rasmussen , Jiaqi Yan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 409114000E X-Stat-Signature: qj5qnwibbjk176mzbtkhm7kenjs7j5bm X-HE-Tag: 1684960033-595129 X-HE-Meta: U2FsdGVkX1/7dqpECXypWZPcSI0wuFGC3FNAAL3g5EBDeX0P/SpPHum4GjkuHaWl6lRwCH/83us12thjTBMxgRhV84ObuWmdB9fQzr3xlwPkYniwc+mE42gJvrxZc9aGgmIlPk8/nmiTbhjX5x2A9vXXBC+9wqnpnQV6uLV3AH9DB6Jq50aPz84hLo2xnajvH9rJw+rBbmq/7bdtOxRa6+FsYL6jw19NJ6cEEfo6VrRXxJ9fjdX9qfozfhgFhmb5De9IszuLGciN/Qy8ShnF92UJl2xTfb1a3Bc7+Y0uFdApuDQVIlWi9GdXwUvPHbWi2w+l6unedXzmNhPT6ZTvqNEHa9c/M+RtS+dmckZKHWxbynWn1jYoKssOgWAHeAlioC3pOQ0eysiGLYd4HA5Q+IPsDppqoKXz7p1OVtTxXXCpE92JXgiOM1VRy1NWH/+vPl+5ochlge1W9jf/bZweH+RQF9wDio3jq0To6eQuEntYpw0QmZvyAWZ4FRTppQ6tmoJGDqrZ1qylL6ZRs4GBmS91ZJt9RZkuJwjfQBRcCWWVVJ9ApTXiMem20rlKOQAF9ab81qt3RHp0fikYR1yfe/+65qf9rpfIfB5Bo6lJaM4pXtl5D6wCRdxfpNr/rxHEq9i90UTNkt6rx48/XXr2rrWo45OlFCHBO6E/lYC/eaxMyz8zdPGc4d2Rx+UwqryTGwI8r1IavmFKrGqprH+d9+ZlUnc3GrSB8iUMjVnWMuPMg0BZA9nMhaJ6ninLzrw/OkTaUZjsnTTE9sHrnsrryyqJhZLf/o7xYoNrmrVmXTct48yJWgvA9tfQWAmzBan26/e05B1u/0mkiT0ZtQIK+i0vTnQFlR0FjRxdriOLFbhD/650W0oAqxNs9v0Pg9pjmKZWoMyCka/tfb74lArh/IyAwQw6UAdVvHfN+xBHuOE2SbwV0hRrZ99KInjvY6VMoSIy5RLrLOnSfBrfhpt Jp3PcNBc RoZOys+c1XZG9+3daTspWWVJbHQiEAtgqa3saMXC5fItYCmJt7BuJfj+ZQITwDyiy9ULKAgJheF7xbmEUTq+UHhHVfqiUkzjKhVzoM7KvYjm5KHyQcujlGJOSwbvUYWj1Tx2ZDQWSTFh2T9kwnKg2rV+m4dpoZQNcUt+zLcDK9FG1g3pvIM+M/qgSDZGRRLwmD4lcDDTPRsqoin2Ywhlg1c9aug1RdrXGNJ3uA7ju+6Zd5fQl2HGWLG048JHwYFghhEDMTEpdwqTfgzkoR4x/4whlAFBTLTMSxbqlWnLMwsqqT66qAoLVanNWEHCPUwx62MnW9E6RYheDLQU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Mar 6, 2023 at 11:19=E2=80=AFAM Mike Kravetz wrote: > > This is past the deadline, so feel free to ignore. However, ... > > James Houghton has been working on the concept of HugeTLB High Granularit= y > Mapping (HGM) as discussed here: > https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@goog= le.com/ > > The primary motivation for this work is post-copy live migration of VMs b= acked > by hugetlb pages via userfaultfd. A followup use case is more gracefully > handling memory errors/poison on hugetlb pages. > > As can be seen by the size of James's patch set, the required changes for > HGM are a bit complex and involved. This is also complicated the need > choosing a 'mapcount strategy' as the previous scheme used by hugetlb > will no longer work. > > A HGM for hugetlbfs session would present the current approach and challe= nges. > While much of the work is confined to hugetlb, there is a bit spill over = to > other mm areas: specifically page table walking. A discussion on ways to > move forward with this effort would be appreciated. > -- > Mike Kravetz Hi everyone, If you came to the HGM session at LSF/MM/BPF, thank you! I want to address some of the feedback I got and restate the importance of HGM, especially as it relates to handling memory poison. ## Memory poison is a problem HGM allows us to unmap poison at 4K instead of unmapping the entire hugetlb page. For applications that use HugeTLB, losing the entire hugepage can be catastrophic. For example, if a hypervisor is using 1G pages for guest memory, the VM will lose 1G of its physical address space, which is catastrophic (even 2M will most likely kill the VM). If we can limit the poisoning to only 4K, the VM will most likely be able to recover. This improved recoverability applies to other HugeTLB users as well, like databases. ## Adding a new filesystem has risks, and unification will take years Most of the feedback I got from the HGM session was to simply avoid adding new code to HugeTLB, and instead to make a new device or filesystem. Creating a new device or filesystem could work, but it leaves existing HugeTLB users with no answer for memory poison. Users would need to switch to the new device/filesystem if they want better hwpoison handling, and it will probably take years for the new device/filesystem to support all the features that HugeTLB supports today (so beyond PUD+ mappings, we would need page table sharing, page struct freeing, and even private mappings/CoW). If we make a new filesystem and are unable to completely implement the HugeTLB uapi exactly with that filesystem, we will be stuck unable to remove HugeTLB. We would strongly like to avoid coexisting HugeTLB implementations (similar to cgroup v1 and cgroup v2) if at all possible. Instead of making a new filesystem, we could add HugeTLB-like features tmpfs, such as support for gigantic page allocations (from bootmem or CMA, like HugeTLB), for example. This path would work to mostly unify HugeTLB with tmpfs, but existing HugeTLB users will still have to wait for many years before poison can be handled more efficiently. (And some users care about things like hugetlb_cgroup!) ## HGM doesn=E2=80=99t hinder future unification HGM doesn=E2=80=99t add any new special cases into mm code; it takes advant= age of the existing special cases that already exist to support HugeTLB. HGM also isn=E2=80=99t adding a completely novel feature that can=E2=80=99t= be replicated by THPs: PTE-mapping of THPs is already supported. HGM solves a problem that HugeTLB users have right now: unnecessarily large portions of memory are poisoned. Unless we fix HugeTLB itself, we will have to spend years effectively rewriting HugeTLB and telling users to switch to the new system that gets built. Given all this, I think we should continue to move forward with HGM unless there is another feasible way to solve poisoning for existing HugeTLB users. Also, I encourage everyone to read the series itself (it's not all that complicated!). - James