From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45270C77B7A for ; Wed, 7 Jun 2023 07:38:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CEFA66B0072; Wed, 7 Jun 2023 03:38:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C9FE96B0074; Wed, 7 Jun 2023 03:38:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B67CD6B0075; Wed, 7 Jun 2023 03:38:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A89976B0072 for ; Wed, 7 Jun 2023 03:38:43 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 7A81D160150 for ; Wed, 7 Jun 2023 07:38:43 +0000 (UTC) X-FDA: 80875149726.16.E411585 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf25.hostedemail.com (Postfix) with ESMTP id 43A9FA001B for ; Wed, 7 Jun 2023 07:38:41 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=g8c1Acb7; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686123521; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=41FA+JywCzlBRUuUqyzR0YmmsXmqfB1m93KcdCOwxNY=; b=X+iibLCbWM2TD0NeKRZphpoOhEUWD7vRwyPqsYOgSwzMO/cgqG2vSVZyOQayAIlIc7lIU1 dziX9DyibhyJVhcEYHltP8rRJpEFI9w5k/33jFybcrZscwAPJSYTD/Q//92EQtFwoSxBei DqN3kOOJ1OW2xKl676N4LCC2wrc+eks= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=g8c1Acb7; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686123521; a=rsa-sha256; cv=none; b=lveFHtVUaZxhrXU9/L13WH48eLHL184+TKPK/6dB1XcqpjaQP1DM0fLTXDKJ6tSHRdA4W2 2RYhA8I+5UmIv08pgLH9MD3oOHMnfvHLb4WZQ8JaHn8y6IkJ2DPYferlzNFhFKLPUBMqnA tDvHmx99WF/yDNjZbpxFDYQsM4ezQuc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1686123520; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=41FA+JywCzlBRUuUqyzR0YmmsXmqfB1m93KcdCOwxNY=; b=g8c1Acb7Zyw/Nv8xSB0sAwuJBg7oy2CmvHMbcADZStnkf81cj4PZmmYM9l+Hygftcdm155 gF9hoQPd64R+sZpnhrWPnOKywXDu+YaWhFYc0dNqpNLmtMo7u06oyZbUKynsR8PP0HvbhQ CeDpC07rZ3i62foZNsdtuH5to+3w1aA= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-402-dRmrnnnQOtW8Gh0xjRyO6w-1; Wed, 07 Jun 2023 03:38:39 -0400 X-MC-Unique: dRmrnnnQOtW8Gh0xjRyO6w-1 Received: by mail-wm1-f72.google.com with SMTP id 5b1f17b1804b1-3f7aad897a5so53704805e9.3 for ; Wed, 07 Jun 2023 00:38:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686123518; x=1688715518; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=41FA+JywCzlBRUuUqyzR0YmmsXmqfB1m93KcdCOwxNY=; b=aDfwbNHsAYycqgUKnZEcDPGciGSFRM4Wd8JAKuNaNthZutm4xF4QlUhbFtzRs3SeJT 8uqj1FSqK+8nRXU/RsxzxJmneQmIcDTPW7VMlt0awPvFAVLLjWqT1wNmDiLpIoEaaVC5 gzSkVGH5aleCL/X3FtK93oXKPF6REIaRxA8KyofbgSsEIcjBADX1up2x1oE9/LCfDAzE j4PVn98XiTf7PxFh92n2qj8MI/gm4vkjcL/qGUs1cCkOaYPPRgfn17R+xtNp7lCdchda 8ycKOzGLepWU7nSxu9NOQBXpdoHBDYHpmlE2/Awie6JIR8BeFGIL13gAYvUZ1C9N3ybP cYRg== X-Gm-Message-State: AC+VfDzuVmjNLuFl/Kqg1SawDyDbKNfOl02r3jnGnhRxw4hDcp4COfqu HjjbJpEXgohr5Ctnrp2VUlRJAvLvnk7KlzhaOYAfJP0oXIegiU3zybNdl64yyp9VUfEGpHZjlWR nF1myhVfjbR4= X-Received: by 2002:a05:600c:3785:b0:3f7:28d8:431e with SMTP id o5-20020a05600c378500b003f728d8431emr4530727wmr.27.1686123518125; Wed, 07 Jun 2023 00:38:38 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4fVoQVhp1JnzfGvKMtuPss17VIUk4ndeadX7SxH53fIslTvPAU1ls0jleDzCHLboUKmNtqbA== X-Received: by 2002:a05:600c:3785:b0:3f7:28d8:431e with SMTP id o5-20020a05600c378500b003f728d8431emr4530702wmr.27.1686123517731; Wed, 07 Jun 2023 00:38:37 -0700 (PDT) Received: from ?IPV6:2003:cb:c70e:9c00:8d26:3031:d131:455c? (p200300cbc70e9c008d263031d131455c.dip0.t-ipconnect.de. [2003:cb:c70e:9c00:8d26:3031:d131:455c]) by smtp.gmail.com with ESMTPSA id y22-20020a1c4b16000000b003f72d7dc66esm1169740wma.13.2023.06.07.00.38.36 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 07 Jun 2023 00:38:37 -0700 (PDT) Message-ID: <7c42a738-d082-3338-dfb5-fd28f75edc58@redhat.com> Date: Wed, 7 Jun 2023 09:38:35 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 To: David Rientjes , Mike Kravetz Cc: James Houghton , Naoya Horiguchi , Miaohe Lin , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Peter Xu , Michal Hocko , Matthew Wilcox , Axel Rasmussen , Jiaqi Yan References: <20230306191944.GA15773@monkey> <20230602172723.GA3941@monkey> <7e0ce268-f374-8e83-2b32-7c53f025fec5@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs In-Reply-To: <7e0ce268-f374-8e83-2b32-7c53f025fec5@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 43A9FA001B X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 3ha9esectjy5bnq8yrpteckdznjwewnm X-HE-Tag: 1686123521-764843 X-HE-Meta: U2FsdGVkX1+/N0eahAKByHY1k4HyPEqTFsgCjJYUIrJ3aPrlLCteD35quwhQ45ciWxoZuAQl4+9g1GMfq0AhWko9iDScyUtPhuU48Riz/YLwYGE9bfJdfNWsy1kxTsk8UdUMKyaLICJ8kZSyStZaTJn7oe3/Eb+QLeNdLBunLOAC4j+fl3FFryewp1PSQRj2gD8lYVkZmw/nFBKGxFRlFwKGkgQiUQua1WyWMCz5HA8e8V6yVaqfkwJ4zlDefSoX2LeHumS3jPIG0E3J48yunYjChQxJmQW633+K9jXKs0QuSuQ5hXPt4gsPVfJIVhQ87msyXrX2GxM3NX9EHckd7v+l0LP9ivv6u1E9wG0sHs8CYXmauMT8DgHViZ3KYgvTAMy3nr76vCAlf3M0EtzZDRX+QmDkObeAFf30bxVHDKprEeS+XD9QwuP9UA4d0SJgYJCR8jcz84UMs3jq6+fsP31cEC9VArMhQOYqXsLm0XrlLkEsX+StM0y6oDnYon1gV7yYiMoPrmE5jWX1r6b721yxM68vJST/Wv75I4o4MemcHLJQIzhuwx1YzCfHTtz5rFBUSv20s+uJUSH6ylzKYlP5jnYmtFB5pj2RqJX+BiJMSt5KTj0wZOO+0TAU2P0Pnp7men4Xw+TYqvqJ5pnMkimWeM2LpjEvNxYDBj9PfQa1fElP7vmOjbsR+Z9N6P78rTzshH13H4UJhab+d2qNXQPTAbudxADuF+RBAFMJcPB4PVcLkEcPj4a7JRG+z9jKrGia/kqeZVP54mx21Es7KtFTNsRTd/SGcn3rmwhE/FWbZDlSnExD4ZedHiNmUmPFMEDnjUPmMT3IxMDzzf9zUXOy4TLjPuh4z1+ZeaBWaMQXGD2EoKHPpT4gUVdKf5/b/WSsnhUNJykSitFQl5V6q1BEctWLJ6Q6nGSK24N5iBVLxZJVIHjIlvJj/J0extFAdVdWfZIDxWl2E/VMoxK 7gO4RCFb 9UxO3MlV7ZDrhrtjHLmi7StY2sDk9IG+N8lr0dgb99yBrSIWHQNXtk6R8kwj58q8oefZcwIwaN+eTcjydtgUDvIfhwYaUWiUzRkgRS2xrE6w1gUiSE5wH18gapi3yoCHdoVaKWtbztE4MwPkz5Xkv7VgYCJndDe9y/PGrzR8/3bKAdMP/hbm8GQ0+sjeOHvv/BfRseNN44sNpXNO0iwkCGynzHWlQwCUg4ejTqOTCOwUjs+W64EL5XjqtJjcPd9BF9ZCQw+Uf+RfQbgzW6m13y8G6p2CPSOlXIoxhix8UrguSJsnVIj1cDVMyWLq2n5klVYx5lVZ12v3qqCTRdcAVKUX2B9eYgynr+jCEZuuh2dG2exwJSPfAeqoBKeWaxx8Q5L10 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 07.06.23 00:40, David Rientjes wrote: > On Fri, 2 Jun 2023, Mike Kravetz wrote: > >> The benefit of HGM in the case of memory errors is fairly obvious. As >> mentioned above, when a memory error is encountered on a hugetlb page, >> that entire hugetlb page becomes inaccessible to the application. Losing, >> 1G or even 2M of data is often catastrophic for an application. There >> is often no way to recover. It just makes sense that recovering from >> the loss of 4K of data would generally be easier and more likely to be >> possible. Today, when Oracle DB encounters a hard memory error on a >> hugetlb page it will shutdown. Plans are currently in place repair and >> recover from such errors if possible. Isolating the area of data loss >> to a single 4K page significantly increases the likelihood of repair and >> recovery. >> >> Today, when a memory error is encountered on a hugetlb page an >> application is 'notified' of the error by a SIGBUS, as well as the >> virtual address of the hugetlb page and it's size. This makes sense as >> hugetlb pages are accessed by a single page table entry, so you get all >> or nothing. As mentioned by James above, this is catastrophic for VMs >> as the hypervisor has just been told that 2M or 1G is now inaccessible. >> With HGM, we can isolate such errors to 4K. >> >> Backing VMs with hugetlb pages is a real use case today. We are seeing >> memory errors on such hugetlb pages with the result being VM failures. >> One of the advantages of backing VMs with THPs is that they are split in >> the case of memory errors. HGM would allow similar functionality. > > Thanks for this context, Mike, it's very useful. > > I think everybody is aligned on the desire to map memory at smaller > granularities for multiple use cases and it's fairly clear that these use > cases are critically important to multiple stakeholders. > > I think the open question is whether this functionality is supported in > hugetlbfs (like with HGM) or that there is a hard requirement that we must > use THP for this support. > > I don't think that hugetlbfs is feature frozen, but if there's a strong > bias toward not merging additional complexity into the subsystem that > would useful to know. I personally think the critical use cases described At least I, attending that session, thought that it was clear that the majority of the people speaking up clearly expressed "no more added complexity". So I think there is a clear strong bias, at least from the people attending that session. > above justify the added complexity of HGM to hugetlb and we wouldn't be > blocked by the long standing (15+ years) desire to mesh hugetlb into the > core MM subsystem before we can stop the pain associated with memory > poisoning and live migration. > > Are there strong objections to extending hugetlb for this support? I don't want to get too involved in this discussion (busy), but I absolutely agree on the points that were raised at LSF/MM that (A) hugetlb is complicated and very special (many things not integrated with core-mm, so we need special-casing all over the place). [example: what is a pte?] (B) We added a bunch of complexity in the past that some people considered very important (and it was not feature frozen, right? ;) ). Looking back, we might just not have done some of that, or done it differently/cleaner -- better integrated in the core. (PMD sharing, MAP_PRIVATE, a reservation mechanism that still requires preallocation because it fails with NUMA/fork, ...) (C) Unifying hugetlb and the core looks like it's getting more and more out of reach, maybe even impossible with all the complexity we added over the years (well, and keep adding). Sure, HGM for the purpose of better hwpoison handling makes sense. But hugetlb is probably 20 years old and hwpoison handling probably 13 years old. So we managed to get quite far without that optimization. Absolutely, HGM for better postcopy live migration also makes sense, I guess nobody disagrees on that. But as discussed in that session, maybe we should just start anew and implement something that integrates nicely with the core , instead of making hugetlb more complicated and even more special. Now, we all know, nobody wants to do the heavy lifting for that, that's why we're discussing how to get in yet another complicated feature. Maybe we can manage to reduce complexity and integrate some parts nicer with core-mm, I don't know. Don't get me wrong, Mike is the maintainer, I'm just reading along and voicing what I observed in the LSF/MM session (well, I mixed in some of my own opinion ;) ). -- Cheers, David / dhildenb