From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 17A5EC7EE23 for ; Wed, 7 Jun 2023 07:52:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 91BA86B0072; Wed, 7 Jun 2023 03:52:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8A3416B0074; Wed, 7 Jun 2023 03:52:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 71CC38E0002; Wed, 7 Jun 2023 03:52:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 5CDCE6B0072 for ; Wed, 7 Jun 2023 03:52:18 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 1DEF080159 for ; Wed, 7 Jun 2023 07:52:18 +0000 (UTC) X-FDA: 80875183956.28.9213039 Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47]) by imf01.hostedemail.com (Postfix) with ESMTP id 3FFD24000D for ; Wed, 7 Jun 2023 07:52:15 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=WpJhINKQ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686124336; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dL1Ozfu82PyUgNbvH57vS7KhV24BrzL22oknT+yv/+E=; b=2DlJjK0QCoJ/j3y/WF9CPb0iJm9sI4606aRgX+neb8MhVE91SuKT0IZAUpDA/skVWKgL36 cjSi8d1uGsnNaLGeq/kKMxE6Xj+jGkpv8fcs6k9PRuev0eiqFgNpN5qucpS7UByjJcEFu6 ICGV5yjNJvkkymOt8RW8qX+/Kiq+kPM= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=WpJhINKQ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686124336; a=rsa-sha256; cv=none; b=5fWzJkA3YK6oTgrNBK/wRuCzYuCnQrkDTfMUHEpcpU0+okcQ26Kpt5WpjAXot23UklmGae M/LLyayjmof+ihGjBTyFrZw4ksVSIyCq0Psz+bMSCf9rb4hDZNM4TAAUMMG7vlzUl5X4k/ HSehsOf5VT+UZdhncNdo4mInWowUUzQ= Received: by mail-ed1-f47.google.com with SMTP id 4fb4d7f45d1cf-5147f5efeb5so1030136a12.0 for ; Wed, 07 Jun 2023 00:52:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1686124334; x=1688716334; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=dL1Ozfu82PyUgNbvH57vS7KhV24BrzL22oknT+yv/+E=; b=WpJhINKQeHprBFgX7SWX+sDGiAoZG3UFXWWpk4jM+yTvsWTw668EZ9r3KxzvEzSdSp He0jVK68JhUIx86ijFfOHHuoku8WTRAx/www+U6mPhasaUbofrH3Rq+7Ai9bLjjRgEDp mVqAQej5hKBCdQLc6ZIs22qmqLUtNEmkDyMQfbWRA3tFgG1pg9XQdW/Bl0D+ZL29NVgB nxa50pgdnM2kJYW1VkD+p+BnEwgN2l54aiYvbvYc86aj9yCYFfeJnTbvuAzOVa+MXzhM BmBoVWfLUt8uMhLv+SemSkEQTBb8AgihUqrMqwnAvIvV7J3QTM39KR+Ol9BDAPkGXAHL WJMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686124334; x=1688716334; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dL1Ozfu82PyUgNbvH57vS7KhV24BrzL22oknT+yv/+E=; b=WqN0b1WJk0AREAjZ+o7Vb2iTwWa7qcGjG5DylomUGb52Eqes2UkMy8BoIYlS4Jgzc6 t94cLtbjG8Vmp1Zqr5BEGWQy+NYx0JZMGPrND45CEV9GrxBy8+VEQnPQtDAHfArw9Cs8 LkA7cbIvIty0C8hTL2Mp6+5+N0npYJ3cSKkp1t01gKINXLxItrZfHd/vXH4zvLcUzX6E BYHmkVf66dE7TseC173aCuozSBLtlATHIAWIxN8f1KeaNlHa3i236t1rmRXmm7bwL8cr UC2C1xLV8F8fMBFPa1S035mErCFKEsbHQ9UrPwue/V3+l3o2Ey6/NEtnGaK/m4qK/uKi njIg== X-Gm-Message-State: AC+VfDwr1FwxkfUjFuVO3tfU2/SsRHW3b45ah/xF2ZDGpp/i7JsKnK7y l0wdT1oAkNuWcdCES/ckO85tgfo/gsrNcGCPfvStYg== X-Google-Smtp-Source: ACHHUZ5x6+DovSpwY7UN54jZfKryDL+aWsKNiVRhZ+J4HY6OBnDk/8ZFxwItRpWCNjDVG6CKrlGbmOSmmsvgp7+J88w= X-Received: by 2002:a17:907:3f9a:b0:974:1e85:6a69 with SMTP id hr26-20020a1709073f9a00b009741e856a69mr5322880ejc.16.1686124334387; Wed, 07 Jun 2023 00:52:14 -0700 (PDT) MIME-Version: 1.0 References: <20230306191944.GA15773@monkey> <20230602172723.GA3941@monkey> <7e0ce268-f374-8e83-2b32-7c53f025fec5@google.com> <7c42a738-d082-3338-dfb5-fd28f75edc58@redhat.com> In-Reply-To: <7c42a738-d082-3338-dfb5-fd28f75edc58@redhat.com> From: Yosry Ahmed Date: Wed, 7 Jun 2023 00:51:37 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs To: David Hildenbrand Cc: David Rientjes , Mike Kravetz , James Houghton , Naoya Horiguchi , Miaohe Lin , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Peter Xu , Michal Hocko , Matthew Wilcox , Axel Rasmussen , Jiaqi Yan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: u6fiyhrq6ydqneb6z7pa7fbx6rwfbci9 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 3FFD24000D X-HE-Tag: 1686124335-860003 X-HE-Meta: U2FsdGVkX19GEVOmDezLRs4hc+ye3neDDgcrPVnIepeLMLf8CttjuTRt5aztgXOoU0U37T6ASies9gAMN4omjN1qd1qAHbunyKtjGrMUhaZOBJTDIHQppDk19Z/+/XXl3Z8Wzz+eHUSJPnFrI5n1dq3smTc//BBIieZVbShp99cE5OUqc9YhFgynNnq57IMfc47kESSVPgxojlio9LOFzRpjT5qgCOKE2UWKAnCG3Bxvynk4ps+BOeQpu3ropoe6YO1NYytOfQ2Jzd3Xe3qgkN6+OB9xGo+7EIZGeDVAymvj1pBWLzJAJ6BL7hlx+NgmWgOsKyU44mkY6A9tUU7XpQsr/hhOWP+0I4Nuu/PeKRQPHE8x0DRruaMGrKz4VLgxubJqkG7kGvGQ5w+P5YYXtdqX15ohJ3oUT8fJy6+GvJTDBvP2HDQcw6PU/AXRAj67+wk3Ci3+t3WRABsml4Y/B9ZNikLhRiSXzK7fstbxWXgpZcO7VEk9e/aYKz14XMlWlBxT4MQRVI/uquwIsYXh6rhQUdlkLK1KonlMPVb2jwCjI5d5IMLVXfia5Td1wwrCH9uwS9DHciPtD9kYCl7ZF4eWmDNg3hm888s2kNI5b0Gt5cIr5Pdcj3CgVVaRZnrFwuUcxJcfJnmFaci/ylX4IXFoC+nqlavtY4mnXn2JDGj9TN89Rn/IxboboZDF2xRQ5aB+df0cI2OhpEROEgoFt1RnL2OVLs7WLLtexIm/lzM9CwQJDJnGM1CPO2C+s23tvRGyf3mpedFUeTc+jMtvewl4Oe3FW3zqFh4D9nF0IgYstICWzoLgIPU6kNJeIbLVmyEBVSI+N0kv7N0T8eshzWzW+NcL4E++CWsFU1MXEcQOCPWc1iig66NiSj47sv6fcSLx57njAF9eViH9DkGmCUrMZkl3Y1tI1m8zvPtukLft2w5VxA/G5Mf5T25whKqrFafujQQ4PhIEFXd9o5k +cnzhL2x m/IZvg4cREh1yVCKtGG+9IUKUtqyttup0arF05/jubZfDTg1TdvU7EnxDbOdcOnoI7orFTkVYt8XMhIGmZcVjXpJoGtprtWz9dFycQ8tAmZEQqlnfaCJ9HVMZzdBRK8xJUS04iYboYsStvmRKykgNqu19tk8W3MZADiGucEN1A/Yq3/SXzGsfsstnful9o+qTOotL19VEFAHrUg0sMxrB8f89njmYkJOxsn3m1RJDN3eJVRaHud2ZOrBycw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 7, 2023 at 12:38=E2=80=AFAM David Hildenbrand wrote: > > On 07.06.23 00:40, David Rientjes wrote: > > On Fri, 2 Jun 2023, Mike Kravetz wrote: > > > >> The benefit of HGM in the case of memory errors is fairly obvious. As > >> mentioned above, when a memory error is encountered on a hugetlb page, > >> that entire hugetlb page becomes inaccessible to the application. Los= ing, > >> 1G or even 2M of data is often catastrophic for an application. There > >> is often no way to recover. It just makes sense that recovering from > >> the loss of 4K of data would generally be easier and more likely to be > >> possible. Today, when Oracle DB encounters a hard memory error on a > >> hugetlb page it will shutdown. Plans are currently in place repair an= d > >> recover from such errors if possible. Isolating the area of data loss > >> to a single 4K page significantly increases the likelihood of repair a= nd > >> recovery. > >> > >> Today, when a memory error is encountered on a hugetlb page an > >> application is 'notified' of the error by a SIGBUS, as well as the > >> virtual address of the hugetlb page and it's size. This makes sense a= s > >> hugetlb pages are accessed by a single page table entry, so you get al= l > >> or nothing. As mentioned by James above, this is catastrophic for VMs > >> as the hypervisor has just been told that 2M or 1G is now inaccessible= . > >> With HGM, we can isolate such errors to 4K. > >> > >> Backing VMs with hugetlb pages is a real use case today. We are seein= g > >> memory errors on such hugetlb pages with the result being VM failures. > >> One of the advantages of backing VMs with THPs is that they are split = in > >> the case of memory errors. HGM would allow similar functionality. > > > > Thanks for this context, Mike, it's very useful. > > > > I think everybody is aligned on the desire to map memory at smaller > > granularities for multiple use cases and it's fairly clear that these u= se > > cases are critically important to multiple stakeholders. > > > > I think the open question is whether this functionality is supported in > > hugetlbfs (like with HGM) or that there is a hard requirement that we m= ust > > use THP for this support. > > > > I don't think that hugetlbfs is feature frozen, but if there's a strong > > bias toward not merging additional complexity into the subsystem that > > would useful to know. I personally think the critical use cases descri= bed > > At least I, attending that session, thought that it was clear that the > majority of the people speaking up clearly expressed "no more added > complexity". So I think there is a clear strong bias, at least from the > people attending that session. > > > > above justify the added complexity of HGM to hugetlb and we wouldn't be > > blocked by the long standing (15+ years) desire to mesh hugetlb into th= e > > core MM subsystem before we can stop the pain associated with memory > > poisoning and live migration. > > > > Are there strong objections to extending hugetlb for this support? > > I don't want to get too involved in this discussion (busy), but I > absolutely agree on the points that were raised at LSF/MM that > > (A) hugetlb is complicated and very special (many things not integrated > with core-mm, so we need special-casing all over the place). [example: > what is a pte?] > > (B) We added a bunch of complexity in the past that some people > considered very important (and it was not feature frozen, right? ;) ). > Looking back, we might just not have done some of that, or done it > differently/cleaner -- better integrated in the core. (PMD sharing, > MAP_PRIVATE, a reservation mechanism that still requires preallocation > because it fails with NUMA/fork, ...) > > (C) Unifying hugetlb and the core looks like it's getting more and more > out of reach, maybe even impossible with all the complexity we added > over the years (well, and keep adding). > > Sure, HGM for the purpose of better hwpoison handling makes sense. But > hugetlb is probably 20 years old and hwpoison handling probably 13 years > old. So we managed to get quite far without that optimization. > > Absolutely, HGM for better postcopy live migration also makes sense, I > guess nobody disagrees on that. > > > But as discussed in that session, maybe we should just start anew and > implement something that integrates nicely with the core , instead of > making hugetlb more complicated and even more special. > > > Now, we all know, nobody wants to do the heavy lifting for that, that's > why we're discussing how to get in yet another complicated feature. If nobody wants to do the heavy lifting and unifying hugetlb with core MM is becoming impossible as you state, then does adding another feature to hugetlb (that we are all agreeing is useful for multiple use cases) really making things worse? In other words, if someone decides tomorrow to do the heavy lifting, how much harder does this become because of HGM, if any? I am the farthest away from being an expert here, I am just an observer here, but if the answer to the above question is "HGM doesn't actually make it worse" or "HGM only slightly makes things harder", then I naively think that it's something that we should do, from a pure cost-benefit analysis. Again, I don't have a lot of context here, and I understand everyone's frustration with the current state of hugetlb. Just my 2 cents. > > Maybe we can manage to reduce complexity and integrate some parts nicer > with core-mm, I don't know. > > > Don't get me wrong, Mike is the maintainer, I'm just reading along and > voicing what I observed in the LSF/MM session (well, I mixed in some of > my own opinion ;) ). > > -- > Cheers, > > David / dhildenb > >