From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7CDC8C27C53 for ; Fri, 7 Jun 2024 22:26:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0BD1B6B0088; Fri, 7 Jun 2024 18:26:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 06BFB6B008C; Fri, 7 Jun 2024 18:26:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E75CF6B0096; Fri, 7 Jun 2024 18:26:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id CCE2B6B0088 for ; Fri, 7 Jun 2024 18:26:31 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 810EF404A7 for ; Fri, 7 Jun 2024 22:26:31 +0000 (UTC) X-FDA: 82205527782.29.ECDA6CE Received: from mail-wr1-f41.google.com (mail-wr1-f41.google.com [209.85.221.41]) by imf10.hostedemail.com (Postfix) with ESMTP id 9A9A7C0007 for ; Fri, 7 Jun 2024 22:26:29 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=yajrswU+; spf=pass (imf10.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.221.41 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717799189; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TJ3OQQW+eHyCePm6E/J3FwFPQTGkY7d0el0lDRbI3xQ=; b=v0G9p3bOe0fndR+ki+cgoS6QBE7tbjDCwhCLVNqPMyiZG684GVvjEIJZI2Dm9lxRcgK3Bh VzusBUmm+1ovM5sS3OX75WW2mJAu7D+9XIHA0wfm7AkE69hIJmliXF6aG9nxv4iXUS6Ooi bL7b3IqdL8hwThQ60+MFgSipNm/6Kcg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717799189; a=rsa-sha256; cv=none; b=QC3weCYSNWKHLMf7pvtxqtQ8Cjq9zvXN/alGbpImmsXcNj4166zV778h3sYWmQuOIwfz92 Gs37QF/ViY9xIUmcA9PaR0wtTFybzI7o+9U4pM/BNN3SdzEmB8GKM6MzqfdtT7iormbvIW fmyfQj7fUJEBHthfYbhtee3TSYaVHRQ= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=yajrswU+; spf=pass (imf10.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.221.41 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-wr1-f41.google.com with SMTP id ffacd0b85a97d-35dce610207so2940238f8f.2 for ; Fri, 07 Jun 2024 15:26:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1717799188; x=1718403988; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=TJ3OQQW+eHyCePm6E/J3FwFPQTGkY7d0el0lDRbI3xQ=; b=yajrswU+0PaWJM8I0mdTniCIbR3jpzEd+CVzg7XavjuTYwmFPmQGe+HqENiN0TPKDl JUt18wzPHYhv+IZ3HlgzrVC/ix4Czjr9cPUlcvBs2wDG8pKRTBtLObWL8Zl34rLBHW3f OH034gOQsdo7/Q4Ybw3OrU2wPd0JbfTyWkBC0Ent7XevzTuNeJNXrYro6FqzBwmNXNKr kdgl+Wwa4kF9B4UyFWOEIFMWbq3Hjx7kJbaQegLObAYFKV6f1bfOlvKsU+P/kxS0AhT2 J3zYdlSbqsITruaOhrp6Iskp8RfESWGxgIhFQxUKLs393FX9ej1TeHXPzp/y5exC/O/2 36pA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717799188; x=1718403988; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TJ3OQQW+eHyCePm6E/J3FwFPQTGkY7d0el0lDRbI3xQ=; b=bdx5MrinIvXgdwKXSCzY4VBsKLRAotNW2rFH/kjgjG6kpoHIvITYvltt+Mlx9dK0Id kk0Omo6+SqAj2L5beVMK9+BZlsTBf4bvFY70fU2jVznyvJ3LOHoYDGagJuXARbhW1lN9 8mvdBlkyHC3iO6PvOpLczWGOLiFWld4UsLWgBoSouJCGq6VkkC+xYWRVbIgtrwBp/qI8 RAiVvSptupaMvKrsJpq5wSkRUVLCcj3YZD6UTfC+h06yJrVP4sOk25m3iA3yXFDOk2V7 cFoJHR0wHZoZ8QwT4U0/JZpQRskl1NwqmOgKspNqGFYvwUHyuNHRtyvNLB8hWsTx1vMH ccqg== X-Forwarded-Encrypted: i=1; AJvYcCVXI48fZbBSXQso5VPN2xuBZqecN66k0wBxOIbnBtoB6MxHzt8tNEhTM7n+KI/eQa6Y/mRVgYAc0AaKq08nGanNp2c= X-Gm-Message-State: AOJu0YyGZ5bDSI4SfKYj/pLR9xCawakUsRhz4ivMcajHIulSo9wa6iGV 8EtlcHQFiY32oTdQ7CzHWqjj3YrPtyweKGbef9ckydai+ZsOhYcq8bNssx1ZXbH9w2bfN6OhWk5 8SUwOOO1K75trPnT60MBw9e1IZmzD05HwT/0jJHeHS8bM5+cexDVe X-Google-Smtp-Source: AGHT+IG4fiWf2whywWOKYTELPDvRzqNwtYRFuyl5M1j+OGyfWA/hmL+B/m3tm3AJoZasegTNzzDHy4gCozES5Oz25vM= X-Received: by 2002:a5d:5744:0:b0:35f:e4d:f3dc with SMTP id ffacd0b85a97d-35f0e4df6e3mr1309708f8f.9.1717799187682; Fri, 07 Jun 2024 15:26:27 -0700 (PDT) MIME-Version: 1.0 References: <20240531213439.2958891-1-jiaqiyan@google.com> <20240531213439.2958891-2-jiaqiyan@google.com> In-Reply-To: <20240531213439.2958891-2-jiaqiyan@google.com> From: Jiaqi Yan Date: Fri, 7 Jun 2024 15:26:16 -0700 Message-ID: Subject: Re: [PATCH v1 1/3] mm/memory-failure: userspace controls soft-offlining hugetlb pages To: naoya.horiguchi@nec.com, muchun.song@linux.dev, linmiaohe@huawei.com Cc: akpm@linux-foundation.org, shuah@kernel.org, corbet@lwn.net, osalvador@suse.de, rientjes@google.com, duenwen@google.com, fvdl@google.com, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org, Jane Chu Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 9A9A7C0007 X-Rspam-User: X-Stat-Signature: e8z393a8rsoqg8x8681xqpjyyquhcsyc X-HE-Tag: 1717799189-202532 X-HE-Meta: U2FsdGVkX1+PiH8+oN/bFI3H5COUXQaSbBpLOmt0Qh8irhGkxftCNJRMYu3dqkp0iobshx/68CBzwUFh8j/zrXxhFkj4Wwnq5/AROq7LK5+GUpmQbcZ/Ux01o+eeuymsIMHJKj3fZ85m1q2FxJLbxCFP/YIJS1DD+FdfciReHmrkGM9ePJZa2vnPFUBoPYr5e8L+K5K+HhlgprgbFB92rSqkZIBekve7bfg9xrtLC8DxOv2an5ZGKsz3ZBaWr7xu1UF0F0IK0ubRxsR2O8iGAqKOIqY3D4seliiHGgGaWV+Mebq7fq+jllnMVMBaJE21FzNHWRQmMNijdD3Pne2l6OHGog5bEVgPrVDWGqY4Dto5zEyxbQwKUGSyhBnvTo2Kzc5FD/MbI+zenBg3mGXXu9nRbAYscKF0X3ooznHao+nlQKg9BfzqZLB1Ik/WUInEgPShQasPGqIDB8Y5a7mfp6i/NNyymvgQ0Om2mo+dW/NjTmVqSMJs//e73Pd7oIr9ZqUZ4JBaHoBJU1M2Cl3vwhODzSuXi5o3xYKUE4HLwZ8TZeG6ButAmMjU30mCC02qIQGe9vgv5lrWrxgwN3/IsOFNa5Trr3H0/fPgVP8/nfY0hCOMdrXxk/a4guypMUpeHm/d5frQo2MFHKA5t+UqidW69GjamXgrwUhbNgKK8v2NVZwb+oIThmWYF0p/sXZwxIGTtI3L6mZOoj/ZH8S6ohDlQIb8y9D2QW2XhXVLyAFDzDvdHLxIBzCO5/N0cDvS/uWn7XIOCarVuAwhYGzsqCxXjOfFSY8qtkFkBezGhGpnRB1ujlPd4o3ekWkmGGwYAqvSFehY4PPwF0JGO5mpDzipQ+KcU7kW47o8s+BVBteNKkLKZsNlugqGd2ipgWHsWDi49m6xYSQR8N+HLoVUWNwR7MKlbZCgWwHwDTCr1u6fdJ/R5XXwCzYxwV5TLycchitSXR5xyRNh/ake5LD XoZR4TOq z7OOTcdcLY24ZLTDbFasBDtGGreUW+3MCIcS6lgq6FLrmX9QAUozvOCOw6BeNnGvEI/2h+NRKlKKHcqX1mSAIbhEq2W7s33XeqSpG4Fvr/CWkDVma2VyAefGqgSA9K++/StlI711Cu3tOcmD8E7bQhR0kL4WHBZxsjjcz3McEES4YEY1+qZdwa8TjkzWrCcGe95nG2pzi+P5HYXgGxtJItJQ8B3OYYOlrAmQmnqn5C3FO7UdFg5LQW0MNfww79VwcbG+QsurRCY6zC1ENlmcWQVPw22IlfE4DAypW X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: +CC Jane. On Fri, May 31, 2024 at 2:34=E2=80=AFPM Jiaqi Yan wro= te: > > Correctable memory errors are very common on servers with large > amount of memory, and are corrected by ECC. Soft offline is kernel's > additional recovery handling for memory pages having (excessive) > corrected memory errors. Impacted page is migrated to a healthy page > if mapped/inuse; the original page is discarded for any future use. > > The actual policy on whether (and when) to soft offline should be > maintained by userspace, especially in case of HugeTLB hugepages. > Soft-offline dissolves a hugepage, either in-use or free, into > chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. > If userspace has not acknowledged such behavior, it may be surprised > when later mmap hugepages MAP_FAILED due to lack of hugepages. > In addition, discarding the entire 1G memory page only because of > corrected memory errors sounds very costly and kernel better not > doing under the hood. But today there are at least 2 such cases: > 1. GHES driver sees both GHES_SEV_CORRECTED and > CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. > 2. RAS Correctable Errors Collector counts correctable errors per > PFN and when the counter for a PFN reaches threshold > In both cases, userspace has no control of the soft offline performed > by kernel's memory failure recovery. > > This commit gives userspace the control of soft-offlining HugeTLB > pages: kernel only soft offlines hugepage if userspace has opt-ed in > in for that specific hugepage size. The interface to userspace is a > new sysfs entry called softoffline_corrected_errors under the > /sys/kernel/mm/hugepages/hugepages-${size}kB directory: > * When softoffline_corrected_errors=3D0, skip soft offlining for all > hugepages of size ${size}kB. > * When softoffline_corrected_errors=3D1, soft offline as before this > patch series. > > So the granularity of the control is per hugepage size, and is kept > in corresponding hstate. By default softoffline_corrected_errors is > 1 to preserve existing behavior in kernel. > > Signed-off-by: Jiaqi Yan > --- > include/linux/hugetlb.h | 17 +++++++++++++++++ > mm/hugetlb.c | 34 ++++++++++++++++++++++++++++++++++ > mm/memory-failure.c | 7 +++++++ > 3 files changed, 58 insertions(+) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 2b3c3a404769..55f9e9593cce 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -685,6 +685,7 @@ struct hstate { > int next_nid_to_free; > unsigned int order; > unsigned int demote_order; > + unsigned int softoffline_corrected_errors; > unsigned long mask; > unsigned long max_huge_pages; > unsigned long nr_huge_pages; > @@ -1029,6 +1030,16 @@ void hugetlb_unregister_node(struct node *node); > */ > bool is_raw_hwpoison_page_in_hugepage(struct page *page); > > +/* > + * For certain hugepage size, when a hugepage has corrected memory error= (s): > + * - Return 0 if userspace wants to disable soft offlining the hugepage. > + * - Return > 0 if userspace allows soft offlining the hugepage. > + */ > +static inline int hugetlb_softoffline_corrected_errors(struct folio *fol= io) > +{ > + return folio_hstate(folio)->softoffline_corrected_errors; > +} > + > #else /* CONFIG_HUGETLB_PAGE */ > struct hstate {}; > > @@ -1226,6 +1237,12 @@ static inline bool hugetlbfs_pagecache_present( > { > return false; > } > + > +static inline int hugetlb_softoffline_corrected_errors(struct folio *fol= io) > +{ > + return 1; > +} > + > #endif /* CONFIG_HUGETLB_PAGE */ > > static inline spinlock_t *huge_pte_lock(struct hstate *h, > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 6be78e7d4f6e..a184e28ce592 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4325,6 +4325,38 @@ static ssize_t demote_size_store(struct kobject *k= obj, > } > HSTATE_ATTR(demote_size); > > +static ssize_t softoffline_corrected_errors_show(struct kobject *kobj, > + struct kobj_attribute *a= ttr, > + char *buf) > +{ > + struct hstate *h =3D kobj_to_hstate(kobj, NULL); > + > + return sysfs_emit(buf, "%d\n", h->softoffline_corrected_errors); > +} > + > +static ssize_t softoffline_corrected_errors_store(struct kobject *kobj, > + struct kobj_attribute *= attr, > + const char *buf, > + size_t count) > +{ > + int err; > + unsigned long input; > + struct hstate *h =3D kobj_to_hstate(kobj, NULL); > + > + err =3D kstrtoul(buf, 10, &input); > + if (err) > + return err; > + > + /* softoffline_corrected_errors is either 0 or 1. */ > + if (input > 1) > + return -EINVAL; > + > + h->softoffline_corrected_errors =3D input; > + > + return count; > +} > +HSTATE_ATTR(softoffline_corrected_errors); > + > static struct attribute *hstate_attrs[] =3D { > &nr_hugepages_attr.attr, > &nr_overcommit_hugepages_attr.attr, > @@ -4334,6 +4366,7 @@ static struct attribute *hstate_attrs[] =3D { > #ifdef CONFIG_NUMA > &nr_hugepages_mempolicy_attr.attr, > #endif > + &softoffline_corrected_errors_attr.attr, > NULL, > }; > > @@ -4655,6 +4688,7 @@ void __init hugetlb_add_hstate(unsigned int order) > h =3D &hstates[hugetlb_max_hstate++]; > mutex_init(&h->resize_lock); > h->order =3D order; > + h->softoffline_corrected_errors =3D 1; > h->mask =3D ~(huge_page_size(h) - 1); > for (i =3D 0; i < MAX_NUMNODES; ++i) > INIT_LIST_HEAD(&h->hugepage_freelists[i]); > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 16ada4fb02b7..7094fc4c62e2 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -2776,6 +2776,13 @@ int soft_offline_page(unsigned long pfn, int flags= ) > return -EIO; > } > > + if (PageHuge(page) && > + !hugetlb_softoffline_corrected_errors(page_folio(page))) { > + pr_info("soft offline: %#lx: hugetlb page is ignored\n", = pfn); > + put_ref_page(pfn, flags); > + return -EINVAL; > + } > + > mutex_lock(&mf_mutex); > > if (PageHWPoison(page)) { > -- > 2.45.1.288.g0e0cd299f1-goog >