From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C22CC27C6E for ; Fri, 14 Jun 2024 16:36:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5711F6B01AA; Fri, 14 Jun 2024 12:31:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 520786B01AE; Fri, 14 Jun 2024 12:31:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E88D6B01B1; Fri, 14 Jun 2024 12:31:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 208086B01AA for ; Fri, 14 Jun 2024 12:31:16 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B7D3D140864 for ; Fri, 14 Jun 2024 16:31:15 +0000 (UTC) X-FDA: 82230034110.25.E86391B Received: from mail-wm1-f41.google.com (mail-wm1-f41.google.com [209.85.128.41]) by imf16.hostedemail.com (Postfix) with ESMTP id C8A5618001A for ; Fri, 14 Jun 2024 16:31:13 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=zR+0lSMn; spf=pass (imf16.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.41 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718382672; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GHabd/J4Wd8iFYsfwq/Rjz2F7MrwAzZMKt21CUFAuTc=; b=Tc2OUB1wImw2W51nAqUpuM79rznIGxObtA6L3RYGFwbXF0hZiyCQlnup/bs022gHL3J0Kx i2PHf9TLHBA2Jpl3cE7NdE7WJ4u4s1CVtYq4MZy+D9FDDElx8M5YGpLH1tVrTQ6xT5FfHV CExYjJN6tX+Vnwz2ZMpOynkdlj+Lh5g= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=zR+0lSMn; spf=pass (imf16.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.41 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718382672; a=rsa-sha256; cv=none; b=1fgJ8znRTEy/x7OvPbHyDjP8L+KgM4gb7VfFDTObw6Gm/gT4mLIsnew0hRASPgMGnbnMnJ RiINPgoukcJh6fDuJ0xT0qKojemuX9jhx/Sn1IkwP3WhYPhIATcTGKsqFlSmdTkP2oXuZO 7bFK6jJr5MWYm4fBKr3t/iOrwmginrs= Received: by mail-wm1-f41.google.com with SMTP id 5b1f17b1804b1-4218314a6c7so21712365e9.0 for ; Fri, 14 Jun 2024 09:31:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1718382672; x=1718987472; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GHabd/J4Wd8iFYsfwq/Rjz2F7MrwAzZMKt21CUFAuTc=; b=zR+0lSMnhOE460U5YBXGy4drhD6BW9a3rK7+ylRJmMoxMZUkPx7QCsgm9XFewse8OL /7BEFUiRFa4vTKM09gcTG6AHOW2M8DKSmxr20GPc9AO2T++wY6Xl9cK0N7a2GHnA/klK osev7rlj719iLSapGCG8oorcAxdfJZ4MVFBg49LtSetL0deOmB96l/hrfGFV0SKnY3aE HIiu8qVO+dV4wEck35A/Cky1/35rR2k3beAPL6AC4WusY84/jkykz7tUwB0KhBXGQofk Ksuf3O4bV0ZZr+w6wWMMnSY8bRQ5feFQ5ha4gojfPhmPFFI0kJHwt1hqj/ySWWGZYe6v G/4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718382672; x=1718987472; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GHabd/J4Wd8iFYsfwq/Rjz2F7MrwAzZMKt21CUFAuTc=; b=n353dXmo/iNqaGw8sBWb1PIxd3s47RushGppU1vmexOditviH4HESUCWOxB8hAPXFM ZMiyUkG3xb4qcN0fjhwmcHWbSTMrKAPXjx72ejyDYoKu2XOPgJhyO0H7rmGnh+Qd8AeR 4bFCS+xTLcxAQQcgKaE6aSjXQYv/zBi0DOXaZuqnDm74mBBDsDicZmck8M08fSMDu6Pb K6D1RMKkl7rm4Rl8Gbksganps/flT/+I3zZm46gcnw7t+51z91u9tnoFvRRirYT0vCVZ vpK+qmL5T9gmJ7cbzsVzx/DIoOyz9yYQNLbW2iCvqmTh3gVDb3730Fe046s2Pm37cglI 6CXA== X-Forwarded-Encrypted: i=1; AJvYcCVfEV2H6KZhiUl27ppIMzXcO/COkKYtMUwNPUlzuueqb4FvzGo4qLbIT6kh5p1oYi5wbX+Zab++K51VChZyr/qXVVM= X-Gm-Message-State: AOJu0YyoZJzgyIaqx92UVvVIRXDCCDH90I5AXdEoRe2HBXUImTdCWisc Bz+spBTpZ/yN3Ft66aPKJRAgclLR3qsZXPgI+N1E/hkBiAH3UQgHrvoJwEvNKRYa0i8QSVXCa6i ZL9M8+FspmLhaFCsDsIogeD5dVtoquMkPIBlG X-Google-Smtp-Source: AGHT+IF3RYTewQNUO29sU0IeAlB8PahfU9UKR06ru7WydepJid0XkPOG7hgoTXXCXiU6Yf2kaohISDR04mVUUsDJxfU= X-Received: by 2002:a05:600c:b42:b0:421:7109:c7b3 with SMTP id 5b1f17b1804b1-42304827d80mr26851395e9.14.1718382671175; Fri, 14 Jun 2024 09:31:11 -0700 (PDT) MIME-Version: 1.0 References: <20240611215544.2105970-1-jiaqiyan@google.com> <20240611215544.2105970-2-jiaqiyan@google.com> In-Reply-To: From: Jiaqi Yan Date: Fri, 14 Jun 2024 09:30:59 -0700 Message-ID: Subject: Re: [PATCH v2 1/3] mm/memory-failure: userspace controls soft-offlining pages To: Lance Yang , linmiaohe@huawei.com Cc: nao.horiguchi@gmail.com, jane.chu@oracle.com, muchun.song@linux.dev, akpm@linux-foundation.org, shuah@kernel.org, corbet@lwn.net, osalvador@suse.de, rientjes@google.com, duenwen@google.com, fvdl@google.com, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: bxf16enikj89jr6s48t5wjmw9ih5k3hh X-Rspam-User: X-Rspamd-Queue-Id: C8A5618001A X-Rspamd-Server: rspam02 X-HE-Tag: 1718382673-715280 X-HE-Meta: U2FsdGVkX18gOL9Q7wUz/dXfGFDUHG8R3D96Tt6PTq2nX8I6rU/qM0LVjrSFukN0hHtcxgkvJQntqdbjMsfFFDKIYwEF0gyEuQjqVgsh/ItWRCf2YLAXWHow03R0QK7+O4qjucx3QaIVzFumsayOc7jBNPJ3dPncGzdKOR7puO+lo+nokGH7DXOCXNom1c9srWDQ/P2ncGTDiuoG+DrHkiyH8zI+pURV3ibrqhWbfUmubBe4kAoQmqHfsETcpe/cp8LEi1EcHjc/7bZBVJuZwespMHzvsP5x9Cg6RSGpg5wcI++7G033Ine2M1MU/tX2TUgFhsyLw7jNxSCtcPeBYwkd47ewjMjS6Wi7RjrykJ5mybfrLyRe6gM9zrMEWU3ogd2eWdEAAsy2/dzjkZd540w/qRbA9gvP9aDBW7bYdW+mXF5ydoNHmHYw8mAbFZk/S8yXBDfi0p2a1rcI9aouIrXxTOU1TWJR1IGmRyRCOJW/o6je02hoqiUzIYPfuUwEpvimYfJi0lnGeWGiPGE+It8f6F1e/tgtsST4FP/m3eZXm9cjghvjOFLnnqBRHBWqDfgo0jncTJBYH2CNmMZcrvpjZ1kwq+8U0LIfkNDSFKX44FmFPwbP0GKryHtQUvpE3d73d/GEgEN8VwnN5gf9ncnnA06dxsUlP+iGLTvIExyUd47BuFxuw7sOa+OEVmgvlIywaGps5vP5PjUZbWH6dOBoF8AnTdv0W61o7MwsTCUyeGeG+nyevnUjLJqL1u3Ioj1az2T1Mi6mUl4cMxDCxY99A7++xsqp7sjlMOaqPiwiCd08HpvjlM8w8pSJjMPGaIrpQogw4xwkcZDjQys3A8kldglUKABf9ui/pP2wj6wAFVnM3Kzz+ai2Ta/NkhPD7QM7prjVgDYeGNufJ9AdE9r38agpAeGvZqyG7XSdCakhDmR7rgVUZv04Mn5ZAqkXNqgmhjn0Y5wdMM0hswv jdlB33+n Q4unp46Zl29LsW07fK53FYS6yB/78kTBUonMKg6vUQrie7CRlI3+fxR2pqe0a/cj3BF2VeqLDqBwJK0Rtw9RuD6rp3kYM2a/Jnvv3CtLZFpr9NIOlLjZ8l2H2DVGdjXbptWYMPrXNuOqr4v5ANXLmjOhdrtAvCXHMTBKOTMMhrh2rPbpdmHDX0U/4MDIhK7EoCsL3qiXwYwXmAJUNyA6oqu+6xaZYWvxKwij9Vt1OO1v7q+x/PxiB+69SexMEQqhINsSh6zODc4/YMjuu0s9YTnFCUVkTeotH7JHf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 14, 2024 at 1:35=E2=80=AFAM Lance Yang wr= ote: > > Hi Jiaqi, > > On Wed, Jun 12, 2024 at 5:56=E2=80=AFAM Jiaqi Yan w= rote: > > > > Correctable memory errors are very common on servers with large > > amount of memory, and are corrected by ECC. Soft offline is kernel's > > additional recovery handling for memory pages having (excessive) > > corrected memory errors. Impacted page is migrated to a healthy page > > if inuse; the original page is discarded for any future use. > > > > The actual policy on whether (and when) to soft offline should be > > maintained by userspace, especially in case of an 1G HugeTLB page. > > Soft-offline dissolves the HugeTLB page, either in-use or free, into > > chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. > > If userspace has not acknowledged such behavior, it may be surprised > > when later mmap hugepages MAP_FAILED due to lack of hugepages. > > In case of a transparent hugepage, it will be split into 4K pages > > as well; userspace will stop enjoying the transparent performance. > > > > In addition, discarding the entire 1G HugeTLB page only because of > > corrected memory errors sounds very costly and kernel better not > > doing under the hood. But today there are at least 2 such cases: > > 1. GHES driver sees both GHES_SEV_CORRECTED and > > CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. > > 2. RAS Correctable Errors Collector counts correctable errors per > > PFN and when the counter for a PFN reaches threshold > > In both cases, userspace has no control of the soft offline performed > > by kernel's memory failure recovery. > > > > This commit gives userspace the control of softofflining any page: > > kernel only soft offlines raw page / transparent hugepage / HugeTLB > > hugepage if userspace has agreed to. The interface to userspace is a > > new sysctl called enable_soft_offline under /proc/sys/vm. By default > > enable_soft_line is 1 to preserve existing behavior in kernel. > > s/enable_soft_line/enable_soft_offline Will fix this typo in v3. > > > > > Signed-off-by: Jiaqi Yan > > --- > > mm/memory-failure.c | 16 ++++++++++++++++ > > 1 file changed, 16 insertions(+) > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > index d3c830e817e3..23415fe03318 100644 > > --- a/mm/memory-failure.c > > +++ b/mm/memory-failure.c > > @@ -68,6 +68,8 @@ static int sysctl_memory_failure_early_kill __read_mo= stly; > > > > static int sysctl_memory_failure_recovery __read_mostly =3D 1; > > > > +static int sysctl_enable_soft_offline __read_mostly =3D 1; > > + > > atomic_long_t num_poisoned_pages __read_mostly =3D ATOMIC_LONG_INIT(0)= ; > > > > static bool hw_memory_failure __read_mostly =3D false; > > @@ -141,6 +143,15 @@ static struct ctl_table memory_failure_table[] =3D= { > > .extra1 =3D SYSCTL_ZERO, > > .extra2 =3D SYSCTL_ONE, > > }, > > + { > > + .procname =3D "enable_soft_offline", > > + .data =3D &sysctl_enable_soft_offline, > > + .maxlen =3D sizeof(sysctl_enable_soft_offline), > > + .mode =3D 0644, > > + .proc_handler =3D proc_dointvec_minmax, > > + .extra1 =3D SYSCTL_ZERO, > > + .extra2 =3D SYSCTL_ONE, > > + } > > }; > > > > /* > > @@ -2771,6 +2782,11 @@ int soft_offline_page(unsigned long pfn, int fla= gs) > > bool try_again =3D true; > > struct page *page; > > > > + if (!sysctl_enable_soft_offline) { > > + pr_info("soft offline: %#lx: OS-wide disabled\n", pfn); > > + return -EINVAL; > > IMO, "-EPERM" might sound better ;) > > Using "-EPERM" indicates that the operation is not permitted due to > the OS-wide configuration. Miaohe suggested -EOPNOTSUPP. I agree both EOPNOTSUPP and EPERM may be better than EINVAL. But I wonder how about EAGAIN? With EAGAIN plus showing "disabled by /proc/sys/vm/enable_soft_offline" in dmesg, users now should be clear that they can try again with /proc/sys/vm/enable_soft_offline=3D1. > > Thanks, > Lance > > > + } > > + > > if (!pfn_valid(pfn)) { > > WARN_ON_ONCE(flags & MF_COUNT_INCREASED); > > return -ENXIO; > > -- > > 2.45.2.505.gda0bf45e8d-goog > > > >