From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2AC4E77188 for ; Tue, 14 Jan 2025 12:57:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 388156B0083; Tue, 14 Jan 2025 07:57:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 336D26B0088; Tue, 14 Jan 2025 07:57:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1FF146B0089; Tue, 14 Jan 2025 07:57:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id F3E916B0083 for ; Tue, 14 Jan 2025 07:57:58 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id A97AE1A0917 for ; Tue, 14 Jan 2025 12:57:58 +0000 (UTC) X-FDA: 83006059836.13.2B54E63 Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf10.hostedemail.com (Postfix) with ESMTP id 0D810C0017 for ; Tue, 14 Jan 2025 12:57:56 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="OUE/nr6Y"; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf10.hostedemail.com: domain of mchehab+huawei@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=mchehab+huawei@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736859477; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xEg0EctRWc1ZJi8pb6NZNtU2t4Tq4tSV5WJsEkjHiCU=; b=t0n3OlW7R42fBcrIMlWD9FN6vwA9bdc7zyJGoXKn89jL0e4IXjV50gad/H5WXR3BDXqGlD 9Cl/cdQKZ8FzRRossbA7VCJ+yQL5J6l1EyOl1R+zmDKcjAqt3P4ejWOaeAvcraKmzYLmZd YUmgKpIwuEDe+9MzXZcCLdoL3NCv434= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736859477; a=rsa-sha256; cv=none; b=nj23l3rSWy2YL76BAP87XunMFNP1imu/UfoE5YGGqxyj7TLKxzQBoF85r8p8e4WPNMgzc4 EccRBEM5fuY0C3F4wwsHLWaLzOBoyhZRysn5SzNb7FQAUHrIK4hzRI+b61A/mQnrG/CwW7 8rGZbbidOhGJTvkM012Ff01JvROdlS8= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="OUE/nr6Y"; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf10.hostedemail.com: domain of mchehab+huawei@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=mchehab+huawei@kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id D6827A4158F; Tue, 14 Jan 2025 12:56:07 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2D753C4CEE1; Tue, 14 Jan 2025 12:57:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1736859475; bh=TeCqkUf9U8u8664JbtiG8nCrJnAMZRQuIJXO6gvHTG8=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=OUE/nr6YzldIEWnOZDo3FD11hU+2ABvhuTXn26k54LAjHe+2eD19jR1ieGIeDMoGm zZVCcKtE2IUzNaKFUC8yeug1HHThumd87gMXwbqRW4c8g3sJ7FtICsolEIiQzxozTE O75ZvUMS2LS34DKAerd9VaCANiPeRSBqevyg2PfUWXhPjeWR5XFCS/EcGY7RIMrO79 opDu34gZLQvoK9MDvCtkZb4Ftn+fITxY2hkBOl4LFUXsskevDD7VWqR72QcH0Oft+e 4JAmTfLm5ZnFaejmSxCCi50iJDK/opVPjBFYMThuwpPdvAPIvP4RQfTLIYweqE59TL OKJCxuVpE4/+A== Date: Tue, 14 Jan 2025 13:57:44 +0100 From: Mauro Carvalho Chehab To: Jonathan Cameron Cc: Borislav Petkov , Shiju Jose , "linux-edac@vger.kernel.org" , "linux-cxl@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "tony.luck@intel.com" , "rafael@kernel.org" , "lenb@kernel.org" , "mchehab@kernel.org" , "dan.j.williams@intel.com" , "dave@stgolabs.net" , "dave.jiang@intel.com" , "alison.schofield@intel.com" , "vishal.l.verma@intel.com" , "ira.weiny@intel.com" , "david@redhat.com" , "Vilas.Sridharan@amd.com" , "leo.duran@amd.com" , "Yazen.Ghannam@amd.com" , "rientjes@google.com" , "jiaqiyan@google.com" , "Jon.Grimm@amd.com" , "dave.hansen@linux.intel.com" , "naoya.horiguchi@nec.com" , "james.morse@arm.com" , "jthoughton@google.com" , "somasundaram.a@hpe.com" , "erdemaktas@google.com" , "pgonda@google.com" , "duenwen@google.com" , "gthelen@google.com" , "wschwartz@amperecomputing.com" , "dferguson@amperecomputing.com" , "wbs@os.amperecomputing.com" , "nifan.cxl@gmail.com" , tanxiaofei , "Zengtao (B)" , "Roberto Sassu" , "kangkang.shen@futurewei.com" , wanghuiqiang , Linuxarm Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature Message-ID: <20250114135738.2b79b73d@foz.lan> In-Reply-To: <20250109160159.00002add@huawei.com> References: <20250106121017.1620-1-shiju.jose@huawei.com> <20250106121017.1620-5-shiju.jose@huawei.com> <20250109091915.GAZ3-Uk3rkuh38cQyy@fat_crate.local> <3b2d4275d1d24dbeacee0f192ac4d69b@huawei.com> <20250109123222.GBZ3_B1g3Esgu1-MPi@fat_crate.local> <20250109142433.00004ea7@huawei.com> <20250109151854.GCZ3_o3rf6S24qUbtB@fat_crate.local> <20250109160159.00002add@huawei.com> X-Mailer: Claws Mail 4.3.0 (GTK 3.24.43; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Stat-Signature: 6bg4bh99df7iut9xctx487pn9thsqbcb X-Rspamd-Queue-Id: 0D810C0017 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1736859476-179650 X-HE-Meta: U2FsdGVkX1/I330xHQ/VcPy8YKh8XLZ+awfgaE/Ou88755fXo/Sd6WnnO5NLco1UDYjLRlNrWEsrGh7lfoIZOCaGeh9CTLfjCr/rZOSMLPZrGRclNQCNUIaipX26fGsECiDtDP0bJIou0itFyeHaXLZyITDgCOOFU5ZmDElxHLNk4psROzGs26564jf/siPAz5GnUlfauZGBjhBz9J6krSxAq4wd5hyHSSGnCVIuUWRfYf6U87yQrObhltBtjDBIVlLaIjE3SAemk3masrfHWeSwhhdKdnzU+ULtIWa6PQzGHjTwAFoSpJfXiPE9YOW1qrZ9KIN0mtV6hoIxOYFAfxXR28ns9WUotz87WCpClGFbBmTi6fWX/5eCT+B89Rj7TcjdLuDZxsCJwbfcwjyFCtIRPUGqSAcOoW+/OkAA1VWaH+TJYSDkE5pYHNy1Z+InvdoGonSzDeR8Scv7hwOgK84zZ/37b2jWgGlmAaOZ1TnOMMDCR0Xq28WsBBTudvYmgmpkxKbONToK117zjowQjAGTfIAxGc2vP4373vlMxRo5UuLL9BClk2Ot8KH1ivVIiN/HsbNaSk+59QVyjhgq8OQEfqoH5KvIk0SB9OPWf/7Z6Re5kMQPz1R7ZtTS3rYOexWmMQcYWmgxrq7rYAju9YdNd90obmeep7ZOHcy1e95kwKad2M1hJ+tpaSqnyNARMnX99mHoKySNe8rez8QkVQmkX8TbSR0BMHNyOx/st7oUItfDFCb6vbQEEz2FbfYy+reTvtqDAKp0Uq7X0nPiN38nD0ZIJmdGF8x/kNAR5jpU3AUmysMbmpcicLUym4f47TXlmgtfZVGWIHAPjqI+hkdEjRCkvZtmtJH13si1twlDhZ4sCmcQsCfgs/LWngtnbwuU6IMtz1qQgzflJnQsxrP6uUlOYmkkb9PSGMCYDQwajPJbK9QatJaFZ2E4Jvag831o20ESOQ+QemiEYq8 qg65+xap /j/aw2f1BnHryQQjdI5+0AV+PWE0WmnJUuCP6d5ZjX/AWKvZBhNabvGV6k08oYA25Nskn22mIJwo+pObV/3wqS0zilx5dB1461yLGB61PhzJyPyfnTwV1XfBVzwRaRkLWsReICfrUwF78UOnQZdpOQxvBnNsmHa+R0LUUefwXgeUacaROzYSk4UG+JMysH0KAFKomjQFli9ymOX+V5u+M66jX4wooX/CR0w5qgU7C9Nqs4irHaRxmPWqDxXono96YugYZNFjfvjglyqUaSk17cYvYZrXKEDKtNujcN9XLFgC7SJ+kqPITEJ9Llbw9v+uOtVmwK1Qlh4YY/rUZnDb+DZvg15+rmUZ7UTtr X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Em Thu, 9 Jan 2025 16:01:59 +0000 Jonathan Cameron escreveu: > > My goal here was to make this user-friendly. Because you need some way of > > knowing what valid ranges are and in order to trigger the repair, if it needs > > to happen for a range. IMO, user-friendly is important, as it allows people to manually use the feature. This is interesting for debugging purposes and also to test if some hardware is doing the right thing. Ok, in practice, production will use an userspace tool like rasdaemon, and/or some scripts [1]. [1] I'd say that rasdaemon should have an initialization phase to discover capabilities that can be discovered. As an example, rasdaemon could, for instance, reserve some sparing memory at init time, if the hardware (partially) supports it. For instance, maybe a CXL device could not be able to handle rank-mem-sparing, but it can handle bank-mem-sparing. > In at least the CXL case I'm fairly sure most of them are not discoverable. > Until you see errors you have no idea what the memory topology is. Sure, but some things can be discovered in advance, like what CXL scrubbing features are supported by a given hardware. If the hardware supports detecting ranges for row/bank/rank sparing, it would be nice to have this reported in a way that userspace can properly set it at OS init time, if desired by the sysadmins. > > Or, you can teach the repair logic to ignore invalid ranges and "clamp" things > > to whatever makes sense. > > For that you'd need to have a path to read back what happened. If sysfs is RW, you have it there already after committing the value set. > > Again, I'm looking at it from the usability perspective. I haven't actually > > needed this scrub+repair functionality yet to know whether the UI makes sense. > > So yeah, collecting some feedback from real-life use cases would probably give > > you a lot better understanding of how that UI should be designed... perhaps > > you won't ever need the ranges, whow knows. > > > > So yes, preemptively designing stuff like that "in the dark" is kinda hard. > > :-) > > The discoverability is unnecessary for any known usecase. > > Ok. Then can we just drop the range discoverability entirely or we go with > your suggestion and do not support read back of what has been > requested but instead have the reads return a range if known or "" / > return -EONOTSUPP if simply not known? It sounds to be that ranges are needed at least to setup mem sparing. > I can live with that though to me we are heading in the direction of > a less intuitive interface to save a small number of additional files. > > Jonathan > > > > Thanks, Mauro