From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B803DE7719A for ; Sat, 11 Jan 2025 17:13:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A0516B008C; Sat, 11 Jan 2025 12:13:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 24E226B0092; Sat, 11 Jan 2025 12:13:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C8626B0093; Sat, 11 Jan 2025 12:13:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E31D46B008C for ; Sat, 11 Jan 2025 12:13:43 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6F8F81613BF for ; Sat, 11 Jan 2025 17:13:43 +0000 (UTC) X-FDA: 82995817926.09.1FAB0D9 Received: from mail.alien8.de (mail.alien8.de [65.109.113.108]) by imf18.hostedemail.com (Postfix) with ESMTP id 0DDF01C0009 for ; Sat, 11 Jan 2025 17:13:40 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=alien8.de header.s=alien8 header.b=OgD4Au04; dmarc=pass (policy=none) header.from=alien8.de; spf=pass (imf18.hostedemail.com: domain of bp@alien8.de designates 65.109.113.108 as permitted sender) smtp.mailfrom=bp@alien8.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736615621; a=rsa-sha256; cv=none; b=CDvSeZrKQpY6JKbhtDeFM4quukm3odYb/UXTZ04/iGVaLdTRUvFl1T3u/HGr00PZtvTiiy vOwKjlJS6vB31FHab7V56MKdD0G6T5KYo7zvj+FgzS/DKsBOwKmoJ3fGy46zLeCfIEGjsz 1M7mseBAHlqLzK2dQXqn9OT4nFakFCI= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=alien8.de header.s=alien8 header.b=OgD4Au04; dmarc=pass (policy=none) header.from=alien8.de; spf=pass (imf18.hostedemail.com: domain of bp@alien8.de designates 65.109.113.108 as permitted sender) smtp.mailfrom=bp@alien8.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736615621; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dy/nLqIHFB1W6aDIAXq0L4N3W+XLgmeHWVFfPlL6OZA=; b=0HASeQPbyHKCV3Qh7xO7VWgK7/PhFz9PUNEohorvs8x83fU1cBRE2m++kTrPVwQlsSRVe2 vflTYpVLaF83u+EfXJwR7SDSmii+J/eIrF6Z8kHdtnbkdIcz5I8GId0H4Ffb6MEw/57qhE gnCHf4t4Adi0IgcsqqcmqL9d7TU2E5s= Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.alien8.de (SuperMail on ZX Spectrum 128k) with ESMTP id C3A8640E01B5; Sat, 11 Jan 2025 17:13:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at mail.alien8.de Received: from mail.alien8.de ([127.0.0.1]) by localhost (mail.alien8.de [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id tE6Ly5eTuki8; Sat, 11 Jan 2025 17:13:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alien8.de; s=alien8; t=1736615614; bh=dy/nLqIHFB1W6aDIAXq0L4N3W+XLgmeHWVFfPlL6OZA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=OgD4Au04w7yNqKvb1xwuel8mJVIjoFPyfCzdGWssaAU7lktQhh/OxwupNjtI1TFsd gT/ZemcjG4/wDC7AJjQCIh3LRWWrXGR6G3M3pNRb75/8c/AeIE3d+xVfv8xrjX6s7x z869j3MfOlzJdOCIxHQdDOHPbLUElbjM2qTDEYLPB5htGJJoDa62JfzpLJ7GN1+IDq h1vZTYb/QeGbAU+vlBYkQ/Gy1vGCsJLDLzCkLQQEfCWlnCuQwDa5E8m+bXg+AupPH0 MkZXqYuEfI9xhKIkaO/b9ejGfdEn4RENncDCJsdjubNT4h8cBPWAC5xRKukGHHLnm3 cdNQw4sgHFJoWcIy4yp0gzp6DLQsOQ91dRB6CPNasqglSwgCFdBM/5PKBiD2SPxNoE pZvOlozxkvJeNQUQnoAmCNSGEC88sI37sK/wG/9NcqYGFRhB7JJgJkybbjaC7/tlHH tJCxbTjsrtDSkypy4NY7BCHs/jOfxs6JhwiEdLrGvQes+Hb8j+HTp917affjX/IuKr deuuZ7/eqdtlDoMQKAkCTP57V8eFOB6sHwHRGfeEU9GeU6TsirbQIc4V+yw55KlS6G vY7lAMOChfNTgRVc5Z76KsCg6q837EigHDPhlaBJQWrCRh7O6tloA0nJCkECFlliCL IoLbsSWJiqc/5YwU9V6HNueU= Received: from zn.tnic (pd953008e.dip0.t-ipconnect.de [217.83.0.142]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by mail.alien8.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id E015140E02BF; Sat, 11 Jan 2025 17:12:50 +0000 (UTC) Date: Sat, 11 Jan 2025 18:12:43 +0100 From: Borislav Petkov To: Jonathan Cameron Cc: Shiju Jose , "linux-edac@vger.kernel.org" , "linux-cxl@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "tony.luck@intel.com" , "rafael@kernel.org" , "lenb@kernel.org" , "mchehab@kernel.org" , "dan.j.williams@intel.com" , "dave@stgolabs.net" , "dave.jiang@intel.com" , "alison.schofield@intel.com" , "vishal.l.verma@intel.com" , "ira.weiny@intel.com" , "david@redhat.com" , "Vilas.Sridharan@amd.com" , "leo.duran@amd.com" , "Yazen.Ghannam@amd.com" , "rientjes@google.com" , "jiaqiyan@google.com" , "Jon.Grimm@amd.com" , "dave.hansen@linux.intel.com" , "naoya.horiguchi@nec.com" , "james.morse@arm.com" , "jthoughton@google.com" , "somasundaram.a@hpe.com" , "erdemaktas@google.com" , "pgonda@google.com" , "duenwen@google.com" , "gthelen@google.com" , "wschwartz@amperecomputing.com" , "dferguson@amperecomputing.com" , "wbs@os.amperecomputing.com" , "nifan.cxl@gmail.com" , tanxiaofei , "Zengtao (B)" , Roberto Sassu , "kangkang.shen@futurewei.com" , wanghuiqiang , Linuxarm Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature Message-ID: <20250111171243.GCZ4Kmi5xMtY2ktCHm@fat_crate.local> References: <20250106121017.1620-1-shiju.jose@huawei.com> <20250106121017.1620-5-shiju.jose@huawei.com> <20250109091915.GAZ3-Uk3rkuh38cQyy@fat_crate.local> <3b2d4275d1d24dbeacee0f192ac4d69b@huawei.com> <20250109123222.GBZ3_B1g3Esgu1-MPi@fat_crate.local> <20250109142433.00004ea7@huawei.com> <20250109151854.GCZ3_o3rf6S24qUbtB@fat_crate.local> <20250109160159.00002add@huawei.com> <20250109161902.GDZ3_29rH-sQMV4n0N@fat_crate.local> <20250109183448.000059ec@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20250109183448.000059ec@huawei.com> X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 0DDF01C0009 X-Stat-Signature: k399sfzench3eafd9n3eq1ppwgab9uqg X-HE-Tag: 1736615620-809521 X-HE-Meta: U2FsdGVkX182ggonell6AVsxNT+CgfXx5q9jOBzLvASNioDSl5435G5KVEoTLv6H0qxLAUivhA28TY19ZbLWyIY+eFaUaqQlbL12qqorfgUO3bYSn/JcyKUlO9xUCSAupd+HutJtziRMrwHQkXqsWfyouVJRutiMTsm0SA2YKrHEiW43Y+ezsDtCzhSZP7t4Xs3Mg2R2fKKitbINSG7bPkSJQrp9ehP2diaqCXc9LSo+UP/CpBnAJhlLHXM4YoUKlTU8fk12+W6vI4a7yepAN3IvzgPNcnFv6NeMWJATVeu6sNQbHcbvzPKLOakMmTQpSbfBCkunl2qZHWvpT4Y53hIyE7V9a5AxGF+KzOJGRMNiQZpwvhuB2C8rkLbrc1gSjmA8JQiIMP3uQFrQeoY/b5r7ipaZ99WMSHII5zZznaovSNVl6SW8xIp5TlJqs/JEgT91UqXaWPM6qsCXUJ+UpPd9IzeLInBQ5hl+A6+Ch3zbt/BY7SNlvM+K7pCozR6McpltglGd2imzFy+YpbeMJT9zSOcT6k2esW9F9/O3oAp/6u+3fugI+cq9psNnyr7DzIilWuHzqU4W32w72xX5jPtuiA7TQJtPF8v/KZTcpefQy6DDJKeuR09HrJVveKt4z/WMQ9Cg9wMe2ikOlc94mv64BXYrlWLP50xmKn0ya9sJO2W5Z3wft5XIUwsiXdg/bipj2h6umCDhz3Muwbgm36iwc/WQ0NPKU12lIPQDhHYjeDCvb8Qfp8696TqEqiJbsEwTHJj3T5s45pWppY5m5xVDa3kG9je8xbBDQMUxTW+IPievvYRi8TFV0AB8b0nLbm01xYvxxN+/q1k8QDZIKTr7M8vLDgWy/OG2ohFOEA0PfbNIO4s0nWGSOVzPn9hbZwicxuidLrIytV8iNj8pAzCvUOjRmuA3cZomSXB3qBrl1y/ZXAtB4jz/B4JfeyScf0ls5ixeB0R5R/BDgXI s4jRUUpN RSzELojeyb8HeeSluO2vZ5cKdDs8OTf8degxXFbWLZsav4Xb8ILxxI+hH7PnE7qRd5nNaEdfajFzLagxxhe2hRkroucgrwAAzcpAt+pROoax3cVbyjzQYUvotEi0eDu/G5MvYP2MNdNIGRwyYKxIQ9Vf1zLuKu0ZXDDqC3dy9ulrO2VBZHJEvj5CZlJHUSZ80dfRHmjAdzdv01qHcQUTRVrIAGLFZHJ9/IxvmqP0bhdlU7GIHkCJtbKHIA2xx72CdzpLd+COCiJBGZXwEzdyGpcu0me/HUmdAiAXdgtRMfvw699g6kdgzexg0OIg76dJ0DaYrDRL3bBHAWcoMpNZJItSqbA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 09, 2025 at 06:34:48PM +0000, Jonathan Cameron wrote: > Today you can. Seems we are talking cross purposes. > > I'm confused. I thought your proposal was for "bank" attribute to present an > allowed range on read. > "bank" attribute is currently written to and read back as the value of the bank on which > to conduct a repair. Maybe this disconnect is down to the fact max_ and min_ > attributes should have been marked as RO in the docs. They aren't controls, > just presentation of limits to userspace. > > Was intent a separate bank_range type attribute rather than max_bank, min_bank? I don't know - I'm just throwing ideas out there. You could do: cat /sys/.../bank and that gives you [ ] So you have all the needed information. Dunno if this would be abusing sysfs rules too much tho. > > > > > In at least the CXL case I'm fairly sure most of them are not discoverable. > > > Until you see errors you have no idea what the memory topology is. > > > > Ok. > > > > > For that you'd need to have a path to read back what happened. > > > > So how is this scrubbing going to work? You get an error, you parse it for all :> > the attributes and you go and write those attributes into the scrub interface > > and it starts scrubbing? > > Repair not scrubbing. They are different things we should keep separate, > scrub corrects the value, if it can, but doesn't change the underlying memory to > new memory cells to avoid repeated errors. Replacing scrub with repair > (which I think was the intent here)... Really? So how is scrubbing defined for CXL? You read memory, do ECC check on it, report any potential errors but write back the *original* wrong value?! I thought the point of scrubbing is to repair it while at it too... > You get error records that describe the error seen in hardware, write back the > values into this interface and tell it to repair the memory. This is not > necessarily a synchronous or immediate thing - instead typically based on > trend analysis. This is just silly: I'm scrubbing, I found an error, I should simply fix it while at it. Why would I need an additional command to repair it?! > As an example, the decision might be that bit of ram threw up 3 errors > over a month including multiple system reboots (for other reasons) and > that is over some threshold so we use a spare memory line to replace it. Right. > Short answer, it needs to be very smart and there isn't a case of one size > fits all - hence suggested approach of making it a user space problem. Making it a userspace problem is pretty much always a sign that the hw design failed. > Given in the systems being considered here, software is triggering the repair, > we want to allow for policy in the decision. Right, you can leave a high-level decision to userspace: repair only when idle, repair only non-correctable errors, blabla but exposing every single aspect of every single error... meh. > In simple cases we could push that policy into the kernel e.g. just repair > the moment we see an error record. > > These repair resources are very limited in number, so immediately repairing > may a bad idea. We want to build up a history of errors before making > such a decision. That can be done in kernel. Yap, we are doing this now: drivers/ras/cec.c Userspace involvement is minimal, if at all. It is mostly controlling the parameters of the leaky bucket. > The decision to repair memory is heavily influenced by policy and time considerations > against device resource constraints. > > Some options that are hard to do in kernel. > > 1. Typical asynchronous error report for a corrected error. > > Tells us memory had an error (perhaps from a scrubbing engine on the device > running checks). No need to take action immediately. Instead build up more data > over time and if lots of errors occur make decision to repair as no we are sure it > is worth doing rather than a single random event. We may tune scrubbing engines > to check this memory more frequently and adjust our data analysis to take that > into account for setting thresholds etc. See above. What happens when your daemon dies and loses all that collected data? > 2. Soft repair across boots. We are actually storing the error records, then only > applying the fix on reboot before using the memory - so maintaining a list > of bad memory and saving it to a file to read back on boot. We could provide > another kernel interface to get this info and reinject it after reboot instead > of doing it in userspace but that is another ABI to design. We did something similar recently: drivers/ras/amd/fmpm.c. It basically "replays" errors from persistent storage as that memory cannot be replaced. > 3. Complex policy across fleets. A lot of work is going on around prediction techniques > that may change the local policy on each node dependent on the overall reliability > patterns of a particular batch of devices and local characteristics, service guarantees > etc. If it is hard repair, then once you've run out you need schedule an engineer > out to replace the DIMM. All complex inputs to the decision. You probably could say here: "repair or report when this and that." or "offline page and report error" and similar high-level decisions by leaving the details to the kernel instead of looking at every possible error in userspace and returning back to the kernel to state your decision. > Similar cases like CPU offlining on repeated errors are done in userspace (e.g. > RAS Daemon) for similar reasons of long term data gathering and potentially > complex algorithms. > > > > > > Ok. Then can we just drop the range discoverability entirely or we go with > > > your suggestion and do not support read back of what has been > > > requested but instead have the reads return a range if known or "" / > > > return -EONOTSUPP if simply not known? > > > > Probably. > > Too many options in the above paragraph so just to check... Probably to which? > If it's a separate attribute from the one we write the control so then > we do what is already done here and don't present the interface at all if > the range isn't discoverable. Probably means I still don't get a warm and fuzzy feeling about this design. As I've noted above. > Ok. Best path is drop the available range support then (so no min_ max_ or > anything to replace them for now). > > Added bonus is we don't have to rush this conversation and can make sure we > come to the right solution driven by use cases. Yap, that sounds like a prudent idea. What I'm trying to say, basically, is, this interface through sysfs is a *lot* of attributes, there's no clear cut use case where we can judge how useful it is and as I alluded to above, I really think that you should leave the high-level decisions to userspace and let the kernel do its job. This'll make your interface a lot simpler. And if you really need to control every single aspect of scrubbing in userspace, then you can always come later with proper design and use case. But again, I really think you should keep as much recovery logic in the kernel and as automatic as possible. Only when you really really need user input, only then you should allow it... I hope I'm making sense here... Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette