From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2DF9AC25B75 for ; Thu, 6 Jun 2024 16:06:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B73816B0082; Thu, 6 Jun 2024 12:06:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B4B286B00A4; Thu, 6 Jun 2024 12:06:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9EC636B00A5; Thu, 6 Jun 2024 12:06:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 81D916B0082 for ; Thu, 6 Jun 2024 12:06:40 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 0F7A91616D0 for ; Thu, 6 Jun 2024 16:06:40 +0000 (UTC) X-FDA: 82200941760.17.27116E0 Received: from mail.alien8.de (mail.alien8.de [65.109.113.108]) by imf09.hostedemail.com (Postfix) with ESMTP id 8B4CD140003 for ; Thu, 6 Jun 2024 16:06:36 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=alien8.de header.s=alien8 header.b=hjXKW9Dx; spf=pass (imf09.hostedemail.com: domain of bp@alien8.de designates 65.109.113.108 as permitted sender) smtp.mailfrom=bp@alien8.de; dmarc=pass (policy=none) header.from=alien8.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717689997; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yNiIiUrsriudqclFjLXJzKQ4Ttub+OIs99LR3avItYs=; b=Av4aBIO61rWx8m79JOROQw5+oWWCF/mnQLq157vtgVboxvZFI67wfLxy+6y7uwqwpcUtZr vqZYkknaK3WQncxv6JtHTnoQ5bLulD5bZgDfIRA1WQtV0HDPQY2+WDzLruOtbcaaYeIwrk jwLjDJr+Qf3XUaH4wHT3ZZkYqCpIlvc= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=alien8.de header.s=alien8 header.b=hjXKW9Dx; spf=pass (imf09.hostedemail.com: domain of bp@alien8.de designates 65.109.113.108 as permitted sender) smtp.mailfrom=bp@alien8.de; dmarc=pass (policy=none) header.from=alien8.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717689997; a=rsa-sha256; cv=none; b=cJS8T4UoC4mBnszVlLozP40quINZVRIlbgWyzOZIxwWEDXMmURY5iq6mwFDyctK2aw1UFV 4vlYNXjtuMaOX6htZAlhclWzdUjNTRXi1qgXxoDcm8cOpBOtUKDm70s1C74EmaZnQPIYZq 3rHE22hTLnXPQC8o/ZdI3bTKXW/F/mo= Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.alien8.de (SuperMail on ZX Spectrum 128k) with ESMTP id 59E0D40E0177; Thu, 6 Jun 2024 16:06:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at mail.alien8.de Received: from mail.alien8.de ([127.0.0.1]) by localhost (mail.alien8.de [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 9oTyQaefmcS0; Thu, 6 Jun 2024 16:06:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alien8.de; s=alien8; t=1717689985; bh=yNiIiUrsriudqclFjLXJzKQ4Ttub+OIs99LR3avItYs=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=hjXKW9DxeUu9aYm5BTzAi4VUPIYgVK61PopErp2APevAYakJhr7CxNPIYBKp9jmbt iJ83j1nCVMNLyfCzKTC6tvsWhjeFOz4y+Nt8R9iOrHvBEAemf1jcSL6rnApKXwcuX7 RCVpPSQWW5vvlDJ1+bHpKUuP2zLS8J++4Kqk1MBP9gq5JJKX+iPd8/K8DGizdE1Bu9 2M3gUiSkCPJ7uYw1mTQ4AXs0Jwc0IJydwcPq4vzbz6Z+I1uc52Dts5X3+jKfFyGAAx Hm3U8Y3C47whcAX0yw0AZb/FNFqfkM+Nckf4gCaLIg9ufOuXD+YKe+l+4Ja3A/cVrc VSXevzhXxMMd5r2FFZDrlH7X+LOXPTdTvkyVa4bCzlf4N/C0KAfjLuKlnq8IUgrWY7 KEs9cVxXvda/yssDlwzMrQhg9JguYckb8JwVzsZFg/eh9o0gqofKLA1y6u5HELj7fE SahSODA0/z1zHY/6o72zqI5eZLZVBf24/8JFUeQY+pEhMtQur3oN3tN1et52bONwnI NqHKhNrHI5OSHp5UkfKO0haHb/rFshLJNCGt5HyKxoUx8IoPa8ZkpnB5U/As1DTQX+ v0u54Wp0hdNWX8zS+wcWOIK65Robm0iQ0TlboK1qzt8LYJCZ0IPIcAoD8eEzR1vVmt uM0W6E2TSQg6W6pgB/sdTUjQ= Received: from zn.tnic (p5de8ee85.dip0.t-ipconnect.de [93.232.238.133]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by mail.alien8.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id C995340E016A; Thu, 6 Jun 2024 16:05:38 +0000 (UTC) Date: Thu, 6 Jun 2024 18:05:33 +0200 From: Borislav Petkov To: Jonathan Cameron Cc: Shiju Jose , Dan Williams , "linux-cxl@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "linux-mm@kvack.org" , "dave@stgolabs.net" , "dave.jiang@intel.com" , "alison.schofield@intel.com" , "vishal.l.verma@intel.com" , "ira.weiny@intel.com" , "linux-edac@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "david@redhat.com" , "Vilas.Sridharan@amd.com" , "leo.duran@amd.com" , "Yazen.Ghannam@amd.com" , "rientjes@google.com" , "jiaqiyan@google.com" , "tony.luck@intel.com" , "Jon.Grimm@amd.com" , "dave.hansen@linux.intel.com" , "rafael@kernel.org" , "lenb@kernel.org" , "naoya.horiguchi@nec.com" , "james.morse@arm.com" , "jthoughton@google.com" , "somasundaram.a@hpe.com" , "erdemaktas@google.com" , "pgonda@google.com" , "duenwen@google.com" , "mike.malvestuto@intel.com" , "gthelen@google.com" , "wschwartz@amperecomputing.com" , "dferguson@amperecomputing.com" , "wbs@os.amperecomputing.com" , "nifan.cxl@gmail.com" , tanxiaofei , "Zengtao (B)" , "kangkang.shen@futurewei.com" , wanghuiqiang , Linuxarm , Greg Kroah-Hartman , Jean Delvare , Guenter Roeck , Dmitry Torokhov Subject: Re: [RFC PATCH v8 01/10] ras: scrub: Add scrub subsystem Message-ID: <20240606160533.GDZmHeTbhCoJYKSsD2@fat_crate.local> References: <663d3e58a0f73_1c0a1929487@dwillia2-xfh.jf.intel.com.notmuch> <20240509215147.GBZj1Fc06Ieg8EQfnR@fat_crate.local> <663d55515a2d9_db82d2941e@dwillia2-xfh.jf.intel.com.notmuch> <20240510092511.GBZj3n9ye_BCSepFZy@fat_crate.local> <663e55c59d9d_3d7b429475@dwillia2-mobl3.amr.corp.intel.com.notmuch> <20240511101705.GAZj9FoVbThp7JUK16@fat_crate.local> <6645f0738ead48a79f1baf753fc709c6@huawei.com> <20240520125857.00007641@Huawei.com> <20240527092131.GBZlRQmxwFTxxyR20q@fat_crate.local> <20240528100645.00000765@Huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20240528100645.00000765@Huawei.com> X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 8B4CD140003 X-Stat-Signature: k5hq93czmtsy13kyha76rcfqpqkg67mu X-HE-Tag: 1717689996-233937 X-HE-Meta: U2FsdGVkX18Ip+LUinZnfG8NIVhnYWOna2OSDv/k6OlpdGSY2Lg6BRohxjeim38OlOi6tkAhCScEkgI1UvS/1JXhhdc6eU7lX7IyxZQKHwpjJLlBUn19I/j5MCHEed7xN1TXsQqEYXdihOV8zLGqDVfrMkRtLOqkvDkd2hBs35h/OeWbreOrm/Ub4viM3EnnXoWOm8zu4Z6EoX9hXOiKdqbn3rwtq1HnKesok4CnilvuD693GGW9IgTAnZKjUzL3m8RSfEZjYCDeWI7XS9a9ia5H4SEa0mCvoGB/OMLcw8lxk/S87uNShPN4l4eWOANxsKn5bdMqJz3Q1dWOPGP4XzAv+d5OdvfuLU8xXPHTYXh82gHNTWos2dj6xuEDJEAq3mczIZeb0gtpvFUSM5LZKkS67z1k4KNHgY7kQDotHC49NXyhkT/RU2pxT4+H84skNF+PysGpGj01BXozNOmCnXmeNrrsnH9X5i6Tsy77GI4bkAwOMHP/G1uuu/bXA+zyD4b6Hpw+FjWwaeimIr1OXV2TZ+HpYKKRHTq1Te1zj7qBcuVWAXf36IS4DlhA4KZcnXQ0YkVRr6S6y6Oe5JdStzJE0Lq6h3I5M8F3POFuNcSwKA2V4ejETvJwn+3chNPaLWFXe5IcWeNhk5CpDTwFfWegguTqRi698tnmfz7Hxx/W5wgJJvzqdenJIuKZqNyXcIXeo4kGZO+tiR1wEif1weQBoIzvDY2nl4Oq2lpOt4E2xTYlOb1ke6wonFHh4m5h9KLwqDyNv78aBgukVuMMAEFBgmiDXDMRIsr+kSouoemyxSaS1o46qFFL3UAn2MSn7BUriMOsPj1YkPBDBOPmAz8o99Pqeb2MPkQrN2T54CwOTcCfN+OoMhlzjcgOlBWkkYpHR28AHbnV7L1sF1oyUx3yiaDvZB/u0XyuEh+rj5adfCYl95jvDdauZptWYZt4Y2V5g/+6Gajz5pXu+KN Ga1ZEFfc /9abBZ7HcoRX/cCcHkpwrYtQTDePAGpXCbIuKi5cpzFOfAihM46oGEsMAXTQf7Xq6Qu9dgear7cLcGyiiQZ0TAv4+w3ppRlnbBXMb/szy3qF0O7d2R51brxZiCrHcMW1JtgtdX019EPM4O07/zjvPDmvbAk+dQYJlDSBwTMRmXysISVRtmIb1OLlEeyF1ESWVB3z8unCaoPFDnCA4LXVZHt7EcESf96RAkBU/QmGmrBY/uz1QITrur4BU71IAiJ/Bdg2a8JkmwuT5vy3WrQoBDjol9b1hFE+FQ2ZsylSqp8BN7VYba/9IYNXkr07strgiFEHS X-Bogosity: Ham, tests=bogofilter, spamicity=0.000673, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 28, 2024 at 10:06:45AM +0100, Jonathan Cameron wrote: > If dealing with disabling, I'd be surprised if it was a normal policy but > if it were udev script or boot script. If unusual event (i.e. someone is Yeah, I wouldn't disable it during boot but around my workload only. You want for automatic scrubs to still happen on the system. > trying to reduce jitter in a benchmark targetting something else) then > interface is simple enough that an admin can poke it directly. Right, for benchmarks direct poking is fine. When it is supposed to be something more involved like, dunno, HPC doing a heavy workload and it wants to squeeze all performance so I guess turning off the scrubbers would be part of the setup script. So yeah, if this is properly documented, scripting around it is easy. > To a certain extent this is bounded by what the hardware lets us > do but agreed we should make sure it 'works' for the usecases we know > about. Starting point is some more documentation in the patch set > giving common flows (and maybe some example scripts). Yap, sounds good. As in: "These are the envisioned usages at the time of writing... " or so. > > Do you go and start a scrub cycle by hand? > > Typically no, but the option would be there to support an admin who is > suspicious or who is trying to gather statistics or similar. Ok. > That definitely makes sense for NVDIMM scrub as the model there is > to only ever do it on a demand as a single scrub pass. > For a cyclic scrub we can spin a policy in rasdaemon or similar to > possibly crank up the frequency if we are getting lots of 'non scrub' > faults (i.e. correct error reported on demand accesses). I was going to suggest that: automating stuff with rasdaemon. It would definitely simplify talking to that API. > Shiju is our expert on this sort of userspace stats monitoring and > handling so I'll leave him to come back with a proposal / PoC for doing that. > > I can see two motivations though: > a) Gather better stats on suspect device by ensuring more correctable > error detections. > b) Increase scrubbing on a device which is on it's way out but not replacable > yet for some reason. > > I would suggest this will be PoC level only for now as it will need > a lot of testing on large fleets to do anything sophisticated. Yeah, sounds like a good start. > > Do you automate it? I wanna say yes because that's miles better than > > having to explain yet another set of knobs to users. > > First instance, I'd expect an UDEV policy so when a new CXL memory > turns up we set a default value. A cautious admin would have tweaked > that script to set the default to scrub more often, an admin who > knows they don't care might turn it off. We can include an example of that > in next version I think. Yes, and then hook into rasdaemon the moment it logs an error in some component to go and increase scrubbing of that component. But yeah, you said that above already. > Absolutely. One area that needs to improve (Dan raised it) is > association with HPA ranges so we at can correlate easily error reports > with which scrub engine. That can be done with existing version but > it's fiddlier than it needs to be. This 'might' be a userspace script > example, or maybe making associations tighter in kernel. Right. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette