From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A595ED6DDED for ; Fri, 15 Nov 2024 12:14:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B58F66B0089; Fri, 15 Nov 2024 07:14:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B08E46B008A; Fri, 15 Nov 2024 07:14:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9A9366B008C; Fri, 15 Nov 2024 07:14:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 7CEAE6B0089 for ; Fri, 15 Nov 2024 07:14:28 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4736F81418 for ; Fri, 15 Nov 2024 12:14:28 +0000 (UTC) X-FDA: 82788219864.07.E05B354 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf08.hostedemail.com (Postfix) with ESMTP id 0070916001B for ; Fri, 15 Nov 2024 12:13:54 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf08.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731672802; a=rsa-sha256; cv=none; b=RDNacwIn6OMeUu7TBgqkLHDeD82+mCPCBW034+WWBtCBtCD33AokWzdelGw8fgh8Nu+Ymp 8Ih2IcbkznNzpOiFDRFN5c74Xe6+63NezlbPuXnB/WcRbw9+UqPvKMMy+f20KUxCFdUglm lkyQXaaxrLwcdi6QTierS2ZJnT35ANg= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf08.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731672802; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UC/tkycnqDHXLsGj4vwM4XYWKg9CKPUNisPwlMi4MGE=; b=hNqD9tQH9DbIjORqnP0s81QeM7GdxNT/oy5conjQBsNXKpKook0i10YAQfQTKM+HWlUBL9 xnlmhb3Bdz0wV4Vr2aXw4wk8rZhlO703fwkeCblvZH204ar98JsCDd/G85cn5BfU3mzSwJ PHO2SasK7mgqWhzvmbjaCXhdLE41c+k= Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4XqbXq0CZJz6LD9k; Fri, 15 Nov 2024 20:14:03 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id AF913140A08; Fri, 15 Nov 2024 20:14:18 +0800 (CST) Received: from localhost (10.203.177.66) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 15 Nov 2024 13:14:17 +0100 Date: Fri, 15 Nov 2024 12:14:15 +0000 From: Jonathan Cameron To: Borislav Petkov CC: Shiju Jose , "linux-edac@vger.kernel.org" , "linux-cxl@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "tony.luck@intel.com" , "rafael@kernel.org" , "lenb@kernel.org" , "mchehab@kernel.org" , "dan.j.williams@intel.com" , "dave@stgolabs.net" , "gregkh@linuxfoundation.org" , "sudeep.holla@arm.com" , "jassisinghbrar@gmail.com" , "dave.jiang@intel.com" , "alison.schofield@intel.com" , "vishal.l.verma@intel.com" , "ira.weiny@intel.com" , "david@redhat.com" , "Vilas.Sridharan@amd.com" , "leo.duran@amd.com" , "Yazen.Ghannam@amd.com" , "rientjes@google.com" , "jiaqiyan@google.com" , "Jon.Grimm@amd.com" , "dave.hansen@linux.intel.com" , "naoya.horiguchi@nec.com" , "james.morse@arm.com" , "jthoughton@google.com" , "somasundaram.a@hpe.com" , "erdemaktas@google.com" , "pgonda@google.com" , "duenwen@google.com" , "gthelen@google.com" , "wschwartz@amperecomputing.com" , "dferguson@amperecomputing.com" , "wbs@os.amperecomputing.com" , "nifan.cxl@gmail.com" , tanxiaofei , "Zengtao (B)" , "Roberto Sassu" , "kangkang.shen@futurewei.com" , wanghuiqiang , Linuxarm Subject: Re: [PATCH v15 11/15] EDAC: Add memory repair control feature Message-ID: <20241115121415.00005c76@huawei.com> In-Reply-To: <20241114133249.GEZzX8ATNyc_Xw1L52@fat_crate.local> References: <20241101091735.1465-1-shiju.jose@huawei.com> <20241101091735.1465-12-shiju.jose@huawei.com> <20241104061554.GOZyhmmo9melwI0c6q@fat_crate.local> <1ac30acc16ab42c98313c20c79988349@huawei.com> <20241111112819.GCZzHqUz1Sz-vcW09c@fat_crate.local> <7fd81b442ba3477787f5342e69adbb96@huawei.com> <20241114133249.GEZzX8ATNyc_Xw1L52@fat_crate.local> X-Mailer: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.203.177.66] X-ClientProxiedBy: lhrpeml100004.china.huawei.com (7.191.162.219) To frapeml500008.china.huawei.com (7.182.85.71) X-Rspam-User: X-Rspamd-Queue-Id: 0070916001B X-Rspamd-Server: rspam01 X-Stat-Signature: uifibfoxq11q4pfzjzrwq77aybsrx8u1 X-HE-Tag: 1731672834-792555 X-HE-Meta: U2FsdGVkX1+CB+v3gQH0yhW5ejxR6A3NLBgc8xJZ/Fndy0DxIwvrJL58o09gd+BmFSxuqnN4kgm1CoKc1aWT6K3MGHw6C+SDzjjpQy2Xq9z4ckIwMDnSg3ubZYAMlZFWdraaVvxtuo3rl/R+SIh1dolxmfxh/uSA102SMuGBgCmm659CDAqPFGIiQ67NGxUwsmtv1TnGWCRJzhHsYTDEEahNA2+jv0r4s1hzIUr1Gvg7l00RRUTkQiX5e3cRXe8SNUnsJpShlvi0hljJW/TomLuJHs8XFpKqaIW0novRqSYZNVEo6V2AGvhpDSzJ9K2gfpwdq+wOMBq+qOgB4pPPcfqP/vTK/KJxIh3nrtKu9aAq3pAhvaav8cZRtDQu49EblYFdu4QnT7LS2xAj8bLZDd8HM5P+pF88W3QlwRxgNWrTp8fk5fwaJUDcZS60o53xYXDoDqkoxMKmfvk5aEyLOCi0JcwOVt/MzGYzSsDyLh5KYOp/XUhoAabtdk3b3D/riCJpPflCoewagoEF+CghFjD8KXCDHiys48V4KnWpIISXsNfuZkE0IVONOc5dS+vuvd+gzc+jwaaG6tYJ5nWtysTTV68tx5kSZgdy565qp+1TuTqj1/WuYRgJvh7n/YetH7bhdhlbcXjNsgyxgyl6oC2rPOxY5C1WrBM+BE4RjifYTrw9eZ/+0Clf+EzNP2zzguFc58EuGlQJEyOdpnHrFSqbIW3J05fgw28LWQN5CrBARojILFs5fhftMhkkKQLs17EsCCgf5VzHyhac9GV1vHrGe3jl9b71oZQuH77u4SFdZQp/+HefGzaXX+lzBUPqKC9f1BOiNEcXrN88F/KcWcylHgUteN7NBKA7yeBWhPC5/xHfEB49UvjYyIloRA0YZJyy9mLDNAoKdoAUsBUNxOG7Twwh3tU7AXgd5GyGT9CtxotR3MtwoeW2QFoTwl7kwU5XNz8YW0yf7bQ3HOO skrFFFZy l+T/S2pQH0No1aX1iN9LkWpcGkefcs5Aln7xh7K7lwrouDHoTw7BFoQj+fL/u/85FfH0Ddh+MhiO5HPZOPaI7cO+LAWSErg0e8wppMlQUT44vRIzM+UdPaWG6tTRO5OXJPVMA+neoMjktEouFqSRYwgLP7Ia7qDsrBEdmurD6czrTm+Xb4Zc4ag2b22GVDeQC8+OU57+6fCfAASgcvBh71HBMzLs1Zt1xX8RpEaBHxbSJ2EJPtBcF8m1kmA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.060709, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Borislav, I'll just jump in on one element. > > This will work for the CXL PPR feature where the result of the query operation for resources availability > > return to the command, however for the CXL memory sparing features, the result of the query resources > > availability command returned later in a Memory Sparing Event Record from the device. > > Userspace shall issue repair operation with the attributes values received on the Memory Sparing trace event. > > Thus for the CXL memory sparing feature, query for resources availability and repair operation > > cannot be combined. > > What happens if the resources availability changes between the query and the > start of the repair operation? > Short answer, you get an error return. The query is an optional step / optimization. You can just skip it. There is no point in querying if you are going to immediately issue the command to repair (as that will report an error if you can't do it). A typical flow where it might be useful is: 1) Lots of corrected errors reported on a particular part of the memory. 2) OS decides enough is enough, that row/bank/nibble should be replaced. 3) Before doing so it checks it can actually replace it - otherwise maybe we will be disrupting a gigantic page or similar where the perf cost of just off lining is higher than we want. 4) After query the page is offlined etc (may or may not be necessary depending on the hardware design - we may be able to do it 'live'). 5) 'Try' to repair. Hopefully no one raced with us and used up the remaining resources. Given this is typically only driven by something like RASDaemon that race should be a corner case only (very unlikely) 6) If repair fails can just bring the memory back - but this dance was expensive and we will carry on working with less than ideal memory (probably schedule some real maintenance to swap out the device). 7) If repair succeeds bring the memory back as now we have shiny new memory. We could drop the query for now and bring it back later once more of the surrounding infrastructure becomes clearer. To me it's a useful feature, but I appreciate this is early days and we shouldn't always try for all the bells and whistles on day 1. > The cat catches fire? Dog person? :) Just a nice normal error return to indicate no resources. Jonathan >