From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D945C3DA59 for ; Fri, 19 Jul 2024 06:24:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AB7D06B0088; Fri, 19 Jul 2024 02:24:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A69446B0089; Fri, 19 Jul 2024 02:24:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 907D06B008C; Fri, 19 Jul 2024 02:24:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7277E6B0088 for ; Fri, 19 Jul 2024 02:24:23 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 212F8120457 for ; Fri, 19 Jul 2024 06:24:23 +0000 (UTC) X-FDA: 82355512806.23.C2AACC5 Received: from esa8.hc1455-7.c3s2.iphmx.com (esa8.hc1455-7.c3s2.iphmx.com [139.138.61.253]) by imf27.hostedemail.com (Postfix) with ESMTP id 9FFAC4002D for ; Fri, 19 Jul 2024 06:24:20 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=fujitsu.com header.s=fj2 header.b="KLQc5/4v"; spf=pass (imf27.hostedemail.com: domain of ruansy.fnst@fujitsu.com designates 139.138.61.253 as permitted sender) smtp.mailfrom=ruansy.fnst@fujitsu.com; dmarc=pass (policy=reject) header.from=fujitsu.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721370220; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kUTVneEJZOdzALTpI1W7r04HwjASTKBSia3q+29htzI=; b=kSh6iagGrUY5fOqNDTmA1FG/rJItiMAfD536Ndl1xjyhQFoKOHF51nxNT+JaA8w8boE9l+ 3BzvU8mZFlaxdNv4OLYLMde2634FpxkLZKcESaCQ23szwk51zyX16DHN4q9TkVDF2odzXA WDoeLADpImtn0RirRsLdTI4JmDgCOt8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721370220; a=rsa-sha256; cv=none; b=PTIYgQoOd3Yy/gAHlx5oYr4gAvhvxHXDn8EAAshKicgbBy7HU/MLnQOWW/xYcteVSP7Sop WECVExhbXj2REdO76ORk/y98oMNq1hawBC3ZPo3CPOoyXuUEmvkSfnN8oQ7V5BOLYLgvIg FRjHjVWkdWi5SLcGK6yyRQvQ+D+pk3E= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=fujitsu.com header.s=fj2 header.b="KLQc5/4v"; spf=pass (imf27.hostedemail.com: domain of ruansy.fnst@fujitsu.com designates 139.138.61.253 as permitted sender) smtp.mailfrom=ruansy.fnst@fujitsu.com; dmarc=pass (policy=reject) header.from=fujitsu.com DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=fujitsu.com; i=@fujitsu.com; q=dns/txt; s=fj2; t=1721370260; x=1752906260; h=message-id:date:mime-version:subject:from:to:cc: references:in-reply-to:content-transfer-encoding; bh=j2S2BiTnFjnMDCqHjpzb2VhUaNfciznSi4vJjRSF6QU=; b=KLQc5/4v7/L4FJD8iefPgR5GmFUtPJnsD217OnSlTkPanl9oWJ371UZs curW2Qp25qI7vteLXmKNnuKH86VsD/+uUsotw4oZ5tAnQ08wcpTWR6ZDg RKwyw/LwV8VdSHrNvJgJuKCMoW2TTjfkkPN6IQ2Y+plRlo6d6wPL2lD+2 9Q9zvw5quY+R4CQDMr/KhMFmAWNwhiL5o73DPlT8PDlNSxJbYtEj1epEc JSYSaHfeDrGqYhMfQX18hJjuOTDTqHD1ngrE2R2MMfSYzPJLiwefYDXDd xDBiyAzAFN6BKYWHrUE4KSp5KiuxKq4uP09nwx5XdFNB9rhEIPiwg2or5 A==; X-IronPort-AV: E=McAfee;i="6700,10204,11137"; a="155852857" X-IronPort-AV: E=Sophos;i="6.09,220,1716217200"; d="scan'208";a="155852857" Received: from unknown (HELO oym-r4.gw.nic.fujitsu.com) ([210.162.30.92]) by esa8.hc1455-7.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jul 2024 15:24:18 +0900 Received: from oym-m4.gw.nic.fujitsu.com (oym-nat-oym-m4.gw.nic.fujitsu.com [192.168.87.61]) by oym-r4.gw.nic.fujitsu.com (Postfix) with ESMTP id 63830D8027 for ; Fri, 19 Jul 2024 15:24:16 +0900 (JST) Received: from kws-ab4.gw.nic.fujitsu.com (kws-ab4.gw.nic.fujitsu.com [192.51.206.22]) by oym-m4.gw.nic.fujitsu.com (Postfix) with ESMTP id 9EDB7D530E for ; Fri, 19 Jul 2024 15:24:15 +0900 (JST) Received: from edo.cn.fujitsu.com (edo.cn.fujitsu.com [10.167.33.5]) by kws-ab4.gw.nic.fujitsu.com (Postfix) with ESMTP id 2E2CC223FEA for ; Fri, 19 Jul 2024 15:24:15 +0900 (JST) Received: from [192.168.50.5] (unknown [10.167.226.114]) by edo.cn.fujitsu.com (Postfix) with ESMTP id 6343B1A000A; Fri, 19 Jul 2024 14:24:13 +0800 (CST) Message-ID: <1d98c0a9-3981-4a01-890a-00eb763a140c@fujitsu.com> Date: Fri, 19 Jul 2024 14:24:13 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device From: Shiyang Ruan To: qemu-devel@nongnu.org, linux-cxl@vger.kernel.org, linux-mm@kvack.org, linux-edac@vger.kernel.org Cc: dan.j.williams@intel.com, Jonathan Cameron , dave@stgolabs.net, ira.weiny@intel.com, alison.schofield@intel.com, dave.jiang@intel.com, vishal.l.verma@intel.com, Borislav Petkov , Tony Luck , James Morse , Mauro Carvalho Chehab , Robert Richter , Miaohe Lin , Naoya Horiguchi References: <20240618165310.877974-1-ruansy.fnst@fujitsu.com> In-Reply-To: <20240618165310.877974-1-ruansy.fnst@fujitsu.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-TM-AS-Product-Ver: IMSS-9.1.0.1417-9.0.0.1002-28538.003 X-TM-AS-User-Approved-Sender: Yes X-TMASE-Version: IMSS-9.1.0.1417-9.0.1002-28538.003 X-TMASE-Result: 10--24.299000-10.000000 X-TMASE-MatchedRID: Ef5f9TJi2wePvrMjLFD6eKn9fPsu8s0a2q80vLACqaeqvcIF1TcLYCxz RYsJiUavTn58gKg8la7m3gC+nXarcZcLewwAa76fs/Hes76OTZCNTfm/w05jw56fSoF3Lt+MnRw gpbfXKvNRzAfF54kYaWowq23ynEFcWHT+0u4LwuT+xRIVoKNMvH5Lmbb/xUua13kaiNHAkIBlWG AuNI6Sxsz9lPfBq+1F7abh56CWkbN/s7IOpKdgtxFbgtHjUWLyutt2Dch6FcotferJ/d7Ab5/1F vKGn+lPEEqtRysSflO46UHmTVPkFB63VPG+eA2DrMZ+BqQt2NpDVmiiHQSFeAtLaYWP8cGHuua+ F17fpzj7EH8CyiIGBt6nuB9VOJmrWrN4/58ppmLTCZHfjFFBz7JOtZXi/DJf1xSe1t5SKeM7ETv uQ225CXsDbTFoPmjmf+pDHdeWxgz4eXMWsO1RDPUDwTduLUkchCDQXoQcDah9ZDXSzHFFNzd7cd HHAIr1Wb1EFD5Yly4kAchtY7c9mnzh4vqEo4GEtT4jIeGRd/Vcsgu/IQFPzk8XS5YwROOZRtUL4 XifTntJN5dsWcjr6TpA5K9m1Jp8/bu+3CdDf0BZNYSHk3Zr0eOaAxDXuHnryPRAwD/3abYYrur/ gtBkqOHJnvo8jZJC7wQR4QooTK5VmyN0+mvk2EXBhxFdFgcQrECF/8Y+iZ+PaLJ/Ca3ST28GLH2 U1HrNpFedpgCFxaXJVWLjSiZxbnAA9eFj9SfYngIgpj8eDcBZDL1gLmoa/CGj6i653v+C7nY51l wLq0+8QIu4z6HhEH7cGd19dSFd X-TMASE-SNAP-Result: 1.821001.0001-0-1-22:0,33:0,34:0-0 X-Rspamd-Queue-Id: 9FFAC4002D X-Stat-Signature: 4bnqkiaaa1xekobi6ipbk1o9beipbr5d X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1721370260-943260 X-HE-Meta: U2FsdGVkX1+8Ed9ttAF1Foe90C5WEH9LvSK9xs+ZLzIBuf1e3GvgOAn5wt20H4g20jgLRraHewDTDr2zdQcFEdLAK6/NVqWg20F3/pjyeHG7pv30cbVY+H47EF3DRn/myfxoOggSMvub44kZPMWLz4+TNaWgvVd1JuJo/c0QOeupNreWz9Dwwd2CoZIqjFr/IOmHGXDlPWdoXL76/kjiboCN5+OXYQR2oo5vd1JPbY1C20B1Ru1qjytcUtSUVxVltNe220s0yqhe4yOgOxMqft49SeOGDviBdFkPbMNgl3LSNRT/U1OQPiKJxZm+p7bg0rKEORVCXjJ1HTq9S3O9G9vOWCMKCUpQnRbG5c0FzOr5afzKG6L3DYAB+mIvmOWGMBa5nWChZavWxnTK++zvcvl6PABfBl/mh9VKBUxOvilIjSWL9PX8CkOlEWLKF6S79QiJrlDJEx3yzEJKnqyuSGxSav/P321d+iBwlLTuaBCC3puNoCZZG4wVBaPRNPp2xFNeOCyjpQ33LqlFpOo/KvBN9PLu31pVW1X7SU6QFcEeI9skvwAgiVtJ8ANkB+rVOMbe0IBsAAcOJ6jIcS0QA4M4xLFp3GE2/9uSiXiLpE2rAmfOAy6zZHRtVgAv+3pFCwZjyJ+06psYO3/wOjz5N1VfVsCVBdYYShTYNURjRtQ5NYhQ5RhkIB6KxdWapAWIvzx422q9YDWqwqdL63nuMHA+sHwYavg2+YyIfkQTrJ8Mc6Yxo7RktaTQ9PS3OeOqSBpbdnswT0ywGD1IdM+8howEJdbI49sIKz7y2+sQE1vZRFEd93wgIGC85huw/4Gz/xsGiGbxnFLCzt0bpYujul+pM5dd6lBkQcHpcK9Il+Ug8zMk0ty/p/ve8jr56vJUB99mZQC92s9vrCT5/vUO628uskP2vI/1sRTy+d5K9N7BTslPf9hv2ZpLZPHmVo5ZiWNeo2GTNo4O43q1vgH gB08Rjco 7CGtgSjX+mdXFYfiam5K71sHB+3URDF9yxhs80QPxEU/ciWrsjlyADkkIisZh0cq9feN/617Q+3iNIhYdb7Nv5mRqt7l6+uB/MWcN3AyF7yBImiBFzQjUkaO8gAk3eqIG9R0Le1PnoEMG+xjs4oYUxmnLkGIlnuJ56+k/1k26Aeml4pD98jUTuLHwTNS/gpsgOjXPadGUbTkdxx558Y9897TNIjvTU6RkSMj/WqW+WWCEX2HHtP9fxh01QibSF8vLANsDCZgWguoXEw3qXWuLkzyrnrFbcyNgU9RYRgWumy7xxvaFc1ZYqWqWXapNfWGiZwUx18eLhbSWdoZFwnMWUAz+I0oPzkUqE/AXxqrukpNOH45Q1hlijy8kDm5jnwwOaa5nvfZJWZBnm4/y7XWQanEz33q3FGRuPEXnBoPR/D2z7wbWs1LsTvuhJbcsmKJmEkxuqNr6Ynn5bfAs1HpHEc28eA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/6/19 0:53, Shiyang Ruan 写道: > > This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev > to check whether the current poison page has been reported (if yes, > stop the notifier chain, won't call the following memory_failure() > to report), into `x86_mce_decoder_chain`. In this way, if the poison > page already handled(recorded and reported) in (1) or (2), the other one > won't duplicate the report. The record could be clear when > cxl_clear_poison() is called. Hi guys, I'd like to sort out the work I am currently carrying forward, to make sure I'm not going in the wrong direction. Please correct me if anything is wrong. As is known to us, CXL spec defines POISON feature to notify its status when CXL memory device got a broken page. Basically, there are two major paths for the notification. 1. CPU handling error When a process is accessing this broken page, CXL device returns data with POISON. When CPU consumes the POISON, it raises a kind of error notification. To be precise, "how CPU should behave when it consumes POISON" is architecture dependent. In my understanding, x86-64 raises Machine Check Exception(MCE) via interrupt #18 in this case. 2. CXL device reporting error When CXL device detects the broken page by itself and sends memory error signal to kernel in two optional paths. 2.a. FW-First CXL device sends error via VDM to CXL Host, then CXL Host sends it to System Firmware via interrupt, finally kernel handles the error. 2.b. OS-First CXL device directly sends error via MSI/MSI-X to kernel. Note: Since I'm now focusing on x86_64, basically I'll describe about x86-64 only. The following diagram should describe the 2 major paths and 2 optional sub-paths above. ``` 1. MCE (interrupt #18, while CPU consuming POISON) -> do_machine_check() -> mce_log() -> notify chain (x86_mce_decoder_chain) -> memory_failure() 2.a FW-First (optional, CXL device proactively find&report) -> CXL device -> Firmware -> OS: ACPI->APEI->GHES->CPER -> CXL driver -> trace 2.b OS-First (optional, CXL device proactively find&report) -> CXL device -> MSI -> OS: CXL driver -> trace ``` For "1. CPU handling error" path, the current code seems to work fine. When I used error injection feature on QEMU emulation, the code path is executed certainly. Then, if the CPU certainly raises a MCE when it consumes the POISON, this path has no problem. So, I'm working on making for 2.a and 2.b path, which is CXL device reported POISON error could be handled by kernel. This path has two advantages. - Proactively find&report memory problems Even if a process does not read data yet, kernel/drivers can prevent the process from using corrupted data proactively. AFAIK, the current kernel only traces POISON error event from FW-First/OS-First path, but it doesn't handle them, neither notify processes who are using the POISON page like MCE does. User space tools like rasdaemon reads the trace and log it, but as well, it doesn't handle the POISON page. As a result, user has to read the error log from rasdaemon, distinguish whether the POISON error is from CXL memory or DDR memory, find out which applications are effected. That is not an easy work and cannot be handled in time. Thus, I'd like to add a feature to make the work done automatically and quickly. Once CXL device reports the POISON error (via FW-First/OS-First), kernel handles it immediately, similar to the flow when a MCE is triggered. This is my first motivation. - Architecture independent As the mentioned above, "1. CPU handling error" path is architecture dependent. On the other hand, this route can be architecture independent code. If there is a CPU which does not have similar feature like MCE of x86-64, my work will be essential. (To be honest, I did not notice this advantage at first as mentioned later, but I think this is also important.) Here is the timeline of my development of it. Two series of patches have been sent so far: - PATCH: cxl/core: add poison creation event handler [1] - PATCH: cxl: avoid duplicating report from MCE & device [2] [1] https://lore.kernel.org/linux-cxl/20240417075053.3273543-1-ruansy.fnst@fujitsu.com/ [2] https://lore.kernel.org/linux-cxl/20240618165310.877974-1-ruansy.fnst@fujitsu.com/ The 1st patch[1] added POISON error handler in "2. CXL device reporting error" path. My first version was constructing a MCE data from POISON address and calling mce_log() to handle the POISON. But I was told that constructing MCE data is architecture dependent while CXL is not. So, in later version, just call memory_failure_queue() in CXL to handle the POISON error to avoid the arch-dependent problem. After many discussions, a new problem was found: as Dan said[3], added POISON handling will cause the "duplicate report" problem: > So, I think all CXL poison notification events should trigger an > action optional memory_failure(). I expect this needs to make sure > that duplicates re not a problem. I.e. in the case of CPU consumption > of CXL poison, that causes a synchronous MF_ACTION_REQUIRED event via > the MCE path *and* it may trigger the device to send an error record > for the same page. As far as I can see, duplicate reports (MCE + CXL > device) are unavoidable. [3] https://lore.kernel.org/linux-cxl/664d948fb86f0_e8be294f8@dwillia2-mobl3.amr.corp.intel.com.notmuch/ To solve this problem, I made the 2nd patch[2]. Allow me to describe the background again: Since CXL device is a memory device, while CPU is consuming a poison page of CXL device, it always triggers a MCE (via interrupt #18) and calls memory_failure() to handle POISON page, no matter which-First path is configured. My patch added memory_failure() in FW-First/OS-First path: if device finds and reports the POISON, kernel not only traces but also calls memory_failure() to handle it, marked as "ADD" in the figure blow. ``` 1. MCE (interrupt #18, while CPU consuming POISON) -> do_machine_check() -> mce_log() -> notify chain (x86_mce_decoder_chain) -> memory_failure() <---------------------------- EXISTS 2.a FW-First (optional, CXL device proactively find&report) -> CXL device -> Firmware -> OS: ACPI->APEI->GHES->CPER -> CXL driver -> trace \-> memory_failure() ^----- ADD 2.b OS-First (optional, CXL device proactively find&report) -> CXL device -> MSI -> OS: CXL driver -> trace \-> memory_failure() ^------------------------------- ADD ``` But in this way, the memory_failure() could be called twice or even at same time, as is shown in the figure above: (1.) and (2.a or 2.b), before the POISON page is cleared. memory_failure() has it own mutex lock so it actually won't be called at same time and the later call could be avoided because HWPoison bit has been set. However, assume such a scenario, "CXL device reports POISON error" triggers 1st call, user see it from log and want to clear the poison by executing `cxl clear-poison` command, and at the same time, a process tries to access this POISON page, which triggers MCE (it's the 2nd call). Since there is no lock between the 2nd call with clearing poison operation, race condition may happen, which may cause HWPoison bit of the page in an unknown state. Thus, we have to avoid the 2nd call. This patch[2] introduces a new notifier_block into `x86_mce_decoder_chain` and a POISON cache list, to stop the 2nd call of memory_failure(). It checks whether the current poison page has been reported (if yes, stop the notifier chain, don't call the following memory_failure() to report again). Looking forward to your comments! -- Thanks, Ruan.