From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42C01C3DA4A for ; Fri, 16 Aug 2024 07:47:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D095D8D0054; Fri, 16 Aug 2024 03:47:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CB9C48D0002; Fri, 16 Aug 2024 03:47:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B595D8D0054; Fri, 16 Aug 2024 03:47:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 94E588D0002 for ; Fri, 16 Aug 2024 03:47:31 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 448EC1219E6 for ; Fri, 16 Aug 2024 07:47:31 +0000 (UTC) X-FDA: 82457328702.16.F49A50E Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21]) by imf02.hostedemail.com (Postfix) with ESMTP id DB9B58001C for ; Fri, 16 Aug 2024 07:47:27 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="F/irJokg"; spf=pass (imf02.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723794374; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=grJfqQHdTLVsHBjlqFeUgRjQQnWzl2i9ue8IB0I2Edc=; b=g1bWo1oNGvig2NmyOebliDKdD/X1pgUxR43CDLfwBtpiwbUEO1MUVcoXKeydgAeUHIuoJ6 TZw5ic98KUFeLlRg580ukgllNvpEldj5AbB7AnRHBu8gBSUfHJ/oW2NMm77iZO39FayDR+ 6uqQDXp1unwDwA/Mnd84EdT1/K3stqc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723794374; a=rsa-sha256; cv=none; b=s4qun6U2QK6z6zZs3blGD2qxZ/BQBPydPM/wRhH+H7xe+ELNTzuk8u+wTAP3seH4e2NkNv ZJBD2iLmwnq8gooPcyNzL4cW7FNKTr71fsYxLtUkRD8ffZlsOM/ctrMD9ZlbhnYvb8jDSe dOV5jam2R2hYPDPmb/wXp4jdda338ak= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="F/irJokg"; spf=pass (imf02.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1723794448; x=1755330448; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=OqVTIszw2Wu/WHB8JbwN1IKzYU3OakSmmpks4rcjQ68=; b=F/irJokgnV1uMkdeZ/RbKLlVd4Aj1POEsv8OHPtQ5TnlcD457djZ6iHx VDpf7Kne2oNtfPedN7w0iA2XpiRrVBFaEUDqLsYXL6ZzNeRVE+ChIkm0Z w1SW3L4mFGtVxQ8XlYm/hHcMr8QfBTrqzAOR4V0kf+sN6JoUSC5oXCysY ++J/4PKFCjk5r+yklKP/HXO0GDyJ3xll1aQWHtVwBcdZZFYUEZwOcKuF+ dnwuUkgrO323/bEyknWhNrSWRGhJJwOD5vnimoz2srJnHYfCWDqt2ryqa +5RF8o1mwBXd+gFHJ9IlcQX8V3xtZD8/oUtXRil6bqFrGTF3aDiKLXLqD g==; X-CSE-ConnectionGUID: i2v851kwQeqVS8InYbViEw== X-CSE-MsgGUID: bZ2fSfvYQeKLlBrF2gTDAA== X-IronPort-AV: E=McAfee;i="6700,10204,11165"; a="22051677" X-IronPort-AV: E=Sophos;i="6.10,151,1719903600"; d="scan'208";a="22051677" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Aug 2024 00:47:26 -0700 X-CSE-ConnectionGUID: kzn96sZiSz+afU+mHl02Wg== X-CSE-MsgGUID: r5CXRn9pQq6vcuKL9sOvdA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,151,1719903600"; d="scan'208";a="59437266" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Aug 2024 00:47:23 -0700 From: "Huang, Ying" To: Dan Williams Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, Davidlohr Bueso , Jonathan Cameron , Dave Jiang , Alison Schofield , Vishal Verma , Ira Weiny , Alistair Popple , Andy Shevchenko , Bjorn Helgaas , Baoquan He Subject: Re: [PATCH] Resource: fix region_intersects() for CXL memory In-Reply-To: <66bedd0f746ab_1c18294b5@dwillia2-mobl3.amr.corp.intel.com.notmuch> (Dan Williams's message of "Thu, 15 Aug 2024 22:01:03 -0700") References: <20240816020723.771196-1-ying.huang@intel.com> <66bedd0f746ab_1c18294b5@dwillia2-mobl3.amr.corp.intel.com.notmuch> Date: Fri, 16 Aug 2024 15:43:49 +0800 Message-ID: <87y14wj4ju.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Stat-Signature: ux8bmdk4uujidafamkw3zou5aesgk43m X-Rspamd-Queue-Id: DB9B58001C X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1723794447-84856 X-HE-Meta: U2FsdGVkX19kt2/EluReGEqc5pJ+6MV9EUdCRCAKYNRv+jZLUdaTEcRl1hRP1n8lWrMOTUcvbCzzQDTuoo2VlnL3QLb+hoSDmhf8tVfzh0OceCjup4laPnnp4PSzc+TyOm5aDAeaUrvetzyzeOGZcms8cVax5divuYJcdVTt1ciws5Yyl6qCQRWdXJhZKhMIesXO/PxFQedTJ2a8EOl//QHLI3sqsejUpOyxknE/g33V6+B7XqRPAxcOBpuzoSPgtHFN0JOw90ZeCHavlor67wGflCjL1K9Xbm6wf7Zgum8OTCPf5DGn6mAdRRr3sYq5mSGeCSrXNhKVIWlm4aokAsm8ZRTxEnbDrmGh2wWysm3ragc81zRXcH0/dB2Jd1Qc+UxtrFavUrqBgFoInxLR9f+guV6Z2+SDyigvPIiw213IPLKS//wBUQy0Oq3pnPFqIL6GG421pwbEsJhjD0vQ+H89G3yz/LzzlKkMam8pkpBjfwlsrlIN6FsWRybhEQuqVIXPdmeIDHOq8VfsPmNThYGB16komrfzKHqyUwnwVv1roOvR0IwrCT3m4qWPDr1cdBDCtab3wNsmq7XfggAZT72y9NTET3bls07I/mlxrOh1juPaOaZFTivpyCJvb7xJ7umKHWm3B47BeowdaN3whOJPkzFrqz/iUn1paRHMUvLsfQB47kYkFivi4dv5tO6zQr9vjpo7W7NDQFePhb649Qdf0ph01O2uYrAchYOxa91HycgYW9dEa93zjWZFxRxMdQp5ShumcBGmcBomPa2v7HmVSlZqjz0rmW669CRGbEPy7eQIwA42GxCxqELfBjM8dFuEWOUQ8DoV9fgCGkp0gGktag638JCBSU/QoJTLcO495irvDaupEV3Xa1cOYtyJOJI2joK7y/YCXuTnz42zp+w3kfgTtqSul7sZJ7z8G6yZSgSj77wYzN1cDEZ2AHUJDZk6fGdHx0l2tj/0Jmw phl54yFD omRKqmcEUwva5+biTFGqXqu2FpW08x832tc4uw8hTTGDwnPBJ5VAJEMVcUhFxqRq6Cc1Hk2Mamqrp1rqtQPer2ov+E1BqSQcCN60ABWyenpaDqpjuD5hDD0PNMCc689V7G+VFyrVgueJfUvPlw44o0o7sj/F1NHbFukSVwyTPosIuKEBhZgDw/HpVbgTuKW9+l2hWHLV2cqcjqY5oo6TgUULDcyGRQ5LlGHJKlxRUf7n6MR6KeBgdiLcRSFtZ/SpW/YEOEdCxl8v721xJQoBYZVOuVJDGUgPfi1iHVAqD0yeFW030V63Z7vvO9nvmTcIE6WkCpmiUYHNqGz9BmP2PdZFwFLhyzj9JCTmNAp+g5TDUtiCM3F1kgcH0UHwA1haR9RXin7LXMsNvOiDJJ4BqODPIbi2Ns1hWt/hEcNsWM4kjG1w= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Dan Williams writes: > Huang Ying wrote: >> On a system with CXL memory installed, the resource tree (/proc/iomem) >> related to CXL memory looks like something as follows. >> >> 490000000-50fffffff : CXL Window 0 >> 490000000-50fffffff : region0 >> 490000000-50fffffff : dax0.0 >> 490000000-50fffffff : System RAM (kmem) >> >> When the following command line is run to try writing some memory in >> CXL memory range, >> >> $ dd if=data of=/dev/mem bs=1k seek=19136512 count=1 >> dd: error writing '/dev/mem': Bad address >> 1+0 records in >> 0+0 records out >> 0 bytes copied, 0.0283507 s, 0.0 kB/s >> >> the command fails as expected. However, the error code is wrong. It >> should be "Operation not permitted" instead of "Bad address". And, >> the following warning is reported in kernel log. >> >> ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff >> WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d >> Modules linked in: cxl_pmem libnvdimm cbc encrypted_keys cxl_pmu >> CPU: 2 UID: 0 PID: 416 Comm: dd Not tainted 6.11.0-rc3-kvm #40 >> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 >> RIP: 0010:__ioremap_caller.constprop.0+0x131/0x35d >> Code: 2d 80 3d 24 6a a1 02 00 75 c1 48 8d 54 24 70 48 8d b4 24 90 00 00 00 48 c7 c7 40 3a 05 8a c6 05 07 6a a1 02 01 e8 a3 a0 01 00 <0f> 0b eb 9d 48 8b 84 24 90 00 00 00 48 8d 4c 24 60 89 ea 48 bf 00 >> RSP: 0018:ffff888105387bf0 EFLAGS: 00010282 >> RAX: 0000000000000000 RBX: 0000000490000fff RCX: 0000000000000000 >> RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffed1020a70f73 >> RBP: 0000000000000000 R08: ffffed100d9fce92 R09: 0000000000000001 >> R10: ffffffff892348e7 R11: ffffed100d9fce91 R12: 0000000490000000 >> R13: 0000000000000001 R14: 0000000000000001 R15: ffff888105387ca0 >> FS: 00007f86c438c740(0000) GS:ffff88806ce00000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 000055ba75b1b818 CR3: 0000000005231000 CR4: 0000000000350eb0 >> Call Trace: >> >> ? __warn+0xd7/0x1b8 >> ? __ioremap_caller.constprop.0+0x131/0x35d >> ? report_bug+0x136/0x19e >> ? __ioremap_caller.constprop.0+0x131/0x35d >> ? handle_bug+0x3c/0x64 >> ? exc_invalid_op+0x13/0x38 >> ? asm_exc_invalid_op+0x16/0x20 >> ? irq_work_claim+0x16/0x38 >> ? __ioremap_caller.constprop.0+0x131/0x35d >> ? tracer_hardirqs_off+0x18/0x16d >> ? kmem_cache_debug_flags+0x16/0x23 >> ? memremap+0xcb/0x184 >> ? iounmap+0xfe/0xfe >> ? lock_sync+0xc7/0xc7 >> ? lock_sync+0xc7/0xc7 >> ? rcu_is_watching+0x1c/0x38 >> ? do_raw_read_unlock+0x37/0x42 >> ? _raw_read_unlock+0x1f/0x2f >> memremap+0xcb/0x184 >> ? pfn_valid+0x159/0x159 >> ? resource_is_exclusive+0xba/0xc5 >> xlate_dev_mem_ptr+0x25/0x2f >> write_mem+0x94/0xfb >> vfs_write+0x128/0x26d >> ? kernel_write+0x89/0x89 >> ? rcu_is_watching+0x1c/0x38 >> ? __might_fault+0x72/0xba >> ? __might_fault+0x72/0xba >> ? rcu_is_watching+0x1c/0x38 >> ? lock_release+0xba/0x13e >> ? files_lookup_fd_raw+0x40/0x4b >> ? __fget_light+0x64/0x89 >> ksys_write+0xac/0xfe >> ? __ia32_sys_read+0x40/0x40 >> ? tracer_hardirqs_off+0x18/0x16d >> ? tracer_hardirqs_on+0x11/0x146 >> do_syscall_64+0x9a/0xfd >> entry_SYSCALL_64_after_hwframe+0x4b/0x53 >> RIP: 0033:0x7f86c4487140 >> Code: 40 00 48 8b 15 c1 9c 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d a1 24 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89 >> RSP: 002b:00007ffca9f62af8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 >> RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f86c4487140 >> RDX: 0000000000000400 RSI: 000055ba75b1a000 RDI: 0000000000000001 >> RBP: 000055ba75b1a000 R08: 0000000000000000 R09: 00007f86c457c080 >> R10: 00007f86c43a84d0 R11: 0000000000000202 R12: 0000000000000000 >> R13: 0000000000000000 R14: 000055ba75b1a000 R15: 0000000000000400 >> >> irq event stamp: 0 >> hardirqs last enabled at (0): [<0000000000000000>] 0x0 >> hardirqs last disabled at (0): [] copy_process+0xb60/0x255f >> softirqs last enabled at (0): [] copy_process+0xb60/0x255f >> softirqs last disabled at (0): [<0000000000000000>] 0x0 >> >> After investigation, we found the following bug. >> >> In the above resource tree, "System RAM" is a descendant of "CXL >> Window 0" instead of a top level resource. So, region_intersects() >> will report no System RAM resources in the CXL memory region >> incorrectly, because it only checks the top level resources. >> Consequently, devmem_is_allowed() will return 1 (allow access via >> /dev/mem) for CXL memory region incorrectly. Fortunately, ioremap() >> doesn't allow to map System RAM and reject the access. >> >> However, region_intersects() needs to be fixed to work correctly with >> the resources tree with CXL Window as above. To fix it, we will >> search matched resources in the descendant resources too. So, we will >> not miss any matched resources anymore. > > Thanks for this Ying! > > I think this needs an explanation of the expected semantics of > region_intersects() and maybe a fixup to meet that expectation, see > below. > >> Signed-off-by: "Huang, Ying" >> Cc: Dan Williams >> Cc: Davidlohr Bueso >> Cc: Jonathan Cameron >> Cc: Dave Jiang >> Cc: Alison Schofield >> Cc: Vishal Verma >> Cc: Ira Weiny >> Cc: Alistair Popple >> Cc: Andy Shevchenko >> Cc: Bjorn Helgaas >> Cc: Baoquan He >> --- >> kernel/resource.c | 44 +++++++++++++++++++++++++++++++++++++------- >> 1 file changed, 37 insertions(+), 7 deletions(-) >> >> diff --git a/kernel/resource.c b/kernel/resource.c >> index 14777afb0a99..c97a5add9394 100644 >> --- a/kernel/resource.c >> +++ b/kernel/resource.c >> @@ -542,18 +542,48 @@ static int __region_intersects(struct resource *parent, resource_size_t start, >> { >> struct resource res; >> int type = 0; int other = 0; >> - struct resource *p; >> + struct resource *p, *dp; >> + resource_size_t ostart, oend; >> + bool is_type; >> >> res.start = start; >> res.end = start + size - 1; >> >> for (p = parent->child; p ; p = p->sibling) { >> - bool is_type = (((p->flags & flags) == flags) && >> - ((desc == IORES_DESC_NONE) || >> - (desc == p->desc))); >> - >> - if (resource_overlaps(p, &res)) >> - is_type ? type++ : other++; >> + if (!resource_overlaps(p, &res)) >> + continue; >> + is_type = (((p->flags & flags) == flags) && >> + ((desc == IORES_DESC_NONE) || (desc == p->desc))); >> + if (is_type) { >> + type++; >> + continue; >> + } >> + /* >> + * Continue to search in descendant resources. Unless >> + * the matched descendant resources cover the whole >> + * overlapped range, increase 'other', because it >> + * overlaps with 'p' at least. >> + */ >> + other++; > > This results in REGION_MIXED whenever the target of the search is found > as a descendant of @parent which I believe is unwanted. This is not the behavior of this patch. There's a "other--" later in this patch. + ostart = max(res.start, p->start); + oend = min(res.end, p->end); + for_each_resource(p, dp, false) { + if (!resource_overlaps(dp, &res)) + continue; + is_type = (((dp->flags & flags) == flags) && + ((desc == IORES_DESC_NONE) || + (desc == dp->desc))); + if (is_type) { + type++; + if (dp->start > ostart) + break; + if (dp->end >= oend) { + other--; <====================== HERE! + break; + } + ostart = dp->end + 1; + } + } } if (type == 0) That is, if the overlapped range is covered by matched (is_type == true) descendant resources completely, other will not increase. So, for resource tree as follows 490000000-52fffffff : CXL Window 0 490000000-50fffffff : region0 490000000-50fffffff : dax0.0 490000000-50fffffff : System RAM (kmem) 510000000-52fffffff 510000000-52fffffff : dax0.1 region_intersects(, 0x490000000, PAGE_SIZE, IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE) => REGION_INTERSECTS region_intersects(, 0x50f000000, 0x2000000, IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE) => REGION_MIXED Even for 490000000-52fffffff : CXL Window 0 490000000-50fffffff : region0 490000000-50fffffff : dax0.0 490000000-50fffffff : System RAM (kmem) region_intersects(, 0x50f000000, 0x2000000, IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE) => REGION_MIXED This isn't perfect, but it looks OK for me. Because for 490000000-50fffffff : System RAM 510000000-52fffffff : CXL Window 0 region_intersects(, 0x50f000000, 0x2000000, IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE) => REGION_MIXED However, I admit that the original code is hard to be understood, whether is something like below better? for (p = parent->child; p ; p = p->sibling) { if (!resource_overlaps(p, &res)) continue; is_type = (((p->flags & flags) == flags) && ((desc == IORES_DESC_NONE) || (desc == p->desc))); if (is_type) { type++; continue; } /* * Continue to search in descendant resources. Unless * the matched descendant resources cover the whole * overlapped range, increase 'other', because it * overlaps with 'p' at least. */ covered = false; ostart = max(res.start, p->start); oend = min(res.end, p->end); for_each_resource(p, dp, false) { if (!resource_overlaps(dp, &res)) continue; is_type = (((dp->flags & flags) == flags) && ((desc == IORES_DESC_NONE) || (desc == dp->desc))); if (is_type) { type++; if (dp->start > ostart) break; if (dp->end >= oend) { covered = true; break; } ostart = dp->end + 1; } } if (!covered) other++; } > The semantics of region_intersects() has always been within a single > sibling level to date. So, I don't think @other should be incremented > until @is_type is non-zero. It follows that if @is_type is set and > !resource_contains(p, &res) then there is no point in descending because > it is known at that there are no descendants to worry about. Sorry, I don't understand your words here. Can you show your idea with some examples or pseudo code? > I do note that with those changes region_intersects() will start > reporting REGION_MIXED rather than REGION_INTERSECTS when the passed in > range intersects a hole plus the target, but I am not aware of any code > path that expects to ignore resource holes. I.e. better to report holes > as occupied in that case. > > Lastly, it would probably also be useful to special case the "@flags == > 0 && @desc == IORES_DESC_NONE" case because there is no point in > descending the resource tree to look for additional holes. Those are all > visible at the top of the resource tree. -- Best Regards, Huang, Ying