From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B0DCC2D0CD for ; Mon, 19 May 2025 16:01:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A00A06B00D1; Mon, 19 May 2025 12:00:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 98ABE6B00D2; Mon, 19 May 2025 12:00:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 804336B00D3; Mon, 19 May 2025 12:00:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 5DE9C6B00D1 for ; Mon, 19 May 2025 12:00:59 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 380701A0502 for ; Mon, 19 May 2025 16:01:03 +0000 (UTC) X-FDA: 83460121206.20.E0B3625 Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) by imf28.hostedemail.com (Postfix) with ESMTP id 42DCFC000C for ; Mon, 19 May 2025 16:01:01 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YJF8uAUM; spf=pass (imf28.hostedemail.com: domain of surenb@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747670461; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5Vk5UECDnxcbQWnHyIDFhExQ1Vexoa4EZtKehLd/4jY=; b=7jKaK7r5f6K0numqM3wlmcVCbgfAJwSdYDP8lO2Ha4ixt+Mq9A6+vdTXiYvrfgpLePKCe2 n9fFSAIxVa1nr+Htvfn4gmGe9yJWHRJ7cs2yGHOM5CL2PGffOGMOVGDuvazTuYr82wQ+7X cQMWAZkuu+bwW0bp3V/NZqcIBBlIoMc= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YJF8uAUM; spf=pass (imf28.hostedemail.com: domain of surenb@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747670461; a=rsa-sha256; cv=none; b=wyvE1f1i4HimpFnPQ3NH9TM5xdpg8r2GGZOIbbhPZGe1zb4gOQ22gBrf8LJHEGee5slRZ6 TPRGC16LoqR86q7ErMOqW+C7JxpA9KYeDZWJFH7ATiDIeSoPeYOZnkU+kEK12LYcGU2iZu BNE+xaStqxwoKGPfxr1I6s0JqD+UuBo= Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-47666573242so685401cf.0 for ; Mon, 19 May 2025 09:01:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1747670460; x=1748275260; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5Vk5UECDnxcbQWnHyIDFhExQ1Vexoa4EZtKehLd/4jY=; b=YJF8uAUMEnoBkHMl9kkEHDxU4xUSIQ8D2Gc7hXcSYbC/+UTXVwl94elL4MV9EvmeKq 8ngHF3Pxf6kUP3ubTRuvS5DssvfT4NONmsgwc4vTwWDgT7Ta/k9OVyJeWzxNC5km2K93 SB28fhSv3gbp5pCjqKHObvQkCIRv0JvqCw1e0VMMdyCQveLYm4icYDoGhsKUYvMEtsl9 +MdTCfiTYR7Zix5BW91QGfXlDSzU/v62B+xh7jowa6/xUtxqV+xj8V5VgWP4Bh3fR0xx EC8pHr/YWcfS1PRsq8Qq2IPfF5hpdS31ZIskac2GT8xTpm/SslRwYBTRSKPOX++T1aZX 6BAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747670460; x=1748275260; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5Vk5UECDnxcbQWnHyIDFhExQ1Vexoa4EZtKehLd/4jY=; b=AFCr1QdIwWoxiKbq1XxzXe3dWqZQ2fhLR5GSDAcwEaH3y1AWXUVGGZOFcEjF1ULJb4 LrLOIyJOia98MWoUarnl9hxD20QqXK26JA98sOg5BvZXYFFrogrwtY8s1OzXyuoyD7PW iFRfTtvSY3xz5RZvdcS2Qks4DK1hYHjijnCDe0f7H+avCVSr51QUk9bynSfvJUoaColp 7H0YD3XYwzulaGBRb7YTxLB/HJE5/SYE11UGVycf/+HvQH4YnnycA6CEgjPSh+jUdyNh kxXNhNFzR1LkcenPJHCW7f4lTBaUso2xPYnAQexYcMPaL8d2z7lWien2gCth3EUaZd3Z DHUA== X-Forwarded-Encrypted: i=1; AJvYcCU+370yLZpBLJjcpc/oO4JyyxHVHf4EG5QFU5Ao7DY3xOspDzbv6+Hbb+Ftd1Iq5/etpvbbKV3gCQ==@kvack.org X-Gm-Message-State: AOJu0YxQPUiz3zJ9TueojhuJlq21H8+V70dMZuwYybSdw1sJx0djG/sX 517D2T07jBlhlxlTL3icIggE/0OQ4TjtayGhhTCPHe7A+Ywv2vQKcvEhhhmiY3benrvantBuXIh Gux4v1mj6wAybzFFUBWRY15GVtpZ3rUnK0ugRGwvk X-Gm-Gg: ASbGnctF70xqPskjW89OWfs7IRby+0Ih/Exl3aEaWi06NkwenvJzLVwgyNVD/HUoLXe K6jwYE5ioPuAk8t7yQRDqrNPwPJjSytXt+uZmZpOAhXCWYmcPvGJtI9QOGiCMJaPkfWsgrulDso zG60DhtnvIqMBYXG+tLBq0NL+pEbAt2+I= X-Google-Smtp-Source: AGHT+IGwzH7DfuY5mp6BefgIj9BTKVFTlfepLxckJGGzF8UUYseL5yKVQvTSbahQ66L628RsaBaiAUoOc+KcN8yWNxw= X-Received: by 2002:ac8:5781:0:b0:47e:a6ff:a215 with SMTP id d75a77b69052e-495864e985amr7597451cf.0.1747670459761; Mon, 19 May 2025 09:00:59 -0700 (PDT) MIME-Version: 1.0 References: <20250516131246.6244-1-00107082@163.com> <6646d582.18f8.196dd0d5071.Coremail.00107082@163.com> <233aab47.38f2.196df28812a.Coremail.00107082@163.com> <5a1f5612.363e.196df64bd1f.Coremail.00107082@163.com> <551cd408.515.196e11108a5.Coremail.00107082@163.com> <489a2474.19ea.196e2d20b87.Coremail.00107082@163.com> In-Reply-To: <489a2474.19ea.196e2d20b87.Coremail.00107082@163.com> From: Suren Baghdasaryan Date: Mon, 19 May 2025 09:00:46 -0700 X-Gm-Features: AX0GCFuY9pqRmToI9UzuwAkqWEVNpNKh5fnk4HzxVmhR6scob19OeLSDvHxIY_U Message-ID: Subject: Re: BUG: unable to handle page fault for address To: David Wang <00107082@163.com> Cc: kent.overstreet@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Stat-Signature: zmq8azw5d57emq6z1od9puz4sbffsj3m X-Rspamd-Queue-Id: 42DCFC000C X-Rspam-User: X-HE-Tag: 1747670461-577712 X-HE-Meta: U2FsdGVkX18i8tQBn1RYzfQD/HAE3IcszdTRYIDTQGbzbJ55bq8Que/kwlsm3epzi01VSFgJTP+/vmk4lF84mosSxBfRBkmKbqTc/xjRznLVN6fCka/a7wz1uMRtXTwnIVZXAOUD7PyswmBzgXT1yb4WqsQab2TLrwmMrUy23QbRv7cHNmUWo51loz7bRzIP1AeWg/TwCPHAKiFoym9mwLGQGc4vR0mA6u9Cnekaf8eJ63MCP5BD6uTl7bFHzy0gKElHKcQS8g0nP3n2S/xbaYXKGviBlPv7hBVzSDVSGDUQ+v29oEleshmtfMC0feCn+11uEQwD71asv2vOu3dQFo1hZ+tVMteQVDX2tVi+ga0nbs8WWGHchWZgOC6iogvzELplvmV8CiDbpwi8w8yzRChp2+T+w1Yg7uPoy9YtdS0cqMnoc5omYKianrKDQtXhTZNm0e+aukULEHJeghDUNTLbpFvpgk4S8Vncdtzb/FDKcYWTT+0FhsigYSfEZ/8c4/aHo9ge7NufDy7qT4P5HTyauGAVcvsI3ip2ns4yVAC2/a7UbD4qhI8WyU+PHZacqNeCFxkYq1tE+oBZ2UpctSGNd91RvN9ItqLuhSbSl5VMElRD21gl6ouWFze6Y7glUCMdshakfeFq4O6OmB0ikCVygRJdeh720WeQ3my+w7YSprIbsTEqsh9EZGyqUf5x7Xz61bsOBuNiAPqFjcU+HYj9kQA9LSBkCYPjxK554fYHJ3+1BZS0nrn7JDEOoGETI6vN2z/UdX2a7xuEwCfrpFTuM+9OhWyRBP52wNIxXa1Bm0217Q9oAVT8pga9x3k6RxDbZGeDXNzNQpmrGIfzfk9tKT8ckfJ9/OBqxhM+81lPbTUaWuSWdM8Fi4TVS5IpT3u+anNa/R2vH41rm/BJu819HxK7/Bu8yLagKkG5N5ElcVd5s+RgujFiAVTQwoO6WbwIaPrQgVz/ajGBXjb X2/egIKi oA1AwsviAZINq82ulLjVpBBVP6TaEPPvOSsmRyOTUxxr06jUy0Cxkd81o5A7MC4YYDwJx18/+IzVgHShFlS+X3EabK2OM2m5uKg+f2Gr4x2G/5x1dx4unrqyypG8w5q1lowvMuKM6ugmcscVAplO+t/QOdAHFiCIzk6NjcXxAzwiupypLNPA31jESXEzkw5KOR5NYOSUFOc4vrUwcxc7LNg7oI6ii1wsacaXepaj30LDZOSQGw0R2cGOGh2HwiaHWuC6HA7W+fp9+QBJh6/Er15ek65I2r/Qfmrd8cBCH/+PiJe8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, May 18, 2025 at 2:55=E2=80=AFAM David Wang <00107082@163.com> wrote= : > > > >>> > >>> I do notice there are places where counters are referenced "after" fr= ee_module, but the logs I attached > >>> happened "during" free_module(): > >>> > >>> [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for addr= ess: ffff9d28984c3000 > >>> [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mod= e > >>> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page > >>> ... > >>> [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0 > >>> ... > >>> [Fri May 16 12:05:41 2025] Call Trace: > >>> [Fri May 16 12:05:41 2025] > >>> [Fri May 16 12:05:41 2025] codetag_unload_module+0x135/0x160 > >>> [Fri May 16 12:05:41 2025] free_module+0x19/0x1a0 > >>> > >>> The call chain is the same as you mentioned above. > >> > >>Is this failure happening before or after my fix? With my fix, percpu > >>data should not be freed at all if tags are still used. Please > >>clarify. > > > >It is before your fix. Your patch does fix the issue. > > > >In my reproduce procedure: > >1. enter recovery mode > >2. install nvidia driver 570.144, failed with Unknown symbol drm_client_= setup > >3. modprobe drm_client_lib > >4. install nvidia driver 570.144 > >5. install nvidia driver 550.144.03 > >6. reboot and repeat from step 1 > > > >The error happened in step 4, and the failure in step2 is crucial, if = 'modprobe drm_client_lib' at the beginning, no error could be observed. > > > >There may be something off about how kernel handles data.percpu section. > >Good thing is that It can be reproduced, I can add debug messages to cl= ear or confirm suspicions, > >Any suggestion? > > > > > >Thanks > >David > > > > > After poking around logging memory addresses, I think I finally understan= d what is happening here. > > 1. codetag_alloc_module_section alloc memory when loading module > 2. module load failed, due to undefined symbol > 3. codetag section memory not freed > 4. module load, and module's address happens to reuse the address previou= s used > 5. another codetag_alloc_module_section > 6. percup section allocation and then relocation address changes made to = codetag_alloc_module_section > 7. unload module, when searching through maple tree, the code tag memory = in step 1 is used, > which has no relocation address populated at all. > 8. page fault error, because tag->counters is 0 > > I use following changes to log the address, > > > The offending address is > --- a/lib/alloc_tag.c > +++ b/lib/alloc_tag.c > @@ -575,6 +575,11 @@ static void release_module_tags(struct module *mod, = bool used) > if (!used) > goto release_area; > > + struct alloc_tag *ptag =3D (struct alloc_tag *)(module_tags.start= _addr + mas.index); > + pr_info("percpu 0: 0x%llx(0x%llx)\n", > + (long long)per_cpu_ptr(ptag->counters, 0), > + (long long)ptag->counters > + ); > > > And got following: > [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee41030(0xffffffffbc57e03= 0) > [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee410e0(0xffffffffbc57e0e= 0) > [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee40fa0(0xffffffffbc57dfa= 0) > [Sun May 18 16:26:43 2025] percpu 0: 0xffff8edbb28c3000(0x0) <------ > > > I think, we spot two issues in this thread: > > 1. when module load failed after codetag section alloced, the memory woul= d leak. > 2. counters may needs reference even after module is unloaded. > > #2 has already been addressed by your patch. I will send a simple patch t= o fix #1 > > (Feel so released to finally draw a conclusion, hope no silly mistakes he= re :) I see. So, layout_and_allocate() succeeds in allocating the codetag memory but during a later failure we fail to free it. Makes sense and your patch looks good to me. Thanks! > > > Thanks > David >