From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5008CC3DA5D for ; Mon, 22 Jul 2024 09:28:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D65AC6B0082; Mon, 22 Jul 2024 05:28:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D15706B0083; Mon, 22 Jul 2024 05:28:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C042A6B0085; Mon, 22 Jul 2024 05:28:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 9EA646B0082 for ; Mon, 22 Jul 2024 05:28:53 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 4164B1A1591 for ; Mon, 22 Jul 2024 09:28:53 +0000 (UTC) X-FDA: 82366864146.21.3A6932C Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf24.hostedemail.com (Postfix) with ESMTP id D0C66180006 for ; Mon, 22 Jul 2024 09:28:50 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=qAGy4wrW; spf=pass (imf24.hostedemail.com: domain of vbabka@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=vbabka@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721640508; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yKiveT3+QWxybCPdqVV0Lsun6kbWX8vm55bSjCWJIpQ=; b=yoQ68cCfZqCaMerwZ2Y+Y0m1DN40BcfmwWZ0vlXlLzsNt4v6p1OF56w3f4UBIkCbQu7Snb 0Jni/CvofTXqu4VTEqy7N7oQQ9h3vV0p5BCteoMj2duBMn2rT4O45pvfBxnol3i4H0eTA0 rBYQTzYMBmhWcShXvaAHrt2H80e/sOw= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=qAGy4wrW; spf=pass (imf24.hostedemail.com: domain of vbabka@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=vbabka@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721640508; a=rsa-sha256; cv=none; b=GJDcl200DR9l2l6tSriEGj4RIai/lP3B8MbfqhWN7xkpztUva5v3RFitwvCixK+nkzGxP3 IIkjl/OolJv2dNfted9myTnc3cPh3gyg1iGyyG2t6S9AQ+gcI0FfF8flAKiX/Sur10PO6k EP3/QFIZjyn3j+goInekHP3cWtIiqlQ= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 5F880CE0B1E; Mon, 22 Jul 2024 09:28:46 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5B8F7C116B1; Mon, 22 Jul 2024 09:28:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1721640525; bh=L5bUTfrXZE1Svfn99KNsFnI0LcGEquBfSerVr0GpQKc=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=qAGy4wrWRGEmSwtZEo8UE1qA5yR+nHVSOnTOhRY30P8AHWaKntAOlCRzdGSeUx/j2 4dRqccGa1d5b5jXFPgnKw9UZr9pkZMsmuJvroNFPSfn266GBkQMej62pFAjGC+Ck6I dC8ASfgrVN/LEa7Bo2mXje1ZKWGnyEt9w2e7o8L2o9nydENk7kZdUol9gZVvhMJUM8 T9ehJZlp9sxX4Yv47VKJ4vdsv2AhLKdhUAADEwthDm472Z50HhySosmbDuNR0UJ3dE PcqCucwFAkxdcnN4Fq4g0VsZhqqkxtb4wxVQWP0f8/fwY/0d76Km1FhUwUPadkk8C9 X9/7hhAJmMA5g== Message-ID: Date: Mon, 22 Jul 2024 11:28:42 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2] mm/page_alloc: Fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist() Content-Language: en-US To: "Zhijian Li (Fujitsu)" , "linux-mm@kvack.org" Cc: "akpm@linux-foundation.org" , "linux-kernel@vger.kernel.org" , "Yasunori Gotou (Fujitsu)" , David Hildenbrand , "Xingtao Yao (Fujitsu)" References: <20240722021059.1076399-1-lizhijian@fujitsu.com> <8323327f-3386-48ba-8554-10a5a6d12a04@kernel.org> From: "Vlastimil Babka (SUSE)" In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: D0C66180006 X-Stat-Signature: rg1sg5ezfhpo813gi1wwaytxpffqemw4 X-Rspam-User: X-HE-Tag: 1721640530-189882 X-HE-Meta: U2FsdGVkX19bqpYXMk4fnsJe3pQ/Qbt2PnkpyLwzulKSjl4hVB65ejyr+wgG14DjM6IEvNdUhCmTFP0dzdieG0GkR9ZvARo22CJ81jE7SJzvIu5gtTEhiYqacGpB5JA2/u9HX+DsEyS+oE4U7uxrERyyuK4CsFAykga0e3gtAg61NYrFvLYnAW8ZTEsDXu25oJLBRuOc/J2HCUpy6TT4A26AGl1I/63J7N95gjGUjKRlLZpVxtcnzVSpJ3Ix83l1gED2HEyANfLiSflRnxup+0yW9DB+ievdMnBNcLl0eUVEcZ/qUjuOacaXdYZW2d7xUngkY7jBpLfW4R/xVJW0+0UcKvwdC24iCQVuCxESfht/xMB5C2cCfsBXKQKsnwIRBdaT3gcOWltQumHHcllfkoptfsK+TIJacQjtn+oysb0vjZ/IAOvLeeUlRKT9oiHCVkCHJGxxO5dp+DOKgk7tCLpAqB6NjQhZUyINKLgEhdvs5zdcMFRZTl+wcmEG0UYBbKVONRJiJ4OkeF9FuVvdWWRq+ADar5gsazZRGgEDnGl5MB9N5ncLX49eGfBZPo6ACdhweA0TJC67kDuHtp2wqkJrPfngsiPZ3RlLpcKHaZxiTuxk24nS8yN1Kp/to3bqi8MfDiC7GBf6zOmPK/E/vlbVgo9f4mPZTwoBlTV3Qyuwi/s6UM6BBLQeos0JnRpPE7aD20J1DBNkS8q1Ekgi0Cy/aZLyk/6wK9h+lEChd+VrrTcNJFhAqNtpRdedk5pLoiZvq/by2uO49PczaOfyZDYiBrNdOP4BMUdgvQDJS78sYtHHJ2JK+tg8blT95T8kiY0jksPDXsIDhUPG5E0douRpzNBCKVuIx4A8sF74t1SAJ/dk1EsZ4QMuPMR9SQvod5b57f5zCHzfrZXtNBBbpVOhAZTAsxOYi/nZNMit7JUsbR4nEbxq6XnxBFxfYXviSzICD8LoZDVcxUhfAFc qQKHrLOU AWVTIIpGZy9MabpIhl6Yxeoheq46S3OryYxg/c1Roys8GvcVxk8mISjC+YgRJu4hBJB1sK+wk1WPN1uir35LIo2eXEhTEFYAXqoRby53nYjebjzUgEhZMtDCblSzd6kBhpZh+rryk8B0lZwJr9pi/IT7t3uQBWtC0M9qkEoytpv0RQ5XQPeIRFWTPTehKBb/CnmTNomg6tIL+HEtDSb4YY5ksvGyO/o4uz6m1+s7fYc9sSS3oMbchnKx0AM+crJuiOu9aNAru+X6794eoRrhEUoMeU574FhZ6IIH6sU0NKV50Uc3mDEID7zg0r5o6jfO+7+Bw7qYh90YpnDT1mVmUhbftI3AWpt/WjAxra9I3QRpNUYjqywfLO5067yi88IwpuE5eGqJaoXjcg3VniEPvSnLJz9nBAJBmBDin X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 7/22/24 11:15 AM, Zhijian Li (Fujitsu) wrote: > Hi David > > Thanks for you quickly reply. > > > On 22/07/2024 14:44, Vlastimil Babka (SUSE) wrote: >> On 7/22/24 4:10 AM, Li Zhijian wrote: >>> It's expected that no page should be left in pcp_list after calling >>> zone_pcp_disable() in offline_pages(). Previously, it's observed that >>> offline_pages() gets stuck [1] due to some pages remaining in pcp_list. >>> >>> Cause: >>> There is a race condition between drain_pages_zone() and __rmqueue_pcplist() >>> involving the pcp->count variable. See below scenario: >>> >>> CPU0 CPU1 >>> ---------------- --------------- >>> spin_lock(&pcp->lock); >>> __rmqueue_pcplist() { >>> zone_pcp_disable() { >>> /* list is empty */ >>> if (list_empty(list)) { >>> /* add pages to pcp_list */ >>> alloced = rmqueue_bulk() >>> mutex_lock(&pcp_batch_high_lock) >>> ... >>> __drain_all_pages() { >>> drain_pages_zone() { >>> /* read pcp->count, it's 0 here */ >>> count = READ_ONCE(pcp->count) >>> /* 0 means nothing to drain */ >>> /* update pcp->count */ >>> pcp->count += alloced << order; >>> ... >>> ... >>> spin_unlock(&pcp->lock); >>> >>> In this case, after calling zone_pcp_disable() though, there are still some >>> pages in pcp_list. And these pages in pcp_list are neither movable nor >>> isolated, offline_pages() gets stuck as a result. >>> >>> Solution: >>> Expand the scope of the pcp->lock to also protect pcp->count in >>> drain_pages_zone(), to ensure no pages are left in the pcp list after >>> zone_pcp_disable() >>> >>> [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ >>> >>> Cc: David Hildenbrand >>> Cc: Vlastimil Babka (SUSE) >>> Reported-by: Yao Xingtao >>> Signed-off-by: Li Zhijian >> >> Can we find a breaking commit for Fixes: ? > > I haven't confirmed the FBC because my reproducer is not fit to run in the old kernel for some reasons. > but I noticed it didn't read the count without lock held since below commit > > 4b23a68f9536 mm/page_alloc: protect PCP lists with a spinlock > > > >> >>> --- >>> V2: >>> - Narrow down the scope of the spin_lock() to limit the draining latency. # Vlastimil and David >>> - In above scenario, it's sufficient to read pcp->count once with lock held, and it fully fixed >>> my issue[1] in thounds runs(It happened in more than 5% before). >> >> That should be ok indeed, but... >> >>> RFC: >>> https://lore.kernel.org/linux-mm/20240716073929.843277-1-lizhijian@fujitsu.com/ >>> --- >>> mm/page_alloc.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 9ecf99190ea2..5388a35c4e9c 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -2323,8 +2323,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) >>> static void drain_pages_zone(unsigned int cpu, struct zone *zone) >>> { >>> struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); >>> - int count = READ_ONCE(pcp->count); >>> + int count; >>> >>> + spin_lock(&pcp->lock); >>> + count = pcp->count; >>> + spin_unlock(&pcp->lock); >>> while (count) { >>> int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); >>> count -= to_drain; >> >> It's wasteful to do a lock/unlock cycle just to read the count. > > How about, > > static void drain_pages_zone(unsigned int cpu, struct zone *zone) > { > struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); > int count, to_drain; > > do { > spin_lock(&pcp->lock); > to_drain = min(pcp->count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); > free_pcppages_bulk(zone, to_drain, pcp, 0); > spin_unlock(&pcp->lock); > } while (to_drain); Yeah better than break. But I think you still should use count = pcp->count; ... count -= to_drain; } while(count); or you make one extra wasteful iteration to find to_drain is zero. (assuming "it's sufficient to read pcp->count once with lock held", that I agree with) > } >> It could >> rather look something like this: >> > > Sorry, I don't follow your code... > >> while (true) >> spin_lock(&pcp->lock); >> count = pcp->count; >> ... >> count -= to_drain; >> if (to_drain) >> drain_zone_pages(...) > > Which subroutine does this code belong to, why it involves drain_zone_pages Yeah sorry I meant free_pcppages_bulk() >> ... >> spin_unlock(&pcp->lock); >> if (count) >> break; > > Thanks > Zhijian