From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 426CFC433EF
	for <linux-mm@archiver.kernel.org>; Tue, 25 Jan 2022 21:06:22 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A04936B007B; Tue, 25 Jan 2022 16:06:21 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9B5B06B007D; Tue, 25 Jan 2022 16:06:21 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 82E1F6B0080; Tue, 25 Jan 2022 16:06:21 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0189.hostedemail.com [216.40.44.189])
	by kanga.kvack.org (Postfix) with ESMTP id 6E00B6B007B
	for <linux-mm@kvack.org>; Tue, 25 Jan 2022 16:06:21 -0500 (EST)
Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 2E54D944D6
	for <linux-mm@kvack.org>; Tue, 25 Jan 2022 21:06:21 +0000 (UTC)
X-FDA: 79070042562.05.C06D19F
Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173])
	by imf30.hostedemail.com (Postfix) with ESMTP id BE9E38001A
	for <linux-mm@kvack.org>; Tue, 25 Jan 2022 21:06:20 +0000 (UTC)
Received: by mail-pl1-f173.google.com with SMTP id d7so20536700plr.12
        for <linux-mm@kvack.org>; Tue, 25 Jan 2022 13:06:20 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=sender:date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=WPDtSBJktDbv4eAC/x0jLV8l6Q6L/WIY+zqIPINFKvA=;
        b=XkeIKkTwAYYHbvql2pSI88YuuMaJcTZKhmTnKhGh8vuwECNv4jN36Wzk0IDE/w7RAp
         J6zHY/7zITPOA++MFEuvK9ESFJW1PqXnQLexNJIvqybRp/ivjuCDWWGM1BKjOXa0iRC9
         Vv4qScYyqXm/HcnguyMwz5nLC/sU8thZakOKUGd/BVjyoxSj/qJyGkYY2YH02erPAjcf
         FyWebAfguY+29gVGtmlseVYfOAO4bOeuKIB/oP3Sm5J5q3KLHEbYfdfduaQVt5F24zTa
         kC0RhsTlU098RyrLsovW3Lb+FiSI3B6lAsDiVGTUcZNJJ9wR/T9bJtzRds051Py3c0sY
         qjUQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:sender:date:from:to:cc:subject:message-id
         :references:mime-version:content-disposition:in-reply-to;
        bh=WPDtSBJktDbv4eAC/x0jLV8l6Q6L/WIY+zqIPINFKvA=;
        b=QR5ct8o9vm/mcgzUW9Wyktm49PXGnObBnuWJHn6WPnHbckVoAOO5UOyJQERqyZLnpx
         EWdrg19QlEG07L5vdflw8uyOV3DtwoteLJkupUt6jZQhZiQjm57/AbNPXADtYTIBkp5o
         IsW9RWnZBuQB+Tbu7Wc52LORLO8HhiLm/g+B3R3Jk7ly9Rqo08wmGbVskzcedwjHI9Go
         nefXfPkCKi6BZNkJVjMULhwJa37t55H9G7AuvkwBHiNcHRv9KUZDA2xCwrwRh6duuCWf
         6wH7kqJUCNHXK9n+01DxDrUHdOqSdU2iM4SiNqwVWr5//qbjSTn6oXCn1oNdkS+zTsyX
         fPrQ==
X-Gm-Message-State: AOAM531KE6JnMGblHjUkJR0sWso/CQki1vL7RDPqubgYWMiP0ep77tQn
	GtlfwZJUVWELt/lfp6gifIY=
X-Google-Smtp-Source: ABdhPJxIlbzYrgef5H0QNhsFgTamCDKhB6YR+2vbJzqxpBxsmza4tjRqDld1/5iAcZVJ0lTc4dgndg==
X-Received: by 2002:a17:90b:1a87:: with SMTP id ng7mr5478959pjb.233.1643144779536;
        Tue, 25 Jan 2022 13:06:19 -0800 (PST)
Received: from google.com ([2620:15c:211:201:23a4:12ed:4652:dade])
        by smtp.gmail.com with ESMTPSA id j8sm22080994pfc.127.2022.01.25.13.06.18
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 25 Jan 2022 13:06:19 -0800 (PST)
Date: Tue, 25 Jan 2022 13:06:17 -0800
From: Minchan Kim <minchan@kernel.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>, linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	John Dias <joaodias@google.com>
Subject: Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested
 lru_cache_disable
Message-ID: <YfBmSaMa826ZhFT4@google.com>
References: <YedXhpwURNTkW1Z3@google.com>
 <YefX1t4owjlx/m5I@dhcp22.suse.cz>
 <YejkUlnnYeED1pC5@google.com>
 <YekcNmBqcpO9BYWv@dhcp22.suse.cz>
 <YenPK/JVNOhbxjtr@google.com>
 <YeqEBAKJ6NUjLQhr@dhcp22.suse.cz>
 <YessDywpsnCyrfIy@google.com>
 <Ye54ELlNBpeHoXsj@dhcp22.suse.cz>
 <Ye8mi80ObVZvLdS1@google.com>
 <Ye/Bgc1bH979cXwy@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Ye/Bgc1bH979cXwy@dhcp22.suse.cz>
X-Rspamd-Queue-Id: BE9E38001A
X-Rspam-User: nil
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=XkeIKkTw;
	dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none);
	spf=pass (imf30.hostedemail.com: domain of minchan.kim@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com
X-Stat-Signature: 8qdycahe5soew1nttfowa7id8395aq5o
X-Rspamd-Server: rspam08
X-HE-Tag: 1643144780-84351
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Jan 25, 2022 at 10:23:13AM +0100, Michal Hocko wrote:
> On Mon 24-01-22 14:22:03, Minchan Kim wrote:
> [...]
> >  CPU 0                               CPU 1
> > 
> >  lru_cache_disable                  lru_cache_disable
> >    ret = atomic_inc_return;(ret = 1)
> >                                      
> >                                     ret = atomic_inc_return;(ret = 2)
> >                                     
> >    lru_add_drain_all(true);         
> >                                     lru_add_drain_all(false)
> >                                     mutex_lock() is holding
> >    mutex_lock() is waiting
> > 
> >                                     IPI with !force_all_cpus
> >                                     ...
> >                                     ...
> >                                     IPI done but it skipped some CPUs
> >                
> >      ..
> >      ..
> >  
> > 
> > Thus, lru_cache_disable on CPU 1 doesn't run with every CPUs so it
> > introduces race of lru_disable_count so some pages on cores
> > which didn't run the IPI could accept upcoming pages into per-cpu
> > cache.
> 
> Yes, that is certainly possible but the question is whether it really
> matters all that much. The race would require also another racer to be
> adding a page to an _empty_ pcp list at the same time.
> 
> pagevec_add_and_need_flush
>   1) pagevec_add # add to pcp list
>   2) lru_cache_disabled
>     atomic_read(lru_disable_count) = 0
>   # no flush but the page is on pcp
> 
> There is no strong memory ordering between 1 and 2 and that is why we
> need an IPI to enforce it in general IIRC.

Correct.

> 
> But lru_cache_disable is not a strong synchronization primitive. It aims
> at providing a best effort means to reduce false positives, right? IMHO

Nope. d479960e44f27, mm: disable LRU pagevec during the migration temporarily

Originally, it was designed to close the race fundamentally.

> it doesn't make much sense to aim for perfection because all users of
> this interface already have to live with temporary failures and pcp
> caches is not the only reason to fail - e.g. short lived page pins.

short lived pages are true but that doesn't mean we need to make the
allocation faster. As I mentioned, the IPI takes up to hundreds
milliseconds easily once CPUs are fully booked. If we reduce the
cost, we could spend the time more productive works. I am working
on making CMA more determinstic and this patch is one of parts.

> 
> That being said, I would rather live with a best effort and simpler
> implementation approach rather than aim for perfection in this case.
> The scheme is already quite complex and another lock in the mix doesn't

lru_add_drain_all already hides the whole complexity inside and
lru_cache_disable adds A simple synchroniztion to keep ordering
on top of it. That's natural SW stack and I don't see too complication
here.

> make it any easier to follow. If others believe that another lock makes

Disagree. lru_cache_disable is designed to guarantee closing the race
you are opening again so the other code in allocator since disabling
per-cpu cache doesn't need to consider the race at all. It's more
simple/deterministic and we could make other stuff based on it(e.g.,
zone->pcp). 

> the implementation more straightforward I will not object but I would go
> with the following.
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index ae8d56848602..c140c3743b9e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -922,7 +922,8 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
>   */
>  void lru_cache_disable(void)
>  {
> -	atomic_inc(&lru_disable_count);
> +	int count = atomic_inc_return(&lru_disable_count);
> +
>  #ifdef CONFIG_SMP
>  	/*
>  	 * lru_add_drain_all in the force mode will schedule draining on
> @@ -931,8 +932,28 @@ void lru_cache_disable(void)
>  	 * The atomic operation doesn't need to have stronger ordering
>  	 * requirements because that is enforeced by the scheduling
>  	 * guarantees.
> +	 * Please note that there is a potential for a race condition:
> +	 * CPU0				CPU1			CPU2
> +	 * pagevec_add_and_need_flush
> +	 *   pagevec_add # to the empty list
> +	 *   lru_cache_disabled
> +	 *     atomic_read # 0
> +	 *   				lru_cache_disable	lru_cache_disable
> +	 *				  atomic_inc_return (1)
> +	 *				  			  atomic_inc_return (2)
> +	 *				  __lru_add_drain_all(true)
> +	 *				  			  __lru_add_drain_all(false)
> +	 *				  			    mutex_lock
> +	 *				    mutex_lock
> +	 *				    			    # skip cpu0 (pagevec_add not visible yet)
> +	 *				    			    mutex_unlock
> +	 *				    			 # fail because of pcp(0) pin
> +	 *				    queue_work_on(0)
> +	 *
> +	 * but the scheme is a best effort and the above race quite unlikely
> +	 * to matter in real life.
>  	 */
> -	__lru_add_drain_all(true);
> +	__lru_add_drain_all(count == 1);
>  #else
>  	lru_add_and_bh_lrus_drain();
>  #endif
> -- 
> Michal Hocko
> SUSE Labs