From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 426CFC433EF for ; Tue, 25 Jan 2022 21:06:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A04936B007B; Tue, 25 Jan 2022 16:06:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9B5B06B007D; Tue, 25 Jan 2022 16:06:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 82E1F6B0080; Tue, 25 Jan 2022 16:06:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0189.hostedemail.com [216.40.44.189]) by kanga.kvack.org (Postfix) with ESMTP id 6E00B6B007B for ; Tue, 25 Jan 2022 16:06:21 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 2E54D944D6 for ; Tue, 25 Jan 2022 21:06:21 +0000 (UTC) X-FDA: 79070042562.05.C06D19F Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by imf30.hostedemail.com (Postfix) with ESMTP id BE9E38001A for ; Tue, 25 Jan 2022 21:06:20 +0000 (UTC) Received: by mail-pl1-f173.google.com with SMTP id d7so20536700plr.12 for ; Tue, 25 Jan 2022 13:06:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=WPDtSBJktDbv4eAC/x0jLV8l6Q6L/WIY+zqIPINFKvA=; b=XkeIKkTwAYYHbvql2pSI88YuuMaJcTZKhmTnKhGh8vuwECNv4jN36Wzk0IDE/w7RAp J6zHY/7zITPOA++MFEuvK9ESFJW1PqXnQLexNJIvqybRp/ivjuCDWWGM1BKjOXa0iRC9 Vv4qScYyqXm/HcnguyMwz5nLC/sU8thZakOKUGd/BVjyoxSj/qJyGkYY2YH02erPAjcf FyWebAfguY+29gVGtmlseVYfOAO4bOeuKIB/oP3Sm5J5q3KLHEbYfdfduaQVt5F24zTa kC0RhsTlU098RyrLsovW3Lb+FiSI3B6lAsDiVGTUcZNJJ9wR/T9bJtzRds051Py3c0sY qjUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to; bh=WPDtSBJktDbv4eAC/x0jLV8l6Q6L/WIY+zqIPINFKvA=; b=QR5ct8o9vm/mcgzUW9Wyktm49PXGnObBnuWJHn6WPnHbckVoAOO5UOyJQERqyZLnpx EWdrg19QlEG07L5vdflw8uyOV3DtwoteLJkupUt6jZQhZiQjm57/AbNPXADtYTIBkp5o IsW9RWnZBuQB+Tbu7Wc52LORLO8HhiLm/g+B3R3Jk7ly9Rqo08wmGbVskzcedwjHI9Go nefXfPkCKi6BZNkJVjMULhwJa37t55H9G7AuvkwBHiNcHRv9KUZDA2xCwrwRh6duuCWf 6wH7kqJUCNHXK9n+01DxDrUHdOqSdU2iM4SiNqwVWr5//qbjSTn6oXCn1oNdkS+zTsyX fPrQ== X-Gm-Message-State: AOAM531KE6JnMGblHjUkJR0sWso/CQki1vL7RDPqubgYWMiP0ep77tQn GtlfwZJUVWELt/lfp6gifIY= X-Google-Smtp-Source: ABdhPJxIlbzYrgef5H0QNhsFgTamCDKhB6YR+2vbJzqxpBxsmza4tjRqDld1/5iAcZVJ0lTc4dgndg== X-Received: by 2002:a17:90b:1a87:: with SMTP id ng7mr5478959pjb.233.1643144779536; Tue, 25 Jan 2022 13:06:19 -0800 (PST) Received: from google.com ([2620:15c:211:201:23a4:12ed:4652:dade]) by smtp.gmail.com with ESMTPSA id j8sm22080994pfc.127.2022.01.25.13.06.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Jan 2022 13:06:19 -0800 (PST) Date: Tue, 25 Jan 2022 13:06:17 -0800 From: Minchan Kim To: Michal Hocko Cc: Andrew Morton , David Hildenbrand , linux-mm , LKML , Suren Baghdasaryan , John Dias Subject: Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: BE9E38001A X-Rspam-User: nil Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=XkeIKkTw; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=pass (imf30.hostedemail.com: domain of minchan.kim@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com X-Stat-Signature: 8qdycahe5soew1nttfowa7id8395aq5o X-Rspamd-Server: rspam08 X-HE-Tag: 1643144780-84351 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jan 25, 2022 at 10:23:13AM +0100, Michal Hocko wrote: > On Mon 24-01-22 14:22:03, Minchan Kim wrote: > [...] > > CPU 0 CPU 1 > > > > lru_cache_disable lru_cache_disable > > ret = atomic_inc_return;(ret = 1) > > > > ret = atomic_inc_return;(ret = 2) > > > > lru_add_drain_all(true); > > lru_add_drain_all(false) > > mutex_lock() is holding > > mutex_lock() is waiting > > > > IPI with !force_all_cpus > > ... > > ... > > IPI done but it skipped some CPUs > > > > .. > > .. > > > > > > Thus, lru_cache_disable on CPU 1 doesn't run with every CPUs so it > > introduces race of lru_disable_count so some pages on cores > > which didn't run the IPI could accept upcoming pages into per-cpu > > cache. > > Yes, that is certainly possible but the question is whether it really > matters all that much. The race would require also another racer to be > adding a page to an _empty_ pcp list at the same time. > > pagevec_add_and_need_flush > 1) pagevec_add # add to pcp list > 2) lru_cache_disabled > atomic_read(lru_disable_count) = 0 > # no flush but the page is on pcp > > There is no strong memory ordering between 1 and 2 and that is why we > need an IPI to enforce it in general IIRC. Correct. > > But lru_cache_disable is not a strong synchronization primitive. It aims > at providing a best effort means to reduce false positives, right? IMHO Nope. d479960e44f27, mm: disable LRU pagevec during the migration temporarily Originally, it was designed to close the race fundamentally. > it doesn't make much sense to aim for perfection because all users of > this interface already have to live with temporary failures and pcp > caches is not the only reason to fail - e.g. short lived page pins. short lived pages are true but that doesn't mean we need to make the allocation faster. As I mentioned, the IPI takes up to hundreds milliseconds easily once CPUs are fully booked. If we reduce the cost, we could spend the time more productive works. I am working on making CMA more determinstic and this patch is one of parts. > > That being said, I would rather live with a best effort and simpler > implementation approach rather than aim for perfection in this case. > The scheme is already quite complex and another lock in the mix doesn't lru_add_drain_all already hides the whole complexity inside and lru_cache_disable adds A simple synchroniztion to keep ordering on top of it. That's natural SW stack and I don't see too complication here. > make it any easier to follow. If others believe that another lock makes Disagree. lru_cache_disable is designed to guarantee closing the race you are opening again so the other code in allocator since disabling per-cpu cache doesn't need to consider the race at all. It's more simple/deterministic and we could make other stuff based on it(e.g., zone->pcp). > the implementation more straightforward I will not object but I would go > with the following. > > diff --git a/mm/swap.c b/mm/swap.c > index ae8d56848602..c140c3743b9e 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -922,7 +922,8 @@ atomic_t lru_disable_count = ATOMIC_INIT(0); > */ > void lru_cache_disable(void) > { > - atomic_inc(&lru_disable_count); > + int count = atomic_inc_return(&lru_disable_count); > + > #ifdef CONFIG_SMP > /* > * lru_add_drain_all in the force mode will schedule draining on > @@ -931,8 +932,28 @@ void lru_cache_disable(void) > * The atomic operation doesn't need to have stronger ordering > * requirements because that is enforeced by the scheduling > * guarantees. > + * Please note that there is a potential for a race condition: > + * CPU0 CPU1 CPU2 > + * pagevec_add_and_need_flush > + * pagevec_add # to the empty list > + * lru_cache_disabled > + * atomic_read # 0 > + * lru_cache_disable lru_cache_disable > + * atomic_inc_return (1) > + * atomic_inc_return (2) > + * __lru_add_drain_all(true) > + * __lru_add_drain_all(false) > + * mutex_lock > + * mutex_lock > + * # skip cpu0 (pagevec_add not visible yet) > + * mutex_unlock > + * # fail because of pcp(0) pin > + * queue_work_on(0) > + * > + * but the scheme is a best effort and the above race quite unlikely > + * to matter in real life. > */ > - __lru_add_drain_all(true); > + __lru_add_drain_all(count == 1); > #else > lru_add_and_bh_lrus_drain(); > #endif > -- > Michal Hocko > SUSE Labs