From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=NKc1=7C=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.5 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN,
	FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E336CC433E0
	for <linux-mm@archiver.kernel.org>; Wed, 20 May 2020 11:24:49 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 781D720758
	for <linux-mm@archiver.kernel.org>; Wed, 20 May 2020 11:24:49 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 781D720758
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 0806280008; Wed, 20 May 2020 07:24:49 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 05666900002; Wed, 20 May 2020 07:24:49 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id ED34380008; Wed, 20 May 2020 07:24:48 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0241.hostedemail.com [216.40.44.241])
	by kanga.kvack.org (Postfix) with ESMTP id D7886900002
	for <linux-mm@kvack.org>; Wed, 20 May 2020 07:24:48 -0400 (EDT)
Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 86EAF180AD815
	for <linux-mm@kvack.org>; Wed, 20 May 2020 11:24:48 +0000 (UTC)
X-FDA: 76836865056.23.oil41_8104c1e17c71f
X-HE-Tag: oil41_8104c1e17c71f
X-Filterd-Recvd-Size: 7449
Received: from r3-17.sinamail.sina.com.cn (r3-17.sinamail.sina.com.cn [202.108.3.17])
	by imf25.hostedemail.com (Postfix) with SMTP
	for <linux-mm@kvack.org>; Wed, 20 May 2020 11:24:46 +0000 (UTC)
Received: from unknown (HELO localhost.localdomain)([123.123.26.246])
	by sina.com with ESMTP
	id 5EC513780002B9B6; Wed, 20 May 2020 19:24:43 +0800 (CST)
X-Sender: hdanton@sina.com
X-Auth-ID: hdanton@sina.com
X-SMAIL-MID: 14496649283772
From: Hillf Danton <hdanton@sina.com>
To: "Ahmed S. Darwish" <a.darwish@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
	linux-mm@kvack.org,
	Hillf Danton <hdanton@sina.com>
Subject: Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API
Date: Wed, 20 May 2020 19:24:30 +0800
Message-Id: <20200520112430.9044-1-hdanton@sina.com>
In-Reply-To: <20200519214547.352050-3-a.darwish@linutronix.de>
References: <20200519214547.352050-1-a.darwish@linutronix.de>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On Tue, 19 May 2020 23:45:24 +0200 "Ahmed S. Darwish" wrote:
>=20
> Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
> implemented an optimization mechanism to exit the to-be-started LRU
> drain operation (name it A) if another drain operation *started and
> finished* while (A) was blocked on the LRU draining mutex.
>=20
> This was done through a seqcount latch, which is an abuse of its
> semantics:
>=20
>   1. Seqcount latching should be used for the purpose of switching
>      between two storage places with sequence protection to allow
>      interruptible, preemptible writer sections. The optimization
>      mechanism has absolutely nothing to do with that.
>=20
>   2. The used raw_write_seqcount_latch() has two smp write memory
>      barriers to always insure one consistent storage place out of the
>      two storage places available. This extra smp_wmb() is redundant fo=
r
>      the optimization use case.
>=20
> Beside the API abuse, the semantics of a latch sequence counter was
> force fitted into the optimization. What was actually meant is to track
> generations of LRU draining operations, where "current lru draining
> generation =3D3D x" implies that all generations 0 < n <=3D3D x are alr=
eady
> *scheduled* for draining.
>=20
> Remove the conceptually-inappropriate seqcount latch usage and manually
> implement the optimization using a counter and SMP memory barriers.
>=20
> Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvg=
jBTGbJywUJpA@mail.gmail.com
> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> ---
>  mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 47 insertions(+), 10 deletions(-)
>=20
> diff --git a/mm/swap.c b/mm/swap.c
> index bf9a79fed62d..d6910eeed43d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_str=
uct *dummy)
>   */
>  void lru_add_drain_all(void)
>  {
> -	static seqcount_t seqcount =3D3D SEQCNT_ZERO(seqcount);
> -	static DEFINE_MUTEX(lock);
> +	/*
> +	 * lru_drain_gen - Current generation of pages that could be in vecto=
rs
> +	 *
> +	 * (A) Definition: lru_drain_gen =3D3D x implies that all generations
> +	 *     0 < n <=3D3D x are already scheduled for draining.
> +	 *
> +	 * This is an optimization for the highly-contended use case where a
> +	 * user space workload keeps constantly generating a flow of pages
> +	 * for each CPU.
> +	 */
> +	static unsigned int lru_drain_gen;
>  	static struct cpumask has_work;
> -	int cpu, seq;
> +	static DEFINE_MUTEX(lock);
> +	int cpu, this_gen;
> =3D20
>  	/*
>  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
>  	if (WARN_ON(!mm_percpu_wq))
>  		return;
> =3D20
> -	seq =3D3D raw_read_seqcount_latch(&seqcount);
> +	/*
> +	 * (B) Cache the LRU draining generation number
> +	 *
> +	 * smp_rmb() ensures that the counter is loaded before the mutex is
> +	 * taken. It pairs with the smp_wmb() inside the mutex critical secti=
on
> +	 * at (D).
> +	 */
> +	this_gen =3D3D READ_ONCE(lru_drain_gen);
> +	smp_rmb();
> =3D20
>  	mutex_lock(&lock);
> =3D20
>  	/*
> -	 * Piggyback on drain started and finished while we waited for lock:
> -	 * all pages pended at the time of our enter were drained from vector=
s.
> +	 * (C) Exit the draining operation if a newer generation, from anothe=
r
> +	 * lru_add_drain_all(), was already scheduled for draining. Check (A)=
.
>  	 */
> -	if (__read_seqcount_retry(&seqcount, seq))
> +	if (unlikely(this_gen !=3D3D lru_drain_gen))
>  		goto done;
> =3D20
> -	raw_write_seqcount_latch(&seqcount);
> +	/*
> +	 * (D) Increment generation number
> +	 *
> +	 * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critic=
al
> +	 * section.
> +	 *
> +	 * This pairing must be done here, before the for_each_online_cpu loo=
p
> +	 * below which drains the page vectors.
> +	 *
> +	 * Let x, y, and z represent some system CPU numbers, where x < y < z=
.
> +	 * Assume CPU #z is is in the middle of the for_each_online_cpu loop
> +	 * below and has already reached CPU #y's per-cpu data. CPU #x comes
> +	 * along, adds some pages to its per-cpu vectors, then calls
> +	 * lru_add_drain_all().
> +	 *
> +	 * If the paired smp_wmb() below is done at any later step, e.g. afte=
r
> +	 * the loop, CPU #x will just exit at (C) and miss flushing out all o=
f
> +	 * its added pages.
> +	 */
> +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> +	smp_wmb();

Writer is inside mutex.

> =3D20
>  	cpumask_clear(&has_work);
> -
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work =3D3D &per_cpu(lru_add_drain_work, cpu);
> =3D20
> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
>  {
>  	lru_add_drain();
>  }
> -#endif
> +#endif /* CONFIG_SMP */
> =3D20
>  /**
>   * release_pages - batched put_page()
> --=3D20
> 2.20.1


The tiny diff, inspired by this work, is trying to trylock mutex by
swicthing the sequence counter's readers and writer across the mutex
boundary.

--- a/mm/swap.c
+++ b/mm/swap.c
@@ -714,10 +714,12 @@ static void lru_add_drain_per_cpu(struct
  */
 void lru_add_drain_all(void)
 {
-	static seqcount_t seqcount =3D SEQCNT_ZERO(seqcount);
+	static unsigned int lru_drain_gen;
 	static DEFINE_MUTEX(lock);
 	static struct cpumask has_work;
-	int cpu, seq;
+	int cpu;
+	int loop =3D 0;
+	unsigned int seq;
=20
 	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -726,18 +728,12 @@ void lru_add_drain_all(void)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
=20
-	seq =3D raw_read_seqcount_latch(&seqcount);
+	seq =3D lru_drain_gen++;
+	smp_mb();
=20
-	mutex_lock(&lock);
-
-	/*
-	 * Piggyback on drain started and finished while we waited for lock:
-	 * all pages pended at the time of our enter were drained from vectors.
-	 */
-	if (__read_seqcount_retry(&seqcount, seq))
-		goto done;
-
-	raw_write_seqcount_latch(&seqcount);
+	if (!mutex_trylock(&lock))
+		return;
+more_work:
=20
 	cpumask_clear(&has_work);
=20
@@ -759,7 +755,11 @@ void lru_add_drain_all(void)
 	for_each_cpu(cpu, &has_work)
 		flush_work(&per_cpu(lru_add_drain_work, cpu));
=20
-done:
+	smp_mb();
+	if (++seq !=3D lru_drain_gen && loop++ < 128)
+		goto more_work;
+
+	smp_mb();
 	mutex_unlock(&lock);
 }
 #else
--