From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 95F87C433EF
	for <linux-mm@archiver.kernel.org>; Sat, 18 Jun 2022 08:43:38 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2F2B66B0071; Sat, 18 Jun 2022 04:43:38 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 27C186B0072; Sat, 18 Jun 2022 04:43:38 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 11C846B0073; Sat, 18 Jun 2022 04:43:38 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id EF5816B0071
	for <linux-mm@kvack.org>; Sat, 18 Jun 2022 04:43:37 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id C539020930
	for <linux-mm@kvack.org>; Sat, 18 Jun 2022 08:43:37 +0000 (UTC)
X-FDA: 79590718074.16.D0A29B4
Received: from r3-24.sinamail.sina.com.cn (r3-24.sinamail.sina.com.cn [202.108.3.24])
	by imf16.hostedemail.com (Postfix) with SMTP id E138718008C
	for <linux-mm@kvack.org>; Sat, 18 Jun 2022 08:43:34 +0000 (UTC)
Received: from unknown (HELO localhost.localdomain)([114.249.59.48])
	by sina.com (172.16.97.23) with ESMTP
	id 62AD8FD30002C451; Sat, 18 Jun 2022 16:41:57 +0800 (CST)
X-Sender: hdanton@sina.com
X-Auth-ID: hdanton@sina.com
X-SMAIL-MID: 75167054919678
From: Hillf Danton <hdanton@sina.com>
To: Waiman Long <longman@redhat.com>,
	Eric Dumazet <edumazet@google.com>
Cc: linux-mm@kvack.org,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] locking/rwlocks: do not starve writers
Date: Sat, 18 Jun 2022 16:43:22 +0800
Message-Id: <20220618084322.81-1-hdanton@sina.com>
In-Reply-To: <980387bb-f46d-e9ab-96b0-7293df08c447@redhat.com>
References: <20220617091039.2257083-1-eric.dumazet@gmail.com> <YqxufxqsnHjVfQOs@worktop.programming.kicks-ass.net> <2dd754f9-3a79-ed17-e423-6b411c3afb69@redhat.com> <CALvZod5ijDz=coEE8G8v_haPaKuUa5jHYzEwKvLVxHGphixsFA@mail.gmail.com> <2730b855-8f99-5a9e-707e-697d3bd9811d@redhat.com> <CANn89iJLWJMmNrLYQ0EU7_0Wri6c3Kn9vYMOiWu1Ds8Af2KOnw@mail.gmail.com> <7499dd05-30d1-669c-66b4-5cb06452b476@redhat.com> <CANn89iLxX_bqD8PvAkZXGWzKBKYxB3qaqQjxxdmoG91PfmvRnA@mail.gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655541817; a=rsa-sha256;
	cv=none;
	b=fw3dY77uaHKG04wiakhpqKz0A2xja8im28o31wDUvkXSu/paP+1W59/90C1VqR3i4FBcbP
	Gp6fIUlK/zNihWLGkd2glPfQgTM+7Nh//3j5qby6Y4nh8r8iwWRjzFuN75noLD7oqIOXF6
	R5bGZ8Nyx/Pf9qMUOvH7UboZmp09F7I=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=none;
	spf=pass (imf16.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.24 as permitted sender) smtp.mailfrom=hdanton@sina.com;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1655541817;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=goG9+hYRMLqgbXM+LMRHgvbMvhRX+tnyl2JeiiGLrDE=;
	b=Vt6BETjG5kKgV/+BQzN0lXTkz8PagvI5BQA6UOBcRM/vedQ3jWVgd+TsAhazvo9WLi0Bmp
	YzDcfSG1+B28PGqv737vMulSZ0Q+gGR94cSk11OtXwd1G1yc9wiCsrnmdGZE540gQv1T4D
	RDXFgfCrkMVu3dZnGvsCLqawqdeyGM0=
Authentication-Results: imf16.hostedemail.com;
	dkim=none;
	spf=pass (imf16.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.24 as permitted sender) smtp.mailfrom=hdanton@sina.com;
	dmarc=none
X-Rspam-User: 
X-Stat-Signature: qjchjokiwnr34nqc6dhm896g6fnhkoic
X-Rspamd-Queue-Id: E138718008C
X-Rspamd-Server: rspam08
X-HE-Tag: 1655541814-771684
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, 17 Jun 2022 14:57:55 -0400 Waiman Long wrote:
>On 6/17/22 13:45, Eric Dumazet wrote:
>> On Fri, Jun 17, 2022 at 7:42 PM Waiman Long <longman@redhat.com> wrote:
>>> On 6/17/22 11:24, Eric Dumazet wrote:
>>>> On Fri, Jun 17, 2022 at 5:00 PM Waiman Long <longman@redhat.com> wrote:
>>>>> On 6/17/22 10:57, Shakeel Butt wrote:
>>>>>> On Fri, Jun 17, 2022 at 7:43 AM Waiman Long <longman@redhat.com> wrote:
>>>>>>> On 6/17/22 08:07, Peter Zijlstra wrote:
>>>>>>>> On Fri, Jun 17, 2022 at 02:10:39AM -0700, Eric Dumazet wrote:
>>>>>>>>> --- a/kernel/locking/qrwlock.c
>>>>>>>>> +++ b/kernel/locking/qrwlock.c
>>>>>>>>> @@ -23,16 +23,6 @@ void queued_read_lock_slowpath(struct qrwlock *lock)
>>>>>>>>>         /*
>>>>>>>>>          * Readers come here when they cannot get the lock without waiting
>>>>>>>>>          */
>>>>>>>>> -    if (unlikely(in_interrupt())) {
>>>>>>>>> -            /*
>>>>>>>>> -             * Readers in interrupt context will get the lock immediately
>>>>>>>>> -             * if the writer is just waiting (not holding the lock yet),
>>>>>>>>> -             * so spin with ACQUIRE semantics until the lock is available
>>>>>>>>> -             * without waiting in the queue.
>>>>>>>>> -             */
>>>>>>>>> -            atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
>>>>>>>>> -            return;
>>>>>>>>> -    }
>>>>>>>>>         atomic_sub(_QR_BIAS, &lock->cnts);
>>>>>>>>>
>>>>>>>>>         trace_contention_begin(lock, LCB_F_SPIN | LCB_F_READ);
>>>>>>>> This is known to break tasklist_lock.
>>>>>>>>
>>>>>>> We certainly can't break the current usage of tasklist_lock.
>>>>>>>
>>>>>>> I am aware of this problem with networking code and is thinking about
>>>>>>> either relaxing the check to exclude softirq or provide a
>>>>>>> read_lock_unfair() variant for networking use.
>>>>>> read_lock_unfair() for networking use or tasklist_lock use?
>>>>> I mean to say read_lock_fair(), but it could also be the other way
>>>>> around. Thanks for spotting that.
>>>>>
>>>> If only tasklist_lock is problematic and needs the unfair variant,
>>>> then changing a few read_lock() for tasklist_lock will be less
>>>> invasive than ~1000 read_lock() elsewhere....
>>> After a second thought, I think the right way is to introduce a fair
>>> variant, if needed. If an arch isn't using qrwlock, the native rwlock
>>> implementation will be unfair. In that sense, unfair rwlock is the
>>> default. We will only need to change the relevant network read_lock()
>>> calls to use the fair variant which will still be unfair if qrwlock
>>> isn't used. We are not going to touch other read_lock call that don't
>>> care about fair or unfair.
>>>
>> Hmm... backporting this kind of invasive change to stable kernels will
>> be a daunting task.
>>
>> Were rwlocks always unfair, and we have been lucky ?
>>
> Yes, rwlocks was always unfair and it always had this kind of soft 
> lockup problem and scalability problem because of cacheline bouncing. 
> That was reason of creating qrwlock which can at least provide a fair 
> rwlock at task context. Now we have systems with more and more cpus and 
> that is the reason why you are seeing it all over again with the 
> networking code.

No fair play without paying the price.

If writer wants to play fair game with readers, it has to wait in queue
for at least a tick then forces readers to go the slow path by setting
_QW_WAITING.

Only for thoughts now.

Hillf

+++ b/kernel/locking/qrwlock.c
@@ -22,16 +22,20 @@ void queued_read_lock_slowpath(struct qr
 	/*
 	 * Readers come here when they cannot get the lock without waiting
 	 */
+	if (_QW_WAITING & atomic_read(&lock->cnts))
+		goto queue;
+
 	if (unlikely(in_interrupt())) {
 		/*
 		 * Readers in interrupt context will get the lock immediately
-		 * if the writer is just waiting (not holding the lock yet),
+		 * if the writer is not waiting long enough,
 		 * so spin with ACQUIRE semantics until the lock is available
 		 * without waiting in the queue.
 		 */
 		atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
 		return;
 	}
+queue:
 	atomic_sub(_QR_BIAS, &lock->cnts);
 
 	/*
@@ -60,6 +64,8 @@ EXPORT_SYMBOL(queued_read_lock_slowpath)
  */
 void queued_write_lock_slowpath(struct qrwlock *lock)
 {
+	unsigned long start;
+	int qw;
 	int cnts;
 
 	/* Put the writer into the wait queue */
@@ -70,12 +76,20 @@ void queued_write_lock_slowpath(struct q
 	    atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED))
 		goto unlock;
 
-	/* Set the waiting flag to notify readers that a writer is pending */
-	atomic_or(_QW_WAITING, &lock->cnts);
-
+	start = jiffies;
+	qw = 0;
 	/* When no more readers or writers, set the locked flag */
 	do {
-		cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == _QW_WAITING);
+		if (qw == 0 && start + 2 < jiffies) {
+			qw = _QW_WAITING;
+			/*
+			 * Set the waiting flag to notify readers that a writer is pending
+			 * only after waiting long enough - that is the price writer pays
+			 * for fairness
+			 */
+			atomic_or(_QW_WAITING, &lock->cnts);
+		}
+		cnts = atomic_cond_read_relaxed(&lock->cnts, VAL == qw);
 	} while (!atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED));
 unlock:
 	arch_spin_unlock(&lock->wait_lock);