From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7327FD2A520 for ; Wed, 16 Oct 2024 14:23:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 061B66B008A; Wed, 16 Oct 2024 10:23:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 011786B008C; Wed, 16 Oct 2024 10:23:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF40F6B0092; Wed, 16 Oct 2024 10:23:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C17466B008A for ; Wed, 16 Oct 2024 10:23:21 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 53C47160548 for ; Wed, 16 Oct 2024 14:23:10 +0000 (UTC) X-FDA: 82679682792.05.E48E750 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf26.hostedemail.com (Postfix) with ESMTP id A590014000F for ; Wed, 16 Oct 2024 14:23:11 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=QmOYTc3L; spf=pass (imf26.hostedemail.com: domain of llong@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=llong@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729088453; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=R/KW/XoaEe9YPin6L1l6dmEf2XR0ptPwsvepwftJ/SU=; b=W+ipi9c9X775FFiRXHgkcLCbWx7D2XXJ21H/o2rt0EPC2l1QKU9UFnNUT6mhVJwPi55DP8 wyqyLkRuhEGrh/ZfbGu1L5CnWGLv72QihymVXzUeA9oOLioQWohrFSME0HeAa+XGRQ86k5 VP8gJVozUZ72qJCazI0+lwURE+I+UiY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729088453; a=rsa-sha256; cv=none; b=eAaRW3atZpW7J0SUsSfXch0HHUU909j1qq/W07cmEXjfszRoslnIedrmm7N8+h1rO59MKB ZrT7suGucvvwhKqsIqyDzd4HVQs8gnPPN/9SsWmo3gbNtdj9wE2phmHua2qW9oO/UuEVLI SUwso8O4Iazku459lXWsLGScpPdwJw0= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=QmOYTc3L; spf=pass (imf26.hostedemail.com: domain of llong@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=llong@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1729088597; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=R/KW/XoaEe9YPin6L1l6dmEf2XR0ptPwsvepwftJ/SU=; b=QmOYTc3LZO7pRqZXcdRQUZK3rAq6lkmiD1WjDUF+vsbkmX7Ix2JoVWH1BHBrY3vqGO1Fnw CefBmLgEkjUUMNaAPAON7E+H8l8JDN+dr+n5hB0RFsRQJcDhJ1/I425aIZQtys36Y6mRRg gQ6Naep53wx3b1ZhzBgDhq5PIBeJPbQ= Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-607-qA1BOVRvOJePuWFLeuYY3A-1; Wed, 16 Oct 2024 10:23:16 -0400 X-MC-Unique: qA1BOVRvOJePuWFLeuYY3A-1 Received: by mail-qv1-f69.google.com with SMTP id 6a1803df08f44-6cbe50b415eso105597196d6.3 for ; Wed, 16 Oct 2024 07:23:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729088596; x=1729693396; h=content-transfer-encoding:in-reply-to:content-language:references :cc:to:subject:user-agent:mime-version:date:message-id:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=R/KW/XoaEe9YPin6L1l6dmEf2XR0ptPwsvepwftJ/SU=; b=rgj0XIspfPIC51K8Z8t3f66mVJmzmduuF2dFk12sEhlcwVY/UvufRPNqqfCzVM/fjn JJXO86eNsB3JovEnFIIR+7p00iUca/lCmXCvdffJ0vXP7hifmUltZBET8FeeknO1b+R5 9wSB9uV1I/HnxVSmTv1d198CaZpWlYN7VzcuVcR9Psc8q6xFbw+1j/MggMzWIvlKvfAj 6JfPx+iKKLYnm76LjkERhaJ2y7K5bQALmtH3gBxB5HU0FyFMQnEyr/YS80zw5kSRfC2W XEzGHr/X5EAs1t5ruYgzbjFzmD/+/OMZQ87JY+Jc7rxRnyH+bxNw/rpym75tmnmzI33H 88tw== X-Forwarded-Encrypted: i=1; AJvYcCWx/jMWdWzwyfJ45jL+9QnX/WJvBISVx2+Sueva2mWBOOHC34vHJEK3ZzpkEYWehr1RKYUUar7ojQ==@kvack.org X-Gm-Message-State: AOJu0YzuB3NOQX7/OvXXs7sZtL0S64zBR94Ny4bTCW3GSq5qWr+2xi+5 rX43+Z4csNyaEMSKk/woErXMncDA+q54TYMBAldsi94ZBen6XY6oALxSshdX77yfWK4hO+vX/bj yPhyVoaqVIOsJHJLxzjG6Q69AmTrnnw+RFuc/Ucl9SJxEKtp9 X-Received: by 2002:a05:6214:2f10:b0:6c3:5496:3e06 with SMTP id 6a1803df08f44-6cbefffdfe6mr302613926d6.10.1729088595814; Wed, 16 Oct 2024 07:23:15 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGlmSxZTKIDajxln309baSOeFhzJ8LjphoJlbw+lLiJL6M7eRNZKov97jwPhJoCscR7o086vw== X-Received: by 2002:a05:6214:2f10:b0:6c3:5496:3e06 with SMTP id 6a1803df08f44-6cbefffdfe6mr302613556d6.10.1729088595418; Wed, 16 Oct 2024 07:23:15 -0700 (PDT) Received: from ?IPV6:2601:188:ca00:a00:f844:fad5:7984:7bd7? ([2601:188:ca00:a00:f844:fad5:7984:7bd7]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6cc2295b8fbsm18467386d6.75.2024.10.16.07.23.14 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 16 Oct 2024 07:23:14 -0700 (PDT) From: Waiman Long X-Google-Original-From: Waiman Long Message-ID: <7f7b277a-7019-4bf4-b100-0505c6ce9737@redhat.com> Date: Wed, 16 Oct 2024 10:23:14 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 1/2] rwsem: introduce upgrade_read interface To: lizhe.67@bytedance.com, peterz@infradead.org, mingo@redhat.com, will@kernel.org, boqun.feng@gmail.com, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20241016043600.35139-1-lizhe.67@bytedance.com> <20241016043600.35139-2-lizhe.67@bytedance.com> In-Reply-To: <20241016043600.35139-2-lizhe.67@bytedance.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: 6n4sxg1uq8j9s1cir1dtrz54t7uakqhp X-Rspamd-Queue-Id: A590014000F X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1729088591-598984 X-HE-Meta: U2FsdGVkX1+CyGKBJrpz4jQihTKxz//59rUYSo4rzpGSpFUEbsemkLX1e87K1We0nwtTbQb1FoU+93y2eKno5wQp0prD7aEGHIIqh43qKlElLEMsGS1dHPJlYvcV99L+vMrua1Snemah6rlQz6bHZjmCt5y/93NDURAdKVRl7xP89NSY3G+W863vY+6Zu9HIK14O2HdCZ0iWADQgAWkwZRsFTZxiI1iBOX2QNlkKebbtMIaokCOZlIJt4SvBy0jEpFo/gfi81m5GV5+8p6QlrloFfFkdy3iLNyjq+k58GGOq3+kVJ5ngKg/oIPDBDBmdOVcSxicWJ5O4zqmvIpilZ42z7CGFKZeCVOEWB/A36SOkTNUoksiI12RwrPxyGhO5DsF31leEiefDMDa6DLsBIINQq3J9f28+7IhKU8qCOcblrORTY9QjnbisTAFP0wrolf9q6jixWJnYzKR2QZayWJT7XU2bsc7qFT8EJgbfE6IRBbaEq7F0/z56l6w2SecApeAinL/N7ht5LUzqgnf2y3cEglbXjUxvVzXhwR7Cg19eIKu0pjmGn5dAlDHWZXcnjod1NRmZk4Fy91h9LniByGzzIHsO+VBm2WV9CGZee5Fkxgmh/HEf1MzF89dnBfPPQD898KiEPMOSVKPdhignuuMvoF/QX6wrEBSfpqjYBoR/RcA8t48fmORd0NzpWI5AYYHCPqEn8bhPR5rilOWPXTznriBcl6QdR771r2SjKWo50g2hhpsMeLGEYCXLoxgJ+2VXCLuXh98m4T++mgAoIJVlmVpXBhG0aLTzoX+dKwpC83h17tmXe+Nf+MEP69HDlt0Guq7weW/wy4+ehmPhXCQQmsFK1s8Tscq9WHvK8o6ncVs5Imx5ytd97cJaykpojnFJv4kVMTo+zvQN0wncb3WHncZvH+AS3jgbSEzXnCW9+xkGDvFSUoGY8BPg9ppYDuuEyFq7kUWcWdIxx/0 3k2JDNVw DFBUSk3mLJJWRWp8AoHZ+/z4h3bVDnws+kxJsnK7vcpQpL9Gt5u4EJeO3ZkoIcc57TxtwgI3IDKn0orMpg2+a/dFrBSOJIHqIvA8KYgQozt6QG9rdgmg6+8yzYHtY/dON2/ltryUj+/ed5ALOzjMBWfEVPJE/slPFnK/3wVz/2v/lhiBY4iEPQpe009V6prUVNdRmWiC6FiOr3myS+MYeqSPEBPRDO6AiRp4g5whBFJkavBxidPGk8HTc4yV3Tn63+WFZtJuRDUbdJZ8wUhI4Fi/cDFYQ/RaWYCGZwEVBXnE/dgy7sTvTCHUfQuyItk338vNlBe8fduxxrCFLZjzloNg6E5/3WGnvC8B8KAwGctvWnv2vUFVRKmk5PwTUOtbkY6+cJ8SF65Wt/TTUeI3xLzQM73X3TxHnkw1qcQutU2Qmk4sb0p5Q1s8bCt78an20HF3yQjMoOYMdvjBWEMNbDo5LyutEA22l4t1zA7TWXJZcVL51ZXsJ00ayZfofMEtDVMqRW+eQC6gS/BG4WsWgao2B/CobQXj4MWpUhj416LMbPF9qSuXgM0B13T2p3hCmpwQrTtC9uL7MjZE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 10/16/24 12:35 AM, lizhe.67@bytedance.com wrote: > From: Li Zhe > > Introduce a new rwsem interface upgrade_read(). We can call it > to upgrade the lock into write rwsem lock after we get read lock. > This interface will wait for all readers to exit before obtaining > the write lock. In addition, this interface has a higher priority > than any process waiting for the write lock and subsequent threads > that want to obtain the read lock. > > Signed-off-by: Li Zhe > --- > include/linux/rwsem.h | 1 + > kernel/locking/rwsem.c | 87 ++++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 85 insertions(+), 3 deletions(-) > > diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h > index c8b543d428b0..90183ab5ea79 100644 > --- a/include/linux/rwsem.h > +++ b/include/linux/rwsem.h > @@ -249,6 +249,7 @@ DEFINE_GUARD_COND(rwsem_write, _try, down_write_trylock(_T)) > * downgrade write lock to read lock > */ > extern void downgrade_write(struct rw_semaphore *sem); > +extern int upgrade_read(struct rw_semaphore *sem); > > #ifdef CONFIG_DEBUG_LOCK_ALLOC > /* > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > index 2bbb6eca5144..0583e1be3dbf 100644 > --- a/kernel/locking/rwsem.c > +++ b/kernel/locking/rwsem.c > @@ -37,6 +37,7 @@ > * meanings when set. > * - Bit 0: RWSEM_READER_OWNED - rwsem may be owned by readers (just a hint) > * - Bit 1: RWSEM_NONSPINNABLE - Cannot spin on a reader-owned lock > + * - Bit 2: RWSEM_UPGRADING - doing upgrade read process > * > * When the rwsem is reader-owned and a spinning writer has timed out, > * the nonspinnable bit will be set to disable optimistic spinning. > @@ -62,7 +63,8 @@ > */ > #define RWSEM_READER_OWNED (1UL << 0) > #define RWSEM_NONSPINNABLE (1UL << 1) > -#define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE) > +#define RWSEM_UPGRADING (1UL << 2) > +#define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE | RWSEM_UPGRADING) > > #ifdef CONFIG_DEBUG_RWSEMS > # define DEBUG_RWSEMS_WARN_ON(c, sem) do { \ > @@ -93,7 +95,8 @@ > * Bit 0 - writer locked bit > * Bit 1 - waiters present bit > * Bit 2 - lock handoff bit > - * Bits 3-7 - reserved > + * Bit 3 - upgrade read bit > + * Bits 4-7 - reserved > * Bits 8-30 - 23-bit reader count > * Bit 31 - read fail bit > * > @@ -117,6 +120,7 @@ > #define RWSEM_WRITER_LOCKED (1UL << 0) > #define RWSEM_FLAG_WAITERS (1UL << 1) > #define RWSEM_FLAG_HANDOFF (1UL << 2) > +#define RWSEM_FLAG_UPGRADE_READ (1UL << 3) > #define RWSEM_FLAG_READFAIL (1UL << (BITS_PER_LONG - 1)) > > #define RWSEM_READER_SHIFT 8 > @@ -143,6 +147,13 @@ static inline void rwsem_set_owner(struct rw_semaphore *sem) > atomic_long_set(&sem->owner, (long)current); > } > > +static inline void rwsem_set_owner_upgrade(struct rw_semaphore *sem) > +{ > + lockdep_assert_preemption_disabled(); > + atomic_long_set(&sem->owner, (long)current | RWSEM_UPGRADING | > + RWSEM_READER_OWNED | RWSEM_NONSPINNABLE); > +} Because of possible  racing between 2 competing upgraders, read lock owner setting has to be atomic to avoid one overwriting the others. > + > static inline void rwsem_clear_owner(struct rw_semaphore *sem) > { > lockdep_assert_preemption_disabled(); > @@ -201,7 +212,7 @@ static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem) > */ > long count = atomic_long_read(&sem->count); > > - if (count & RWSEM_WRITER_MASK) > + if ((count & RWSEM_WRITER_MASK) && !(count & RWSEM_FLAG_UPGRADE_READ)) > return false; > return rwsem_test_oflags(sem, RWSEM_READER_OWNED); > } > @@ -1336,6 +1347,8 @@ static inline int __down_write_trylock(struct rw_semaphore *sem) > static inline void __up_read(struct rw_semaphore *sem) > { > long tmp; > + unsigned long flags; > + struct task_struct *owner; > > DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem); > DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); > @@ -1349,6 +1362,9 @@ static inline void __up_read(struct rw_semaphore *sem) > clear_nonspinnable(sem); > rwsem_wake(sem); > } > + owner = rwsem_owner_flags(sem, &flags); > + if (unlikely(!(tmp & RWSEM_READER_MASK) && (flags & RWSEM_UPGRADING))) > + wake_up_process(owner); > preempt_enable(); > } > > @@ -1641,6 +1657,71 @@ void downgrade_write(struct rw_semaphore *sem) > } > EXPORT_SYMBOL(downgrade_write); > > +static inline void rwsem_clear_upgrade_flag(struct rw_semaphore *sem) > +{ > + atomic_long_andnot(RWSEM_FLAG_UPGRADE_READ, &sem->count); > +} > + > +/* > + * upgrade read lock to write lock > + */ > +static inline int __upgrade_read(struct rw_semaphore *sem) > +{ > + long tmp; > + > + preempt_disable(); > + > + tmp = atomic_long_read(&sem->count); > + do { > + if (tmp & (RWSEM_WRITER_MASK | RWSEM_FLAG_UPGRADE_READ)) { > + preempt_enable(); > + return -EBUSY; > + } > + } while (!atomic_long_try_cmpxchg(&sem->count, &tmp, > + tmp + RWSEM_FLAG_UPGRADE_READ + RWSEM_WRITER_LOCKED - RWSEM_READER_BIAS)); > + > + if ((tmp & RWSEM_READER_MASK) == RWSEM_READER_BIAS) { > + /* fast path */ > + DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem); > + rwsem_clear_upgrade_flag(sem); > + rwsem_set_owner(sem); > + preempt_enable(); > + return 0; > + } > + /* slow path */ > + raw_spin_lock_irq(&sem->wait_lock); > + rwsem_set_owner_upgrade(sem); > + > + set_current_state(TASK_UNINTERRUPTIBLE); > + > + for (;;) { > + if (!(atomic_long_read(&sem->count) & RWSEM_READER_MASK)) > + break; > + raw_spin_unlock_irq(&sem->wait_lock); > + schedule_preempt_disabled(); > + set_current_state(TASK_UNINTERRUPTIBLE); > + raw_spin_lock_irq(&sem->wait_lock); > + } > + > + rwsem_clear_upgrade_flag(sem); > + rwsem_set_owner(sem); > + __set_current_state(TASK_RUNNING); > + raw_spin_unlock_irq(&sem->wait_lock); > + preempt_enable(); > + return 0; > +} > + > +/* > + * upgrade read lock to write lock > + * > + * Return: 0 on success, error code on failure > + */ > +int upgrade_read(struct rw_semaphore *sem) > +{ > + return __upgrade_read(sem); > +} > +EXPORT_SYMBOL(upgrade_read); This new interface should have an API similar to a trylock. True if successful, false otherwise. I like the read_try_upgrade() name. Another alternative that I have been thinking about is a down_read() variant with intention to upgrade later. This will ensure that only one active reader is allowed to upgrade later. With this, upgrade_read() will always succeed, maybe with some sleeping, as long as the correct down_read() is used. Cheers, Longman