From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D4A3C02194 for ; Thu, 6 Feb 2025 16:19:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D1BAB6B0085; Thu, 6 Feb 2025 11:19:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CC93E6B0088; Thu, 6 Feb 2025 11:19:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B91C56B0089; Thu, 6 Feb 2025 11:19:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 99A876B0085 for ; Thu, 6 Feb 2025 11:19:49 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 4F019140E36 for ; Thu, 6 Feb 2025 16:19:49 +0000 (UTC) X-FDA: 83090030898.30.7AB5912 Received: from out-176.mta0.migadu.com (out-176.mta0.migadu.com [91.218.175.176]) by imf03.hostedemail.com (Postfix) with ESMTP id 65DFE20010 for ; Thu, 6 Feb 2025 16:19:47 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=UquEWaf5; spf=pass (imf03.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.176 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738858787; a=rsa-sha256; cv=none; b=qBl9ldW+T8i/oYgGDudJN5RZPplyYQuLSDvVqZdKSsKcQ7cUrstQwC3BsMnveWPRU62hVU h7YjOPCVbDSL6gvS6GuRpjLCAIfgJeKWDY9X7KTaz1TCdTCEzWsDBXdJPHCD3QzJGMTmKm pWak+8plYBCDQmOcsn7HekLKpzN9zGU= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=UquEWaf5; spf=pass (imf03.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.176 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738858787; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Dw9izvEu2ay4lxLU8U0gIItbeRtUjvZQJv3a90hkz2I=; b=q/pdBolSRfqOi3HhdGCyYdSn2e764INxYvq4oLOU5v9/WHW9hRvV+XklFPrxnVA5kKBMI2 yTIt48fNdI1K6sWfhPXdNTYx2kylz8e/RsVVtyYQYRpQzD0Oqp0LOKmkNqwdmk6F6aymLw df+RMQpMGypRvOmCcNGFOKsmzxDR3wY= Date: Thu, 6 Feb 2025 16:19:36 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1738858780; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Dw9izvEu2ay4lxLU8U0gIItbeRtUjvZQJv3a90hkz2I=; b=UquEWaf59i0M+nwUCj76hQpLZJanFuEgpUVS3M7VeK3JOuVoeF6Tzg23QGtJxblpp4jFUI nm7BAfUSRckrG4jRlkbEJHhFZCGP7PgGsjOQUZ/v7RosYyOR5h/rd4i/TAO9+KH/DBv0Eq 4NA2ihZfooDhFPcqL9zCGaZntpkGS5Q= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: Sergey Senozhatsky Cc: Andrew Morton , Minchan Kim , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible Message-ID: References: <20250131090658.3386285-1-senozhatsky@chromium.org> <20250131090658.3386285-15-senozhatsky@chromium.org> <6vtpamir4bvn3snlj36tfmnmpcbd6ks6m3sdn7ewmoles7jhau@nbezqbnoukzv> <6uhsj4bckhursiblkxe54azfgyqal6tq2de3lpkxw6omkised6@uylodcjruuei> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6uhsj4bckhursiblkxe54azfgyqal6tq2de3lpkxw6omkised6@uylodcjruuei> X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: 65DFE20010 X-Stat-Signature: kun1i9wu4pytqz1uagmeri3w13ox1sri X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1738858787-509978 X-HE-Meta: U2FsdGVkX18Qg/Gi94mU9JtfZwMiPx127a1eAcLcGddPsNHcU5tZIcnu5QVgomVBC6mCq+7s43hXiW/KYhJo73XYe3P1OVl22X1tLHST1EgfD59RjTExt/YQbX+V3X3mzk6lzKt3aJ0C7BLADEd8TxL5Ah4+VZzN9LHfp057D78FjGnNkkNXbn6syxKun1lCIAUXN16prZqT1yINTOQQ3A3Xq3zXX1MKEJ7wDo8mNR0xRgQXugz5QlRH7CIr80PC6umye2mp1E37KQLoeyub01tdGtyQgpZpTi9HXUHucLV1LAUdMS+Roll+JNWYuOjfxbgkR1oZF9hKvlbD7vgCO4mww0zghufCQT1b1ZqxRMKswzetfT07weYmuVsGymUEGIt/y7piPsf/6QAc0mYj2b7ePpXoDtUdlYG4lP7hw1QD8uP/UKnKQRdETyBxY0OZeQyxG5bgQ2O8h+dY0MhH8g2AC6puTOJ992t0LRMGY2rXcSZxuyCBq5MYHD6+oRkuxCL6ZeXSrn6fPWXgj1zWgXv9+qpXcEXi2GYGiYwqVs+2ToHSY+qqzNz9XcRAg5QnyXu84GXVxU0WX9dHz9Mxw1jmQ13AO1CpiRAXupSsEBj4Zx1ewGIgcuOTodiqOtKyP+8TQr+JdWBvhR58d0vz5Z8o38NNAXlxixzOkB86QgrN/m7EyimH4QgDqzcvt8dS64de0h+aXSEG/xueHlrS1QxHs7T5K7MfxHlz8vQF8d0tbUHmM5272gdme1b7myYcLhOFi8vBDvyv2w+rFPGG6HHdaFncgvBIBdLxrHmv5N1La+7E8rXaTfy7gQ+H4kM+4t1RcVXMpoK1flmxfiIexJncakLpmxEk3j6iF9H+S50a2lFnpQ3exhTqAg4nf/h4d3PxKzZIORrEi/B2Ljnwk05FQA9EEP5JUeaNdKTwa2jNEkloN3Ui73e2TjvknLAxt72eQNDy6JzeD4m2pO/ xgLr5KNJ 0JNx0F6FC2utZJRbKHm3e7786XRpAmmWO+NfBFIpRg+053mXeFybY+KKIG/01ubp6rLnDb60L4F/PwT6tT4+WZzWaE1xco2oXdWbj7yVXxKbGKIlc50dS0qwZnLcbHxi8rRN2NVVa1mG/qryMbyU5nGCPXYiIQBXc5sJUyO7kBsvoLu4ukL7IDCmjQDf22woFvsAqn3B3IxApJVZbj3kXt38FFQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 06, 2025 at 12:05:55PM +0900, Sergey Senozhatsky wrote: > On (25/02/05 19:06), Yosry Ahmed wrote: > > > > For example, the compaction/migration code could be sleeping holding the > > > > write lock, and a map() call would spin waiting for that sleeping task. > > > > > > write-lock holders cannot sleep, that's the key part. > > > > > > So the rules are: > > > > > > 1) writer cannot sleep > > > - migration/compaction runs in atomic context and grabs > > > write-lock only from atomic context > > > - write-locking function disables preemption before lock(), just to be > > > safe, and enables it after unlock() > > > > > > 2) writer does not spin waiting > > > - that's why there is only write_try_lock function > > > - compaction and migration bail out when they cannot lock the > > > zspage > > > > > > 3) readers can sleep and can spin waiting for a lock > > > - other (even preempted) readers don't block new readers > > > - writers don't sleep, they always unlock > > > > That's useful, thanks. If we go with custom locking we need to document > > this clearly and add debug checks where possible. > > Sure. That's what it currently looks like (can always improve) > > --- > /* > * zspage lock permits preemption on the reader-side (there can be multiple > * readers). Writers (exclusive zspage ownership), on the other hand, are > * always run in atomic context and cannot spin waiting for a (potentially > * preempted) reader to unlock zspage. This, basically, means that writers > * can only call write-try-lock and must bail out if it didn't succeed. > * > * At the same time, writers cannot reschedule under zspage write-lock, > * so readers can spin waiting for the writer to unlock zspage. > */ > static void zspage_read_lock(struct zspage *zspage) > { > atomic_t *lock = &zspage->lock; > int old = atomic_read_acquire(lock); > > do { > if (old == ZS_PAGE_WRLOCKED) { > cpu_relax(); > old = atomic_read_acquire(lock); > continue; > } > } while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1)); > > #ifdef CONFIG_DEBUG_LOCK_ALLOC > rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_); > #endif > } > > static void zspage_read_unlock(struct zspage *zspage) > { > atomic_dec_return_release(&zspage->lock); > > #ifdef CONFIG_DEBUG_LOCK_ALLOC > rwsem_release(&zspage->lockdep_map, _RET_IP_); > #endif > } > > static bool zspage_try_write_lock(struct zspage *zspage) > { > atomic_t *lock = &zspage->lock; > int old = ZS_PAGE_UNLOCKED; > > preempt_disable(); > if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) { > #ifdef CONFIG_DEBUG_LOCK_ALLOC > rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_); > #endif > return true; > } > > preempt_enable(); > return false; > } > > static void zspage_write_unlock(struct zspage *zspage) > { > atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED); > #ifdef CONFIG_DEBUG_LOCK_ALLOC > rwsem_release(&zspage->lockdep_map, _RET_IP_); > #endif > preempt_enable(); > } > --- > > Maybe I'll just copy-paste the locking rules list, a list is always cleaner. Thanks. I think it would be nice if we could also get someone with locking expertise to take a look at this. > > > > > I wonder if there's a way to rework the locking instead to avoid the > > > > nesting. It seems like sometimes we lock the zspage with the pool lock > > > > held, sometimes with the class lock held, and sometimes with no lock > > > > held. > > > > > > > > What are the rules here for acquiring the zspage lock? > > > > > > Most of that code is not written by me, but I think the rule is to disable > > > "migration" be it via pool lock or class lock. > > > > It seems like we're not holding either of these locks in > > async_free_zspage() when we call lock_zspage(). Is it safe for a > > different reason? > > I think we hold size class lock there. async-free is only for pages that > reached 0 usage ratio (empty fullness group), so they don't hold any > objects any more and from her such zspages either get freed or > find_get_zspage() recovers them from fullness 0 and allocates an object. > Both are synchronized by size class lock. > > > > Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data > > > patterns the clients have. I suspect we'd need to synchronize RCU every > > > time a zspage is freed: zs_free() [this one is complicated], or migration, > > > or compaction? Sounds like anti-pattern for RCU? > > > > Can't we use kfree_rcu() instead of synchronizing? Not sure if this > > would still be an antipattern tbh. > > Yeah, I don't know. The last time I wrongly used kfree_rcu() it caused a > 27% performance drop (some internal code). This zspage thingy maybe will > be better, but still has a potential to generate high numbers of RCU calls, > depends on the clients. Probably the chances are too high. Apart from > that, kvfree_rcu() can sleep, as far as I understand, so zram might have > some extra things to deal with, namely slot-free notifications which can > be called from softirq, and always called under spinlock: > > mm slot-free -> zram slot-free -> zs_free -> empty zspage -> kfree_rcu > > > It just seems like the current locking scheme is really complicated :/ > > That's very true. Seems like we have to compromise either way, custom locking or we enter into a new complexity realm with RCU freeing.