From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B493C3ABBF for ; Tue, 6 May 2025 01:25:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2F4486B000A; Mon, 5 May 2025 21:25:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 29FC76B0082; Mon, 5 May 2025 21:25:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 11D3F6B0083; Mon, 5 May 2025 21:25:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id E235F6B000A for ; Mon, 5 May 2025 21:25:10 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 5B0DB14104E for ; Tue, 6 May 2025 01:25:11 +0000 (UTC) X-FDA: 83410739622.02.7B1D813 Received: from out-184.mta0.migadu.com (out-184.mta0.migadu.com [91.218.175.184]) by imf18.hostedemail.com (Postfix) with ESMTP id 5EA471C000B for ; Tue, 6 May 2025 01:25:09 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=iEUSpYtE; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf18.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.184 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746494709; a=rsa-sha256; cv=none; b=e9jpZ8mdEVfidXy7kop3Z/oWOXkQS0vjcBQno3myPl7KfxvluCNAr/c9K7YCSPhD84PSDw idoY0HzjMsmbl2CHaxIkF+AYezNEJXjvusrb8s6QxulJPaI5T07E+5x9Yb9efdTBFPXXEx v2fdfca6QuMszl/R191l3EwV8lAyg4U= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=iEUSpYtE; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf18.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.184 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746494709; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fcKf3ZnERF7BUr1vw6W2BgJU183whVAFQP4h2oIJf5M=; b=XjfNUv0kBSrjVC3q0jZIcjIUxsfXnF4K/8nLAYPVLwrXKUaddD2fiW7fFC+vbGM68TUzUb Qnx+XWWO9vfQeGhA0No8/1SwgDKiBh2tUfI9GYYLsQGXWOSSh9WQDR/VoK+EinF5LNIRZe wPynd7Roqlmo6bRmTwyby7GRibab4dw= Date: Mon, 5 May 2025 18:24:56 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1746494705; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fcKf3ZnERF7BUr1vw6W2BgJU183whVAFQP4h2oIJf5M=; b=iEUSpYtEueO6vD8eNz6epby8CnAKHXq0A+F+y29y1cI54Q2b2c3VSWEUt5+nxNhuwux43E bmdlz5VnElBnBYC10tFJwMeK1SIvXeMtsULLg/InoLkwg/yXNyooZawepTdJIbhv7L+x74 KiyMPFeJP6Kiaac9bxqP50ymRxmgRAQ= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Alexei Starovoitov Cc: bpf , linux-mm , Vlastimil Babka , Harry Yoo , Michal Hocko , Sebastian Sewior , Andrii Nakryiko , Kumar Kartikeya Dwivedi , Andrew Morton , Peter Zijlstra , Steven Rostedt , Johannes Weiner , Matthew Wilcox Subject: Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). Message-ID: References: <20250501032718.65476-1-alexei.starovoitov@gmail.com> <20250501032718.65476-7-alexei.starovoitov@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: 5EA471C000B X-Stat-Signature: 9oiqk4f6nynajcqt81rewudsucid8f9z X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1746494709-248944 X-HE-Meta: U2FsdGVkX19DuoNdfbIA425c8oL9wCrZHw+tMZO6nDRNbgvEpzBknZ1qKoX9aKcIg29ZYQR0IhP7gu6Jha4q6sgbnOZBTD75uamY55ECErV2Jhy75kQpnXkNaSEOclRzLr9HnWS+JmS4MoZWyYUx6Lf30j2UmgVYe1xpjSpipfg8TOe0yKHhYLqRt/c2JLUgPdfSJbcO35jtm5nQiVimwDldEGdd/z1m+VjFc+W1r/YIZgIr3Ap6ODyVukmDpiKTvzOUR2GFv7buqRK4KIijzgbgsD8WApgzoe2PLfDeBJA3s/fr7eVa8nYlDsyYqItPYOaHlYCQxnO7x5poRQ6y+3StTM3EnyFGoNFzgGO/U0GFn+AOMUfBa6X1Sn6NiBPZL0ZrtToIPFeo2pmBR7vaPXQsQ2byTVv2t2lJRID291+uABRz8aq3CVLA1ulsNOlglNg4ROWwzFbJshcqnIBhbrjfBAsNFArF+2TP6PUniMygghMp2PBKYHtULoaYcfoZibYrXSie1mF0lsNHfPZI/ingrO/OQCLU1tD/mgOSsE4ZDlutwRVN1OouE5kTEewNHY4x3FBVuKCHL9DlTL/DZ+sMm6t2PbPnhKtlPTFYCYCgzt99gELSbkJUQ/CE2I/8ACTmsfOCcWdg9AKabKjkCyfdp4w72N+LfuZ7IJcUBYjYLDlemjtRBAG17GV50M8yTy/wKqf1dJeIBQo5HQ+hOQCsYuXNBx3yrIXZRU4cyKibCQ1WT9sRtPHlmBFQ37uXcPiJoLNA4ZZXnHFXhP1OsyHKhYfa7PFAQ0VYrzEbKj4wgPiIaPt5QtpGm7ykNDE2Q870SsG4PYIU1asb12D6GD3eF+hevHVr+/GudMwmMl7pF1h8P2CgsL2qvB0iMiBQ+p87WdR2POzjRgn/0d/UL1OzsGMd0Ir9iKKj2rR6Rt4mhLOMTsXe3fBuPvWcQ8wpfsoIQXpw/35lZMLeoNR 9MOsE65Y K8wsQgeseQO/KAj38v9T2mCtPQ3K4pWQqeou55GmWfL/mDx586RWnptokcjdArwUmr9m0tJXvqMwOdelBFC5rNHnaYf0n43e+7GmaAxnrsic9wth/YzFo2kDAqSgduUIaGQdh9yOOuszt+jdNHsndgBMPrlXoUA4MfPuj7rzEFCaUts+wnktZJrQQReHDuzgVJHeDO9dVxxBnSPjplzR6yBMTgS2MYbTq/KyhlM9SSMVFPqVDD707EfL6JAPZtIW2S7CecyWVY9vk+TZ6sIxoEyc5HRSJIRaZq0H3pX55Ni6bs/2gk6EVEHRalUAaCUOU3F/ocxLLkE9xJw8yy66bFu0DBkg8MPbHonIA X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, May 05, 2025 at 05:49:47PM -0700, Alexei Starovoitov wrote: > On Mon, May 5, 2025 at 11:46 AM Shakeel Butt wrote: > > > > On Wed, Apr 30, 2025 at 08:27:18PM -0700, Alexei Starovoitov wrote: > > > From: Alexei Starovoitov > > > > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -595,7 +595,13 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) > > > if (!val) > > > return; > > > > > > - cgroup_rstat_updated(memcg->css.cgroup, cpu); > > > + /* > > > + * If called from NMI via kmalloc_nolock -> memcg_slab_post_alloc_hook > > > + * -> obj_cgroup_charge -> mod_memcg_state, > > > + * then delay the update. > > > + */ > > > + if (!in_nmi()) > > > + cgroup_rstat_updated(memcg->css.cgroup, cpu); > > > > I don't think we can just ignore cgroup_rstat_updated() for nmi as there > > is a chance (though very small) that we will loose these stats updates. > > I'm failing to understand why it's an issue. > Not doing cgroup_rstat_updated() can only cause updated_next link > to stay NULL when it should be set, > but it should be harmless, and no different from racy check > that the code already doing: > if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next)) > return; > > Imaging it was !NULL, the code would return, > but then preemption, something clears it to NULL, > and here we're skipping a set of updated_next. cgroup_rstat_updated() puts the given cgroup whose stats are modified in the per-cpu update tree which later the read side will flush to get the uptodate stats. Not putting in the update tree will cause the read side to not flush the stats cached on that cpu. Though there is a possibility that someone else in non-nmi context may put that cgroup on that cpu's update tree but there is no guarantee. > > > In addition, memcg_rstat_updated() itself is not reentrant safe along > > with couple of functions leading to it like __mod_memcg_lruvec_state(). > > Sure. __mod_memcg_lruvec_state() is not reentrant, > but it's not an issue for kmalloc_nolock(), since objcg/memcg > charge/uncharge from slub is not calling it (as far as I can tell). Without this patch: __memcg_slab_post_alloc_hook() -> obj_cgroup_charge_account() -> consume_obj_stock() -> __account_obj_stock() -> __account_obj_stock() -> __mod_objcg_mlstate() -> __mod_memcg_lruvec_state() With this patch: __memcg_slab_post_alloc_hook() -> obj_cgroup_charge_atomic() -> obj_cgroup_charge_pages() -> mod_memcg_state() -> __mod_memcg_state() Other than __mod_memcg_state() being not reentrant safe, we will be missing NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B after the patch. > > > > > > statc = this_cpu_ptr(memcg->vmstats_percpu); > > > for (; statc; statc = statc->parent) { > > > /* > > > @@ -2895,7 +2901,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, > > > unsigned long flags; > > > bool ret = false; > > > > > > - local_lock_irqsave(&memcg_stock.stock_lock, flags); > > > + local_lock_irqsave_check(&memcg_stock.stock_lock, flags); > > > > > > stock = this_cpu_ptr(&memcg_stock); > > > if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { > > > @@ -2995,7 +3001,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, > > > unsigned long flags; > > > unsigned int nr_pages = 0; > > > > > > - local_lock_irqsave(&memcg_stock.stock_lock, flags); > > > + local_lock_irqsave_check(&memcg_stock.stock_lock, flags); > > > > > > stock = this_cpu_ptr(&memcg_stock); > > > if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ > > > @@ -3088,6 +3094,27 @@ static inline size_t obj_full_size(struct kmem_cache *s) > > > return s->size + sizeof(struct obj_cgroup *); > > > } > > > > > > +/* > > > + * Try subtract from nr_charged_bytes without making it negative > > > + */ > > > +static bool obj_cgroup_charge_atomic(struct obj_cgroup *objcg, gfp_t flags, size_t sz) > > > +{ > > > + size_t old = atomic_read(&objcg->nr_charged_bytes); > > > + u32 nr_pages = sz >> PAGE_SHIFT; > > > + u32 nr_bytes = sz & (PAGE_SIZE - 1); > > > + > > > + if ((ssize_t)(old - sz) >= 0 && > > > + atomic_cmpxchg(&objcg->nr_charged_bytes, old, old - sz) == old) > > > + return true; > > > + > > > + nr_pages++; > > > + if (obj_cgroup_charge_pages(objcg, flags, nr_pages)) > > > + return false; > > > + > > > + atomic_add(PAGE_SIZE - nr_bytes, &objcg->nr_charged_bytes); > > > + return true; > > > +} > > > + > > > bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, > > > gfp_t flags, size_t size, void **p) > > > { > > > @@ -3128,6 +3155,21 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, > > > return false; > > > } > > > > > > + if (!gfpflags_allow_spinning(flags)) { > > > + if (local_lock_is_locked(&memcg_stock.stock_lock)) { > > > + /* > > > + * Cannot use > > > + * lockdep_assert_held(this_cpu_ptr(&memcg_stock.stock_lock)); > > > + * since lockdep might not have been informed yet > > > + * of lock acquisition. > > > + */ > > > + return obj_cgroup_charge_atomic(objcg, flags, > > > + size * obj_full_size(s)); > > > > We can not just ignore the stat updates here. > > > > > + } else { > > > + lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock)); > > > + } > > > + } > > > + > > > for (i = 0; i < size; i++) { > > > slab = virt_to_slab(p[i]); > > > > > > @@ -3162,8 +3204,12 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, > > > void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, > > > void **p, int objects, struct slabobj_ext *obj_exts) > > > { > > > + bool lock_held = local_lock_is_locked(&memcg_stock.stock_lock); > > > size_t obj_size = obj_full_size(s); > > > > > > + if (likely(!lock_held)) > > > + lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock)); > > > + > > > for (int i = 0; i < objects; i++) { > > > struct obj_cgroup *objcg; > > > unsigned int off; > > > @@ -3174,8 +3220,12 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, > > > continue; > > > > > > obj_exts[off].objcg = NULL; > > > - refill_obj_stock(objcg, obj_size, true, -obj_size, > > > - slab_pgdat(slab), cache_vmstat_idx(s)); > > > + if (unlikely(lock_held)) { > > > + atomic_add(obj_size, &objcg->nr_charged_bytes); > > > > objcg->nr_charged_bytes is stats ignorant and the relevant stats need to > > be updated before putting stuff into it. > > I'm not following. > It's functionally equivalent to refill_obj_stock() without > __account_obj_stock(). > And the stats are not ignored. > The next __memcg_slab_free_hook() from good context will update > them. It's only a tiny delay in update. > I don't see why it's an issue. For the slab object of size obj_size which is being freed here, we need to update NR_SLAB_RECLAIMABLE_B or NR_SLAB_UNRECLAIMABLE_B stat for the corresponding objcg by the amount of obj_size. If we don't call __account_obj_stock() here we will loose the context and information to update these stats later. > > > > + } else { > > > + refill_obj_stock(objcg, obj_size, true, -obj_size, > > > + slab_pgdat(slab), cache_vmstat_idx(s)); > > > + } > > > obj_cgroup_put(objcg); > > > } > > > } > > > > I am actually working on making this whole call chain (i.e. > > kmalloc/kmem_cache_alloc to memcg [un]charging) reentrant/nmi safe. > > Thank you for working on it! > You mean this set: > https://lore.kernel.org/all/20250429061211.1295443-1-shakeel.butt@linux.dev/ > ? > it's making css_rstat_updated() re-entrant, > which is renamed/reworked version of memcg_rstat_updated(). > That's good, but not enough from slub pov. > It removes the need for the first hunk in this patch from mm/memcontrol.c > + if (!in_nmi()) > + cgroup_rstat_updated(...); > > but hunks in __memcg_slab_post_alloc_hook() and __memcg_slab_free_hook() > are still needed. > And I think the obj_cgroup_charge_atomic() approach in this patch is correct. > The delay in rstat update seems fine. > Please help me understand what I'm missing. > The css_rstat_updated() is the new name of cgroup_rstat_updated() and it is only a piece of the puzzle. My plan is to memcg stats reentrant which would allow to call __account_obj_stock (or whatever new name would be) in nmi context and then comsume_obj_stock() and refill_obj_stock() would work very similar to consume_stock() and refill_stock(). Please give me couple of days and I can share the full RFC of the memcg side.