From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D72E9D339AE for ; Fri, 5 Dec 2025 18:52:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0ED696B02B2; Fri, 5 Dec 2025 13:52:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0C57A6B02B3; Fri, 5 Dec 2025 13:52:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F1CCA6B02B4; Fri, 5 Dec 2025 13:52:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id DB18C6B02B2 for ; Fri, 5 Dec 2025 13:52:09 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 871BA1602DB for ; Fri, 5 Dec 2025 18:52:09 +0000 (UTC) X-FDA: 84186312378.26.12F6CF6 Received: from out-184.mta0.migadu.com (out-184.mta0.migadu.com [91.218.175.184]) by imf08.hostedemail.com (Postfix) with ESMTP id 7F67B160018 for ; Fri, 5 Dec 2025 18:52:07 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=fgv+qnvZ; spf=pass (imf08.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.184 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1764960727; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kwB4FIQ4wqkM4BN5rZZX2NVwtIL9s/PS1C1n75qh8Bo=; b=ynR5KZ6BKB/4pcUowY3UE+lo2em1Y7gaSCL2PJHMAejYiL7kGCRiE9mCwKdAlucLWtezz9 BXBN83G5+ECioFLyEMTOhJfwMVLGWOAcLNshSSPLB+aWsTrIH2EhFcOeVhpxnnHtw6KS2C i1jofv2+ZcDgUktZtWZZIegI3JJMvmk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1764960727; a=rsa-sha256; cv=none; b=nhRvK06Ig4WUrFEBhO7kMEdL3QAOmMxKO9gV1t8bXEgNNOTeILGCsSQx9mY7qHsT/WeaUE JtPUfRKJiG2mRh+R6X3rIX0k+ggRad6W0dPJPOTAIL0vDYxq6VlLf2S2mmkGqScZAd8wI5 SWHlSO1Hjm04GkwDzikenIEDw7P8ATM= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=fgv+qnvZ; spf=pass (imf08.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.184 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Date: Fri, 5 Dec 2025 10:51:51 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1764960725; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kwB4FIQ4wqkM4BN5rZZX2NVwtIL9s/PS1C1n75qh8Bo=; b=fgv+qnvZUf7HgDmod9GHbk7XDp+rCy+g3fOfQemdK4rVAfeO9zKwiz2WpvxvrLdQBmVLYc diSWUc68PGyWCqzvWBAQcm3mxH+g40438S3BSPYD4bW99ZNy4R6BxIQRNeFX8qyn6MJVhj vhIyoJQCAG/vrVzS00o7tt0QhfjJFZs= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Tejun Heo Cc: Johannes Weiner , Michal =?utf-8?Q?Koutn=C3=BD?= , "Paul E . McKenney" , JP Kobryn , Yosry Ahmed , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: Re: [PATCH] cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated Message-ID: References: <20251205022437.1743547-1-shakeel.butt@linux.dev> <5i4ei2rbszdwlezpi63h5ksmckry27ffx6kfcg74qbvgjk22ao@2y2jeadg43y3> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <5i4ei2rbszdwlezpi63h5ksmckry27ffx6kfcg74qbvgjk22ao@2y2jeadg43y3> X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 7F67B160018 X-Stat-Signature: bp3hu13zhnbw9z4f6ugyoesutescfp4h X-Rspam-User: X-HE-Tag: 1764960727-133318 X-HE-Meta: U2FsdGVkX18b1tfAnh2J7YZJyFNRR5Xe59XdutLOK4i+YcjNi2PRTySFiSdDUcVghMovezwSlMFGhl4kWKp5qUZnEEOCgZsQzN9ovqKeokJpQtutEsheXd7R97TlRSDLw0tx4SEa0XMBT3woVzM+roErH4N1kgRDb26Klz1r9hA98q6eBNo+MCY3+R5uL6VDO2SsJdpIBh9UxVmFZjD88WQdOgfYGWNPyd/u1bc1YnN8J1iuaJGZwS6J5WpxkljQIY1cHcTOV8j55L2ofyVdUx6qdyYiEV5jbUYlsaaFG1RqODbA8j+3/jMhuAPNP66uJSEj/Sk300oADZ2dzvkChoT4lntyAA5tVu+PZKr9AaU17hThDfRPisjwvVgNPB0y049ZSENQxJO76po35utb9QiMLUtUEgeVOVRZh3sjLVn24Ev4Ek7urpPxZ5H5bEVLCHVl/QwKKSNaXON14smtK+2jr0noHhTvnV0rx5HS+kBONIUZmTSqTlhqvVySBC36k37ofewZEv4SF3dJ9S0nxqWjqsnf5Kf+kSnJ87Y+MvhxeAXshSdkVUNpdozCKc5b6HZsYWj5FX8yzRO2FOaFvzdqTuNJf3RLmG16mnoIIjsZLCgFX6u4tgiEVkkkjnbm12U/U0bmS7BFWBnI2bDzUzTiwxHWmpSoFJ+eFa8rYs7zd4fqHdBua9FRKMx8FP2N1at+U03MgQibDzD7q+L6Dok2TwMld18Icpowc/tNmHX9p2ag1YV0GsRAqR3sFRd/rQFS6+uoBE+XnvcjSfJCafvHIQT1nKEOGs2YKQ+mZR82s+dtz5V844nWo2g0+Jh7nA0e01QWxyoq6Z4WIaiIQVFr7vllXO5P68QqyBcogWXj7QFAKfoVOQgw75QHFQIXZIAl2Ha2M6tJWHgzRu0XYx7aU+5DRZdAYSHtxerLLqsD9Km+5/l+h5NnphTDWFodfhnCVHUpkZbiYiYMtKc pBUbl55x 2K720vk3s1crf8lkSxhb1xXcNQ2HThiRNXCBbv2MxKu/rPdcy8y4I2xk5QAGgUVx282BfCMDmnW2OCwbRWGQTyTDVQPhZUSDmiIZy5vG+fry9yryyZqAtbJJ7n0GOml14QSf7BNw1PwOTdRMNOWYdINqd+5v60tmyOnyUv5IYihMlZlw33gyEPiieeMvG9jV6qS1yet4766ehBV+XWneCMZ9g3x/ech57Y6fEfGK1HGyLOa/qGqJcy498HDvDToieubnEIt0ZQb1958jz66PVUj75Tg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Dec 05, 2025 at 09:59:39AM -0800, Shakeel Butt wrote: > On Fri, Dec 05, 2025 at 07:35:30AM -1000, Tejun Heo wrote: > > Hello, > > > > On Thu, Dec 04, 2025 at 06:24:37PM -0800, Shakeel Butt wrote: > > ... > > > In Meta's fleet running the kernel with the commit 36df6e3dbd7e, we are > > > observing on some machines the memcg stats are getting skewed by more > > > than the actual memory on the system. On close inspection, we noticed > > > that lockless node for a workload for specific CPU was in the bad state > > > and thus all the updates on that CPU for that cgroup was being lost. At > > > the moment, we are not sure if this CMPXCHG without LOCK is the cause of > > > that but this needs to be fixed irrespective. > > > > Is there a plausible theory of events that can explain the skew with the use > > of this_cpu_cmpxchg()? lnode.next being set to self but this_cpu_cmpxchg() > > returning something else? It may be useful to write a targeted repro for the > > particular combination - this_cpu_cmpxchg() vs. remote NULL clearing and see > > whether this_cpu_cmpxchg() can return a value that doesn't agree with what > > gets written in the memory. > > Yes, I am working on creating a repro for this and will share the > results. > I think I found the repro pasted below and can be built using following command: $ gcc -O2 -pthread -o cmpxchg_race cmpxchg_race.c $ cat cmpxchg_race.c #define _GNU_SOURCE #include #include #include #include #include #include #include #include // Kernel-style macros adapted for user-space #define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) #define WRITE_ONCE(x, val) do { *(volatile typeof(x) *)&(x) = (val); } while (0) // Simple llist implementation - EXACT KERNEL DEFINITIONS struct llist_node { struct llist_node *next; }; struct llist_head { struct llist_node *first; }; // smp_load_acquire - load with acquire semantics #define smp_load_acquire(p) \ ({ \ typeof(*p) ___p1 = READ_ONCE(*p); \ __atomic_thread_fence(__ATOMIC_ACQUIRE); \ ___p1; \ }) // try_cmpxchg - compare and exchange, updates old value on failure static inline bool try_cmpxchg(struct llist_node **ptr, struct llist_node **old, struct llist_node *new) { typeof(*ptr) __old = *old; typeof(*ptr) __ret; asm volatile("lock cmpxchgq %2, %1" : "=a" (__ret), "+m" (*ptr) : "r" (new), "0" (__old) : "memory"); if (__ret == __old) return true; *old = __ret; return false; } /** * init_llist_node - initialize lock-less list node * From: include/linux/llist.h:84-87 */ static inline void init_llist_node(struct llist_node *node) { node->next = node; } /** * llist_on_list - test if a lock-list list node is on a list * From: include/linux/llist.h:98-101 */ static inline bool llist_on_list(const struct llist_node *node) { return node->next != node; } /** * llist_add_batch - add several linked entries in batch * From: include/linux/llist.h:234-245 */ static inline bool llist_add_batch(struct llist_node *new_first, struct llist_node *new_last, struct llist_head *head) { struct llist_node *first = READ_ONCE(head->first); do { new_last->next = first; } while (!try_cmpxchg(&head->first, &first, new_first)); return !first; } /** * llist_add - add a new entry * From: include/linux/llist.h:263-266 */ static inline bool llist_add(struct llist_node *new, struct llist_head *head) { return llist_add_batch(new, new, head); } /** * llist_del_first - delete the first entry of lock-less list * From: lib/llist.c:31-43 */ struct llist_node *llist_del_first(struct llist_head *head) { struct llist_node *entry, *next; entry = smp_load_acquire(&head->first); do { if (entry == NULL) return NULL; next = READ_ONCE(entry->next); } while (!try_cmpxchg(&head->first, &entry, next)); return entry; } /** * llist_del_first_init - delete first entry and mark as off-list * From: include/linux/llist.h:303-310 */ static inline struct llist_node *llist_del_first_init(struct llist_head *head) { struct llist_node *n = llist_del_first(head); if (n) init_llist_node(n); return n; } // Global list and node struct llist_head global_list = { .first = NULL }; struct llist_node node_self; volatile bool stop = false; volatile uint64_t success_count = 0; volatile uint64_t already_count = 0; volatile uint64_t del_count = 0; volatile uint64_t empty_count = 0; bool use_locked_cmpxchg = false; // CMPXCHG without LOCK static inline bool cmpxchg_unlocked(struct llist_node **ptr, struct llist_node *old, struct llist_node *new_val) { struct llist_node *prev = old; asm volatile( "cmpxchgq %2, %1" // No lock prefix! : "+a" (prev), "+m" (*ptr) : "r" (new_val) : "memory" ); return prev == old; } // CMPXCHG with LOCK static inline bool cmpxchg_locked(struct llist_node **ptr, struct llist_node *old, struct llist_node *new_val) { struct llist_node *prev = old; asm volatile( "lock cmpxchgq %2, %1" // WITH lock prefix! : "+a" (prev), "+m" (*ptr) : "r" (new_val) : "memory" ); return prev == old; } // Check if node is in the list bool is_node_in_list(struct llist_head *head, struct llist_node *node) { struct llist_node *curr = READ_ONCE(head->first); while (curr) { if (curr == node) return true; curr = READ_ONCE(curr->next); } return false; } // Thread 1: Simulates css_rstat_updated() void *thread_cmpxchg(void *arg) { printf("Thread 1 (UPDATER): using %s CMPXCHG\n", use_locked_cmpxchg ? "LOCKED" : "UNLOCKED"); while (!stop) { // Try to atomically change from self to NULL (win the race) bool success; if (use_locked_cmpxchg) { success = cmpxchg_locked(&node_self.next, &node_self, NULL); } else { success = cmpxchg_unlocked(&node_self.next, &node_self, NULL); } if (success) { // We won! Add to the global list llist_add(&node_self, &global_list); success_count++; } else { already_count++; } } return NULL; } // Thread 2: Simulates css_process_update_tree() -> llist_del_first_init() void *thread_write(void *arg) { printf("Thread 2 (FLUSHER): doing llist_del_first_init\n"); while (!stop) { // Remove first node from list and reinitialize it (sets next = self) struct llist_node *node = llist_del_first_init(&global_list); if (node) { del_count++; } else { empty_count++; } } return NULL; } void run_test(bool use_lock, int duration) { pthread_t t1, t2; use_locked_cmpxchg = use_lock; stop = false; success_count = 0; del_count = 0; empty_count = 0; already_count = 0; // Initialize: node_self.next = self (not on list) init_llist_node(&node_self); global_list.first = NULL; printf("\n=== Running test with %s CMPXCHG for %d seconds ===\n", use_lock ? "LOCKED" : "UNLOCKED", duration); pthread_create(&t1, NULL, thread_cmpxchg, NULL); pthread_create(&t2, NULL, thread_write, NULL); sleep(duration); stop = true; pthread_join(t1, NULL); pthread_join(t2, NULL); // Check final state struct llist_node *next = READ_ONCE(node_self.next); bool on_list = is_node_in_list(&global_list, &node_self); printf("\n=== Results (%s CMPXCHG) ===\n", use_lock ? "LOCKED" : "UNLOCKED"); printf("Successful cmpxchg: %lu\n", success_count); printf("Already on list: %lu\n", already_count); printf("Deletions: %lu\n", del_count); printf("Empty list: %lu\n", empty_count); printf("\nFinal state:\n"); printf(" node_self.next: %s\n", next == NULL ? "NULL" : (next == &node_self ? "self" : "OTHER")); printf(" On global list: %s\n", on_list ? "YES" : "NO"); // Check for failure condition bool failed = false; if (next == NULL && !on_list) { printf("\n*** FAILURE DETECTED! ***\n"); printf("node_self.next is NULL but node is NOT on the list!\n"); printf("This means we 'won' the cmpxchg race but lost the update.\n"); failed = true; } else if (next == &node_self && !on_list) { printf("\n✓ OK: node_self.next is self and not on list (expected state)\n"); } else if (next == NULL && on_list) { printf("\n✓ OK: node_self.next is NULL and on list (expected state)\n"); } else if (on_list) { printf("\n✓ OK: node is on list\n"); } else { printf("\n✓ OK: consistent state\n"); } if (failed) { printf("\nThis demonstrates the race condition where:\n"); printf("1. Thread 1 does unlocked cmpxchg(node_self.next, self, NULL) → succeeds\n"); printf("2. Thread 2 does init_llist_node() → writes node_self.next = self\n"); printf("3. Thread 1 thinks it won but the write from Thread 2 was lost\n"); printf("4. Thread 1 adds node to list\n"); printf("5. Thread 2 removes node and does init_llist_node() again\n"); printf("6. Final state: next=NULL but not on list (INCONSISTENT!)\n"); } } int main(int argc, char *argv[]) { int duration = argc > 1 ? atoi(argv[1]) : 3; printf("=== Simulating css_rstat_updated() Race Condition ===\n"); printf("Using EXACT kernel llist implementations from:\n"); printf(" - include/linux/llist.h (init_llist_node, llist_on_list, llist_add)\n"); printf(" - lib/llist.c (llist_del_first)\n"); printf("\n"); printf("This emulates the exact kernel scenario:\n"); printf(" Thread 1: css_rstat_updated() - cmpxchg + llist_add\n"); printf(" Thread 2: css_process_update_tree() - llist_del_first_init\n"); printf("\n"); // Run with unlocked CMPXCHG (the bug) run_test(false, duration); printf("\n"); printf("========================================\n"); // Run with locked CMPXCHG (the fix) run_test(true, duration); return 0; }