From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD7FFC433FE for ; Mon, 10 Jan 2022 17:10:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 483266B0071; Mon, 10 Jan 2022 12:10:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4310C6B0072; Mon, 10 Jan 2022 12:10:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2851D6B0074; Mon, 10 Jan 2022 12:10:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0068.hostedemail.com [216.40.44.68]) by kanga.kvack.org (Postfix) with ESMTP id 167386B0071 for ; Mon, 10 Jan 2022 12:10:01 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id BF3CE94FBD for ; Mon, 10 Jan 2022 17:10:00 +0000 (UTC) X-FDA: 79015014960.10.BB7C7B5 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf29.hostedemail.com (Postfix) with ESMTP id 1BBFD12000B for ; Mon, 10 Jan 2022 17:09:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1641834597; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=tBR8KnvX63pvs+Mv7Kb2j3m653wYDjd5gSjSGP6dBRs=; b=THOtJVuBSQgwy8HUoph4Aymft8xrFgbvuuD3gtEDFKDs+0ebbNX98KeQJDbjgWhXVSIqiU PgWWzRAp6aKnS8SYhnrMzGiK/xRaX9w2j8AdqobcI/ms85WLJGVnbGGABPuodyDFox7XSG kIk9QSGNlyRO7RFaQvqM3hmkgWdK7/8= Received: from mail-oo1-f72.google.com (mail-oo1-f72.google.com [209.85.161.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-571-hXsLfylYMGqire3Vuy3JRg-1; Mon, 10 Jan 2022 12:09:56 -0500 X-MC-Unique: hXsLfylYMGqire3Vuy3JRg-1 Received: by mail-oo1-f72.google.com with SMTP id l6-20020a4abe06000000b002dab9d33f2eso8877365oop.0 for ; Mon, 10 Jan 2022 09:09:56 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=tBR8KnvX63pvs+Mv7Kb2j3m653wYDjd5gSjSGP6dBRs=; b=RsRl4lrh/QTG45tfbzXdSzt6Sfg2LRQ4oDxrbDWR5/CB4MxUc972mJW2asJhQ4G6zM 2SZ3QEPQSNjhn3h0JNP466/Uq2k+DGoVwKpgduD17/grZgjtEtbb7fQcvv21RGkx7Tyt Zz/K65aRuaH6izjmcjMED8UECOmFy24lB/SY3TAoAhWEbUbMBup7poMEPnK9Cr7Z/Yz7 c9ohC2aGLJw5sodg168plH/uMhHEGHJTfg1/U+Jc+ombmL1pZiQA4NvgI+6hE8hfRv/P fbt/r71PHBzND9oIN7J8pLUNlRNHd4cebkHSWXiR7r07q6mSkxDHMFBkFr7AeW8/hr9q aCQA== X-Gm-Message-State: AOAM5316l2C6p/DQhAB/KymNpWP1yrf4Jf+FDx+SEbWz79y9ugmSbgQ5 tkKG0tCg5WPyjceOrTLQfGDoOlmycJqXFgOQIIMBwJc8J2RJm8l2uGUeQT7w8SZQd1AMtupm+Og RIWs3DihOaks= X-Received: by 2002:a9d:7d99:: with SMTP id j25mr565813otn.252.1641834594153; Mon, 10 Jan 2022 09:09:54 -0800 (PST) X-Google-Smtp-Source: ABdhPJwmX+24PBb8HyPbtQL8beByYwrkmR0B9Wexnr9flf86/F/kvxiYcT4yZ3RwPWsBHgabHSSqzQ== X-Received: by 2002:a9d:7d99:: with SMTP id j25mr565787otn.252.1641834593854; Mon, 10 Jan 2022 09:09:53 -0800 (PST) Received: from optiplex-fbsd (c-73-182-255-193.hsd1.nh.comcast.net. [73.182.255.193]) by smtp.gmail.com with ESMTPSA id f9sm1536135oto.56.2022.01.10.09.09.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jan 2022 09:09:53 -0800 (PST) Date: Mon, 10 Jan 2022 12:09:50 -0500 From: Rafael Aquini To: Yang Shi Cc: Andrew Morton , Nico Pache , Linux Kernel Mailing List , Linux MM , Shakeel Butt , Kirill Tkhai , Roman Gushchin , Vlastimil Babka , Vladimir Davydov , raquini@redhat.com, Michal Hocko , David Hildenbrand Subject: Re: [PATCH v2 1/1] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes Message-ID: References: <20211207224013.880775-1-npache@redhat.com> <20211207224013.880775-2-npache@redhat.com> <20211207154438.c1e49a3f0b5ebc9245aac61b@linux-foundation.org> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 1BBFD12000B X-Stat-Signature: 9g4ierdoij5q3pa3pc8r9smxao5cft5a Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=THOtJVuB; spf=none (imf29.hostedemail.com: domain of aquini@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=aquini@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1641834599-718047 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Dec 07, 2021 at 04:26:28PM -0800, Yang Shi wrote: > On Tue, Dec 7, 2021 at 3:44 PM Andrew Morton wrote: > > > > On Tue, 7 Dec 2021 17:40:13 -0500 Nico Pache wrote: > > > > > We have run into a panic caused by a shrinker allocation being attempted > > > on an offlined node. > > > > > > Our crash analysis has determined that the issue originates from trying > > > to allocate pages on an offlined node in expand_one_shrinker_info. This > > > function makes the incorrect assumption that we can allocate on any node. > > > To correct this we make sure the node is online before tempting an > > > allocation. If it is not online choose the closest node. > > > > This isn't fully accurate, is it? We could allocate on a node which is > > presently offline but which was previously onlined, by testing > > NODE_DATA(nid). > > > > It isn't entirely clear to me from the v1 discussion why this approach > > isn't being taken? > > > > AFAICT the proposed patch is *already* taking this approach, by having > > no protection against a concurrent or subsequent node offlining? > > AFAICT, we have not reached agreement on how to fix it yet. I saw 3 > proposals at least: > > 1. From Michal, allocate node data for all possible nodes. > https://lore.kernel.org/all/Ya89aqij6nMwJrIZ@dhcp22.suse.cz/T/#u > > 2. What this patch does. Proposed originally from > https://lore.kernel.org/all/20211108202325.20304-1-amakhalov@vmware.com/T/#u > This patch doesn't cut it, because it only fixes up one caller. The issue, however, boils down to any iteration like ... for_each_possible_cpu(cpu) { struct page *page = alloc_pages_node(cpu_to_node(cpu), gfp, 0); ... where topology allows for possible cpus in yet-to-be-online node to cause node_zonelist() to produce a bogus zonelist pointer when populating ac->zonelist within prepare_alloc_pages(). I just bisected the following commit causing a PPC host to crash at boot, when initializing the cgoup subsystem: commit bd0e7491a931f5a2960555b10b9551464ff8cc8e Author: Vlastimil Babka Date: Sat May 22 01:59:38 2021 +0200 mm, slub: convert kmem_cpu_slab protection to local_lock The commit, obviously is not responsible for the crash, and the only "sin" committed by commit bd0e7491a93 as far as the crash issue is concerned is to increase the size of struct kmem_cache_cpu size with one local_lock_t, which in the worst case scenario -- !CONFIG_PREEMPT_RT && CONFIG_DEBUG_LOCK_ALLOC -- will end up creating an overhead of 56 bytes for each kmem_cache_cpu created, burning faster through the bootstrap pre-allocated embedded per-cpu pool. ---8<--- [ 0.001888] BUG: Kernel NULL pointer dereference on read at 0x00001508 [ 0.001900] Faulting instruction address: 0xc0000000005277fc [ 0.001905] Oops: Kernel access of bad area, sig: 11 [#1] [ 0.001909] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 0.001915] Modules linked in: [ 0.001919] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.14.0+ #24 [ 0.001925] NIP: c0000000005277fc LR: c0000000005276d8 CTR: 0000000000000000 [ 0.001930] REGS: c000000002c936d0 TRAP: 0380 Not tainted (5.14.0+) [ 0.001935] MSR: 8000000002009033 CR: 24024222 XER: 00000002 [ 0.001948] CFAR: c0000000005276e0 IRQMASK: 0 [ 0.001948] GPR00: c0000000005276d8 c000000002c93970 c000000002c95500 0000000000001500 [ 0.001948] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 0.001948] GPR08: 0000000000000081 0000000000000001 0000000000001500 0000000044024222 [ 0.001948] GPR12: 00000003fa6a0000 c0000000049d0000 0000000000000000 c0000000053f0780 [ 0.001948] GPR16: 0000000000000000 0000000000000000 0000000000000008 c000000002cec160 [ 0.001948] GPR20: 0000000000000000 0000000000000cc2 c000000002cf0a30 c000000002cf02c0 [ 0.001948] GPR24: 0000000000000001 0000000000000001 0000000000000000 0000000000000000 [ 0.001948] GPR28: 0000000000000000 c000000002c939e8 c000000002c939f0 0000000000000000 [ 0.002007] NIP [c0000000005277fc] prepare_alloc_pages.constprop.0+0x20c/0x280 [ 0.002014] LR [c0000000005276d8] prepare_alloc_pages.constprop.0+0xe8/0x280 [ 0.002020] Call Trace: [ 0.002022] [c000000002c93970] [c0000000005276c4] prepare_alloc_pages.constprop.0+0xd4/0x280 (unreliable) [ 0.002030] [c000000002c939c0] [c000000000531270] __alloc_pages+0xb0/0x3b0 [ 0.002036] [c000000002c93a50] [c0000000004cd4d4] pcpu_alloc_pages.constprop.0+0x144/0x2a0 [ 0.002044] [c000000002c93ae0] [c0000000004cd714] pcpu_populate_chunk+0x64/0x2b0 [ 0.002050] [c000000002c93b90] [c0000000004d1a14] pcpu_alloc+0x944/0xd80 [ 0.002056] [c000000002c93cb0] [c0000000005abcd4] mem_cgroup_alloc+0x144/0x470 [ 0.002063] [c000000002c93d30] [c0000000005bcedc] mem_cgroup_css_alloc+0xac/0x390 [ 0.002070] [c000000002c93d80] [c00000000203181c] cgroup_init_subsys+0xf4/0x260 [ 0.002078] [c000000002c93e30] [c000000002031d40] cgroup_init+0x1f4/0x528 [ 0.002085] [c000000002c93f00] [c0000000020051ec] start_kernel+0x64c/0x6b0 [ 0.002091] [c000000002c93f90] [c00000000000d39c] start_here_common+0x1c/0x600 [ 0.002098] Instruction dump: [ 0.002101] 61280080 2c0a0001 7d28489e 913d0000 57ffa7fe 9bfe0020 809e001c e8be0008 [ 0.002112] e87e0000 2c250000 7c6a1b78 40820058 <81230008> 7c044840 4180004c f95e0010 [ 0.002124] ---[ end trace 7392155448beabaa ]--- [ 0.002127] [ 1.002133] Kernel panic - not syncing: Attempted to kill the idle task! [ 1.002195] ------------[ cut here ]------------ --->8--- > 3. From David, fix in node_zonelist(). > https://lore.kernel.org/all/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com/T/#u It seems to me that (3) is the simplest and effective way to cope with this case Cheers, -- Rafael > > > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -222,13 +222,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, > > > int size = map_size + defer_size; > > > > > > for_each_node(nid) { > > > + int tmp = nid; > > > > Not `tmp', please. Better to use an identifier which explains the > > variable's use. target_nid? > > > > And a newline after defining locals, please. > > > > > pn = memcg->nodeinfo[nid]; > > > old = shrinker_info_protected(memcg, nid); > > > /* Not yet online memcg */ > > > if (!old) > > > return 0; > > > > > > - new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid); > > > + if(!node_online(nid)) > > > > s/if(/if (/ > > > > > + tmp = numa_mem_id(); > > > + new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp); > > > if (!new) > > > return -ENOMEM; > > > > > > > And a code comment fully explaining what's going on here? >