From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC198C43219 for ; Mon, 6 Dec 2021 18:27:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0203E6B007B; Mon, 6 Dec 2021 13:26:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EEB026B007D; Mon, 6 Dec 2021 13:26:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D8ABF6B007E; Mon, 6 Dec 2021 13:26:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0111.hostedemail.com [216.40.44.111]) by kanga.kvack.org (Postfix) with ESMTP id C23D96B007B for ; Mon, 6 Dec 2021 13:26:56 -0500 (EST) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 79B09894B3 for ; Mon, 6 Dec 2021 18:26:46 +0000 (UTC) X-FDA: 78888200412.30.E696D48 Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by imf18.hostedemail.com (Postfix) with ESMTP id 1BA41400208A for ; Mon, 6 Dec 2021 18:26:45 +0000 (UTC) Received: by mail-ed1-f52.google.com with SMTP id v1so46784236edx.2 for ; Mon, 06 Dec 2021 10:26:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=F7b9gNaP+SmSc68bpz6rexAdvfFqjoyV7ca979OMjGo=; b=bbD/HjGmHDdznAPIU1hsU3oake4L3RF3PwowUcQS4Yze2KgVDKc1h6XpkvXv8Qkhg7 5OhjzYCP9cNdU2dzV4V+yO3X27WE3Le/khZPconYRbvbCF7m8fOQR5lPNYukRIlj4qFy 6OAOkKlaV9Jis5nFNwRXEJAGoAbB5BCoJN9uARnqMExdSe1t+lfoiUaCI954tdeAOEAc LVSHbbuJU4qaV2HGLtItiq43HP399DPIckeRX+TN9s8jlVvgrBA9kXvldZfrDxmleKOd UeN/LwOkrNKvgLv2rzFO0VbP22dA4ACAskUOBEPaQasbbNpv5wE3InQsAqLsIiAMfH4U h4aw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=F7b9gNaP+SmSc68bpz6rexAdvfFqjoyV7ca979OMjGo=; b=oEKaHIbI5KhrqgZFEQ5hXB5/5zKwDjX1q4+4Bs5mlq5iUjTWLXpNW9RSDogeTmnrLe +bHqjgsBBguLG759ByYWZQs7TBRFYHfiE9SdV7ECmiYDMTkPB8v9sGemeCL533sFgaL3 31ei6i7d77+7tGiBkFrBWmc83bfCae/0Kmmu9p+LIzDExZV0AoHSIDODEpzSY12t1W4F hjNNYAxRDHGMNNNDY/A0W5Efg5dZRrU8rJmiXgQ3+cKKE+qTTh3Ive6aGXnNqQEPQVpC Zh2VO9B4Dcp6k0bPlo1O8+6CUxHRZ5cPz0rHIu13MnINc/s2yNRPC1P6KfF08tFqpAEv RdKQ== X-Gm-Message-State: AOAM533dASCLk9vzVBMPaHm8cwxguTVPtx2EDzbPSx3ZVtQkQzOCNgy6 qXQqDXKR2rNwB3/vN5i4lu7LFUjzbtkWXzXNT/Y= X-Google-Smtp-Source: ABdhPJxAp0h9yZDmP3EIloEbdAq8KvLf27IY9TvjjcNWe+fyHa1yogc3zbFInlf6iGqYn3KTw4D7wP6dekvZmX7BVhI= X-Received: by 2002:a17:907:94ce:: with SMTP id dn14mr47864389ejc.85.1638815204756; Mon, 06 Dec 2021 10:26:44 -0800 (PST) MIME-Version: 1.0 References: <840cb3d0-61fe-b6cb-9918-69146ba06cf7@redhat.com> <51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com> <05157de4-e5df-11fc-fc46-8a9f79d0ddb4@redhat.com> In-Reply-To: From: Yang Shi Date: Mon, 6 Dec 2021 10:26:32 -0800 Message-ID: Subject: Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes To: Michal Hocko Cc: Vlastimil Babka , David Hildenbrand , Nico Pache , Linux Kernel Mailing List , Linux MM , Andrew Morton , Shakeel Butt , Kirill Tkhai , Roman Gushchin , Vladimir Davydov , raquini@redhat.com Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: z53sktn36cqn74quwy1ukn6pe4br5gax Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="bbD/HjGm"; spf=pass (imf18.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 1BA41400208A X-HE-Tag: 1638815205-911762 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Dec 6, 2021 at 6:53 AM Michal Hocko wrote: > > On Mon 06-12-21 15:30:37, Vlastimil Babka wrote: > > On 12/6/21 15:21, Michal Hocko wrote: > > > On Mon 06-12-21 15:08:10, David Hildenbrand wrote: > > >> > > >> >> But there might be more missing. Onlining a new zone will get more > > >> >> expensive in setups with a lot of possible nodes (x86-64 shouldn't > > >> >> really be an issue in that regard). > > >> > > > >> > Honestly, I am not really concerned by platforms with too many nodes > > >> > without any memory. If they want to shoot their feet then that's their > > >> > choice. We can optimize for those if they ever prove to be standar. > > >> > > > >> >> If we want stable backports, we'll want something simple upfront. > > >> > > > >> > For stable backports I would be fine by doing your NODE_DATA check in > > >> > the allocator. In upstream I think we should be aiming for a more robust > > >> > solution that is also easier to maintain further down the line. Even if > > >> > that is an investment at this momemnt because the initialization code is > > >> > a mess. > > >> > > > >> > > >> Agreed. I would be curious *why* we decided to dynamically allocate the > > >> pgdat. is this just a historical coincidence or was there real reason to > > >> not allocate it for all possible nodes during boot? > > > > > > I don't know but if I was to guess the most likely explanation would be > > > that the numa init code was in a similar order as now and it was easier > > > to simply allocate a pgdat when a new one was onlined. > > > 9af3c2dea3a3 ("[PATCH] pgdat allocation for new node add (call pgdat allocation)") > > > doesn't really tell much. > > > > I don't know if that's true for pgdat specifically, but generally IMHO the > > advantages of allocating during/after online instead for each possible is > > - memory savings when some possible node is actually never online > > - at least in some cases, the allocations can be local to the node in > > question where the advantages is > > - faster access > > - less memory occupied on nodes that are earlier online, especially node 0 > > > > So while the approach of allocate on boot for all possible nodes instead of > > just online nodes has advantages of being generally safer and simpler (no > > memory hotplug callbacks etc), we should also be careful not to overdo this > > approach so we don't end up with Node 0 memory filled with structures used > > for nodes 1-X that are just onlined later. I imagine that could be a problem > > even for "sane" archs that don't have tons of possible, but offline nodes. > > Yes this can indeed turn out to be a problem as the memory allocations > scales not only with numa nodes but memcgs as well. The later one being > a more visible one. > > > Concretely, pgdat should probably be fine, but things like all shrinkers? > > Maybe less so. > > Yeah, right. But for that purpose the concept of online_node is just > misleading. You would need a check whether the node is populated with > memory and implement hotplug notifiers. Yes, the cons is memory waste. I think it is a known problem since memcg has per node data (a.k.a. mem_cgroup_per_node_info) which holds lruvec and shrinker infos. And the comment in the code of alloc_mem_cgroup_per_node_info() does say: "TODO: this routine can waste much memory for nodes which will never be onlined. It's better to use memory hotplug callback function." But IMHO actually the memory usage should be not that bad for memcg heavy usecases since there should be not too many "never onlined" nodes for such workloads? > > -- > Michal Hocko > SUSE Labs