From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9434C5320E for ; Thu, 22 Aug 2024 09:11:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5ED1F6B024C; Thu, 22 Aug 2024 05:11:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 59AB06B024F; Thu, 22 Aug 2024 05:11:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3ECBA6B0250; Thu, 22 Aug 2024 05:11:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2167E6B024C for ; Thu, 22 Aug 2024 05:11:22 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 8889E4143F for ; Thu, 22 Aug 2024 09:11:21 +0000 (UTC) X-FDA: 82479312762.27.7ED5719 Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by imf21.hostedemail.com (Postfix) with ESMTP id D3D461C0017 for ; Thu, 22 Aug 2024 09:11:17 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="BIwdw/QE"; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf21.hostedemail.com: domain of hezhongkun.hzk@bytedance.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=hezhongkun.hzk@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724317863; a=rsa-sha256; cv=none; b=74JfCrz2RqZKqnFmV58PUHjDXyjbZrLvVzpZQewJwj6o0OI2gKV9KlrwixBph6EOvAfKmL viyWAlqlEJ/SWqv4HvJyTnn1A9LYw+5epVvHWUxCg8mHaX6CJDktd9AwW+JR0cTWYAeWKy BKUdPREemdKRzbIfldt/14pMnvlSlzg= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="BIwdw/QE"; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf21.hostedemail.com: domain of hezhongkun.hzk@bytedance.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=hezhongkun.hzk@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724317863; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4ZLvaadLIt/5eGubYu3Bg1lQuk7g/soC7lWxPhD4egM=; b=UE6C5SbvqaejCqtIsfWHYxr845aWhqAvXQqspZw8j1l1O1H3MJXOinx/dNB5Onatj1vYEa fc4NQe9ex3q+5PLH9hhPi9JIZuudCWRttA0Y87aYRe/n9cyGT8rdN5+y9FnX9kp6xmYmxN 0OlFVr4SQEVNt44XOQ4C/kN0NQ2+KP8= Received: by mail-lj1-f180.google.com with SMTP id 38308e7fff4ca-2f409c87b07so2353441fa.0 for ; Thu, 22 Aug 2024 02:11:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1724317876; x=1724922676; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4ZLvaadLIt/5eGubYu3Bg1lQuk7g/soC7lWxPhD4egM=; b=BIwdw/QEHQiUtEn6NsPptM6rD2n1Ms2onlHU+f0fFY9nQaqicUFgxo2mx70mVobvq1 gHab2XTzb0/b2CnQ2EhC0v3Lpt4e1A0TL4iVejR7hSXyH1x7FpM70bUgkw6YpUhkThlk rpWgbPFS0PDgpZslFYIajO2sdDQz2Fd4bFROLIUM290WmgH3HwvHnjmcY59ZALx/NvEa 0hWFOwC40BnBw2QjZ6h3V5EwiS+ynAQYhFHP7lj9v0ojREm38BX2Tqep5lf8RBaMI+nu YcGlvaXr9qSOu/lgImnVfrthwWb/KOZplLSB6vLNkw51XnPsoWO/nY9NkQObPvfbD7l0 pXkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724317876; x=1724922676; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4ZLvaadLIt/5eGubYu3Bg1lQuk7g/soC7lWxPhD4egM=; b=Rp7QAjGkTCUhD/dWnrxT4xLBKECFYh3etyj6GTWzAng5GN516vfd5mv4gSlUQJXtiF f5At3pxb/kbsHPZwC2dnCYGCeTfyod5a/NQvJjGdToay8TyPlMrEWe18VSGOWGQoLSBI V3GSwV3TBY2Py7azvt/OaE4pYCBLGor5zwg7MaQmkOzundnJiwx4LhQItZpGzSIFqc6B TgjT/izjA19AlGJ5sYD6KW8lFW/oa1ESDF8Kqv1hCVtBCDSl+tyBIiPZuyKlTe/v3cJh xN9qBEswd393RMQbEbyefP27X+0FEii1Sn/19OH7a0D+A2Att95qg3axdTv0eCa++HHR njFw== X-Forwarded-Encrypted: i=1; AJvYcCWe6tOr+oWyGyx/LVHAsSgpb/z74ToQwD6sYnsKBacBbGm0Q5s95jEXKI3SzC+QIeSmuDo+eIxWJA==@kvack.org X-Gm-Message-State: AOJu0YxcWjjOLXPzfVXzz52Z4aT2XBirOwfv26vBv0KmN/aa9m6y6Mfu Vww9jnem5K/nYb+8TeQDdzPftmDPA3kRYFa+nkkGBHKemsSqQWIL8RzVlohxEZmPFPuS+MyKTlC qz8ND5Z/ZmvI3gyo+XZBDg33AvLqbBDf7lV3LQw== X-Google-Smtp-Source: AGHT+IFL1pOCsLOzeBtwBjcM2LXd+pFUt5QuQhNleW8Q7cESkl9pWYhZOltzib9pUL20mIRQ8tYdGrayaMLcj6yPlA4= X-Received: by 2002:a2e:87d5:0:b0:2f3:f441:af16 with SMTP id 38308e7fff4ca-2f3f8b69a3fmr28576191fa.48.1724317875669; Thu, 22 Aug 2024 02:11:15 -0700 (PDT) MIME-Version: 1.0 References: <20240822083842.3167137-1-hezhongkun.hzk@bytedance.com> In-Reply-To: From: Zhongkun He Date: Thu, 22 Aug 2024 17:11:03 +0800 Message-ID: Subject: Re: [External] Re: [PATCH V1] mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath() To: Michal Hocko Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, hannes@cmpxchg.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, lizefan.x@bytedance.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: D3D461C0017 X-Rspamd-Server: rspam01 X-Stat-Signature: nua9mbgb54et77dt4dkcuan7sopapu38 X-HE-Tag: 1724317877-704433 X-HE-Meta: U2FsdGVkX18Ts1kjl1U+G+ZI0SO+HxCIpTb6qRdmmw5byCgiIYHjXow69PaVKEZ0a2cWsUVpXyEAsio9e8RHO578VODZAXKlhbC055A1ziz9/2KGGGjbHKUdYkDh/w/9FdbIaITPgz80Ao2Rnu6RWJacHzYxzedMlBl1VR2Ar7kIMJOpD08enQNbIUHTsXECdOYIxV/2zLQ+7y8mKS253cJLt83aIA6ZfwaMMa26JfcgMhOyb1+3icAIwH9mlBoY4EWSow+nfRE+sqSwtwtvabZzPPSZqLDGfsFEhvG+/RYDhfJlugkoWXnNf7zEFLnkqr2P7WJEZwU62MzIo3LEyAYiIEiTitfA8oeVU83GmhZ7T0SFbFHigh74QkmtiX5bTM6RHKZT8fv22qUqDj+q8KGjnjKTvkL8wMqjc6B3XFvsS70bwkC3iAvqu8upvs2kLQkomY2sjY48ZRryvtZLRC2+kKsRcQ/bIw0Q5/L1YgOeri9gkiWTmWxZ/kSPpjIYUs+pPPu0tzNJqBV3vrG0AwWIR9U4TfA0d0+I4sjRdxUCThaPSkJE+uRI/q5UiK7Zaj4Ja8ATvcKe7EK3kO0ni47Y8i1duTMXEWhW6iPEa0dA/eh/bTMMYg4PotBFRlV+nREaVzh2dvTLTduD4jOOAeEdqO0LXSbtU2BB1JhqOocxJoMHcTlzOUXor/3hB9cY0b7PVQX71l6PK9DzT474ehfnkIY7j1tYSku1qXKsrYRciPAFZQmhJfy+QgY0Bc54f/VJDhuO7nxd3D5i3POu0XpGyVuZowlp5xZ1VsK6LVeyRVeBEpN1BJsda/tP1LWXjQXgqtMJgRVxLG8KxRhmAWUb3bwODTd5vHqRuAqFbcJBGzi5xmToV6mjGACV/L4YC7XiBASActNmiR1uTs9EXuSsfOZMWLOgkX8+b0e+9X7ht745jHbLGzBlPRLodfIRjEDoBK40E+5TyxdtkzX HQAwZp8x wiRE1ZVSI9CLVktVKm4AwfEmMi1aWz2PL4UUg7+La7V3FcOuUwhqShKyhbEmqiH/aebKy5vgXR3crjHyrFTGWCF9Z/67JVyCK6ipLkDmkvaE45bGeh2HTaIJMsl8C8YLY8u60HqjnvKFqLL2at4TekhoY40TkSAlhiHUuYQykMdrmilSKmjDNbdMXl9apIUcUDr6aR1G8KDQd6VJR3ilnocZaZ0XYCFYAzrZPNaXwvGkXaLURJvXRdBUNfjIRbgtVNH8g+giJ7GSCRYb/8LNT4EWPa6nwmATLHrhQ8hkdGS4Xkbshz/FDYFxS64QKQMIZtfo+2pKilh5UIXhzuriXGQEnGFaCGKSUgNMI X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 22, 2024 at 5:01=E2=80=AFPM Michal Hocko wrot= e: > > On Thu 22-08-24 16:38:42, Zhongkun He wrote: > > I found a problem in my test machine that should_reclaim_retry() do > > not get the right node if i set the cpuset.mems. > > The should_reclaim_retry() and try_to_compact_pages() are iterating > > nodes which are not allowed by cpusets and that makes the retry loop > > happening more than unnecessary. > > I would update the problem description because from the above it is not > really clear what the actual problem is. > > should_reclaim_retry is not ALLOC_CPUSET aware and that means that it > considers reclaimability of NUMA nodes which are outside of the cpuset. > If other nodes have a lot of reclaimable memory then should_reclaim_retry > would instruct page allocator to retry even though there is no memory > reclaimable on the cpuset nodemask. This is not really a huge problem > because the number of retries without any reclaim progress is bound but > it could be certainly improved. This is a cold path so this shouldn't > really have a measurable impact on performance on most workloads. > Thanks for your description about this case. > > > > 1.Test step and the machines. > > ------------ > > root@vm:/sys/fs/cgroup/test# numactl -H | grep size > > node 0 size: 9477 MB > > node 1 size: 10079 MB > > node 2 size: 10079 MB > > node 3 size: 10078 MB > > > > root@vm:/sys/fs/cgroup/test# cat cpuset.mems > > 2 > > > > root@vm:/sys/fs/cgroup/test# stress --vm 1 --vm-bytes 12g --vm-keep > > stress: info: [33430] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd > > stress: FAIL: [33430] (425) <-- worker 33431 got signal 9 > > stress: WARN: [33430] (427) now reaping child worker processes > > stress: FAIL: [33430] (461) failed run completed in 2s > > > > 2. reclaim_retry_zone info: > > > > We can only alloc pages from node=3D2, but the reclaim_retry_zone is > > node=3D0 and return true. > > > > root@vm:/sys/kernel/debug/tracing# cat trace > > stress-33431 [001] ..... 13223.617311: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D1 wmark_check=3D1 > > stress-33431 [001] ..... 13223.617682: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D2 wmark_check=3D1 > > stress-33431 [001] ..... 13223.618103: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D3 wmark_check=3D1 > > stress-33431 [001] ..... 13223.618454: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D4 wmark_check=3D1 > > stress-33431 [001] ..... 13223.618770: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D5 wmark_check=3D1 > > stress-33431 [001] ..... 13223.619150: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D6 wmark_check=3D1 > > stress-33431 [001] ..... 13223.619510: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D7 wmark_check=3D1 > > stress-33431 [001] ..... 13223.619850: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D8 wmark_check=3D1 > > stress-33431 [001] ..... 13223.620171: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D9 wmark_check=3D1 > > stress-33431 [001] ..... 13223.620533: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D10 wmark_check=3D1 > > stress-33431 [001] ..... 13223.620894: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D11 wmark_check=3D1 > > stress-33431 [001] ..... 13223.621224: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D12 wmark_check=3D1 > > stress-33431 [001] ..... 13223.621551: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D13 wmark_check=3D1 > > stress-33431 [001] ..... 13223.621847: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D14 wmark_check=3D1 > > stress-33431 [001] ..... 13223.622200: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D15 wmark_check=3D1 > > stress-33431 [001] ..... 13223.622580: reclaim_retry_zone: node=3D0 z= one=3DNormal order=3D0 reclaimable=3D4260 available=3D1772019 min_wmark= =3D5962 no_progress_loops=3D16 wmark_check=3D1 > > > > You can drop the following > OK. > > 3. Root cause: > > Nodemask usually comes from mempolicy in policy_nodemask(), which > > is always NULL unless the memory policy is bind or prefer_many. > > > > nodemask =3D NULL > > __alloc_pages_noprof() > > prepare_alloc_pages > > ac->nodemask =3D &cpuset_current_mems_allowed; > > > > get_page_from_freelist() > > > > ac.nodemask =3D nodemask; /*set NULL*/ > > > > __alloc_pages_slowpath() { > > f (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) { > > ac->nodemask =3D NULL; > > ac->preferred_zoneref =3D first_zones_zonelist(ac= ->zonelist, > > ac->highest_zoneidx, ac->nodemask= ); > > > > /* so ac.nodemask =3D NULL */ > > } > > > > According to the function flow above, we do not have the memory limit t= o > > follow cpuset.mems, so we need to add it. > > > > Test result: > > Try 3 times with different cpuset.mems and alloc large memorys than tha= t numa size. > > echo 1 > cpuset.mems > > stress --vm 1 --vm-bytes 12g --vm-hang 0 > > --------------- > > echo 2 > cpuset.mems > > stress --vm 1 --vm-bytes 12g --vm-hang 0 > > --------------- > > echo 3 > cpuset.mems > > stress --vm 1 --vm-bytes 12g --vm-hang 0 > > > > The retry trace look like: > > stress-2139 [003] ..... 666.934104: reclaim_retry_zone: node=3D1 z= one=3DNormal order=3D0 reclaimable=3D7 available=3D7355 min_wmark=3D8598 = no_progress_loops=3D1 wmark_check=3D0 > > stress-2204 [010] ..... 695.447393: reclaim_retry_zone: node=3D2 z= one=3DNormal order=3D0 reclaimable=3D2 available=3D6916 min_wmark=3D8598 = no_progress_loops=3D1 wmark_check=3D0 > > stress-2271 [008] ..... 725.683058: reclaim_retry_zone: node=3D3 z= one=3DNormal order=3D0 reclaimable=3D17 available=3D8079 min_wmark=3D8597= no_progress_loops=3D1 wmark_check=3D0 > > > > And only keep this > OK. > > With this patch, we can check the right node and get less retry in __al= loc_pages_slowpath() > > because there is nothing to do. > > > > V1: > > Do the same with the page allocator using __cpuset_zone_allowed(). > > > > Suggested-by: Michal Hocko > > Signed-off-by: Zhongkun He > > With those changes you can add > Acked-by: Michal Hocko > Thanks! > > > --- > > mm/compaction.c | 6 ++++++ > > mm/page_alloc.c | 5 +++++ > > 2 files changed, 11 insertions(+) > > > > diff --git a/mm/compaction.c b/mm/compaction.c > > index d1041fbce679..a2b16b08cbbf 100644 > > --- a/mm/compaction.c > > +++ b/mm/compaction.c > > @@ -23,6 +23,7 @@ > > #include > > #include > > #include > > +#include > > #include "internal.h" > > > > #ifdef CONFIG_COMPACTION > > @@ -2822,6 +2823,11 @@ enum compact_result try_to_compact_pages(gfp_t g= fp_mask, unsigned int order, > > ac->highest_zoneidx, ac->nodemask= ) { > > enum compact_result status; > > > > + if (cpusets_enabled() && > > + (alloc_flags & ALLOC_CPUSET) && > > + !__cpuset_zone_allowed(zone, gfp_mask)) > > + continue; > > + > > if (prio > MIN_COMPACT_PRIORITY > > && compaction_deferred(zone, orde= r)) { > > rc =3D max_t(enum compact_result, COMPACT_DEFERRE= D, rc); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 29608ca294cf..8a67d760b71a 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4128,6 +4128,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned or= der, > > unsigned long min_wmark =3D min_wmark_pages(zone); > > bool wmark; > > > > + if (cpusets_enabled() && > > + (alloc_flags & ALLOC_CPUSET) && > > + !__cpuset_zone_allowed(zone, gfp_mask)) > > + continue; > > + > > available =3D reclaimable =3D zone_reclaimable_pages(zone= ); > > available +=3D zone_page_state_snapshot(zone, NR_FREE_PAG= ES); > > > > -- > > 2.20.1 > > -- > Michal Hocko > SUSE Labs