From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52E2FC4167B for ; Tue, 12 Dec 2023 01:07:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8FB006B026A; Mon, 11 Dec 2023 20:07:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8AB9C6B026B; Mon, 11 Dec 2023 20:07:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 77EAC6B026C; Mon, 11 Dec 2023 20:07:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 6567E6B026A for ; Mon, 11 Dec 2023 20:07:11 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 40388A1C8D for ; Tue, 12 Dec 2023 01:07:11 +0000 (UTC) X-FDA: 81556377462.14.EC24F4F Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf03.hostedemail.com (Postfix) with ESMTP id 5610C2000D for ; Tue, 12 Dec 2023 01:07:09 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=EP1BbKtW; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf03.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702343229; a=rsa-sha256; cv=none; b=IPqxFt++RnlU2i44R/iw2DdhW5JS0fo7vin7JAKgLketnoIVV9eH3PJkT9k11y2hbZssCR i/RlW7M/JTu6R2vdc1AvdfxJETO/LTKOibvVnzqUw88DBUj5PM/I73UMbos0bTeeoHKaPR e5qL2oRrBVTWneWzSsm6I8PyRT/Copk= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=EP1BbKtW; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf03.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1702343229; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=d/cWCTzQ+1u8sf594BtBwr5TiC/bLaU/S8GAKRjknjo=; b=DOM6Eo7yYCxQWGlLQAzdawAJcJqrG3M2kzAB6e3qvJI0Hwqsh2aBi4NCqmv+sKQ5Co7ZO5 GzUckz4GZ+apuxx/zuXdc54Z4Z6sP5ww5JU97LNPGRZuVIioaPZ/Wq+EFkyx206yDrAscK GHfX0a+xsBahQlXZQH0A2h41s23UkXk= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 0CBE2615E2 for ; Tue, 12 Dec 2023 01:07:08 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B084DC433C7 for ; Tue, 12 Dec 2023 01:07:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1702343227; bh=MDets05mPCDoqLxyZIXukdeciXvfqDz5mf6mMUdN2sM=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=EP1BbKtWWaBQvGIDk2Hvi9L+UYeZNhtroplbMAv4+6AOEMlK9FiFAJu3sLFOtrRmp rCm3llUhRUAm1nytlNBhtwfWBccJSw6d8IXmR0eL+JmSmtvwk4o4Gmij8s1wCglQV4 9vgx8+yHxEk2+VLCs5cgT5IYk3GhA6ci0ezuKIOUpEzUpE+skYvIaqOmuXr3ypnnUn n58VZNGX97+59IlwUSj5TonA5d0jTtWG2xDRA3dysBuYNy7PzpxWEdb0gBoG60+8R7 mUkQb1FfCcFa8FV4E/I9e6TVTGpjWYpj0SRveBKPqwdkfkja7BJItoB51Ae1PP0Cjl Al12d+wQu0OpQ== Received: by mail-ot1-f44.google.com with SMTP id 46e09a7af769-6d9e179b217so3549074a34.0 for ; Mon, 11 Dec 2023 17:07:07 -0800 (PST) X-Gm-Message-State: AOJu0Yxn9ijV9XdODzWeBiMflU4FId5ttJEnHYi2ED9zejoNGLpwlrJ/ h/0dbZTnp8qafCM0iytP1ORGQu2d+i3BMa/cXr0izw== X-Google-Smtp-Source: AGHT+IFnU6Zg6gtbcKGpVVyMUgGn/eZhHtC+eRX3XCuRhFv0V1qwZo+7UWlo2LwQsjS0Um0ppKxvMMc5FhFchTfzT50= X-Received: by 2002:a05:6358:d598:b0:170:2abc:6e34 with SMTP id ms24-20020a056358d59800b001702abc6e34mr5702692rwb.19.1702343226888; Mon, 11 Dec 2023 17:07:06 -0800 (PST) MIME-Version: 1.0 References: <20231211140419.1298178-1-schatzberg.dan@gmail.com> <20231211140419.1298178-2-schatzberg.dan@gmail.com> In-Reply-To: <20231211140419.1298178-2-schatzberg.dan@gmail.com> From: Chris Li Date: Mon, 11 Dec 2023 17:06:54 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH V3 1/1] mm: add swapiness= arg to memory.reclaim To: Dan Schatzberg Cc: Johannes Weiner , Roman Gushchin , Yosry Ahmed , Huan Yang , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Tejun Heo , Zefan Li , Jonathan Corbet , Michal Hocko , Shakeel Butt , Muchun Song , Andrew Morton , David Hildenbrand , Matthew Wilcox , Kefeng Wang , "Vishal Moola (Oracle)" , Yue Zhao , Hugh Dickins Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 5610C2000D X-Stat-Signature: 7kosh4ohaa5cihxaqh6wjkq1s3e3uog5 X-HE-Tag: 1702343229-300849 X-HE-Meta: U2FsdGVkX1+82J6RrZ7rWl7ImcQMTL9gm+FpOZxyi9PQa+saIvpjwp36V5r11JEKL474r7nu2ZADReoafP2ZJ1NG2O4rugT8/ETm13PoamFESne10ceJ6Mxz33FacS2ycwXwhfcT7/4NKDHHtYe8Jn/r+8OaHP4ocMp1h4GocLvPn78zBD1Ryni0VmfgrwOtoFiLOHR2Zen5e5I3nH7nLKaUwq1aARpVvlwpHDDsWj2aT363sAZfLZ2lQ2Eu7JIVeQjfIDsB5X1cSdHh1Xxe2vMyHESMZCgMKVLjPC2do7OTLK0D4Typ/+XWz7Ikf529WKycBJE636PhU+OVycjLP4mBrTkxQ56Lwfc58juBk0ogr+bqyF1uPjFUxv5omqtRg0I0v2MIEya6LQRljoBEIoa+5iCiufsfgwo8yAu6HPFLbFwCdMhbCVvdHino3hiE4XrN9+aluo0pSg9I33ABi6cLbnyvxjVYr5hYmEkNIS/Sd7Ld789XXXMiJguucVrr5s+sYyYmFIsDsBZEj8xPLmS21jNCfZecCYtSWjvP19Rpdt6XVyZvqXw2VOiLQw+T9C7BBOEd8ZEgC9aSPqb81DTOYbIp5MKhOWESI3pMI5xFErA8+w5JYQ2P2cr26mDzu7TpbTqG64kWE0Usp4zRw1hgM3MeuXn5feJcKAgTFzzBkicUh0OekqFZ94h/goKdbZiRb+scux5Vjk3yfdcaZi6+dj9GiaeH8nC9LH9CSGAuUJc6MBslthWh/D8XE2M2k0+angYPhEVTiLX2DZ/aVnd4cho3FpNm4JZqZCMkTs3RfjpN3bmDM3zWJFWnlM54gNoHqjMPh7wRP4+QeI7V179+bl1VXs9wsXtt5AtQEO/CX9WwqTEveFb8FF1mGoQlykV+wxPDzgXTd19fWU2QVmE7Y0EUwosWBj9lSo7UVsN+5n4F0NwRWsXk4ytIMEotChJ8+UNTr4sN25qAMiv Fnp9oFcc 912kM8nLuFBOgRSxiR85QhYlvoUY4Mt4/ynH3oq/ohaB72gxwLfaYnRhc8mFULVtWTKaXJgteMoSqXFghvRqLCI1d8WqeHdAfL3DNtHNt8XwgFMFkxFJGwUVyUYhVusaBREIMdwgPboIhSsw7HsDGkN69wa5birTlIlsO9X7M9KYTX/BdeNcGS0an8yfwB/usvWo1GOlKKrrlh6qgFZuYFXSAlER4qGyXX/I0BDliEq5xmOykeoQn1gm/mIEsNAktl2lYxc4RmFOzCfz9nV9pK5vrVyIeLUMpBZlW7ha4XmdwMxPuFI3JtWX4JZLxCTea+e6/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Dan, Thank you for the patch. On Mon, Dec 11, 2023 at 6:04=E2=80=AFAM Dan Schatzberg wrote: > > Allow proactive reclaimers to submit an additional swappiness=3D > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. I am curious what prompted you to develop this patch. I understand what this patch does, just want to know more of your background story why this is needed. > > For example: > > echo "2M swappiness=3D0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Signed-off-by: Dan Schatzberg > --- > Documentation/admin-guide/cgroup-v2.rst | 15 ++++++- > include/linux/swap.h | 3 +- > mm/memcontrol.c | 55 ++++++++++++++++++++----- > mm/vmscan.c | 13 +++++- > 4 files changed, 70 insertions(+), 16 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admi= n-guide/cgroup-v2.rst > index 3f85254f3cef..fc2b379dbd0f 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1282,8 +1282,8 @@ PAGE_SIZE multiple when read back. > This is a simple interface to trigger memory reclaim in the > target cgroup. > > - This file accepts a single key, the number of bytes to reclaim. > - No nested keys are currently supported. > + This file accepts a string which containers thhe number of bytes Just as Yosry points out, there is some typo here. "contains the" > + to reclaim. > > Example:: > > @@ -1304,6 +1304,17 @@ PAGE_SIZE multiple when read back. > This means that the networking layer will not adapt based on > reclaim induced by memory.reclaim. > > + This file also allows the user to specify the swappiness value > + to be used for the reclaim. For example: > + > + echo "1G swappiness=3D60" > memory.reclaim > + > + The above instructs the kernel to perform the reclaim with > + a swappiness value of 60. Note that this has the same semantics > + as the vm.swappiness sysctl - it sets the relative IO cost of > + reclaiming anon vs file memory but does not allow for reclaiming > + specific amounts of anon or file memory. > + > memory.peak > A read-only single value file which exists on non-root > cgroups. > diff --git a/include/linux/swap.h b/include/linux/swap.h > index f6dd6575b905..ebc20d094609 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -410,7 +410,8 @@ extern unsigned long try_to_free_pages(struct zonelis= t *zonelist, int order, > extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem= cg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_op= tions); > + unsigned int reclaim_op= tions, > + int swappiness); > extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > gfp_t gfp_mask, bool nosw= ap, > pg_data_t *pgdat, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 1c1061df9cd1..74598c17d3cc 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -63,6 +63,7 @@ > #include > #include > #include > +#include > #include > #include "internal.h" > #include > @@ -2449,7 +2450,7 @@ static unsigned long reclaim_high(struct mem_cgroup= *memcg, > psi_memstall_enter(&pflags); > nr_reclaimed +=3D try_to_free_mem_cgroup_pages(memcg, nr_= pages, > gfp_mask, > - MEMCG_RECLAIM_MAY= _SWAP); > + MEMCG_RECLAIM_MAY= _SWAP, -1); Instead of passing -1, maybe we can use mem_cgroup_swappiness(memcg); And maybe remove the -1 test from the get_scan_count(). "" static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, unsigned long *nr) { struct pglist_data *pgdat =3D lruvec_pgdat(lruvec); struct mem_cgroup *memcg =3D lruvec_memcg(lruvec); unsigned long anon_cost, file_cost, total_cost; int swappiness =3D sc->swappiness !=3D -1 ? sc->swappiness : mem_cgroup_swappiness(memcg); "" In other words, I feel it is cleaner if try_to_free_mem_cgroup_pages accept the swappiness as it is. There is no second guessing the value if it is -1, then the internal function gets the swappiness from mem_cgroup_swappiness(). Maybe we can completely remove the -1 special default value? > psi_memstall_leave(&pflags); > } while ((memcg =3D parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > @@ -2740,7 +2741,7 @@ static int try_charge_memcg(struct mem_cgroup *memc= g, gfp_t gfp_mask, > > psi_memstall_enter(&pflags); > nr_reclaimed =3D try_to_free_mem_cgroup_pages(mem_over_limit, nr_= pages, > - gfp_mask, reclaim_opt= ions); > + gfp_mask, reclaim_opt= ions, -1); Same here. > psi_memstall_leave(&pflags); > > if (mem_cgroup_margin(mem_over_limit) >=3D nr_pages) > @@ -3660,7 +3661,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup = *memcg, > } > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - memsw ? 0 : MEMCG_RECLAIM_MAY_SWA= P)) { > + memsw ? 0 : MEMCG_RECLAIM_MAY_SWA= P, -1)) { Same here. > ret =3D -EBUSY; > break; > } > @@ -3774,7 +3775,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup= *memcg) > return -EINTR; > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - MEMCG_RECLAIM_MAY_SWAP)= ) > + MEMCG_RECLAIM_MAY_SWAP,= -1)) Here too. > nr_retries--; > } > > @@ -6720,7 +6721,7 @@ static ssize_t memory_high_write(struct kernfs_open= _file *of, > } > > reclaimed =3D try_to_free_mem_cgroup_pages(memcg, nr_page= s - high, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWA= P); > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWA= P, -1); Here too. > > if (!reclaimed && !nr_retries--) > break; > @@ -6769,7 +6770,7 @@ static ssize_t memory_max_write(struct kernfs_open_= file *of, > > if (nr_reclaims) { > if (!try_to_free_mem_cgroup_pages(memcg, nr_pages= - max, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWA= P)) > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWA= P, -1)) Here too. > nr_reclaims--; > continue; > } > @@ -6895,6 +6896,16 @@ static ssize_t memory_oom_group_write(struct kernf= s_open_file *of, > return nbytes; > } > > +enum { > + MEMORY_RECLAIM_SWAPPINESS =3D 0, > + MEMORY_RECLAIM_NULL, > +}; > + > +static const match_table_t if_tokens =3D { What this is called "if_tokens"? I am trying to figure out what "if" refers= to. > + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=3D%d"}, > + { MEMORY_RECLAIM_NULL, NULL }, > +}; > + Do we foresee a lot of tunable for the try to free page? I see. You want to use match_token() to do the keyword parsing. > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > size_t nbytes, loff_t off) > { > @@ -6902,12 +6913,33 @@ static ssize_t memory_reclaim(struct kernfs_open_= file *of, char *buf, > unsigned int nr_retries =3D MAX_RECLAIM_RETRIES; > unsigned long nr_to_reclaim, nr_reclaimed =3D 0; > unsigned int reclaim_options; > - int err; > + char *old_buf, *start; > + substring_t args[MAX_OPT_ARGS]; > + int swappiness =3D -1; > > buf =3D strstrip(buf); > - err =3D page_counter_memparse(buf, "", &nr_to_reclaim); > - if (err) > - return err; > + > + old_buf =3D buf; > + nr_to_reclaim =3D memparse(buf, &buf) / PAGE_SIZE; > + if (buf =3D=3D old_buf) > + return -EINVAL; > + > + buf =3D strstrip(buf); > + > + while ((start =3D strsep(&buf, " ")) !=3D NULL) { > + if (!strlen(start)) > + continue; > + switch (match_token(start, if_tokens, args)) { > + case MEMORY_RECLAIM_SWAPPINESS: > + if (match_int(&args[0], &swappiness)) > + return -EINVAL; > + if (swappiness < 0 || swappiness > 200) Agree with Yosry on the 200 magic value. I am also wondering if there is an easier way to just parse one keyword. Will using strcmp("swappiness=3D") be a bad idea? I haven't tried it myself though. > + return -EINVAL; > + break; > + default: > + return -EINVAL; > + } > + } > > reclaim_options =3D MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACT= IVE; > while (nr_reclaimed < nr_to_reclaim) { > @@ -6926,7 +6958,8 @@ static ssize_t memory_reclaim(struct kernfs_open_fi= le *of, char *buf, > > reclaimed =3D try_to_free_mem_cgroup_pages(memcg, > min(nr_to_reclaim - nr_reclaimed,= SWAP_CLUSTER_MAX), > - GFP_KERNEL, reclaim_options); > + GFP_KERNEL, reclaim_options, > + swappiness); > > if (!reclaimed && !nr_retries--) > return -EAGAIN; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 506f8220c5fe..a20965b20177 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -136,6 +136,9 @@ struct scan_control { > /* Always discard instead of demoting to lower tier memory */ > unsigned int no_demotion:1; > > + /* Swappiness value for reclaim, if -1 use memcg/global value */ > + int swappiness; > + It would be great if we can get rid of the -1 case. > /* Allocation order */ > s8 order; > > @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, s= truct scan_control *sc, > struct pglist_data *pgdat =3D lruvec_pgdat(lruvec); > struct mem_cgroup *memcg =3D lruvec_memcg(lruvec); > unsigned long anon_cost, file_cost, total_cost; > - int swappiness =3D mem_cgroup_swappiness(memcg); > + int swappiness =3D sc->swappiness !=3D -1 ? > + sc->swappiness : mem_cgroup_swappiness(memcg); Let's see if we can get rid of the -1. Now I see the -1 is actually introduced by this patch, should be doable. > u64 fraction[ANON_AND_FILE]; > u64 denominator =3D 0; /* gcc */ > enum scan_balance scan_balance; > @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, st= ruct scan_control *sc) > mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > return 0; > > + if (sc->swappiness !=3D -1) > + return sc->swappiness; > + Get rid of that too. Thank you. Chris > return mem_cgroup_swappiness(memcg); > } > > @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgr= oup *memcg, > unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options) > + unsigned int reclaim_options, > + int swappiness) > { > unsigned long nr_reclaimed; > unsigned int noreclaim_flag; > @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct m= em_cgroup *memcg, > .may_unmap =3D 1, > .may_swap =3D !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP= ), > .proactive =3D !!(reclaim_options & MEMCG_RECLAIM_PROACTI= VE), > + .swappiness =3D swappiness, > }; > /* > * Traverse the ZONELIST_FALLBACK zonelist of the current node to= put > -- > 2.34.1 > >