From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD8D4C433B4 for ; Tue, 6 Apr 2021 22:49:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DC475613D4 for ; Tue, 6 Apr 2021 22:49:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DC475613D4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E8D896B0078; Tue, 6 Apr 2021 18:49:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E3D7F6B007E; Tue, 6 Apr 2021 18:49:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CB8A06B0080; Tue, 6 Apr 2021 18:49:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0061.hostedemail.com [216.40.44.61]) by kanga.kvack.org (Postfix) with ESMTP id A92A46B0078 for ; Tue, 6 Apr 2021 18:49:24 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 565EE5DC9 for ; Tue, 6 Apr 2021 22:49:24 +0000 (UTC) X-FDA: 78003435048.11.86DDBC0 Received: from mail-yb1-f181.google.com (mail-yb1-f181.google.com [209.85.219.181]) by imf22.hostedemail.com (Postfix) with ESMTP id 16BFEC0001EE for ; Tue, 6 Apr 2021 22:49:21 +0000 (UTC) Received: by mail-yb1-f181.google.com with SMTP id 82so7418805yby.7 for ; Tue, 06 Apr 2021 15:49:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=byPqaSWgjP71hhs5Es9WrhUrdcZiWK4mT0qunWR3EFw=; b=iuH3RJceDB+rNW6+MwvocgZh88gvb6RYutxVbyYIIN8V4QO3yuLYm7eaREeCIx4r5Z z+KS7BWUtqpQItFzVqHwaiTOI6+Fpt6RS6RIng84SZzRTdTElmSYnLojTYJmNr2mbdNq yRgYav6bPiuRp392wBezSUY7FTiNf4AZEV9Go8PcBMyuO32MDtgBwxX9I2NAb90WWoPt ZRmLqaiFfGsDIVqYyJH+AdKR7IoTSit8g28L4qV9YTw45g+mA/m3wJSA0tsMAv1f2Twp 1IwYJC1dsaFQjiWJqQ75BsjM24z2AO6EVOTjJ5Q9y5ZcTEOormI2bfYGM16QP8nfTzev XcVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=byPqaSWgjP71hhs5Es9WrhUrdcZiWK4mT0qunWR3EFw=; b=litzHsNa5xksQhTewNQJe8EM8A/kDOK0hB6dpljXAO1PqAc6SSyrO9npqf7V8jBt1A w2GwPnP7oNBx6xkhdTq4UrE8M50PAtH8XMN5eosG8S5TbuW7ZIt5ESnr9dffl+yG+H2q 1zUsTDpsMGqlFmmsMKb1nR3EQoAEqDeulZrD1jQuP1pSS2MU/qiFCC3x2i40huhqQGvq l7U/oq+zvdNp+WjCkudbp3P8mOxypEoeOUwvhUDBA9m1fkq9bNKkRl0zvgljoc0pMY0j bFCcJdAevfqCzTMMTt9I7TZNKVIR+hHDiEfYpcciZqpa82lXes2sZUj4Kx4RZzil6hFD e6Hw== X-Gm-Message-State: AOAM532HMJJSPSAlgFN3EY4T/Jqj0+Dk4ee1irawT7Qi9kZznJskiF3e bKq9WnoCQOPjQlNvyAIrrkULHWu+X5ZlhwMr1Sk= X-Google-Smtp-Source: ABdhPJzUpjSCfPbPVhLwMvQNgoWwEIB1tqflOorDvzyfnjMlDAjsUkCCd/kOXzvvin2BSCws/a+EGf7kJ3jcDvc0WIY= X-Received: by 2002:a25:b708:: with SMTP id t8mr515455ybj.320.1617749363030; Tue, 06 Apr 2021 15:49:23 -0700 (PDT) MIME-Version: 1.0 References: <20210406065944.08d8aa76@mail.inbox.lv> In-Reply-To: <20210406065944.08d8aa76@mail.inbox.lv> From: Stillinux Date: Wed, 7 Apr 2021 06:49:17 +0800 Message-ID: Subject: Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop To: Alexey Avramov Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, liuzhengyuan@kylinos.cn, liuyun01@kylinos.cn Content-Type: multipart/alternative; boundary="0000000000001e7b9605bf55a0a4" X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 16BFEC0001EE X-Stat-Signature: snttpp4imx4i1mxdcoam15m3xy65wob6 Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf22; identity=mailfrom; envelope-from=""; helo=mail-yb1-f181.google.com; client-ip=209.85.219.181 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1617749361-176165 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --0000000000001e7b9605bf55a0a4 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Alexey, Thank you for the patch=EF=BC=81 looks cool, we will try this = patch for cutdown io operations during high memory pressure test. and after check our vmcore, we can see our system io pressure under the swap_writepage and swap_readpage to under the shrink list operations. On Tue, Apr 6, 2021 at 5:59 AM Alexey Avramov wrote: > > In the case of high system memory and load pressure, we ran ltp test > > and found that the system was stuck, the direct memory reclaim was > > all stuck in io_schedule > > > For the first time involving the swap part, there is no good way to fix > > the problem > > The solution is protecting the clean file pages. > > Look at this: > > > On ChromiumOS, we do not use swap. When memory is low, the only > > way to free memory is to reclaim pages from the file list. This > > results in a lot of thrashing under low memory conditions. We see > > the system become unresponsive for minutes before it eventually OOMs. > > We also see very slow browser tab switching under low memory. Instead > > of an unresponsive system, we'd really like the kernel to OOM as soon > > as it starts to thrash. If it can't keep the working set in memory, > > then OOM. Losing one of many tabs is a better behaviour for the user > > than an unresponsive system. > > > This patch create a new sysctl, min_filelist_kbytes, which disables > > reclaim of file-backed pages when when there are less than > min_filelist_bytes > > worth of such pages in the cache. This tunable is handy for low memory > > systems using solid-state storage where interactive response is more > important > > than not OOMing. > > > With this patch and min_filelist_kbytes set to 50000, I see very little > block > > layer activity during low memory. The system stays responsive under low > > memory and browser tab switching is fast. Eventually, a process a gets > killed > > by OOM. Without this patch, the system gets wedged for minutes before i= t > > eventually OOMs. > > =E2=80=94 https://lore.kernel.org/patchwork/patch/222042/ > > This patch can almost completely eliminate thrashing under memory pressur= e. > > Effects > - Improving system responsiveness under low-memory conditions; > - Improving performans in I/O bound tasks under memory pressure; > - OOM killer comes faster (with hard protection); > - Fast system reclaiming after OOM. > > Read more: https://github.com/hakavlad/le9-patch > > The patch: > > From 371e3e5290652e97d5279d8cd215cd356c1fb47b Mon Sep 17 00:00:00 2001 > From: Alexey Avramov > Date: Mon, 5 Apr 2021 01:53:26 +0900 > Subject: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified > amount of clean file cache > > The kernel does not have a mechanism for targeted protection of clean > file pages (CFP). A certain amount of the CFP is required by the userspac= e > for normal operation. First of all, you need a cache of shared libraries > and executable files. If the volume of the CFP cache falls below a certai= n > level, thrashing and even livelock occurs. > > Protection of CFP may be used to prevent thrashing and reducing I/O under > memory pressure. Hard protection of CFP may be used to avoid high latency > and prevent livelock in near-OOM conditions. The patch provides sysctl > knobs for protecting the specified amount of clean file cache under memor= y > pressure. > > The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of > CFP. The CFP on the current node won't be reclaimed uder memory pressure > when their volume is below vm.clean_low_kbytes *unless* we threaten to OO= M > or have no swap space or vm.swappiness=3D0. Setting it to a high value ma= y > result in a early eviction of anonymous pages into the swap space by > attempting to hold the protected amount of clean file pages in memory. Th= e > default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in > Kconfig). > > The vm.clean_min_kbytes sysctl knob provides *hard* protection of CFP. Th= e > CFP on the current node won't be reclaimed under memory pressure when the= ir > volume is below vm.clean_min_kbytes. Setting it to a high value may resul= t > in a early out-of-memory condition due to the inability to reclaim the > protected amount of CFP when other types of pages cannot be reclaimed. Th= e > default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in > Kconfig). > > Reported-by: Artem S. Tashkinov > Signed-off-by: Alexey Avramov > --- > Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++++++++++ > include/linux/mm.h | 3 ++ > kernel/sysctl.c | 14 ++++++++ > mm/Kconfig | 35 +++++++++++++++++++ > mm/vmscan.c | 59 > +++++++++++++++++++++++++++++++++ > 5 files changed, 148 insertions(+) > > diff --git a/Documentation/admin-guide/sysctl/vm.rst > b/Documentation/admin-guide/sysctl/vm.rst > index f455fa00c..5d5ddfc85 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -26,6 +26,8 @@ Currently, these files are in /proc/sys/vm: > > - admin_reserve_kbytes > - block_dump > +- clean_low_kbytes > +- clean_min_kbytes > - compact_memory > - compaction_proactiveness > - compact_unevictable_allowed > @@ -113,6 +115,41 @@ block_dump enables block I/O debugging when set to a > nonzero value. More > information on block I/O debugging is in > Documentation/admin-guide/laptops/laptop-mode.rst. > > > +clean_low_kbytes > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +This knob provides *best-effort* protection of clean file pages. The > clean file > +pages on the current node won't be reclaimed uder memory pressure when > their > +volume is below vm.clean_low_kbytes *unless* we threaten to OOM or have = no > +swap space or vm.swappiness=3D0. > + > +Protection of clean file pages may be used to prevent thrashing and > +reducing I/O under low-memory conditions. > + > +Setting it to a high value may result in a early eviction of anonymous > pages > +into the swap space by attempting to hold the protected amount of clean > file > +pages in memory. > + > +The default value is defined by CONFIG_CLEAN_LOW_KBYTES. > + > + > +clean_min_kbytes > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +This knob provides *hard* protection of clean file pages. The clean file > pages > +on the current node won't be reclaimed under memory pressure when their > volume > +is below vm.clean_min_kbytes. > + > +Hard protection of clean file pages may be used to avoid high latency an= d > +prevent livelock in near-OOM conditions. > + > +Setting it to a high value may result in a early out-of-memory condition > due to > +the inability to reclaim the protected amount of clean file pages when > other > +types of pages cannot be reclaimed. > + > +The default value is defined by CONFIG_CLEAN_MIN_KBYTES. > + > + > compact_memory > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index db6ae4d3f..7799f1555 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -202,6 +202,9 @@ static inline void __mm_zero_struct_page(struct page > *page) > > extern int sysctl_max_map_count; > > +extern unsigned long sysctl_clean_low_kbytes; > +extern unsigned long sysctl_clean_min_kbytes; > + > extern unsigned long sysctl_user_reserve_kbytes; > extern unsigned long sysctl_admin_reserve_kbytes; > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index afad08596..854b311cd 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -3083,6 +3083,20 @@ static struct ctl_table vm_table[] =3D { > }, > #endif > { > + .procname =3D "clean_low_kbytes", > + .data =3D &sysctl_clean_low_kbytes, > + .maxlen =3D sizeof(sysctl_clean_low_kbytes), > + .mode =3D 0644, > + .proc_handler =3D proc_doulongvec_minmax, > + }, > + { > + .procname =3D "clean_min_kbytes", > + .data =3D &sysctl_clean_min_kbytes, > + .maxlen =3D sizeof(sysctl_clean_min_kbytes), > + .mode =3D 0644, > + .proc_handler =3D proc_doulongvec_minmax, > + }, > + { > .procname =3D "user_reserve_kbytes", > .data =3D &sysctl_user_reserve_kbytes, > .maxlen =3D sizeof(sysctl_user_reserve_kbytes), > diff --git a/mm/Kconfig b/mm/Kconfig > index 390165ffb..3915c71e1 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -122,6 +122,41 @@ config SPARSEMEM_VMEMMAP > pfn_to_page and page_to_pfn operations. This is the most > efficient option when sufficient kernel resources are available= . > > +config CLEAN_LOW_KBYTES > + int "Default value for vm.clean_low_kbytes" > + depends on SYSCTL > + default "0" > + help > + The vm.clean_file_low_kbytes sysctl knob provides *best-effort* > + protection of clean file pages. The clean file pages on the > current > + node won't be reclaimed uder memory pressure when their volume = is > + below vm.clean_low_kbytes *unless* we threaten to OOM or have > + no swap space or vm.swappiness=3D0. > + > + Protection of clean file pages may be used to prevent thrashing > and > + reducing I/O under low-memory conditions. > + > + Setting it to a high value may result in a early eviction of > anonymous > + pages into the swap space by attempting to hold the protected > amount of > + clean file pages in memory. > + > +config CLEAN_MIN_KBYTES > + int "Default value for vm.clean_min_kbytes" > + depends on SYSCTL > + default "0" > + help > + The vm.clean_file_min_kbytes sysctl knob provides *hard* > protection > + of clean file pages. The clean file pages on the current node > won't be > + reclaimed under memory pressure when their volume is below > + vm.clean_min_kbytes. > + > + Hard protection of clean file pages may be used to avoid high > latency and > + prevent livelock in near-OOM conditions. > + > + Setting it to a high value may result in a early out-of-memory > condition > + due to the inability to reclaim the protected amount of clean > file pages > + when other types of pages cannot be reclaimed. > + > config HAVE_MEMBLOCK_PHYS_MAP > bool > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 7b4e31eac..77e98c43e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -120,6 +120,19 @@ struct scan_control { > /* The file pages on the current node are dangerously low */ > unsigned int file_is_tiny:1; > > + /* > + * The clean file pages on the current node won't be reclaimed wh= en > + * their volume is below vm.clean_low_kbytes *unless* we threaten > + * to OOM or have no swap space or vm.swappiness=3D0. > + */ > + unsigned int clean_below_low:1; > + > + /* > + * The clean file pages on the current node won't be reclaimed wh= en > + * their volume is below vm.clean_min_kbytes. > + */ > + unsigned int clean_below_min:1; > + > /* Allocation order */ > s8 order; > > @@ -166,6 +179,17 @@ struct scan_control { > #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0) > #endif > > +#if CONFIG_CLEAN_LOW_KBYTES < 0 > +#error "CONFIG_CLEAN_LOW_KBYTES must be >=3D 0" > +#endif > + > +#if CONFIG_CLEAN_MIN_KBYTES < 0 > +#error "CONFIG_CLEAN_MIN_KBYTES must be >=3D 0" > +#endif > + > +unsigned long sysctl_clean_low_kbytes __read_mostly =3D > CONFIG_CLEAN_LOW_KBYTES; > +unsigned long sysctl_clean_min_kbytes __read_mostly =3D > CONFIG_CLEAN_MIN_KBYTES; > + > /* > * From 0 .. 200. Higher means more swappy. > */ > @@ -2283,6 +2307,16 @@ static void get_scan_count(struct lruvec *lruvec, > struct scan_control *sc, > } > > /* > + * Force-scan anon if clean file pages is under vm.clean_min_kbyt= es > + * or vm.clean_low_kbytes (unless the swappiness setting > + * disagrees with swapping). > + */ > + if ((sc->clean_below_low || sc->clean_below_min) && swappiness) { > + scan_balance =3D SCAN_ANON; > + goto out; > + } > + > + /* > * If there is enough inactive page cache, we do not reclaim > * anything from the anonymous working right now. > */ > @@ -2418,6 +2452,13 @@ static void get_scan_count(struct lruvec *lruvec, > struct scan_control *sc, > BUG(); > } > > + /* > + * Don't reclaim clean file pages when their volume is > below > + * vm.clean_min_kbytes. > + */ > + if (file && sc->clean_below_min) > + scan =3D 0; > + > nr[lru] =3D scan; > } > } > @@ -2768,6 +2809,24 @@ static void shrink_node(pg_data_t *pgdat, struct > scan_control *sc) > anon >> sc->priority; > } > > + if (sysctl_clean_low_kbytes || sysctl_clean_min_kbytes) { > + unsigned long reclaimable_file, dirty, clean; > + > + reclaimable_file =3D > + node_page_state(pgdat, NR_ACTIVE_FILE) + > + node_page_state(pgdat, NR_INACTIVE_FILE) + > + node_page_state(pgdat, NR_ISOLATED_FILE); > + dirty =3D node_page_state(pgdat, NR_FILE_DIRTY); > + if (reclaimable_file > dirty) > + clean =3D (reclaimable_file - dirty) << (PAGE_SHI= FT > - 10); > + > + sc->clean_below_low =3D clean < sysctl_clean_low_kbytes; > + sc->clean_below_min =3D clean < sysctl_clean_min_kbytes; > + } else { > + sc->clean_below_low =3D false; > + sc->clean_below_min =3D false; > + } > + > shrink_node_memcgs(pgdat, sc); > > if (reclaim_state) { > -- > 2.11.0 > > --0000000000001e7b9605bf55a0a4 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
=C2=A0 Hi Alexey, Thank you for the patch=EF=BC=81 looks c= ool, we will try this patch for cutdown io operations during high memory pr= essure test.

=C2=A0and after check our vmcore, we can se= e our system io pressure under the swap_writepage and swap_readpage to unde= r the shrink list operations.

On Tue, Apr 6, 2021 at 5:59 AM Alexey Av= ramov <hakavlad@inbox.lv> wr= ote:
> In the= case of high system memory and load pressure, we ran ltp test
> and found that the system was stuck, the direct memory reclaim was
> all stuck in io_schedule

> For the first time involving the swap part, there is no good way to fi= x
> the problem

The solution is protecting the clean file pages.

Look at this:

> On ChromiumOS, we do not use swap. When memory is low, the only
> way to free memory is to reclaim pages from the file list. This
> results in a lot of thrashing under low memory conditions. We see
> the system become unresponsive for minutes before it eventually OOMs. =
> We also see very slow browser tab switching under low memory. Instead =
> of an unresponsive system, we'd really like the kernel to OOM as s= oon
> as it starts to thrash. If it can't keep the working set in memory= ,
> then OOM. Losing one of many tabs is a better behaviour for the user <= br> > than an unresponsive system.

> This patch create a new sysctl, min_filelist_kbytes, which disables > reclaim of file-backed pages when when there are less than min_filelis= t_bytes
> worth of such pages in the cache. This tunable is handy for low memory=
> systems using solid-state storage where interactive response is more i= mportant
> than not OOMing.

> With this patch and min_filelist_kbytes set to 50000, I see very littl= e block
> layer activity during low memory. The system stays responsive under lo= w
> memory and browser tab switching is fast. Eventually, a process a gets= killed
> by OOM. Without this patch, the system gets wedged for minutes before = it
> eventually OOMs.

=E2=80=94 https://lore.kernel.org/patchwork/patch/2= 22042/

This patch can almost completely eliminate thrashing under memory pressure.=

Effects
- Improving system responsiveness under low-memory conditions;
- Improving performans in I/O bound tasks under memory pressure;
- OOM killer comes faster (with hard protection);
- Fast system reclaiming after OOM.

Read more: https://github.com/hakavlad/le9-patch

The patch:

>From 371e3e5290652e97d5279d8cd215cd356c1fb47b Mon Sep 17 00:00:00 2001
From: Alexey Avramov <hakavlad@inbox.lv>
Date: Mon, 5 Apr 2021 01:53:26 +0900
Subject: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified =C2=A0amount of clean file cache

The kernel does not have a mechanism for targeted protection of clean
file pages (CFP). A certain amount of the CFP is required by the userspace<= br> for normal operation. First of all, you need a cache of shared libraries and executable files. If the volume of the CFP cache falls below a certain<= br> level, thrashing and even livelock occurs.

Protection of CFP may be used to prevent thrashing and reducing I/O under memory pressure. Hard protection of CFP may be used to avoid high latency and prevent livelock in near-OOM conditions. The patch provides sysctl
knobs for protecting the specified amount of clean file cache under memory<= br> pressure.

The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of CFP. The CFP on the current node won't be reclaimed uder memory pressur= e
when their volume is below vm.clean_low_kbytes *unless* we threaten to OOM<= br> or have no swap space or vm.swappiness=3D0. Setting it to a high value may<= br> result in a early eviction of anonymous pages into the swap space by
attempting to hold the protected amount of clean file pages in memory. The<= br> default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
Kconfig).

The vm.clean_min_kbytes sysctl knob provides *hard* protection of CFP. The<= br> CFP on the current node won't be reclaimed under memory pressure when t= heir
volume is below vm.clean_min_kbytes. Setting it to a high value may result<= br> in a early out-of-memory condition due to the inability to reclaim the
protected amount of CFP when other types of pages cannot be reclaimed. The<= br> default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
Kconfig).

Reported-by: Artem S. Tashkinov <aros@gmx.com>
Signed-off-by: Alexey Avramov <hakavlad@inbox.lv>
---
=C2=A0Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++++++++++ =C2=A0include/linux/mm.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 3 ++
=C2=A0kernel/sysctl.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| 14 ++++++++
=C2=A0mm/Kconfig=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | 35 +++++++++++++++++++
=C2=A0mm/vmscan.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| 59 ++++++++++++++++++++++= +++++++++++
=C2=A05 files changed, 148 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-= guide/sysctl/vm.rst
index f455fa00c..5d5ddfc85 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -26,6 +26,8 @@ Currently, these files are in /proc/sys/vm:

=C2=A0- admin_reserve_kbytes
=C2=A0- block_dump
+- clean_low_kbytes
+- clean_min_kbytes
=C2=A0- compact_memory
=C2=A0- compaction_proactiveness
=C2=A0- compact_unevictable_allowed
@@ -113,6 +115,41 @@ block_dump enables block I/O debugging when set to a n= onzero value. More
=C2=A0information on block I/O debugging is in Documentation/admin-guide/la= ptops/laptop-mode.rst.


+clean_low_kbytes
+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
+
+This knob provides *best-effort* protection of clean file pages. The clean= file
+pages on the current node won't be reclaimed uder memory pressure when= their
+volume is below vm.clean_low_kbytes *unless* we threaten to OOM or have no=
+swap space or vm.swappiness=3D0.
+
+Protection of clean file pages may be used to prevent thrashing and
+reducing I/O under low-memory conditions.
+
+Setting it to a high value may result in a early eviction of anonymous pag= es
+into the swap space by attempting to hold the protected amount of clean fi= le
+pages in memory.
+
+The default value is defined by CONFIG_CLEAN_LOW_KBYTES.
+
+
+clean_min_kbytes
+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
+
+This knob provides *hard* protection of clean file pages. The clean file p= ages
+on the current node won't be reclaimed under memory pressure when thei= r volume
+is below vm.clean_min_kbytes.
+
+Hard protection of clean file pages may be used to avoid high latency and<= br> +prevent livelock in near-OOM conditions.
+
+Setting it to a high value may result in a early out-of-memory condition d= ue to
+the inability to reclaim the protected amount of clean file pages when oth= er
+types of pages cannot be reclaimed.
+
+The default value is defined by CONFIG_CLEAN_MIN_KBYTES.
+
+
=C2=A0compact_memory
=C2=A0=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3f..7799f1555 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -202,6 +202,9 @@ static inline void __mm_zero_struct_page(struct page *p= age)

=C2=A0extern int sysctl_max_map_count;

+extern unsigned long sysctl_clean_low_kbytes;
+extern unsigned long sysctl_clean_min_kbytes;
+
=C2=A0extern unsigned long sysctl_user_reserve_kbytes;
=C2=A0extern unsigned long sysctl_admin_reserve_kbytes;

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afad08596..854b311cd 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -3083,6 +3083,20 @@ static struct ctl_table vm_table[] =3D {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 },
=C2=A0#endif
=C2=A0 =C2=A0 =C2=A0 =C2=A0 {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.procname=C2=A0 =C2= =A0 =C2=A0 =C2=A0=3D "clean_low_kbytes",
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.data=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D &sysctl_clean_low_kbytes,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.maxlen=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0=3D sizeof(sysctl_clean_low_kbytes),
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.mode=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 0644,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.proc_handler=C2=A0= =C2=A0=3D proc_doulongvec_minmax,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0},
+=C2=A0 =C2=A0 =C2=A0 =C2=A0{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.procname=C2=A0 =C2= =A0 =C2=A0 =C2=A0=3D "clean_min_kbytes",
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.data=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D &sysctl_clean_min_kbytes,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.maxlen=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0=3D sizeof(sysctl_clean_min_kbytes),
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.mode=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 0644,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.proc_handler=C2=A0= =C2=A0=3D proc_doulongvec_minmax,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0},
+=C2=A0 =C2=A0 =C2=A0 =C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .procname=C2=A0 =C2= =A0 =C2=A0 =C2=A0=3D "user_reserve_kbytes",
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .data=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D &sysctl_user_reserve_kbytes,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .maxlen=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0=3D sizeof(sysctl_user_reserve_kbytes),
diff --git a/mm/Kconfig b/mm/Kconfig
index 390165ffb..3915c71e1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -122,6 +122,41 @@ config SPARSEMEM_VMEMMAP
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 pfn_to_page and page_to_pfn operations.= =C2=A0 This is the most
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 efficient option when sufficient kernel = resources are available.

+config CLEAN_LOW_KBYTES
+=C2=A0 =C2=A0 =C2=A0 =C2=A0int "Default value for vm.clean_low_kbytes= "
+=C2=A0 =C2=A0 =C2=A0 =C2=A0depends on SYSCTL
+=C2=A0 =C2=A0 =C2=A0 =C2=A0default "0"
+=C2=A0 =C2=A0 =C2=A0 =C2=A0help
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0The vm.clean_file_low_kbytes sysctl knob= provides *best-effort*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0protection of clean file pages. The clea= n file pages on the current
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0node won't be reclaimed uder memory = pressure when their volume is
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0below vm.clean_low_kbytes *unless* we th= reaten to OOM or have
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0no swap space or vm.swappiness=3D0.
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Protection of clean file pages may be us= ed to prevent thrashing and
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0reducing I/O under low-memory conditions= .
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Setting it to a high value may result in= a early eviction of anonymous
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0pages into the swap space by attempting = to hold the protected amount of
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0clean file pages in memory.
+
+config CLEAN_MIN_KBYTES
+=C2=A0 =C2=A0 =C2=A0 =C2=A0int "Default value for vm.clean_min_kbytes= "
+=C2=A0 =C2=A0 =C2=A0 =C2=A0depends on SYSCTL
+=C2=A0 =C2=A0 =C2=A0 =C2=A0default "0"
+=C2=A0 =C2=A0 =C2=A0 =C2=A0help
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0The vm.clean_file_min_kbytes sysctl knob= provides *hard* protection
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0of clean file pages. The clean file page= s on the current node won't be
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0reclaimed under memory pressure when the= ir volume is below
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vm.clean_min_kbytes.
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Hard protection of clean file pages may = be used to avoid high latency and
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0prevent livelock in near-OOM conditions.=
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Setting it to a high value may result in= a early out-of-memory condition
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0due to the inability to reclaim the prot= ected amount of clean file pages
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0when other types of pages cannot be recl= aimed.
+
=C2=A0config HAVE_MEMBLOCK_PHYS_MAP
=C2=A0 =C2=A0 =C2=A0 =C2=A0 bool

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b4e31eac..77e98c43e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -120,6 +120,19 @@ struct scan_control {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* The file pages on the current node are dange= rously low */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 unsigned int file_is_tiny:1;

+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * The clean file pages on the current node won= 't be reclaimed when
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * their volume is below vm.clean_low_kbytes *u= nless* we threaten
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * to OOM or have no swap space or vm.swappines= s=3D0.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned int clean_below_low:1;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * The clean file pages on the current node won= 't be reclaimed when
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * their volume is below vm.clean_min_kbytes. +=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned int clean_below_min:1;
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* Allocation order */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 s8 order;

@@ -166,6 +179,17 @@ struct scan_control {
=C2=A0#define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0= )
=C2=A0#endif

+#if CONFIG_CLEAN_LOW_KBYTES < 0
+#error "CONFIG_CLEAN_LOW_KBYTES must be >=3D 0"
+#endif
+
+#if CONFIG_CLEAN_MIN_KBYTES < 0
+#error "CONFIG_CLEAN_MIN_KBYTES must be >=3D 0"
+#endif
+
+unsigned long sysctl_clean_low_kbytes __read_mostly =3D CONFIG_CLEAN_LOW_K= BYTES;
+unsigned long sysctl_clean_min_kbytes __read_mostly =3D CONFIG_CLEAN_MIN_K= BYTES;
+
=C2=A0/*
=C2=A0 * From 0 .. 200.=C2=A0 Higher means more swappy.
=C2=A0 */
@@ -2283,6 +2307,16 @@ static void get_scan_count(struct lruvec *lruvec, st= ruct scan_control *sc,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }

=C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * Force-scan anon if clean file pages is under= vm.clean_min_kbytes
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * or vm.clean_low_kbytes (unless the swappines= s setting
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * disagrees with swapping).
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if ((sc->clean_below_low || sc->clean_bel= ow_min) && swappiness) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0scan_balance =3D SC= AN_ANON;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto out;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* If there is enough inactive page cache,= we do not reclaim
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* anything from the anonymous working rig= ht now.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
@@ -2418,6 +2452,13 @@ static void get_scan_count(struct lruvec *lruvec, st= ruct scan_control *sc,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 BUG();
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }

+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * Don't reclai= m clean file pages when their volume is below
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * vm.clean_min_kby= tes.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (file &&= sc->clean_below_min)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0scan =3D 0;
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nr[lru] =3D scan; =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0}
@@ -2768,6 +2809,24 @@ static void shrink_node(pg_data_t *pgdat, struct sca= n_control *sc)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 anon >> sc->priority;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }

+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (sysctl_clean_low_kbytes || sysctl_clean_min= _kbytes) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long recla= imable_file, dirty, clean;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0reclaimable_file = =3D
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0node_page_state(pgdat, NR_ACTIVE_FILE) +
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0node_page_state(pgdat, NR_INACTIVE_FILE) +
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0node_page_state(pgdat, NR_ISOLATED_FILE);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0dirty =3D node_page= _state(pgdat, NR_FILE_DIRTY);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (reclaimable_fil= e > dirty)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0clean =3D (reclaimable_file - dirty) << (PAGE_SHIFT - 10);<= br> +
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sc->clean_below_= low =3D clean < sysctl_clean_low_kbytes;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sc->clean_below_= min =3D clean < sysctl_clean_min_kbytes;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0} else {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sc->clean_below_= low =3D false;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sc->clean_below_= min =3D false;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 shrink_node_memcgs(pgdat, sc);

=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (reclaim_state) {
--
2.11.0

--0000000000001e7b9605bf55a0a4--