linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andi Kleen <ak@linux.intel.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>,
	Jesse Barnes <jsbarnes@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Michael Larabel <Michael@michaellarabel.com>,
	Rik van Riel <riel@surriel.com>, Vlastimil Babka <vbabka@suse.cz>,
	Will Deacon <will@kernel.org>, Ying Huang <ying.huang@intel.com>,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	page-reclaim@google.com, x86@kernel.org,
	Konstantin Kharlamov <Hi-Angel@yandex.ru>
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging
Date: Fri, 7 Jan 2022 14:12:45 -0700	[thread overview]
Message-ID: <Ydisze9ZR/QBtjpX@google.com> (raw)
In-Reply-To: <Ydf9RXPch5ddg/WC@dhcp22.suse.cz>

On Fri, Jan 07, 2022 at 09:43:49AM +0100, Michal Hocko wrote:
> On Thu 06-01-22 14:27:52, Yu Zhao wrote:
> > On Thu, Jan 06, 2022 at 05:06:42PM +0100, Michal Hocko wrote:
> [...]
> > > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > > index 2db9a1432511..9c7a4fae0661 100644
> > > > --- a/include/linux/oom.h
> > > > +++ b/include/linux/oom.h
> > > > @@ -57,6 +57,22 @@ struct oom_control {
> > > >  extern struct mutex oom_lock;
> > > >  extern struct mutex oom_adj_mutex;
> > > >  
> > > > +#ifdef CONFIG_MMU
> > > > +extern struct task_struct *oom_reaper_list;
> > > > +extern struct wait_queue_head oom_reaper_wait;
> > > > +
> > > > +static inline bool oom_reaping_in_progress(void)
> > > > +{
> > > > +	/* a racy check can be used to reduce the chance of overkilling */
> > > > +	return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
> > > > +}
> > > > +#else
> > > > +static inline bool oom_reaping_in_progress(void)
> > > > +{
> > > > +	return false;
> > > > +}
> > > > +#endif
> > > 
> > > I do not like this. These are internal oom reaper's and no code should
> > > really make any decisions based on that. oom_reaping_in_progress is not
> > > telling much anyway.
> > 
> > There is a perfectly legitimate reason for this.
> > 
> > If there is already a oom kill victim and the oom reaper is making
> > progress, the system may still be under memory pressure until the oom
> > reaping is done. The page reclaim has two choices in this transient
> > state: kill more processes or keep reclaiming (a few more) hot pages.
> > 
> > The first choice, AKA overkilling, is generally a bad one. The oom
> > reaper is single threaded and it can't go faster with additional
> > victims. Additional processes are sacrificed for nothing -- this is
> > an overcorrection of a system that tries to strike a balance between
> > the tendencies to release memory pressure and to improve memory
> > utilization.
> > 
> > > This is a global queue for oom reaper that can
> > > contain oom victims from different oom scopes (e.g. global OOM, memcg
> > > OOM or memory policy OOM).
> > 
> > True, but this is a wrong reason to make the conclusion below. Oom
> > kill scopes do NOT matter; only the pool the freed memory goes into
> > does. And there is only one global pool free pages.
> > 
> > > Your lru_gen_age_node uses this to decide whether to trigger
> > > out_of_memory and that is clearly wrong for the above reasons.
> > 
> > I hope my explanation above is clear enough. There is nothing wrong
> > with the purpose and the usage of oom_reaping_in_progress(), and it
> > has been well tested in the Arch Linux Zen kernel.
> 
> I disagree. An ongoing oom kill in one domain (say memcg A) shouldn't be
> any base for any decisions in reclaim in other domain (say memcg B or
> even a global reclaim). Those are fundamentally different conditions.

I agree for the memcg A oom and memcg B reclaim case, because memory
freed from A doesn't go to B.

I still think for the memcg A and the global reclaim case, memory
freed from A can be considered when deciding whether to make more
kills during global reclaim.

But this is something really minor, and I'll go with your suggestion,
i.e., getting rid of oom_reaping_in_progress().

> > Without it, overkills can be easily reproduced by the following simple
> > script. That is additional oom kills happen to processes other than
> > "tail".
> > 
> >   # enable zram
> >   while true;
> >   do
> >       tail /dev/zero
> >   done
> 
> I would be interested to hear more (care to send oom reports?).

I agree with what said below. I think those additional ooms might have
been from different oom domains. I plan to leave this for now and go
with your suggestion as mentioned above.

> > > out_of_memory is designed to skip over any action if there is an oom
> > > victim pending from the oom domain (have a look at oom_evaluate_task).
> > 
> > Where exactly? Point me to the code please.
> > 
> > I don't see such a logic inside out_of_memory() or
> > oom_evaluate_task(). Currently the only thing that could remotely
> > prevent overkills is oom_lock. But it's inadequate.
> 
> OK, let me try to exaplain. The protocol is rather convoluted. Once the
> oom killer is invoked it choses a victim to kill. oom_evaluate_task will
> evaluate _all_ tasks from the oom respective domain (select_bad_process
> which distinguishes memcg vs global oom kill and oom_cpuset_eligible for
> the cpuset domains). If there is any pre-existing oom victim
> (tsk_is_oom_victim) then the scan is aborted and the oom killer bails
> out. OOM victim stops being considered as relevant once the oom reaper
> manages to release its address space (or give up on the mmap_sem
> contention) and sets MMF_OOM_SKIP flag for the mm.
> 
> That being said the out_of_memory automatically backs off and relies on
> the oom reaper to process its queue.
> 
> Does it make more clear for you now?

Yes, you are right, thanks.

> > This is the entire pipeline:
> > low on memory -> out_of_memory() -> oom_reaper() -> free memory
> > 
> > To avoid overkills, we need to consider the later half of it too.
> > oom_reaping_in_progress() is exactly for this purpose.
> > 
> > > > +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> > > > +		       unsigned long min_ttl)
> > > > +{
> > > > +	bool need_aging;
> > > > +	long nr_to_scan;
> > > > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > > +	int swappiness = get_swappiness(memcg);
> > > > +	DEFINE_MAX_SEQ(lruvec);
> > > > +	DEFINE_MIN_SEQ(lruvec);
> > > > +
> > > > +	if (mem_cgroup_below_min(memcg))
> > > > +		return false;
> > > 
> > > mem_cgroup_below_min requires effective values to be calculated for the
> > > reclaimed hierarchy. Have a look at mem_cgroup_calculate_protection
> > 
> > I always keep that in mind, and age_lruvec() is called *after*
> > mem_cgroup_calculate_protection():
> 
> >   balance_pgdat()
> >     memcgs_need_aging = 0
> >     do {
> >       lru_gen_age_node()
> >         if (!memcgs_need_aging) {
> >             memcgs_need_aging = 1
> >             return
> >         }
> >         age_lruvec()
> > 
> >       shrink_node_memcgs()
> >         mem_cgroup_calculate_protection()
> >         lru_gen_shrink_lruvec()
> >           if ...
> >             memcgs_need_aging = 0
> >     } while ...
> 
> Uff, this is really subtle. I really think you should be following the
> existing pattern when the effective values are calculated right in the
> same context as they are evaluated.

Consider it done.


  reply	other threads:[~2022-01-07 21:12 UTC|newest]

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-04 20:22 [PATCH v6 0/9] Multigenerational LRU Framework Yu Zhao
2022-01-04 20:22 ` [PATCH v6 1/9] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2022-01-05 10:45   ` Will Deacon
2022-01-05 20:47     ` Yu Zhao
2022-01-06 10:30       ` Will Deacon
2022-01-07  7:25         ` Yu Zhao
2022-01-11 14:19           ` Will Deacon
2022-01-11 22:27             ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 2/9] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2022-01-04 21:24   ` Linus Torvalds
2022-01-04 20:22 ` [PATCH v6 3/9] mm/vmscan.c: refactor shrink_node() Yu Zhao
2022-01-04 20:22 ` [PATCH v6 4/9] mm: multigenerational lru: groundwork Yu Zhao
2022-01-04 21:34   ` Linus Torvalds
2022-01-11  8:16   ` Aneesh Kumar K.V
2022-01-12  2:16     ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 5/9] mm: multigenerational lru: mm_struct list Yu Zhao
2022-01-07  9:06   ` Michal Hocko
2022-01-08  0:19     ` Yu Zhao
2022-01-10 15:21       ` Michal Hocko
2022-01-12  8:08         ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 6/9] mm: multigenerational lru: aging Yu Zhao
2022-01-06 16:06   ` Michal Hocko
2022-01-06 21:27     ` Yu Zhao
2022-01-07  8:43       ` Michal Hocko
2022-01-07 21:12         ` Yu Zhao [this message]
2022-01-06 16:12   ` Michal Hocko
2022-01-06 21:41     ` Yu Zhao
2022-01-07  8:55       ` Michal Hocko
2022-01-07  9:00         ` Michal Hocko
2022-01-10  3:58           ` Yu Zhao
2022-01-10 14:37             ` Michal Hocko
2022-01-13  9:43               ` Yu Zhao
2022-01-13 12:02                 ` Michal Hocko
2022-01-19  6:31                   ` Yu Zhao
2022-01-19  9:44                     ` Michal Hocko
2022-01-10 15:01     ` Michal Hocko
2022-01-10 16:01       ` Vlastimil Babka
2022-01-10 16:25         ` Michal Hocko
2022-01-11 23:16       ` Yu Zhao
2022-01-12 10:28         ` Michal Hocko
2022-01-13  9:25           ` Yu Zhao
2022-01-07 13:11   ` Michal Hocko
2022-01-07 23:36     ` Yu Zhao
2022-01-10 15:35       ` Michal Hocko
2022-01-11  1:18         ` Yu Zhao
2022-01-11  9:00           ` Michal Hocko
     [not found]         ` <1641900108.61dd684cb0e59@mail.inbox.lv>
2022-01-11 12:15           ` Michal Hocko
2022-01-11 14:22         ` Alexey Avramov
2022-01-07 14:44   ` Michal Hocko
2022-01-10  4:47     ` Yu Zhao
2022-01-10 10:54       ` Michal Hocko
2022-01-19  7:04         ` Yu Zhao
2022-01-19  9:42           ` Michal Hocko
2022-01-23 21:28             ` Yu Zhao
2022-01-24 14:01               ` Michal Hocko
2022-01-10 16:57   ` Michal Hocko
2022-01-12  1:01     ` Yu Zhao
2022-01-12 10:17       ` Michal Hocko
2022-01-12 23:43         ` Yu Zhao
2022-01-13 11:57           ` Michal Hocko
2022-01-23 21:40             ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 7/9] mm: multigenerational lru: eviction Yu Zhao
2022-01-11 10:37   ` Aneesh Kumar K.V
2022-01-12  8:05     ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 8/9] mm: multigenerational lru: user interface Yu Zhao
2022-01-10 10:27   ` Mike Rapoport
2022-01-12  8:35     ` Yu Zhao
2022-01-12 10:31       ` Michal Hocko
2022-01-12 15:45       ` Mike Rapoport
2022-01-13  9:47         ` Yu Zhao
2022-01-13 10:31   ` Aneesh Kumar K.V
2022-01-13 23:02     ` Yu Zhao
2022-01-14  5:20       ` Aneesh Kumar K.V
2022-01-14  6:50         ` Yu Zhao
2022-01-04 20:22 ` [PATCH v6 9/9] mm: multigenerational lru: Kconfig Yu Zhao
2022-01-04 21:39   ` Linus Torvalds
2022-01-04 20:22 ` [PATCH v6 0/9] Multigenerational LRU Framework Yu Zhao
2022-01-04 20:30 ` Yu Zhao
2022-01-04 21:43   ` Linus Torvalds
2022-01-05 21:12     ` Yu Zhao
2022-01-07  9:38   ` Michal Hocko
2022-01-07 18:45     ` Yu Zhao
2022-01-10 15:39       ` Michal Hocko
2022-01-10 22:04         ` Yu Zhao
2022-01-10 22:46           ` Jesse Barnes
2022-01-11  1:41             ` Linus Torvalds
2022-01-11 10:40             ` Michal Hocko
2022-01-11  8:41   ` Yu Zhao
2022-01-11  8:53     ` Holger Hoffstätte
2022-01-11  9:26     ` Jan Alexander Steffens (heftig)
2022-01-11 16:04     ` Shuang Zhai
2022-01-12  1:46     ` Suleiman Souhlal
2022-01-12  6:07     ` Sofia Trinh
2022-01-12 16:17       ` Daniel Byrne
2022-01-18  9:21     ` Yu Zhao
2022-01-18  9:36     ` Donald Carr
2022-01-19 20:19     ` Steven Barrett
2022-01-19 22:25     ` Brian Geffon
2022-01-05  2:44 ` Shuang Zhai
2022-01-05  8:55 ` SeongJae Park
2022-01-05 10:53   ` Yu Zhao
2022-01-05 11:25     ` SeongJae Park
2022-01-05 21:06       ` Yu Zhao
2022-01-10 14:49 ` Alexey Avramov
2022-01-11 10:24 ` Alexey Avramov
2022-01-12 20:56 ` Oleksandr Natalenko
2022-01-13  8:59   ` Yu Zhao
2022-01-23  5:43 ` Barry Song
2022-01-25  6:48   ` Yu Zhao
2022-01-28  8:54     ` Barry Song
2022-02-08  9:16       ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ydisze9ZR/QBtjpX@google.com \
    --to=yuzhao@google.com \
    --cc=Hi-Angel@yandex.ru \
    --cc=Michael@michaellarabel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=jsbarnes@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=page-reclaim@google.com \
    --cc=riel@surriel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox