Re: [PATCH] mm: deduct the number of pages reclaimed by madvise from workingset

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Zhaoyang Huang <huangzhaoyang@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: "zhaoyang.huang" <zhaoyang.huang@unisoc.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Suren Baghdasaryan <surenb@google.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 ke.wang@unisoc.com
Subject: Re: [PATCH] mm: deduct the number of pages reclaimed by madvise from workingset
Date: Fri, 26 May 2023 14:38:38 +0800	[thread overview]
Message-ID: <CAGWkznE0bNS6bZE99s1PkWdC9UkTQCC0aWo0pS94n8_nkQv7Rg@mail.gmail.com> (raw)
In-Reply-To: <20230525135407.GA31865@cmpxchg.org>

On Thu, May 25, 2023 at 9:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, May 24, 2023 at 05:12:54PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > The pages reclaimed by madvise_pageout are made of inactive and dropped from LRU
> > forcefully, which lead to the coming up refault pages possess a large refault
> > distance than it should be. These could affect the accuracy of thrashing when
> > madvise_pageout is used as a common way of memory reclaiming as ANDROID does now.
>
> This alludes to, but doesn't explain, a real world usecase.
More block io(wait_on_page_bit_common) observed during APP start in
latest android version where user space memory reclaiming changes from
in-kernel PPR to madvise_pageout. We believe that it could be related
with inaccuracy of workingset.
>
> Yes, madvise_pageout() will record non-resident entries today. This
> means refault and thrash detection is on for user-driven reclaim.
>
> So why is that undesirable?
Let's raise an extreme scenario, that is, the tail page of LRU could
experience a given refault distance without any in-kernel reclaiming
and be wrongly deemed as inactive and get less protection.
>
> Today we measure and report the cost of reclaim and memory pressure
> for physical memory shortages, cgroup limits, and user-driven cgroup
> reclaim. Why should we not do the same for madv_pageout()? If the
> userspace code that drives pageout has a bug and the result is extreme
> thrashing, wouldn't you want to know that?
Actually, the pages evicted by madv_cold/pageout from active_lru are
not marked as WORKINGSET, which will surpass the thrashing account
when it faults back and gets struck by IO. I think they should be
treated in the same way in terms of SetPageWorkingset and
lruvec->non-resident. Please refer to my previous patch "mm: mark
folio as workingset in lru_deactivate_fn index 70e2063..4d1c14f
100644"


>
> Please explain the idea here better.
>
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > ---
> >  include/linux/swap.h | 2 +-
> >  mm/madvise.c         | 4 ++--
> >  mm/vmscan.c          | 8 +++++++-
> >  3 files changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2787b84..0312142 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -428,7 +428,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> >  extern int vm_swappiness;
> >  long remove_mapping(struct address_space *mapping, struct folio *folio);
> >
> > -extern unsigned long reclaim_pages(struct list_head *page_list);
> > +extern unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *page_list);
> >  #ifdef CONFIG_NUMA
> >  extern int node_reclaim_mode;
> >  extern int sysctl_min_unmapped_ratio;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index b6ea204..61c8d7b 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -420,7 +420,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >  huge_unlock:
> >               spin_unlock(ptl);
> >               if (pageout)
> > -                     reclaim_pages(&page_list);
> > +                     reclaim_pages(mm, &page_list);
> >               return 0;
> >       }
> >
> > @@ -516,7 +516,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >       arch_leave_lazy_mmu_mode();
> >       pte_unmap_unlock(orig_pte, ptl);
> >       if (pageout)
> > -             reclaim_pages(&page_list);
> > +             reclaim_pages(mm, &page_list);
> >       cond_resched();
> >
> >       return 0;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 20facec..048c10b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2741,12 +2741,14 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list,
> >       return nr_reclaimed;
> >  }
> >
> > -unsigned long reclaim_pages(struct list_head *folio_list)
> > +unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *folio_list)
> >  {
> >       int nid;
> >       unsigned int nr_reclaimed = 0;
> >       LIST_HEAD(node_folio_list);
> >       unsigned int noreclaim_flag;
> > +     struct lruvec *lruvec;
> > +     struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> >
> >       if (list_empty(folio_list))
> >               return nr_reclaimed;
> > @@ -2764,10 +2766,14 @@ unsigned long reclaim_pages(struct list_head *folio_list)
> >               }
> >
> >               nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid));
> > +             lruvec = &memcg->nodeinfo[nid]->lruvec;
> > +             workingset_age_nonresident(lruvec, -nr_reclaimed);
> >               nid = folio_nid(lru_to_folio(folio_list));
> >       } while (!list_empty(folio_list));
> >
> >       nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid));
> > +     lruvec = &memcg->nodeinfo[nid]->lruvec;
> > +     workingset_age_nonresident(lruvec, -nr_reclaimed);
>
> The task might have moved cgroups in between, who knows what kind of
> artifacts it will introduce if you wind back the wrong clock.
>
> If there are reclaim passes that shouldn't participate in non-resident
> tracking, that should be plumbed through the stack to __remove_mapping
> (which already has that bool reclaimed param to not record entries).

next prev parent reply	other threads:[~2023-05-26  6:39 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-24  9:12 zhaoyang.huang
2023-05-24 20:40 ` Suren Baghdasaryan
2023-05-25  1:23   ` Zhaoyang Huang
2023-05-25 13:54 ` Johannes Weiner
2023-05-26  6:38   ` Zhaoyang Huang [this message]
2023-05-26 17:31     ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGWkznE0bNS6bZE99s1PkWdC9UkTQCC0aWo0pS94n8_nkQv7Rg@mail.gmail.com \
    --to=huangzhaoyang@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=ke.wang@unisoc.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=surenb@google.com \
    --cc=zhaoyang.huang@unisoc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox