From: Zhaoyang Huang <huangzhaoyang@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: "zhaoyang.huang" <zhaoyang.huang@unisoc.com>,
Andrew Morton <akpm@linux-foundation.org>,
Suren Baghdasaryan <surenb@google.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
ke.wang@unisoc.com
Subject: Re: [PATCH] mm: deduct the number of pages reclaimed by madvise from workingset
Date: Fri, 26 May 2023 14:38:38 +0800 [thread overview]
Message-ID: <CAGWkznE0bNS6bZE99s1PkWdC9UkTQCC0aWo0pS94n8_nkQv7Rg@mail.gmail.com> (raw)
In-Reply-To: <20230525135407.GA31865@cmpxchg.org>
On Thu, May 25, 2023 at 9:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, May 24, 2023 at 05:12:54PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > The pages reclaimed by madvise_pageout are made of inactive and dropped from LRU
> > forcefully, which lead to the coming up refault pages possess a large refault
> > distance than it should be. These could affect the accuracy of thrashing when
> > madvise_pageout is used as a common way of memory reclaiming as ANDROID does now.
>
> This alludes to, but doesn't explain, a real world usecase.
More block io(wait_on_page_bit_common) observed during APP start in
latest android version where user space memory reclaiming changes from
in-kernel PPR to madvise_pageout. We believe that it could be related
with inaccuracy of workingset.
>
> Yes, madvise_pageout() will record non-resident entries today. This
> means refault and thrash detection is on for user-driven reclaim.
>
> So why is that undesirable?
Let's raise an extreme scenario, that is, the tail page of LRU could
experience a given refault distance without any in-kernel reclaiming
and be wrongly deemed as inactive and get less protection.
>
> Today we measure and report the cost of reclaim and memory pressure
> for physical memory shortages, cgroup limits, and user-driven cgroup
> reclaim. Why should we not do the same for madv_pageout()? If the
> userspace code that drives pageout has a bug and the result is extreme
> thrashing, wouldn't you want to know that?
Actually, the pages evicted by madv_cold/pageout from active_lru are
not marked as WORKINGSET, which will surpass the thrashing account
when it faults back and gets struck by IO. I think they should be
treated in the same way in terms of SetPageWorkingset and
lruvec->non-resident. Please refer to my previous patch "mm: mark
folio as workingset in lru_deactivate_fn index 70e2063..4d1c14f
100644"
>
> Please explain the idea here better.
>
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > ---
> > include/linux/swap.h | 2 +-
> > mm/madvise.c | 4 ++--
> > mm/vmscan.c | 8 +++++++-
> > 3 files changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2787b84..0312142 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -428,7 +428,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> > extern int vm_swappiness;
> > long remove_mapping(struct address_space *mapping, struct folio *folio);
> >
> > -extern unsigned long reclaim_pages(struct list_head *page_list);
> > +extern unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *page_list);
> > #ifdef CONFIG_NUMA
> > extern int node_reclaim_mode;
> > extern int sysctl_min_unmapped_ratio;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index b6ea204..61c8d7b 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -420,7 +420,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> > huge_unlock:
> > spin_unlock(ptl);
> > if (pageout)
> > - reclaim_pages(&page_list);
> > + reclaim_pages(mm, &page_list);
> > return 0;
> > }
> >
> > @@ -516,7 +516,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> > arch_leave_lazy_mmu_mode();
> > pte_unmap_unlock(orig_pte, ptl);
> > if (pageout)
> > - reclaim_pages(&page_list);
> > + reclaim_pages(mm, &page_list);
> > cond_resched();
> >
> > return 0;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 20facec..048c10b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2741,12 +2741,14 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list,
> > return nr_reclaimed;
> > }
> >
> > -unsigned long reclaim_pages(struct list_head *folio_list)
> > +unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *folio_list)
> > {
> > int nid;
> > unsigned int nr_reclaimed = 0;
> > LIST_HEAD(node_folio_list);
> > unsigned int noreclaim_flag;
> > + struct lruvec *lruvec;
> > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> >
> > if (list_empty(folio_list))
> > return nr_reclaimed;
> > @@ -2764,10 +2766,14 @@ unsigned long reclaim_pages(struct list_head *folio_list)
> > }
> >
> > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid));
> > + lruvec = &memcg->nodeinfo[nid]->lruvec;
> > + workingset_age_nonresident(lruvec, -nr_reclaimed);
> > nid = folio_nid(lru_to_folio(folio_list));
> > } while (!list_empty(folio_list));
> >
> > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid));
> > + lruvec = &memcg->nodeinfo[nid]->lruvec;
> > + workingset_age_nonresident(lruvec, -nr_reclaimed);
>
> The task might have moved cgroups in between, who knows what kind of
> artifacts it will introduce if you wind back the wrong clock.
>
> If there are reclaim passes that shouldn't participate in non-resident
> tracking, that should be plumbed through the stack to __remove_mapping
> (which already has that bool reclaimed param to not record entries).
next prev parent reply other threads:[~2023-05-26 6:39 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-24 9:12 zhaoyang.huang
2023-05-24 20:40 ` Suren Baghdasaryan
2023-05-25 1:23 ` Zhaoyang Huang
2023-05-25 13:54 ` Johannes Weiner
2023-05-26 6:38 ` Zhaoyang Huang [this message]
2023-05-26 17:31 ` Suren Baghdasaryan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGWkznE0bNS6bZE99s1PkWdC9UkTQCC0aWo0pS94n8_nkQv7Rg@mail.gmail.com \
--to=huangzhaoyang@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=ke.wang@unisoc.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=surenb@google.com \
--cc=zhaoyang.huang@unisoc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox