From: Suren Baghdasaryan <surenb@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Tejun Heo <tj@kernel.org>, Daniel Drake <drake@endlessm.com>,
Vinayak Menon <vinmenon@codeaurora.org>,
Christopher Lameter <cl@linux.com>,
Mike Galbraith <efault@gmx.de>,
Shakeel Butt <shakeelb@google.com>,
Peter Enderborg <peter.enderborg@sony.com>,
linux-mm <linux-mm@kvack.org>,
cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
kernel-team@fb.com
Subject: Re: [PATCH 2/9] mm: workingset: tell cache transitions from workingset thrashing
Date: Wed, 1 Aug 2018 14:56:27 -0700 [thread overview]
Message-ID: <CAJuCfpGZGPD+k+jHDowWyvZPnUXzQ9n98wBycDZLAWOn=vV6Ew@mail.gmail.com> (raw)
In-Reply-To: <20180801151958.32590-3-hannes@cmpxchg.org>
Hi Johannes,
Thanks for the new patchset! A small nit below while I'm reviewing and
backporting them:
On Wed, Aug 1, 2018 at 8:19 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Refaults happen during transitions between workingsets as well as
> in-place thrashing. Knowing the difference between the two has a range
> of applications, including measuring the impact of memory shortage on
> the system performance, as well as the ability to smarter balance
> pressure between the filesystem cache and the swap-backed workingset.
>
> During workingset transitions, inactive cache refaults and pushes out
> established active cache. When that active cache isn't stale, however,
> and also ends up refaulting, that's bonafide thrashing.
>
> Introduce a new page flag that tells on eviction whether the page has
> been active or not in its lifetime. This bit is then stored in the
> shadow entry, to classify refaults as transitioning or thrashing.
>
> How many page->flags does this leave us with on 32-bit?
>
> 20 bits are always page flags
>
> 21 if you have an MMU
>
> 23 with the zone bits for DMA, Normal, HighMem, Movable
>
> 29 with the sparsemem section bits
>
> 30 if PAE is enabled
>
> 31 with this patch.
>
> So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
> nodes. If that's not enough, the system can switch to discontigmem and
> re-gain the 6 or 7 sparsemem section bits.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> include/linux/mmzone.h | 1 +
> include/linux/page-flags.h | 5 +-
> include/linux/swap.h | 2 +-
> include/trace/events/mmflags.h | 1 +
> mm/filemap.c | 9 ++--
> mm/huge_memory.c | 1 +
> mm/memcontrol.c | 2 +
> mm/migrate.c | 2 +
> mm/swap_state.c | 1 +
> mm/vmscan.c | 1 +
> mm/vmstat.c | 1 +
> mm/workingset.c | 95 ++++++++++++++++++++++------------
> 12 files changed, 79 insertions(+), 42 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 32699b2dc52a..6af87946d241 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -163,6 +163,7 @@ enum node_stat_item {
> NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
> WORKINGSET_REFAULT,
> WORKINGSET_ACTIVATE,
> + WORKINGSET_RESTORE,
> WORKINGSET_NODERECLAIM,
> NR_ANON_MAPPED, /* Mapped anonymous pages */
> NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e34a27727b9a..7af1c3c15d8e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -69,13 +69,14 @@
> */
> enum pageflags {
> PG_locked, /* Page is locked. Don't touch. */
> - PG_error,
> PG_referenced,
> PG_uptodate,
> PG_dirty,
> PG_lru,
> PG_active,
> + PG_workingset,
> PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
> + PG_error,
> PG_slab,
> PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
> PG_arch_1,
> @@ -280,6 +281,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
> PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
> PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
> TESTCLEARFLAG(Active, active, PF_HEAD)
> +PAGEFLAG(Workingset, workingset, PF_HEAD)
> + TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
> __PAGEFLAG(Slab, slab, PF_NO_TAIL)
> __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
> PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2417d288e016..d8c47dcdec6f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -296,7 +296,7 @@ struct vma_swap_readahead {
>
> /* linux/mm/workingset.c */
> void *workingset_eviction(struct address_space *mapping, struct page *page);
> -bool workingset_refault(void *shadow);
> +void workingset_refault(struct page *page, void *shadow);
> void workingset_activation(struct page *page);
>
> /* Do not use directly, use workingset_lookup_update */
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index a81cffb76d89..a1675d43777e 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -88,6 +88,7 @@
> {1UL << PG_dirty, "dirty" }, \
> {1UL << PG_lru, "lru" }, \
> {1UL << PG_active, "active" }, \
> + {1UL << PG_workingset, "workingset" }, \
> {1UL << PG_slab, "slab" }, \
> {1UL << PG_owner_priv_1, "owner_priv_1" }, \
> {1UL << PG_arch_1, "arch_1" }, \
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0604cb02e6f3..bd36b7226cf4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -915,12 +915,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
> * data from the working set, only to cache data that will
> * get overwritten with something else, is a waste of memory.
> */
> - if (!(gfp_mask & __GFP_WRITE) &&
> - shadow && workingset_refault(shadow)) {
> - SetPageActive(page);
> - workingset_activation(page);
> - } else
> - ClearPageActive(page);
> + WARN_ON_ONCE(PageActive(page));
> + if (!(gfp_mask & __GFP_WRITE) && shadow)
> + workingset_refault(page, shadow);
> lru_cache_add(page);
> }
> return ret;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b9f3dbd885bd..c67ecf77ea8b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2370,6 +2370,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
> (1L << PG_mlocked) |
> (1L << PG_uptodate) |
> (1L << PG_active) |
> + (1L << PG_workingset) |
> (1L << PG_locked) |
> (1L << PG_unevictable) |
> (1L << PG_dirty)));
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2bd3df3d101a..c59519d600ea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5283,6 +5283,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
> stat[WORKINGSET_REFAULT]);
> seq_printf(m, "workingset_activate %lu\n",
> stat[WORKINGSET_ACTIVATE]);
> + seq_printf(m, "workingset_restore %lu\n",
> + stat[WORKINGSET_RESTORE]);
> seq_printf(m, "workingset_nodereclaim %lu\n",
> stat[WORKINGSET_NODERECLAIM]);
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8c0af0f7cab1..a6a9114e62dc 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -682,6 +682,8 @@ void migrate_page_states(struct page *newpage, struct page *page)
> SetPageActive(newpage);
> } else if (TestClearPageUnevictable(page))
> SetPageUnevictable(newpage);
> + if (PageWorkingset(page))
> + SetPageWorkingset(newpage);
> if (PageChecked(page))
> SetPageChecked(newpage);
> if (PageMappedToDisk(page))
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 07f9aa2340c3..2721ef8862d1 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -451,6 +451,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> /*
> * Initiate read into locked page and return.
> */
> + SetPageWorkingset(new_page);
> lru_cache_add_anon(new_page);
> *new_page_allocated = true;
> return new_page;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9270a4370d54..8d1ad48ffbcd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1976,6 +1976,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
> }
>
> ClearPageActive(page); /* we are de-activating */
> + SetPageWorkingset(page);
> list_add(&page->lru, &l_inactive);
> }
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index a2b9518980ce..507dc9c01b88 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1145,6 +1145,7 @@ const char * const vmstat_text[] = {
> "nr_isolated_file",
> "workingset_refault",
> "workingset_activate",
> + "workingset_restore",
> "workingset_nodereclaim",
> "nr_anon_pages",
> "nr_mapped",
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 53759a3cf99a..ef6be3d92116 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -121,7 +121,7 @@
> * the only thing eating into inactive list space is active pages.
> *
> *
> - * Activating refaulting pages
> + * Refaulting inactive pages
> *
> * All that is known about the active list is that the pages have been
> * accessed more than once in the past. This means that at any given
> @@ -134,6 +134,10 @@
> * used less frequently than the refaulting page - or even not used at
> * all anymore.
> *
> + * That means if inactive cache is refaulting with a suitable refault
> + * distance, we assume the cache workingset is transitioning and put
> + * pressure on the current active list.
> + *
> * If this is wrong and demotion kicks in, the pages which are truly
> * used more frequently will be reactivated while the less frequently
> * used once will be evicted from memory.
> @@ -141,6 +145,14 @@
> * But if this is right, the stale pages will be pushed out of memory
> * and the used pages get to stay in cache.
> *
> + * Refaulting active pages
> + *
> + * If on the other hand the refaulting pages have recently been
> + * deactivated, it means that the active list is no longer protecting
> + * actively used cache from reclaim. The cache is NOT transitioning to
> + * a different workingset; the existing workingset is thrashing in the
> + * space allocated to the page cache.
> + *
> *
> * Implementation
> *
> @@ -156,8 +168,7 @@
> */
>
> #define EVICTION_SHIFT (RADIX_TREE_EXCEPTIONAL_ENTRY + \
> - NODES_SHIFT + \
> - MEM_CGROUP_ID_SHIFT)
> + 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
> #define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
>
> /*
> @@ -170,23 +181,28 @@
> */
> static unsigned int bucket_order __read_mostly;
>
> -static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
> +static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
> + bool workingset)
> {
> eviction >>= bucket_order;
> eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
> eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
> + eviction = (eviction << 1) | workingset;
> eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
>
> return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
> }
>
> static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> - unsigned long *evictionp)
> + unsigned long *evictionp, bool *workingsetp)
> {
> unsigned long entry = (unsigned long)shadow;
> int memcgid, nid;
> + bool workingset;
>
> entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> + workingset = entry & 1;
> + entry >>= 1;
> nid = entry & ((1UL << NODES_SHIFT) - 1);
> entry >>= NODES_SHIFT;
> memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
> @@ -195,6 +211,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> *memcgidp = memcgid;
> *pgdat = NODE_DATA(nid);
> *evictionp = entry << bucket_order;
> + *workingsetp = workingset;
> }
>
> /**
> @@ -207,8 +224,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> */
> void *workingset_eviction(struct address_space *mapping, struct page *page)
> {
> - struct mem_cgroup *memcg = page_memcg(page);
> struct pglist_data *pgdat = page_pgdat(page);
> + struct mem_cgroup *memcg = page_memcg(page);
> int memcgid = mem_cgroup_id(memcg);
> unsigned long eviction;
> struct lruvec *lruvec;
> @@ -220,30 +237,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
>
> lruvec = mem_cgroup_lruvec(pgdat, memcg);
> eviction = atomic_long_inc_return(&lruvec->inactive_age);
> - return pack_shadow(memcgid, pgdat, eviction);
> + return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
> }
>
> /**
> * workingset_refault - evaluate the refault of a previously evicted page
> + * @page: the freshly allocated replacement page
> * @shadow: shadow entry of the evicted page
> *
> * Calculates and evaluates the refault distance of the previously
> * evicted page in the context of the node it was allocated in.
> - *
> - * Returns %true if the page should be activated, %false otherwise.
> */
> -bool workingset_refault(void *shadow)
> +void workingset_refault(struct page *page, void *shadow)
> {
> unsigned long refault_distance;
> + struct pglist_data *pgdat;
> unsigned long active_file;
> struct mem_cgroup *memcg;
> unsigned long eviction;
> struct lruvec *lruvec;
> unsigned long refault;
> - struct pglist_data *pgdat;
> + bool workingset;
> int memcgid;
>
> - unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
> + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
>
> rcu_read_lock();
> /*
> @@ -263,41 +280,51 @@ bool workingset_refault(void *shadow)
> * configurations instead.
> */
> memcg = mem_cgroup_from_id(memcgid);
> - if (!mem_cgroup_disabled() && !memcg) {
> - rcu_read_unlock();
> - return false;
> - }
> + if (!mem_cgroup_disabled() && !memcg)
> + goto out;
> lruvec = mem_cgroup_lruvec(pgdat, memcg);
> refault = atomic_long_read(&lruvec->inactive_age);
> active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
>
> /*
> - * The unsigned subtraction here gives an accurate distance
> - * across inactive_age overflows in most cases.
> + * Calculate the refault distance
> *
> - * There is a special case: usually, shadow entries have a
> - * short lifetime and are either refaulted or reclaimed along
> - * with the inode before they get too old. But it is not
> - * impossible for the inactive_age to lap a shadow entry in
> - * the field, which can then can result in a false small
> - * refault distance, leading to a false activation should this
> - * old entry actually refault again. However, earlier kernels
> - * used to deactivate unconditionally with *every* reclaim
> - * invocation for the longest time, so the occasional
> - * inappropriate activation leading to pressure on the active
> - * list is not a problem.
> + * The unsigned subtraction here gives an accurate distance
> + * across inactive_age overflows in most cases. There is a
> + * special case: usually, shadow entries have a short lifetime
> + * and are either refaulted or reclaimed along with the inode
> + * before they get too old. But it is not impossible for the
> + * inactive_age to lap a shadow entry in the field, which can
> + * then can result in a false small refault distance, leading
"which can then can" - please remove one of the "can".
> + * to a false activation should this old entry actually
> + * refault again. However, earlier kernels used to deactivate
> + * unconditionally with *every* reclaim invocation for the
> + * longest time, so the occasional inappropriate activation
> + * leading to pressure on the active list is not a problem.
> */
> refault_distance = (refault - eviction) & EVICTION_MASK;
>
> inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
>
> - if (refault_distance <= active_file) {
> - inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> - rcu_read_unlock();
> - return true;
> + /*
> + * Compare the distance to the existing workingset size. We
> + * don't act on pages that couldn't stay resident even if all
> + * the memory was available to the page cache.
> + */
> + if (refault_distance > active_file)
> + goto out;
> +
> + SetPageActive(page);
> + atomic_long_inc(&lruvec->inactive_age);
> + inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> +
> + /* Page was active prior to eviction */
> + if (workingset) {
> + SetPageWorkingset(page);
> + inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
> }
> +out:
> rcu_read_unlock();
> - return false;
> }
>
> /**
> --
> 2.18.0
>
next prev parent reply other threads:[~2018-08-01 21:56 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-01 15:19 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v3 Johannes Weiner
2018-08-01 15:19 ` [PATCH 1/9] mm: workingset: don't drop refault information prematurely Johannes Weiner
2018-08-01 15:19 ` [PATCH 2/9] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2018-08-01 21:56 ` Suren Baghdasaryan [this message]
2018-08-02 12:28 ` Johannes Weiner
2018-08-01 15:19 ` [PATCH 3/9] delayacct: track delays from thrashing cache pages Johannes Weiner
2018-08-01 15:19 ` [PATCH 4/9] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
2018-08-01 15:19 ` [PATCH 5/9] sched: loadavg: make calc_load_n() public Johannes Weiner
2018-08-01 15:19 ` [PATCH 6/9] sched: sched.h: make rq locking and clock functions available in stats.h Johannes Weiner
2018-08-01 15:19 ` [PATCH 7/9] sched: introduce this_rq_lock_irq() Johannes Weiner
2018-08-01 15:19 ` [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
2018-08-03 16:56 ` Peter Zijlstra
2018-08-06 15:05 ` Johannes Weiner
2018-08-06 15:25 ` Peter Zijlstra
2018-08-06 15:40 ` Johannes Weiner
2018-08-06 15:19 ` Johannes Weiner
2018-08-06 16:03 ` Peter Zijlstra
2018-08-21 19:44 ` Johannes Weiner
2018-08-22 9:16 ` Peter Zijlstra
2018-08-03 17:07 ` Peter Zijlstra
2018-08-06 15:23 ` Johannes Weiner
2018-08-03 17:15 ` Peter Zijlstra
2018-08-03 17:21 ` Peter Zijlstra
2018-08-21 20:11 ` Johannes Weiner
2018-08-22 9:10 ` Peter Zijlstra
2018-08-22 17:28 ` Johannes Weiner
2018-08-01 15:19 ` [PATCH 9/9] psi: cgroup support Johannes Weiner
2018-08-07 11:50 ` [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v3 peter enderborg
2018-08-07 17:51 ` Johannes Weiner
-- strict thread matches above, loose matches on Subject: below --
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
2018-08-28 17:22 ` [PATCH 2/9] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2018-08-01 15:12 Johannes Weiner
2018-08-01 15:13 ` [PATCH 2/9] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2018-08-02 6:57 ` peter enderborg
2018-08-02 12:38 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJuCfpGZGPD+k+jHDowWyvZPnUXzQ9n98wBycDZLAWOn=vV6Ew@mail.gmail.com' \
--to=surenb@google.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=cl@linux.com \
--cc=drake@endlessm.com \
--cc=efault@gmx.de \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@redhat.com \
--cc=peter.enderborg@sony.com \
--cc=peterz@infradead.org \
--cc=shakeelb@google.com \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=vinmenon@codeaurora.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox