linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] memdelay: memory health metric for systems and workloads
@ 2017-07-27 15:30 Johannes Weiner
  2017-07-27 15:30 ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros Johannes Weiner
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

This patch series implements a fine-grained metric for memory
health. It builds on top of the refault detection code to quantify the
time lost on VM events that occur exclusively due a lack of memory and
maps it into a percentage of lost walltime for the system and cgroups.

Rationale

When presented with a Linux system or container executing a workload,
it's hard to judge the health of its memory situation.

The statistics exported by the memory management subsystem can reveal
smoking guns: page reclaim activity, major faults and refaults can be
indicative of an unhealthy memory situation. But they don't actually
quantify the cost a memory shortage imposes on the system or workload.

How bad is it when 2000 pages are refaulting each second? If the data
is stored contiguously on a fast flash drive, it might be okay. If the
data is spread out all over a rotating disk, it could be a problem -
unless the CPUs are still fully utilized, in which case adding memory
wouldn't make things move faster, but instead wait for CPU time.

A previous attempt to provide a health signal from the VM was the
vmpressure interface, 70ddf637eebe ("memcg: add memory.pressure_level
events"). This derives its pressure levels from recently observed
reclaim efficiency. As pages are scanned but not reclaimed, the ratio
is translated into levels of low, medium, and critical pressure.

However, the vmpressure scale is too coarse for today's systems. The
accuracy relies on storage being relatively slow compared to how fast
the CPU can go through the LRUs, so that when LRU scan cycles outstrip
IO completion rates the reclaim code runs into pages that are still
reading from disk. But as solid state devices close this speed gap,
and memory sizes are in the hundreds of gigabytes, this effect has
almost completely disappeared. By the time the reclaim scanner runs
into in-flight pages, the tasks in the system already spend a
significant part of their runtime waiting for refaulting pages. The
vmpressure range is compressed into the split second before OOM and
misses large, practically relevant parts of the pressure spectrum.

Knowing the exact time penalty that the kernel's paging activity is
imposing on a workload is a powerful tool. It allows users to finetune
a workload to available memory, but also detect and quantify minute
regressions and improvements in the reclaim and caching algorithms.

Structure

The first patch cleans up the different loadavg callsites and macros
as the memdelay averages are going to be tracked using these.

The second patch adds a distinction between page cache transitions
(inactive list refaults) and page cache thrashing (active list
refaults), since only the latter are unproductive refaults.

The third patch finally adds the memdelay accounting and interface:
its scheduler side identifies productive and unproductive task states,
and the VM side aggregates them into system and cgroup domain states
and calculates moving averages of the time spent in each state.

 arch/powerpc/platforms/cell/spufs/sched.c |   3 -
 arch/s390/appldata/appldata_os.c          |   4 -
 drivers/cpuidle/governors/menu.c          |   4 -
 fs/proc/array.c                           |   8 +
 fs/proc/base.c                            |   2 +
 fs/proc/internal.h                        |   2 +
 fs/proc/loadavg.c                         |   3 -
 include/linux/cgroup.h                    |  14 ++
 include/linux/memcontrol.h                |  14 ++
 include/linux/memdelay.h                  | 174 +++++++++++++++++
 include/linux/mmzone.h                    |   1 +
 include/linux/page-flags.h                |   5 +-
 include/linux/sched.h                     |  10 +-
 include/linux/sched/loadavg.h             |   3 +
 include/linux/swap.h                      |   2 +-
 include/trace/events/mmflags.h            |   1 +
 kernel/cgroup/cgroup.c                    |   4 +-
 kernel/debug/kdb/kdb_main.c               |   7 +-
 kernel/fork.c                             |   4 +
 kernel/sched/Makefile                     |   2 +-
 kernel/sched/core.c                       |  20 ++
 kernel/sched/memdelay.c                   | 112 +++++++++++
 mm/Makefile                               |   2 +-
 mm/compaction.c                           |   4 +
 mm/filemap.c                              |  18 +-
 mm/huge_memory.c                          |   1 +
 mm/memcontrol.c                           |  25 +++
 mm/memdelay.c                             | 289 ++++++++++++++++++++++++++++
 mm/migrate.c                              |   2 +
 mm/page_alloc.c                           |  11 +-
 mm/swap_state.c                           |   1 +
 mm/vmscan.c                               |  10 +
 mm/vmstat.c                               |   1 +
 mm/workingset.c                           |  98 ++++++----
 34 files changed, 792 insertions(+), 69 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: Detecting page cache trashing state
@ 2017-09-18 16:34 Johannes Weiner
  2017-09-19 10:55 ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
  2017-09-19 11:02 ` kbuild test robot
  0 siblings, 2 replies; 24+ messages in thread
From: Johannes Weiner @ 2017-09-18 16:34 UTC (permalink / raw)
  To: Taras Kondratiuk
  Cc: Michal Hocko, linux-mm, xe-linux-external, Ruslan Ruslichenko,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1421 bytes --]

Hi Taras,

On Fri, Sep 15, 2017 at 10:28:30AM -0700, Taras Kondratiuk wrote:
> Quoting Michal Hocko (2017-09-15 07:36:19)
> > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > > Has somebody faced similar issue? How are you solving it?
> > 
> > Yes this is a pain point for a _long_ time. And we still do not have a
> > good answer upstream. Johannes has been playing in this area [1].
> > The main problem is that our OOM detection logic is based on the ability
> > to reclaim memory to allocate new memory. And that is pretty much true
> > for the pagecache when you are trashing. So we do not know that
> > basically whole time is spent refaulting the memory back and forth.
> > We do have some refault stats for the page cache but that is not
> > integrated to the oom detection logic because this is really a
> > non-trivial problem to solve without triggering early oom killer
> > invocations.
> > 
> > [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
> 
> Thanks Michal. memdelay looks promising. We will check it.

Great, I'm obviously interested in more users of it :) Please find
attached the latest version of the patch series based on v4.13.

It needs a bit more refactoring in the scheduler bits before
resubmission, but it already contains a couple of fixes and
improvements since the first version I sent out.

Let me know if you need help rebasing to a different kernel version.

[-- Attachment #2: 0001-sched-loadavg-consolidate-LOAD_INT-LOAD_FRAC-macros.patch --]
[-- Type: text/x-diff, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-09-19 11:02 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-27 15:30 [PATCH 0/3] memdelay: memory health metric for systems and workloads Johannes Weiner
2017-07-27 15:30 ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros Johannes Weiner
2017-07-27 15:30 ` [PATCH 2/3] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2017-07-27 15:30 ` [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads Johannes Weiner
2017-07-27 15:56   ` Johannes Weiner
2017-07-29  9:10   ` Peter Zijlstra
2017-07-30 15:28     ` Johannes Weiner
2017-07-31  8:31       ` Peter Zijlstra
2017-07-31 18:41         ` Johannes Weiner
2017-07-31 19:49           ` Mike Galbraith
2017-07-31 20:38             ` Johannes Weiner
2017-08-01  2:23               ` Mike Galbraith
2017-08-01  7:57           ` Peter Zijlstra
2017-08-01 12:26             ` Johannes Weiner
2017-08-13 14:52               ` Peter Zijlstra
2017-07-29 13:31   ` kbuild test robot
2017-07-27 20:43 ` [PATCH 0/3] memdelay: memory health metric " Andrew Morton
2017-07-28 19:43   ` Johannes Weiner
2017-08-02  8:11     ` Michal Hocko
2017-07-29  2:48 ` Mike Galbraith
2017-07-29  3:21   ` Mike Galbraith
2017-07-29  6:38   ` Mike Galbraith
2017-09-18 16:34 Detecting page cache trashing state Johannes Weiner
2017-09-19 10:55 ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
2017-09-19 11:02 ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox