From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=lzY1=6F=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.4 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BBB3BC54FCC
	for <linux-mm@archiver.kernel.org>; Tue, 21 Apr 2020 22:39:17 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 53FD9206D9
	for <linux-mm@archiver.kernel.org>; Tue, 21 Apr 2020 22:39:17 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="tlhk58r6"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 53FD9206D9
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id B9F648E0005; Tue, 21 Apr 2020 18:39:16 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B4ED58E0003; Tue, 21 Apr 2020 18:39:16 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A67468E0005; Tue, 21 Apr 2020 18:39:16 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0148.hostedemail.com [216.40.44.148])
	by kanga.kvack.org (Postfix) with ESMTP id 8E2188E0003
	for <linux-mm@kvack.org>; Tue, 21 Apr 2020 18:39:16 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 4D7794DA8
	for <linux-mm@kvack.org>; Tue, 21 Apr 2020 22:39:16 +0000 (UTC)
X-FDA: 76733329512.27.ice99_6dfe66284d45f
X-HE-Tag: ice99_6dfe66284d45f
X-Filterd-Recvd-Size: 8595
Received: from mail-lf1-f68.google.com (mail-lf1-f68.google.com [209.85.167.68])
	by imf05.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 21 Apr 2020 22:39:15 +0000 (UTC)
Received: by mail-lf1-f68.google.com with SMTP id l11so41406lfc.5
        for <linux-mm@kvack.org>; Tue, 21 Apr 2020 15:39:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=HBADq5i/qFEsn3GD0ObPojywM7FchlLznipuc6KZVGk=;
        b=tlhk58r6FhjvjTbo2qnJ7nV8jbf0+1sAG9VaNItYxrAEl8vFoPRgpoZA/hT7gbpjCS
         BlH9CtDPSyfIn2TJofxkZ2QsXsD/XTdu0BQZ4InaxQgWyxDf/56LZnO+Mtm/11B4kVp2
         sn6RZ4D0Qsa44zjS4Jd9TBvQ/TAcvHrzsxOMR5uzTLvHcDaPuqdsIi2DSx+VonQwuHR3
         sNiTejBmcczR82S2KK6s69uAuPlDvsUE1IIunFPs4BQg4rIF7hhUdAullCTZ7riYjqjr
         b054fxXCd/CQFBlS24oYnpNQqZmUYDYdmGyxI4yhjC5BYM6qB2hC0+wAlMTCcHkAdnRs
         XhKg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=HBADq5i/qFEsn3GD0ObPojywM7FchlLznipuc6KZVGk=;
        b=c3c0tCMQlkK+CUtl9rct0lfzBzC3C7kfgC7t7HQLReItZlYfgpQ/ul8bEm/hpEihDa
         5uqlPFSb2u0pJE0Bm6z7a5L5LXhApTRlnEclbGGPDfftulXwMWkv+l1lOknorTZ5fZcZ
         YsVAJMvLhZ9Of86sdTb6RBEMkkXHNFCcXE3lDEaRhRrfje/m9Q9n8b4Xppoez6cJBpCV
         bUh7E83f2rdfeHdSyH5BB6D55iWC+/TpqC8vFrykQLEuG2BpNYb5yqsUQF8HRGSFRe32
         fJfDmSlkeY4EKfObv1rwc8fceYiQBic4v3gewIOz3LUZuFQ0xL+BtQ8TXmcRUKXmxlpT
         an3w==
X-Gm-Message-State: AGi0PuZDd0AYMA46ncod/ZWqG+PEQDB120z5TUtq2r2vOCYgf4NyFXtg
	5PHXyJRqW6rAbcgTT9MXs/YN2E7UrQjRhMAlThh5qA==
X-Google-Smtp-Source: APiQypJVTIkjrsKXbaKgsFwRWDfJ5yGP+zS6tk0wcS411u7J+vcgAbV5ReElU3TpS7X+0A9/ANgsCpc0G9ipATGvJ8I=
X-Received: by 2002:a19:5206:: with SMTP id m6mr14936038lfb.33.1587508753816;
 Tue, 21 Apr 2020 15:39:13 -0700 (PDT)
MIME-Version: 1.0
References: <20200417193539.GC43469@mtj.thefacebook.com> <CALvZod6LT25t9aAA1KHmf1U4-L8zSjUXQ4VQvX4cMT1A+R_g+w@mail.gmail.com>
 <20200417225941.GE43469@mtj.thefacebook.com> <CALvZod6M4OsM-t8m_KX9wCkEutdwUMgbP9682eHGQor9JvO_BQ@mail.gmail.com>
 <20200420164740.GF43469@mtj.thefacebook.com> <20200420170318.GV27314@dhcp22.suse.cz>
 <20200420170650.GA169746@mtj.thefacebook.com> <20200421110612.GD27314@dhcp22.suse.cz>
 <20200421142746.GA341682@cmpxchg.org> <CALvZod650M1_46R4OiS1mug+LKbjD=1s_xqckh9T6V8fPjct2g@mail.gmail.com>
 <20200421215946.GA347151@cmpxchg.org>
In-Reply-To: <20200421215946.GA347151@cmpxchg.org>
From: Shakeel Butt <shakeelb@google.com>
Date: Tue, 21 Apr 2020 15:39:01 -0700
Message-ID: <CALvZod4gwBxT=nYnO8eEHr3jrffoMBoLpaB_uHMWL1VAi8XH4Q@mail.gmail.com>
Subject: Re: [PATCH 0/3] memcg: Slow down swap allocation as the available
 space gets depleted
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>, Tejun Heo <tj@kernel.org>, Jakub Kicinski <kuba@kernel.org>, 
	Andrew Morton <akpm@linux-foundation.org>, Linux MM <linux-mm@kvack.org>, 
	Kernel Team <kernel-team@fb.com>, Chris Down <chris@chrisdown.name>, 
	Cgroups <cgroups@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Apr 21, 2020 at 2:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
[snip]
>
> We do control very aggressive batch jobs to the extent where they have
> negligible latency impact on interactive services running on the same
> hosts. All the tools to do that are upstream and/or public, but it's
> still pretty new stuff (memory.low, io.cost, cpu headroom control,
> freezer) and they need to be put together just right.
>
> We're working on a demo application that showcases how it all fits
> together and hope to be ready to publish it soon.
>

That would be awesome.

>
[snip]
> >
> > What do you mean by not interchangeable? If I keep the hot memory (or
> > workingset) of a job in DRAM and cold memory in swap and control the
> > rate of refaults by controlling the definition of cold memory then I
> > am using the DRAM and swap interchangeably and transparently to the
> > job (that is what we actually do).
>
> Right, that's a more precise definition than my randomly chosen "80%"
> number above. There are parts of a workload's memory access curve
> (where x is distinct data accessed and y is the access frequency) that
> don't need to stay in RAM permanently and can be got on-demand from
> secondary storage without violating the workload's throughput/latency
> requirements. For that part, RAM, swap, disk can be interchangeable.
>
> I'm was specifically talking about the other half of that curve, and
> meant to imply that that's usually bigger than 20%. Usually ;-)
>
> I.e. we cannot say: workload x gets 10G of ram or swap, and it doesn't
> matter whether it gets it in ram or in swap. There is a line somewhere
> in between, and it'll vary with workload requirements, access patterns
> and IO speed. But no workload can actually run with 10G of swap and 0
> bytes worth of direct access memory, right?

Yes.

>
> Since you said before you're using combined memory+swap limits, I'm
> assuming that you configure the resource as interchangeable, but still
> have some form of determining where that cutoff line is between them -
> either by tuning proactive reclaim toward that line or having OOM kill
> policies when the line is crossed and latencies are violated?
>

Yes, more specifically tuning proactive reclaim towards that line. We
define that line in terms of acceptable refault rate for the job. The
acceptable refault rate is measured through re-use and idle page
histograms (these histograms are collected through our internal
implementation of Page Idle Tracking). I am planning to upstream and
open-source these.

> > I am also wondering if you guys explored the in-memory compression
> > based swap medium and if there are any reasons to not follow that
> > route.
>
> We played around with it, but I'm ambivalent about it.
>
> You need to identify that perfect "warm" middle section of the
> workingset curve that is 1) cold enough to not need permanent direct
> access memory, yet 2) warm enough to justify allocating RAM to it.
>
> A lot of our workloads have a distinguishable hot set and various
> amounts of fairly cold data during stable states, with not too much
> middle ground in between where compressed swap would really shine.
>
> Do you use compressed swap fairly universally, or more specifically
> for certain workloads?
>

Yes, we are using it fairly universally. There are few exceptions like
user space net and storage drivers.

> > Oh you mentioned DAX, that brings to mind a very interesting topic.
> > Are you guys exploring the idea of using PMEM as a cheap slow memory?
> > It is byte-addressable, so, regarding memcg accounting, will you treat
> > it as a memory or a separate resource like swap in v2? How does your
> > memory overcommit model work with such a type of memory?
>
> I think we (the kernel MM community, not we as in FB) are still some
> ways away from having dynamic/transparent data placement for pmem the
> same way we have for RAM. But I expect the kernel's high-level default
> strategy to be similar: order virtual memory (the data) by access
> frequency and distribute across physical memory/storage accordingly.
>
> (With pmem being divided into volatile space and filesystem space,
> where volatile space holds colder anon pages (and, if there is still a
> disk, disk cache), and the sizing decisions between them being similar
> as the ones we use for swap and filesystem today).
>
> I expect cgroup policy to be separate, because to users the
> performance difference matters. We won't want greedy batch
> applications displacing latency sensitive ones from RAM into pmem,
> just like we don't want this displacement into secondary storage
> today. Other than that, there isn't too much difference to users,
> because paging is already transparent - an mmapped() file looks the
> same whether it's backed by RAM, by disk or by pmem. The difference is
> access latencies and the aggregate throughput loss they add up to. So
> I could see pmem cgroup limits and protections (for the volatile space
> portion) the same way we have RAM limits and protections.
>
> But yeah, I think this is going a bit off topic ;-)

That's really interesting. Thanks for appeasing my curiosity.

thanks,
Shakeel