From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E63FDC433EF
	for <linux-mm@archiver.kernel.org>; Sat, 16 Apr 2022 02:23:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 417446B0072; Fri, 15 Apr 2022 22:23:21 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 39E076B0073; Fri, 15 Apr 2022 22:23:21 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 23EC76B0074; Fri, 15 Apr 2022 22:23:21 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28])
	by kanga.kvack.org (Postfix) with ESMTP id 0F9B76B0072
	for <linux-mm@kvack.org>; Fri, 15 Apr 2022 22:23:21 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id C9D2360A02
	for <linux-mm@kvack.org>; Sat, 16 Apr 2022 02:23:20 +0000 (UTC)
X-FDA: 79361145360.06.01BA2FA
Received: from mail-vs1-f50.google.com (mail-vs1-f50.google.com [209.85.217.50])
	by imf24.hostedemail.com (Postfix) with ESMTP id 4FF52180004
	for <linux-mm@kvack.org>; Sat, 16 Apr 2022 02:23:20 +0000 (UTC)
Received: by mail-vs1-f50.google.com with SMTP id i186so8299920vsc.9
        for <linux-mm@kvack.org>; Fri, 15 Apr 2022 19:23:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=wQoyQFv+DR+j7fh3LRsGpz2vueHCeVQWzUanchbUatQ=;
        b=YfZhReGfuSfjixzN1iUSULLfspq+EGZIgRMn8VXbc68flA1D/N7dwBKQfs/xC0p5yg
         BgGmeJmziqzKEfW2f1qWhonkGaD4X++4IDEbom7B23DXnAGvBARF+Hx0L+xcXhRHLRUQ
         E6VC2+zFnZz9n1JvcpvskO2IDJMJD8shaQWPMhW8VnmThGBVZAo1yNVFeDPkiqGVqiHQ
         tNa96/gupiv5vdSM+D15htH5OWPQEmvn0NPsCokDJTlI+27gB3/z8nc1SMaLpgGODnhN
         VkY4dGw5Y1CR15/Z2dpvvseYqGD2OtOCAFGP1+oF3mAmZWSjWmZAlhNrNJJ4KBegHTbq
         AumQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=wQoyQFv+DR+j7fh3LRsGpz2vueHCeVQWzUanchbUatQ=;
        b=DpLBpSS19Z5g8TLXAFhbQGLf9bSVNSlhwrg1i4sLevTRdcJOIEgdrEymoFpaQwbSOv
         6+t0ZK5x9q0JDGd10oQNLQDDX/D5SvW/Zwx1q/CiDSnM1BnU3ghByBp+VJBS3oFyckod
         Bte0eynMWchfm/J2RHLpvoOccLUur885Ahj3KaXURq3YuKIKglO2i5LrokkPievMKlUp
         FOk08nHkZl4utykVs2IA/Gv0stxRURCtGmEImOFo9iQWOIZmYF03lZthsuow98Y7Meon
         kVkjFZu5UmWgfoLR42/bX/Sn6B1YF+8ao1A928CCLA919FnOUW17Utf0fQTdW3GDXCk7
         yC9g==
X-Gm-Message-State: AOAM530v7WLvJviLXvWVR1mpHgmBffpKezYjFzfElRmXMKV2Ofo53rAh
	BRDYSSsliFBk7GC/Y56w5pBM3Dl8Z5lZvpcEQXARsg==
X-Google-Smtp-Source: ABdhPJz2RupV69wuUiP6W5/87we/n4sQFTTip7chnuEWLhQ2xLI53Oru7ERzQVELUJpxwreO5H3ZWGtU1Fh+7xHkjc4=
X-Received: by 2002:a67:de17:0:b0:32a:4007:cd86 with SMTP id
 q23-20020a67de17000000b0032a4007cd86mr479002vsk.22.1650075799347; Fri, 15 Apr
 2022 19:23:19 -0700 (PDT)
MIME-Version: 1.0
References: <20220407031525.2368067-1-yuzhao@google.com> <20220407031525.2368067-14-yuzhao@google.com>
 <20220411191639.52c62959489a6c27cb7d251e@linux-foundation.org>
In-Reply-To: <20220411191639.52c62959489a6c27cb7d251e@linux-foundation.org>
From: Yu Zhao <yuzhao@google.com>
Date: Fri, 15 Apr 2022 20:22:42 -0600
Message-ID: <CAOUHufacnY6zMzkMvgHD9_DAwDcnpq7a9YdYT3SKUV8dAi=Fmw@mail.gmail.com>
Subject: Re: [PATCH v10 13/14] mm: multi-gen LRU: admin guide
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@rothwell.id.au>, Linux-MM <linux-mm@kvack.org>, 
	Andi Kleen <ak@linux.intel.com>, Aneesh Kumar <aneesh.kumar@linux.ibm.com>, 
	Barry Song <21cnbao@gmail.com>, Catalin Marinas <catalin.marinas@arm.com>, 
	Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>, 
	Jens Axboe <axboe@kernel.dk>, Jesse Barnes <jsbarnes@google.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Jonathan Corbet <corbet@lwn.net>, 
	Linus Torvalds <torvalds@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, 
	Mel Gorman <mgorman@suse.de>, Michael Larabel <Michael@michaellarabel.com>, 
	Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>, Rik van Riel <riel@surriel.com>, 
	Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>, Ying Huang <ying.huang@intel.com>, 
	Linux ARM <linux-arm-kernel@lists.infradead.org>, 
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>, linux-kernel <linux-kernel@vger.kernel.org>, 
	Kernel Page Reclaim v2 <page-reclaim@google.com>, "the arch/x86 maintainers" <x86@kernel.org>, 
	Brian Geffon <bgeffon@google.com>, Jan Alexander Steffens <heftig@archlinux.org>, 
	Oleksandr Natalenko <oleksandr@natalenko.name>, Steven Barrett <steven@liquorix.net>, 
	Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>, Donald Carr <d@chaos-reins.com>, 
	=?UTF-8?Q?Holger_Hoffst=C3=A4tte?= <holger@applied-asynchrony.com>, 
	Konstantin Kharlamov <Hi-Angel@yandex.ru>, Shuang Zhai <szhai2@cs.rochester.edu>, 
	Sofia Trinh <sofia.trinh@edi.works>, Vaibhav Jain <vaibhav@linux.ibm.com>
Content-Type: text/plain; charset="UTF-8"
X-Stat-Signature: 7khe74cub6opf8p91f3e7oxzof5hspi7
X-Rspam-User: 
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=YfZhReGf;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf24.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.50 as permitted sender) smtp.mailfrom=yuzhao@google.com
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 4FF52180004
X-HE-Tag: 1650075800-558951
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Apr 11, 2022 at 8:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed,  6 Apr 2022 21:15:25 -0600 Yu Zhao <yuzhao@google.com> wrote:
>
> > +Kill switch
> > +-----------
> > +``enable`` accepts different values to enable or disable the following
>
> It's actually called "enabled".

Good catch. Thanks!

> And I suggest that the file name be
> included right there in the title.  ie.
>
> "enabled": Kill Switch
> ======================

Will do.

> > +Experimental features
> > +=====================
> > +``/sys/kernel/debug/lru_gen`` accepts commands described in the
> > +following subsections. Multiple command lines are supported, so does
> > +concatenation with delimiters ``,`` and ``;``.
> > +
> > +``/sys/kernel/debug/lru_gen_full`` provides additional stats for
> > +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
> > +evicted generations in this file.
> > +
> > +Working set estimation
> > +----------------------
> > +Working set estimation measures how much memory an application
> > +requires in a given time interval, and it is usually done with little
> > +impact on the performance of the application. E.g., data centers want
> > +to optimize job scheduling (bin packing) to improve memory
> > +utilizations. When a new job comes in, the job scheduler needs to find
> > +out whether each server it manages can allocate a certain amount of
> > +memory for this new job before it can pick a candidate. To do so, this
> > +job scheduler needs to estimate the working sets of the existing jobs.
>
> These various sysfs interfaces are a big deal.  Because they are so
> hard to change once released.

Debugfs, not sysfs. The title is "Experimental features" :)

> btw, what is this "job scheduler" of which you speak?

Basically it's part of cluster management software. Many jobs
(programs + data) can run concurrently in the same cluster and the job
scheduler of this cluster does the bin packing. To improve resource
utilization, the job scheduler needs to know the (memory) size of each
job it packs, hence the working set estimation (how much memory a job
uses within a given time interval). The job scheduler also takes
memory from some jobs so that those jobs can better fit into a single
machine (proactive reclaim).

> Is there an open
> source implementation upon which we hope the world will converge?

There are many [1], e.g., Kubernetes (k8s). Personally, I don't think
they'll ever converge.

At the moment, all open source implementations I know of rely on users
manually specifying the size of each job (job spec), e.g., [2]. Users
overprovision memory to avoid OOM kills. The average memory
utilization generally is surprisingly low. What we can hope for is
that eventually some of the open source implementations will use the
working set estimation and proactive reclaim features provided here.

[1] https://en.wikipedia.org/wiki/List_of_cluster_management_software
[2] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

> > +Proactive reclaim
> > +-----------------
> > +Proactive reclaim induces memory reclaim when there is no memory
> > +pressure and usually targets cold memory only. E.g., when a new job
> > +comes in, the job scheduler wants to proactively reclaim memory on the
> > +server it has selected to improve the chance of successfully landing
> > +this new job.
> > +
> > +Users can write ``- memcg_id node_id min_gen_nr [swappiness
> > +[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
> > +equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
> > +``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
> > +aged and therefore cannot be evicted. ``swappiness`` overrides the
> > +default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
> > +the number of pages to evict.
> > +
> > +A typical use case is that a job scheduler writes to ``lru_gen``
> > +before it tries to land a new job on a server, and if it fails to
> > +materialize the cold memory without impacting the existing jobs on
> > +this server, it retries on the next server according to the ranking
> > +result obtained from the working set estimation step described
> > +earlier.
>
> It sounds to me that these interfaces were developed in response to
> ongoing development and use of a particular job scheduler.

I did borrow some of my previous experience with Google's data
centers. But I'm a Chrome OS developer now, so I designed them to be
job scheduler agnostic :)

> This is a very good thing, but has thought been given to the potential
> needs of other job schedulers?

Yes, basically I'm trying to help everybody replicate the success
stories at Google and Meta [3][4].

[3] https://dl.acm.org/doi/10.1145/3297858.3304053
[4] https://dl.acm.org/doi/10.1145/3503222.3507731