From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E63FDC433EF for ; Sat, 16 Apr 2022 02:23:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 417446B0072; Fri, 15 Apr 2022 22:23:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 39E076B0073; Fri, 15 Apr 2022 22:23:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 23EC76B0074; Fri, 15 Apr 2022 22:23:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id 0F9B76B0072 for ; Fri, 15 Apr 2022 22:23:21 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id C9D2360A02 for ; Sat, 16 Apr 2022 02:23:20 +0000 (UTC) X-FDA: 79361145360.06.01BA2FA Received: from mail-vs1-f50.google.com (mail-vs1-f50.google.com [209.85.217.50]) by imf24.hostedemail.com (Postfix) with ESMTP id 4FF52180004 for ; Sat, 16 Apr 2022 02:23:20 +0000 (UTC) Received: by mail-vs1-f50.google.com with SMTP id i186so8299920vsc.9 for ; Fri, 15 Apr 2022 19:23:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=wQoyQFv+DR+j7fh3LRsGpz2vueHCeVQWzUanchbUatQ=; b=YfZhReGfuSfjixzN1iUSULLfspq+EGZIgRMn8VXbc68flA1D/N7dwBKQfs/xC0p5yg BgGmeJmziqzKEfW2f1qWhonkGaD4X++4IDEbom7B23DXnAGvBARF+Hx0L+xcXhRHLRUQ E6VC2+zFnZz9n1JvcpvskO2IDJMJD8shaQWPMhW8VnmThGBVZAo1yNVFeDPkiqGVqiHQ tNa96/gupiv5vdSM+D15htH5OWPQEmvn0NPsCokDJTlI+27gB3/z8nc1SMaLpgGODnhN VkY4dGw5Y1CR15/Z2dpvvseYqGD2OtOCAFGP1+oF3mAmZWSjWmZAlhNrNJJ4KBegHTbq AumQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=wQoyQFv+DR+j7fh3LRsGpz2vueHCeVQWzUanchbUatQ=; b=DpLBpSS19Z5g8TLXAFhbQGLf9bSVNSlhwrg1i4sLevTRdcJOIEgdrEymoFpaQwbSOv 6+t0ZK5x9q0JDGd10oQNLQDDX/D5SvW/Zwx1q/CiDSnM1BnU3ghByBp+VJBS3oFyckod Bte0eynMWchfm/J2RHLpvoOccLUur885Ahj3KaXURq3YuKIKglO2i5LrokkPievMKlUp FOk08nHkZl4utykVs2IA/Gv0stxRURCtGmEImOFo9iQWOIZmYF03lZthsuow98Y7Meon kVkjFZu5UmWgfoLR42/bX/Sn6B1YF+8ao1A928CCLA919FnOUW17Utf0fQTdW3GDXCk7 yC9g== X-Gm-Message-State: AOAM530v7WLvJviLXvWVR1mpHgmBffpKezYjFzfElRmXMKV2Ofo53rAh BRDYSSsliFBk7GC/Y56w5pBM3Dl8Z5lZvpcEQXARsg== X-Google-Smtp-Source: ABdhPJz2RupV69wuUiP6W5/87we/n4sQFTTip7chnuEWLhQ2xLI53Oru7ERzQVELUJpxwreO5H3ZWGtU1Fh+7xHkjc4= X-Received: by 2002:a67:de17:0:b0:32a:4007:cd86 with SMTP id q23-20020a67de17000000b0032a4007cd86mr479002vsk.22.1650075799347; Fri, 15 Apr 2022 19:23:19 -0700 (PDT) MIME-Version: 1.0 References: <20220407031525.2368067-1-yuzhao@google.com> <20220407031525.2368067-14-yuzhao@google.com> <20220411191639.52c62959489a6c27cb7d251e@linux-foundation.org> In-Reply-To: <20220411191639.52c62959489a6c27cb7d251e@linux-foundation.org> From: Yu Zhao Date: Fri, 15 Apr 2022 20:22:42 -0600 Message-ID: Subject: Re: [PATCH v10 13/14] mm: multi-gen LRU: admin guide To: Andrew Morton Cc: Stephen Rothwell , Linux-MM , Andi Kleen , Aneesh Kumar , Barry Song <21cnbao@gmail.com>, Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Jesse Barnes , Johannes Weiner , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Mel Gorman , Michael Larabel , Michal Hocko , Mike Rapoport , Rik van Riel , Vlastimil Babka , Will Deacon , Ying Huang , Linux ARM , "open list:DOCUMENTATION" , linux-kernel , Kernel Page Reclaim v2 , "the arch/x86 maintainers" , Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , =?UTF-8?Q?Holger_Hoffst=C3=A4tte?= , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh , Vaibhav Jain Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: 7khe74cub6opf8p91f3e7oxzof5hspi7 X-Rspam-User: Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=YfZhReGf; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf24.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.50 as permitted sender) smtp.mailfrom=yuzhao@google.com X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 4FF52180004 X-HE-Tag: 1650075800-558951 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Apr 11, 2022 at 8:16 PM Andrew Morton wrote: > > On Wed, 6 Apr 2022 21:15:25 -0600 Yu Zhao wrote: > > > +Kill switch > > +----------- > > +``enable`` accepts different values to enable or disable the following > > It's actually called "enabled". Good catch. Thanks! > And I suggest that the file name be > included right there in the title. ie. > > "enabled": Kill Switch > ====================== Will do. > > +Experimental features > > +===================== > > +``/sys/kernel/debug/lru_gen`` accepts commands described in the > > +following subsections. Multiple command lines are supported, so does > > +concatenation with delimiters ``,`` and ``;``. > > + > > +``/sys/kernel/debug/lru_gen_full`` provides additional stats for > > +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from > > +evicted generations in this file. > > + > > +Working set estimation > > +---------------------- > > +Working set estimation measures how much memory an application > > +requires in a given time interval, and it is usually done with little > > +impact on the performance of the application. E.g., data centers want > > +to optimize job scheduling (bin packing) to improve memory > > +utilizations. When a new job comes in, the job scheduler needs to find > > +out whether each server it manages can allocate a certain amount of > > +memory for this new job before it can pick a candidate. To do so, this > > +job scheduler needs to estimate the working sets of the existing jobs. > > These various sysfs interfaces are a big deal. Because they are so > hard to change once released. Debugfs, not sysfs. The title is "Experimental features" :) > btw, what is this "job scheduler" of which you speak? Basically it's part of cluster management software. Many jobs (programs + data) can run concurrently in the same cluster and the job scheduler of this cluster does the bin packing. To improve resource utilization, the job scheduler needs to know the (memory) size of each job it packs, hence the working set estimation (how much memory a job uses within a given time interval). The job scheduler also takes memory from some jobs so that those jobs can better fit into a single machine (proactive reclaim). > Is there an open > source implementation upon which we hope the world will converge? There are many [1], e.g., Kubernetes (k8s). Personally, I don't think they'll ever converge. At the moment, all open source implementations I know of rely on users manually specifying the size of each job (job spec), e.g., [2]. Users overprovision memory to avoid OOM kills. The average memory utilization generally is surprisingly low. What we can hope for is that eventually some of the open source implementations will use the working set estimation and proactive reclaim features provided here. [1] https://en.wikipedia.org/wiki/List_of_cluster_management_software [2] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ > > +Proactive reclaim > > +----------------- > > +Proactive reclaim induces memory reclaim when there is no memory > > +pressure and usually targets cold memory only. E.g., when a new job > > +comes in, the job scheduler wants to proactively reclaim memory on the > > +server it has selected to improve the chance of successfully landing > > +this new job. > > + > > +Users can write ``- memcg_id node_id min_gen_nr [swappiness > > +[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or > > +equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than > > +``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully > > +aged and therefore cannot be evicted. ``swappiness`` overrides the > > +default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits > > +the number of pages to evict. > > + > > +A typical use case is that a job scheduler writes to ``lru_gen`` > > +before it tries to land a new job on a server, and if it fails to > > +materialize the cold memory without impacting the existing jobs on > > +this server, it retries on the next server according to the ranking > > +result obtained from the working set estimation step described > > +earlier. > > It sounds to me that these interfaces were developed in response to > ongoing development and use of a particular job scheduler. I did borrow some of my previous experience with Google's data centers. But I'm a Chrome OS developer now, so I designed them to be job scheduler agnostic :) > This is a very good thing, but has thought been given to the potential > needs of other job schedulers? Yes, basically I'm trying to help everybody replicate the success stories at Google and Meta [3][4]. [3] https://dl.acm.org/doi/10.1145/3297858.3304053 [4] https://dl.acm.org/doi/10.1145/3503222.3507731