From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 778F7C433F5 for ; Tue, 17 May 2022 20:06:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BD29F6B0071; Tue, 17 May 2022 16:06:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B82466B0073; Tue, 17 May 2022 16:06:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A22CA6B0074; Tue, 17 May 2022 16:06:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 9117C6B0071 for ; Tue, 17 May 2022 16:06:49 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 5B6E933444 for ; Tue, 17 May 2022 20:06:49 +0000 (UTC) X-FDA: 79476318138.27.C4558F0 Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179]) by imf20.hostedemail.com (Postfix) with ESMTP id B67D21C00CF for ; Tue, 17 May 2022 20:06:36 +0000 (UTC) Received: by mail-qk1-f179.google.com with SMTP id b20so988qkc.6 for ; Tue, 17 May 2022 13:06:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=qcMWVmYSlURTnutvjvuEW7rjU+P4VkwffWxfR8GtrEA=; b=LGie6wwZcbo09Ckj8tP/4otrxCtis9dRmDTD9qDyh3biKQ98yrbjUcuchEk8INxwFG 8b6Tn3W3SJdpgx9i1SgzxcRmFafKytH/WTHsSwLOQmHs27J9KXp/vBkzHb+uWTjBTirT EfJwMOHmhj2dsvVunuSViW+twEuUEC4JB5mBUW7T2eYx67PHgrJSz6aiA6vhBE6UHGzg tlC5FC/eTVTfo08giVbYDiwZXDE1VXYaZQGAPwdMBEylGSzsA/Wn9+7J3TNiEYaBSS6c KjQPd3YJ6BgpCJHNxOlY4gU61In53Pwdr5Mmx2p8GZI4M/sIRwDpUmISWg6pgPneN/vu JnUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=qcMWVmYSlURTnutvjvuEW7rjU+P4VkwffWxfR8GtrEA=; b=pPSgQLIgcrpJjFTwKWButURZLojkAcDk2bq9+ksgT2vHwRebNLgKwF9rKuBh7/qUxO iD/urtC7myGGJ+ip/uT27rIMYkLTyfs760wAzsUTzg9s83gk+mOdLs7R6j0MtEhcnc+L XNd4BoMauZ8e1KHf9iQKFOcMWiL8I6932ccuslAJJ3u6zc/nHQTJr35U7SZxDUVBXo6t kXd0GVsRsqpO94M0Y5xwinukL1lF81CS59EGwTWROYyctkkaSFBwECaBcz7qA4Ob+qLc 2ZAlyByW+2iBw29dIDuRO3UZ9vKEF21v6boh6dkmwdSn1tGEclW1ZSbFQICdnCHut/Qp IuwQ== X-Gm-Message-State: AOAM530yvURTMXwWBm7bQtymlJStV5HAW6oSkreyH8waqk3yras0jMX4 km7wPhrnpdxrRlHO6ADOuRpzqw== X-Google-Smtp-Source: ABdhPJwqXDU+oSzLWJRpekyrFZ2G0BAHMUxLQFXsIT+bEr3FfHaKXjWjnSc6Jcevm0gZUP4cK84MdQ== X-Received: by 2002:a05:620a:2909:b0:6a0:472b:a30d with SMTP id m9-20020a05620a290900b006a0472ba30dmr17493803qkp.258.1652818007815; Tue, 17 May 2022 13:06:47 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:b35b]) by smtp.gmail.com with ESMTPSA id a85-20020ae9e858000000b0069fc13ce254sm38138qkg.133.2022.05.17.13.06.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 17 May 2022 13:06:47 -0700 (PDT) Date: Tue, 17 May 2022 16:06:45 -0400 From: Johannes Weiner To: Yosry Ahmed Cc: Michal Hocko , Shakeel Butt , Andrew Morton , David Rientjes , Roman Gushchin , cgroups@vger.kernel.org, Tejun Heo , Linux-MM , Yu Zhao , Wei Xu , Greg Thelen , Chen Wandun Subject: Re: [RFC] Add swappiness argument to memory.reclaim Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: B67D21C00CF X-Stat-Signature: f1up3iiw1yntji41k6b88631obkywbkm X-Rspam-User: Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=LGie6wwZ; spf=pass (imf20.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.179 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org X-HE-Tag: 1652817996-768685 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Yosry, On Tue, May 17, 2022 at 11:06:36AM -0700, Yosry Ahmed wrote: > On Mon, May 16, 2022 at 11:56 PM Michal Hocko wrote: > > > > On Mon 16-05-22 15:29:42, Yosry Ahmed wrote: > > > The discussions on the patch series [1] to add memory.reclaim has > > > shown that it is desirable to add an argument to control the type of > > > memory being reclaimed by invoked proactive reclaim using > > > memory.reclaim. > > > > > > I am proposing adding a swappiness optional argument to the interface. > > > If set, it overwrites vm.swappiness and per-memcg swappiness. This > > > provides a way to enforce user policy on a stateless per-reclaim > > > basis. We can make policy decisions to perform reclaim differently for > > > tasks of different app classes based on their individual QoS needs. It > > > also helps for use cases when particularly page cache is high and we > > > want to mainly hit that without swapping out. > > > > Can you be more specific about the usecase please? Also how do you > > For example for a class of applications it may be known that > reclaiming one type of pages anon/file is more profitable or will > incur an overhead, based on userspace knowledge of the nature of the > app. I want to make sure I understand what you're trying to correct for with this bias. Could you expand some on what you mean by profitable? The way the kernel thinks today is that importance of any given page is its access frequency times the cost of paging it. swappiness exists to recognize differences in the second part: the cost involved in swapping a page vs the cost of a file cache miss. For example, page A is accessed 10 times more frequently than B, but B is 10 times more expensive to refault/swapin. Combining that, they should be roughly equal reclaim candidates. This is the same with the seek parameter of slab shrinkers: some objects are more expensive to recreate than others. Once corrected for that, presence of reference bits can be interpreted on an even level. While access frequency is clearly a workload property, the cost of refaulting is conventionally not - let alone a per-reclaim property! If I understand you correctly, you're saying that the backing type of a piece of memory can say something about the importance of the data within. Something that goes beyond the work of recreating it. Is that true or am I misreading this? If that's your claim, isn't that, if it happens, mostly incidental? For example, in our fleet we used to copy executable text into anonymous memory to get THP backing. With file THP support in the kernel, the text is back in cache. The importance of the memory *contents* stayed the same. The backing storage changed, but beyond that the anon/file distinction doesn't mean anything. Another example. Probably one of the most common workload structures is text, heap, logging/startup/error handling: hot file, warm anon, cold file. How does prioritizing either file or anon apply to this? Maybe I'm misunderstanding and this IS about per-workload backing types? Maybe the per-cgroup swapfiles that you guys are using? > If most of what an app use for example is anon/tmpfs then it might > be better to explicitly ask the kernel to reclaim anon, and to avoid > reclaiming file pages in order not to hurt the file cache > performance. Hm. Reclaim ages those pools based on their size, so a dominant anon set should receive more pressure than a small file set. I can see two options why this doesn't produce the desired results: 1) Reclaim is broken and doesn't allocate scan rates right, or 2) Access frequency x refault cost alone is not a satisfactory predictor for the value of any given page. Can you see another? I can sort of see the argument for 2), because it can be workload dependent: a 50ms refault in a single-threaded part of the program is likely more disruptive than the same refault in an asynchronous worker thread. This is a factor we're not really taking into account today. But I don't think an anon/file bias will capture this coefficient?