From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D84EFC4332F for ; Tue, 13 Dec 2022 00:54:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2FBE98E0005; Mon, 12 Dec 2022 19:54:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2AC168E0002; Mon, 12 Dec 2022 19:54:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 14D718E0005; Mon, 12 Dec 2022 19:54:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 01D2E8E0002 for ; Mon, 12 Dec 2022 19:54:41 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id BD3ED140C7F for ; Tue, 13 Dec 2022 00:54:41 +0000 (UTC) X-FDA: 80235462762.03.D9D5414 Received: from mail-vs1-f49.google.com (mail-vs1-f49.google.com [209.85.217.49]) by imf06.hostedemail.com (Postfix) with ESMTP id 334F6180008 for ; Tue, 13 Dec 2022 00:54:40 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=gFj+NaTD; spf=pass (imf06.hostedemail.com: domain of almasrymina@google.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670892880; a=rsa-sha256; cv=none; b=G5QGRfzPL/bxnthSR3dkwRh1dJOigghNvRR/u3FlBxjmr8TUznB1Ywms8hmzSPOpvCe5Wd W6h2MZjnl+GxpTFmFtuAapJlp6fgYEHbQY3GZpslCSystMpiCZ5q8uYmam+M88ir0CfubC Bb+ItjP5ZEfTdkXqIGxEFwtriX1TrhM= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=gFj+NaTD; spf=pass (imf06.hostedemail.com: domain of almasrymina@google.com designates 209.85.217.49 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670892880; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0Zi3ZzyKkgNoOmYVx10LPKw87VjddBy1CfolqOqRLdM=; b=4vLwT9qmr14nVdcS/urqBfJkRjn2ntU3/UHhWk5s2XJf88yDwtCreMBWTh/QV82uB2VR4H IY67x+LQem96yLTr/JVHyPm708mdxOJ/wWlfLSDGnUU0/54aZcMplAXnx54APwpSeu4pMK I0hXVuYs4HA4xxrKNKi7sImZtwapVt4= Received: by mail-vs1-f49.google.com with SMTP id c184so13152660vsc.3 for ; Mon, 12 Dec 2022 16:54:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=0Zi3ZzyKkgNoOmYVx10LPKw87VjddBy1CfolqOqRLdM=; b=gFj+NaTDEjrbdtpaBG2ekgela1K6vSk06PHME5bYkhlxpLN/TpofHvr+G1qhD6fdVo Q48yRVBKQyN3eTdjzONydNVQa/y9AMl8JsKpVBkWTEyxMVVH2l6/8bSXPQ6nMfoaYFrl JRa4ASQiKWaHIwAQxgrXNfrmnFtSqfN1Sg9AK5KhNM17J+lNd+IYqFQuV3tWZo7rCc9o AJmYC89CvqJsT3AJLyzxhL2kGwNaMjt2y3z93qBrZHXDVhH4Ojb8xWbj0wCmyH9h/KeW gs666d/CSjHK5LGAClY2hOgbWj3C7TMMX/h5d42YVIa2rVdiTV/geNGtmz5XyIkb2zEP RKbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=0Zi3ZzyKkgNoOmYVx10LPKw87VjddBy1CfolqOqRLdM=; b=Thr+uL5nHYo2catmmVJfm2uN4FiJUVOMHpwvKJig7O1RGG7cYPg8wTIIhBcaYtIeY2 wqNwXiwR7OFt5EXoQfQx97pDqh75Fp2jbGjUpSHU1myqAiM3osSWFCoL6L5MvNvHNcz7 553GDiLVgUEJSlAMhNhXh8eSam8TDkxhnrLGy4hDL0lB5TEsc5C9kH4Qqv7GC3ekylg+ XSIfUyVKw0epCOOiOYPCd2JlS9GwDVrerQEH4SdDTC8b2IckCUv+3gswxyZqyJrM1h0+ jOxNrhOTPb7VcfV6qauoOvnQxq/j8bgme6MO5BwfbkR2h62iIJgpsyd3lhyUTvqVH413 i4pQ== X-Gm-Message-State: ANoB5pk32Xv/3qbhvsGZXbUzdg8Azmo9tMhL0+U+mnuqq+MTZx8nkjif pD3QNkqEr7PG9+wkdCDW5oWaIxGZjO4wP1VBxHrqpQ== X-Google-Smtp-Source: AA0mqf6fSWCQhulCrdDkL8XnvtjamHkOLhd97/tIupBdsydu1hBCEL6D2iS02/S1te7+YgPoaA4C/YBzGlMcxrKoGe0= X-Received: by 2002:a67:d09a:0:b0:3b5:1527:d7bf with SMTP id s26-20020a67d09a000000b003b51527d7bfmr143813vsi.72.1670892879134; Mon, 12 Dec 2022 16:54:39 -0800 (PST) MIME-Version: 1.0 References: <20221202223533.1785418-1-almasrymina@google.com> In-Reply-To: From: Mina Almasry Date: Mon, 12 Dec 2022 16:54:27 -0800 Message-ID: Subject: Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim To: Michal Hocko Cc: Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Huang Ying , Yang Shi , Yosry Ahmed , weixugc@google.com, fvdl@google.com, bagasdotme@gmail.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 334F6180008 X-Stat-Signature: mt47yxigq9hj6gscpfo9nsscqnopuete X-HE-Tag: 1670892880-535941 X-HE-Meta: U2FsdGVkX1/ujJoBRIkOeBkv9peTd+0MbDXrUKG/HbpdLhfPEfvDx3L/MeS8J7DC15T0+4WH89B2YYdUPoMlMYqGyG+q7DxwV6X//GgYrU1P+qEVdRk3WE3QjQi3VrW7MK8JEmniAFH9mtrEX+axanACILuaWZABCjibSEhkmta5cbX7NrgA9l0XfWZVG9ulLsDRO4Sj+3gGpI+9UWyWyL2xtQH5ZYDiNggho/kFyJKW/Sz5idL2Tc0WIIJrGiDJj7I3baFcspORY5TKXnXYgb/KXVJ6TJBD+8voVhcTb4P1Hefz9H0VQrRkXPFWHc/Ezh2XdzzJtJzkj0zLGIIz0BiFqaLOQwkHSOu/tUi4MZl0D8mWBnNKgAAvaWuCzeSfILLuQR5CZLkjJlehqnp8tQcoOzCArseZPKsVRshD/2EHwwLhk9K22XW9gUpa3rxR6NsxsVfD/N/GG8AnsSKp5hvp4ceGqvhv8uXwDZ6EZHlDoVEpyd8wpIjhEaKQWdJ5ldoUU5ahoTJDIi8Z462CWdYzYAwxSpGHacdZJr+zuCw9Fx8wLyLSnIqQVbqd9NngfX6HighU40maqoRGYy7uOAgcIpTFuKzh6mYEGXbfeo25faDRd2yBruMj6w6zF9t3Iljtm3ZTCfxZ7Z7W3YNM1CaA7IhT9rsnrTTcP6ZjDBKNTwmF51IiDrmoHt4moaLNpfyjcEfKYF47Dn3MDoSIgLQK1yF+37dr8csVSQFz+9oPXmR2OPKHxh0Oc+IcbpjhXS3K1nUufDJYNB30b0KGj8NVO32TqKJ8g88AdOGc4PeAS0YgKmSJ99mRDYjBLXVMO5OCb7iV1zQ7X9ZCzR8abXWCFWPDk7zGVPedUsNz0e/XA2LhVUdZzWg7RltJ4ISwi/JVK5aq7b298G7qrC4yygL7LuOhEUbzK7J/FID7GK0XNIDX7DoaVCgo8Sti4iu6oASmkGmB2/esAckavuI mdAQZstz TAqVEPOS66+XyR84= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko wrote: > > On Fri 02-12-22 14:35:31, Mina Almasry wrote: > > The nodes= arg instructs the kernel to only scan the given nodes for > > proactive reclaim. For example use cases, consider a 2 tier memory system: > > > > nodes 0,1 -> top tier > > nodes 2,3 -> second tier > > > > $ echo "1m nodes=0" > memory.reclaim > > > > This instructs the kernel to attempt to reclaim 1m memory from node 0. > > Since node 0 is a top tier node, demotion will be attempted first. This > > is useful to direct proactive reclaim to specific nodes that are under > > pressure. > > > > $ echo "1m nodes=2,3" > memory.reclaim > > > > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > > since this tier of memory has no demotion targets the memory will be > > reclaimed. > > > > $ echo "1m nodes=0,1" > memory.reclaim > > > > Instructs the kernel to reclaim memory from the top tier nodes, which can > > be desirable according to the userspace policy if there is pressure on > > the top tiers. Since these nodes have demotion targets, the kernel will > > attempt demotion first. > > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > reclaim""), the proactive reclaim interface memory.reclaim does both > > reclaim and demotion. Reclaim and demotion incur different latency costs > > to the jobs in the cgroup. Demoted memory would still be addressable > > by the userspace at a higher latency, but reclaimed memory would need to > > incur a pagefault. > > > > The 'nodes' arg is useful to allow the userspace to control demotion > > and reclaim independently according to its policy: if the memory.reclaim > > is called on a node with demotion targets, it will attempt demotion first; > > if it is called on a node without demotion targets, it will only attempt > > reclaim. > > > > Acked-by: Michal Hocko > > Signed-off-by: Mina Almasry > > After discussion in [1] I have realized that I haven't really thought > through all the consequences of this patch and therefore I am retracting > my ack here. I am not nacking the patch at this statge but I also think > this shouldn't be merged now and we should really consider all the > consequences. > > Let me summarize my main concerns here as well. The proposed > implementation doesn't apply the provided nodemask to the whole reclaim > process. This means that demotion can happen outside of the mask so the > the user request cannot really control demotion targets and that limits > the interface should there be any need for a finer grained control in > the future (see an example in [2]). > Another problem is that this can limit future reclaim extensions because > of existing assumptions of the interface [3] - specify only top-tier > node to force the aging without actually reclaiming any charges and > (ab)use the interface only for aging on multi-tier system. A change to > the reclaim to not demote in some cases could break this usecase. > I think this is correct. My use case is to request from the kernel to do demotion without reclaim in the cgroup, and the reason for that is stated in the commit message: "Reclaim and demotion incur different latency costs to the jobs in the cgroup. Demoted memory would still be addressable by the userspace at a higher latency, but reclaimed memory would need to incur a pagefault." For jobs of some latency tiers, we would like to trigger proactive demotion (which incurs relatively low latency on the job), but not trigger proactive reclaim (which incurs a pagefault). I initially had proposed a separate interface for this, but Johannes directed me to this interface instead in [1]. In the same email Johannes also tells me that meta's reclaim stack relies on memory.reclaim triggering demotion, so it seems that I'm not the first to take a dependency on this. Additionally in [2] Johannes also says it would be great if in the long term reclaim policy and demotion policy do not diverge. [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/ > My counter proposal would be to define the nodemask for memory.reclaim > as a domain to constrain the charge reclaim. That means both aging and > reclaim including demotion which is a part of aging. This will allow > to control where to demote for balancing purposes (e.g. demote to node 2 > rather than 3) which is impossible with the proposed scheme. > My understanding is that with this interface in order to trigger demotion I would want to list both the top tier nodes and the bottom tier nodes on the nodemask, and since the bottom tier nodes are in the nodemask the kernel will not just trigger demotion, but will also trigger reclaim. This is very specifically not our use case and not the goal of this patch. I had also suggested adding a demotion= arg to memory.reclaim so the userspace may customize this behavior, but Johannes rejected this in [3] to adhere to the aging pipeline. All in all I like Johannes's model in [3] describing the aging pipeline and the relationship between demotion and reclaim. The nodes= arg is just a hint to the kernel that the userspace is looking for reclaim from a top tier node (which would be done by demotion according to the aging pipeline) or a bottom tier node (which would be done by reclaim according to the aging pipeline). I think this interface is aligned with this model. [3] https://lore.kernel.org/linux-mm/Y36XchdgTCsMP4jT@cmpxchg.org/ > [1] http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com > [2] http://lkml.kernel.org/r/Y5bnRtJ6sojtjgVD@dhcp22.suse.cz > [3] http://lkml.kernel.org/r/CAAPL-u8rgW-JACKUT5ChmGSJiTDABcDRjNzW_QxMjCTk9zO4sg@mail.gmail.com > -- > Michal Hocko > SUSE Labs