From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFE0EC4708D for ; Tue, 13 Dec 2022 07:49:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2ADB38E0003; Tue, 13 Dec 2022 02:49:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 25E558E0002; Tue, 13 Dec 2022 02:49:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 14CDE8E0003; Tue, 13 Dec 2022 02:49:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 06BEC8E0002 for ; Tue, 13 Dec 2022 02:49:13 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C526BA0B1B for ; Tue, 13 Dec 2022 07:49:12 +0000 (UTC) X-FDA: 80236507344.16.DFABEC8 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) by imf28.hostedemail.com (Postfix) with ESMTP id 22DC9C000E for ; Tue, 13 Dec 2022 07:49:10 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="BF6k7/wC"; spf=pass (imf28.hostedemail.com: domain of weixugc@google.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670917751; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Dl3Yao6mN2AiSPWnXX8BOx2+Ij0ONLiwggaARdAt/rY=; b=GOkwZfw8hFQMHYhlS+y4UOCfr05Qvq12247XXNXzTNitNmHwHPUxMntcvaDU1YEM04rQRc zBfrCgUQyemzrFWTW8syKFOBmRFH5VigRaWUct2XN2KpZWHweB0GlLvVbgvUeAHiZwK8D1 8EPK4ZqV10TaDSRQG80e35spPL4JXgE= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="BF6k7/wC"; spf=pass (imf28.hostedemail.com: domain of weixugc@google.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670917751; a=rsa-sha256; cv=none; b=om4G4qcZ93TwhAXkGXcdx0cBKEz/4uyLSY95fDueoeFHF1zeJBBBl0CWDOVmsjStCKLldB /jGuxThr3PlDNs7DOc3uTVG+cVHB1YXnxIskTPaSFVyzKeDsDUrHkBrYA4I/PTD82JyxXn gUyFqXAQV36F7BanWolg7Mkle3sFYK0= Received: by mail-pl1-f180.google.com with SMTP id t2so11789820ply.2 for ; Mon, 12 Dec 2022 23:49:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Dl3Yao6mN2AiSPWnXX8BOx2+Ij0ONLiwggaARdAt/rY=; b=BF6k7/wCZJJeVxACOt4nifXw3+9QOrchqEkwguMyLH9KMLyKBzoTjrJOHvJn6OgLKJ B6qpK9Ui9Qomqp1FshqQPWe713zL2W3CbZ4e3iNV6mDsT8lzwwm33pwMNluRWMNv+63/ KcN6ZybzVkfi+Vj4164zGqEhyhYQdzUJn7df3ZS3uyqbHxQO+ix6+v2oQl1FLlYm61jq QVHrjItXSh2K+Xpvh/LMFNjjElVXq119hLWpnRPYD2Kx8NzbqmOJpThq2wfnHfv/147X jLKB7EdtKMtLd9B6/SCHmMskYvAP8IHxZj59dCtpSX53oJgfuGzHHUsvkuZM/MsH7LH1 EZWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Dl3Yao6mN2AiSPWnXX8BOx2+Ij0ONLiwggaARdAt/rY=; b=UAJAJ0rB93NXPqr9qbbBEOuHGISKkWNPxf26a+09xX2sPgcGKVhikWqpxgNpAzxy09 e0sX01uTTXVy+CsKcTRJw0+cYcclQVRbdbsu+4yKntAU78Y8H9viNml1ChjFUaxmqW8A U7gxe7jEif5QpqKd/PshVNaznzy4flSIjuXFiecXPfyuTRJYrks+wcz/kMuwZ7EzRNhl d8dgtYBFfAxCLGM34XPzxdlSLjMKXIPJ+3MqozBhYQmPtYZKtyO8RAan8hTH5vZgE8ip yLWb33UKIHickISV2FK1SbF8O7WSSWsc76lXGjyQSR85/FT42eCY/V5l1/Xf0dOM93YG IvwA== X-Gm-Message-State: ANoB5pm4CtImrUNiwwP2iHIKlV0pbqVfi17FM79dVDIVuqTsKnL81ujF Ki7+zcP4Rt7wSitxQQKKfBqQXec5tlIWwh0LmcpO6g== X-Google-Smtp-Source: AA0mqf5/FDPvLYc/mxAtrbLrcnbPofqKLabduN34kSKBzNK6U2S3XMzJH70onUnjgE/t15YXkH/N083y3c/M+TEQY6I= X-Received: by 2002:a17:90a:8b06:b0:219:41ef:a812 with SMTP id y6-20020a17090a8b0600b0021941efa812mr233907pjn.153.1670917749693; Mon, 12 Dec 2022 23:49:09 -0800 (PST) MIME-Version: 1.0 References: <20221202223533.1785418-1-almasrymina@google.com> <87k02volwe.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87k02volwe.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Wei Xu Date: Mon, 12 Dec 2022 23:48:57 -0800 Message-ID: Subject: Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim To: "Huang, Ying" Cc: Mina Almasry , Michal Hocko , Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Yang Shi , Yosry Ahmed , fvdl@google.com, bagasdotme@gmail.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 22DC9C000E X-Stat-Signature: 6thwis13e3ps16cdmuzoikkwdschurrf X-Rspam-User: X-HE-Tag: 1670917750-590400 X-HE-Meta: U2FsdGVkX18AAtX9RZq6Q1Q2+5dRTf+ZEIjErgiZPOnlGjFv6Uykz/soTFyN7X2zRZRbl9UbxGkNU4cqOXImUmKVF/vXyrAZ8IjIgZKr9fhr1GBss8w9KYorUtO2CZTXqhUeJgzDW8l178t+ovLy6ABsojrNxBawWt15KRzPx0lyjzRsQlKiV3jHW67v1XY2cQ4DU2BQc0iFKiPGV3zkLISLGzWY/Dgr6MbguHt5cdOqXqesOfmEIzpjUEcUgguIw8b7fZjSqqi3NpiROfWMy7iLAh2sloY8rQKmHlRTMylSvQJ41T6TzAss5TWJx77SNgFBeTTbpCxCcnanvTKWmaeK1/HWKwEGvbOX0L8zPsrycFx7/zgs9uhQ+BOSnTu9FbG2DbnLc8esuj3Jd2WH9A1T9Up0xlqgJ8LqSx7qfQYLs0TGCkAr91EGJ+x2TJ6SCbZ3ZcsHphgTE4ITCLL6mJJdYRS27WUMGv4ytZAG19nauoPy4WZSCpBVW3lWVWgAmRF5vGGkf+Yq8q+uSqHybjdno684LZt8kW5sXkdx75/Pj/J9psffXnyZn1EnZUHjuIAGrbcY6T/gh+H1E5fm8exiqevkW8hWD03KvKMQ9s/swpdEuf654Gqus9LZlsoHG+YAxjJPPb9z9+nxfn3Fc//s1wDrN6fk5XzRFTePajv0qF9rthROtbHtNu5Cg3yjaNr1uVyyEqn43/i8+8w3ErgPB6Gwahl0m5hNdyGo6G0tJY8RpAS7B4x7JX+EUH2kiVy+qUXWFOBGPk7oewAyDVTsQWoTqUlwttD0KMixynLgkstt8SpaeJeyn3WpvJw1B28Dj6gOU2OB9OMYEjJulpL9bp9RAAoFcSvzybupLi6J1B1YRtXL2LHWV4SiQlN2OdwNltLa6Sf0DBCsOy6aGBnAKm0hj0tWfY5boUv5s1Maf0Zxl1Sc1eU/1Cxz4bJRgvWC9NyHCWEymlctK5T Qm176QtH FxAPcBdYb4a+Pq9Y= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Dec 12, 2022 at 10:32 PM Huang, Ying wrote: > > Mina Almasry writes: > > > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko wrote: > >> > >> On Fri 02-12-22 14:35:31, Mina Almasry wrote: > >> > The nodes= arg instructs the kernel to only scan the given nodes for > >> > proactive reclaim. For example use cases, consider a 2 tier memory system: > >> > > >> > nodes 0,1 -> top tier > >> > nodes 2,3 -> second tier > >> > > >> > $ echo "1m nodes=0" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory from node 0. > >> > Since node 0 is a top tier node, demotion will be attempted first. This > >> > is useful to direct proactive reclaim to specific nodes that are under > >> > pressure. > >> > > >> > $ echo "1m nodes=2,3" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > >> > since this tier of memory has no demotion targets the memory will be > >> > reclaimed. > >> > > >> > $ echo "1m nodes=0,1" > memory.reclaim > >> > > >> > Instructs the kernel to reclaim memory from the top tier nodes, which can > >> > be desirable according to the userspace policy if there is pressure on > >> > the top tiers. Since these nodes have demotion targets, the kernel will > >> > attempt demotion first. > >> > > >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > >> > reclaim""), the proactive reclaim interface memory.reclaim does both > >> > reclaim and demotion. Reclaim and demotion incur different latency costs > >> > to the jobs in the cgroup. Demoted memory would still be addressable > >> > by the userspace at a higher latency, but reclaimed memory would need to > >> > incur a pagefault. > >> > > >> > The 'nodes' arg is useful to allow the userspace to control demotion > >> > and reclaim independently according to its policy: if the memory.reclaim > >> > is called on a node with demotion targets, it will attempt demotion first; > >> > if it is called on a node without demotion targets, it will only attempt > >> > reclaim. > >> > > >> > Acked-by: Michal Hocko > >> > Signed-off-by: Mina Almasry > >> > >> After discussion in [1] I have realized that I haven't really thought > >> through all the consequences of this patch and therefore I am retracting > >> my ack here. I am not nacking the patch at this statge but I also think > >> this shouldn't be merged now and we should really consider all the > >> consequences. > >> > >> Let me summarize my main concerns here as well. The proposed > >> implementation doesn't apply the provided nodemask to the whole reclaim > >> process. This means that demotion can happen outside of the mask so the > >> the user request cannot really control demotion targets and that limits > >> the interface should there be any need for a finer grained control in > >> the future (see an example in [2]). > >> Another problem is that this can limit future reclaim extensions because > >> of existing assumptions of the interface [3] - specify only top-tier > >> node to force the aging without actually reclaiming any charges and > >> (ab)use the interface only for aging on multi-tier system. A change to > >> the reclaim to not demote in some cases could break this usecase. > >> > > > > I think this is correct. My use case is to request from the kernel to > > do demotion without reclaim in the cgroup, and the reason for that is > > stated in the commit message: > > > > "Reclaim and demotion incur different latency costs to the jobs in the > > cgroup. Demoted memory would still be addressable by the userspace at > > a higher latency, but reclaimed memory would need to incur a > > pagefault." > > > > For jobs of some latency tiers, we would like to trigger proactive > > demotion (which incurs relatively low latency on the job), but not > > trigger proactive reclaim (which incurs a pagefault). I initially had > > proposed a separate interface for this, but Johannes directed me to > > this interface instead in [1]. In the same email Johannes also tells > > me that meta's reclaim stack relies on memory.reclaim triggering > > demotion, so it seems that I'm not the first to take a dependency on > > this. Additionally in [2] Johannes also says it would be great if in > > the long term reclaim policy and demotion policy do not diverge. > > > > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ > > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/ > > After these discussion, I think the solution maybe use different > interfaces for "proactive demote" and "proactive reclaim". That is, > reconsider "memory.demote". In this way, we will always uncharge the > cgroup for "memory.reclaim". This avoid the possible confusion there. > And, because demotion is considered aging, we don't need to disable > demotion for "memory.reclaim", just don't count it. +1 on memory.demote. > Best Regards, > Huang, Ying >