From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D3A8C77B7A for ; Tue, 13 Jun 2023 06:47:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BCFA46B0074; Tue, 13 Jun 2023 02:47:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B802E8E0005; Tue, 13 Jun 2023 02:47:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A47AE8E0003; Tue, 13 Jun 2023 02:47:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 93EC06B0074 for ; Tue, 13 Jun 2023 02:47:13 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 624521A03DA for ; Tue, 13 Jun 2023 06:47:13 +0000 (UTC) X-FDA: 80896792746.29.F8E04D6 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) by imf23.hostedemail.com (Postfix) with ESMTP id 4720B14001F for ; Tue, 13 Jun 2023 06:47:11 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=cOnUQuyw; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf23.hostedemail.com: domain of wuyun.abel@bytedance.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=wuyun.abel@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686638831; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UIt7yjEGnPmfSnSW7ej7mUT8lTrDwLThOBy1DhvgW+8=; b=gCYZv3PjlXXB5tbpP0Y6W/R293tP8TDk+1OewWTxfXUfTs2CPLDZo4Hn1xoqfjTk0KYavs 6l0wExe17l10w6/iZobf4EHakLUXF5OH2eE9CWXdSqp4TOc2+7hgBp2YuhFTEkq5frMRid qImgPH1WdJJQyyZSYAQcTpEnpZFxU8M= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=cOnUQuyw; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf23.hostedemail.com: domain of wuyun.abel@bytedance.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=wuyun.abel@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686638831; a=rsa-sha256; cv=none; b=XJULN9OBTghC4sp8LJRy96VuObExSSQtDcLxWngCvakVCgIVN89dDretKaJ0lLonz8FWJO tl1QQqkT08BJ5YYAra4o9S0CCMcymeAhcqSX1r/mj/UIgTO7L/1F4ljhfIXPiVLhDSXmA9 IDak/WirRTbVBoebYEia3r3RV4H81iI= Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-1b3c0c476d1so19444895ad.1 for ; Mon, 12 Jun 2023 23:47:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1686638830; x=1689230830; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=UIt7yjEGnPmfSnSW7ej7mUT8lTrDwLThOBy1DhvgW+8=; b=cOnUQuywUfCqQExpNyKzwnRKwjIhzVJw961gL0WjaMUc+Sv1fmcEnvEmTIQY5ecCgP dio5TjN0kwdQPlpGXydTOGeB2Jwc9fCu8IM/ZJZfVOXmCfHY7hWy95VXW8COKOIncsY3 YhRPSXXbHRhta7HJwgme9EHTPGSZRqp3nvHJ33t7hr5/oUUQvAubZRJ6ydRJQNZ0dik2 uz9rxO/887/0vqThciHPE3nO4EFbXj/dtWgIb4AJp2eRSAyx9v+DHatVow/sGRdTgAf6 663cJpcmuQEjmH/xCk9pAdmevfYljg9kOThryDFU6YHrUdDNwVQrMJutUGIvVc/8CyFd RS2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686638830; x=1689230830; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UIt7yjEGnPmfSnSW7ej7mUT8lTrDwLThOBy1DhvgW+8=; b=MM3EablAeHLcjZ3zKcpyblFbK2Cux7XKRSAPfHJEfM11P9kUL7O56Mt2axfSzieM9M CGuL2Y8Xl6avOljPqvjgSS8rIXscHvWtYViEDeGCo8qrR7P3CCEUpyWJj3HHt6m9oMwj jKfsKD0N7fvyD9aJPUuNUgxa/01XKLwJ9OeuXIYJxI/LP0MDxMDTMFVpcoeez6w761yK w6qno51TlXiO/wD80sh7EBSDLCuY9QrJFrRFVxD97+nXGTXZQGsBopu65SaofVswzdj7 /opwxTqMwMby1UntuCO1zZ92tpRS8855OtJAZNdPyh2DxNlvjUKQSxU5BhfvzUv8S1gi p/6Q== X-Gm-Message-State: AC+VfDzbJEZi44NXaV6SNsW7Dydl+IRUNaB6U/9+hTFebpPLMbLtaUS0 Nnf9+Hlzt3yTlJEZQcf3/otOwQ== X-Google-Smtp-Source: ACHHUZ7bvw7gDPT5w7dEDzgHUImPLFOHNZvKiKPuXbCulZUw6cilIxqVTvrX9TmmTPnSHMz90PRE1w== X-Received: by 2002:a17:903:41ce:b0:1b1:ac87:b47a with SMTP id u14-20020a17090341ce00b001b1ac87b47amr10834923ple.65.1686638829727; Mon, 12 Jun 2023 23:47:09 -0700 (PDT) Received: from [10.254.80.225] ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id y20-20020a170902b49400b001a980a23804sm9401790plr.4.2023.06.12.23.46.57 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 12 Jun 2023 23:47:09 -0700 (PDT) Message-ID: <554cd1a7-83a3-ae49-0770-2321c79472a1@bytedance.com> Date: Tue, 13 Jun 2023 14:46:55 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.11.2 Subject: Re: Re: [RFC PATCH net-next] sock: Propose socket.urgent for sockmem isolation Content-Language: en-US To: Shakeel Butt , Eric Dumazet Cc: Tejun Heo , Christian Warloe , Wei Wang , "David S. Miller" , Jakub Kicinski , Paolo Abeni , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Andrew Morton , David Ahern , Yosry Ahmed , "Matthew Wilcox (Oracle)" , Yu Zhao , Vasily Averin , Kuniyuki Iwashima , Martin KaFai Lau , Xin Long , Jason Xing , Michal Hocko , Alexei Starovoitov , open list , "open list:NETWORKING [GENERAL]" , "open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)" , "open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)" References: <20230609082712.34889-1-wuyun.abel@bytedance.com> From: Abel Wu In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: 3138bhptoqy4cccp4usqyywc8xyrg4ks X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 4720B14001F X-HE-Tag: 1686638831-738745 X-HE-Meta: U2FsdGVkX1+jCq3QqIujLeLgfMiQYt13A/rvSSQHeWkb97ept7uOooKh+2jmJX3wsfn5bB+lmepsYFkiuuV81y2N+4Z1Sb85oWuLkpgoyAlBEtjWuHZw9qy5u++xcfJyuOV4qBRU02leHiONceMvyXfX0shtChjfiEBeDJJ3Xp7cN0phPrwzPqO8xJROMdgOFo2upsNHprYD/qYojpipBYbeI6CcuJ5oq6NoTPzgx5d1VbuNALfG+H++lCHVE9QJdMWLWdS2+t0ll/oAEs/xpD6y+oGjxo8DI5L7a6SkyMeOBOAbd+L1ACo+yzlGenygJziIpkRvejl2Fs7XfeL7SZr5Q45vlgGbPP1UDDG7fOXNz6Pd/Fn9fzW6bnAO4No7lLdYVWDLDBERDwkfBPQGCWd51m5jBA3n635IwStXcVRi4R+ADCDWfJzF13PQw97FASm5M8t45rWZdP0njU6uCh36+PM5YYskWyoEu6Du2in6UDfKoHbGSUcT0IhdxIsMYSncriwTulNkrxKDLkD8qgdopoYI9tAbEk0iNUg9tOlFNDbjUAnH1va4ZWMuT0iCl1aAt5wRrRvbdjqKEo0uat3CD03pJCYk3jo8xUBvm05C3V3IudGPLOQc/Nrgzxt/KEYA71MrGSXjGclQLzkthRrzQP7V36CsBQRcNfykXuZ+O2/gUuRJTuY1FY3gLMaKw7L2t4vNKKWmPn3UtVZnQyTwI8/XHuh2F8SaenJpSyZLPdZE/T4WakmKn6KBqXXPoGTAVFLN6pbDkqO0NqaK/p+ZAyEHALMl75+cLfkpOS20wOOUbRcwlxXkTbHt4WYLZQ2VKY7QxBQcTMx7jaULdI+OjTZ7XTogjKVC77DZQ4bQMt9jjZ7AJG0zKaVvOhcZflbFSKBB1hawd2pxi0daeWrIRv/i7w1rOBHeLLbIUbf1a2zNEU6xXagTWAbylGYWe4XXkc0ySSf0n24dXpJ t5Hr3xPg byOlmKalbiZyp+P//uJWxvG8MGPVriWNvSVBwngxsgWuFKFxt7RkjBvS7jjjTI32ZnRsHa8BvG6a2YYAiHs2tu9jvgLEu+PcGpAHiD+FOi6mCN7SQ7h0o4DLvLrrmr9GsO5U2IljF+z2jV76fQx+/ARmsYs0lPovlu00FNVf1U0gecxlffSfdgDNzG+eEOrja3Ewrv4IIgrlBw5oPze5bQAtKT8MiJREBrjZI7u4t/Xp8lHg2CHBGHGZmRvt0vPr2V2Z6d4qnETXu56Jw6o8/LGk5PJKro5+TobeXKJ1jVWrAi+aQSt6swmInozoneDIK6d9RFsA//MEGWNlwzM4S84rjY45asnqgA+jtecF1AiYzVFzSV2vAzNvm9RN6+LGGhpXaBSMvBbC9luDNfqwyUl12giVe4cbHr1OVfjiFjkPoLuVv6Kt5w4T00dz4YGyH/c9etY/EDZuGbzyTXp5BMum4VBBPCeNKaXbCPF1uavqmAkgjZ/7cAeO1ThOK0nn+fSceGU7XIMaJ/rA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 6/10/23 1:53 AM, Shakeel Butt wrote: > On Fri, Jun 9, 2023 at 2:07 PM Eric Dumazet wrote: >> >> On Fri, Jun 9, 2023 at 10:28 AM Abel Wu wrote: >>> >>> This is just a PoC patch intended to resume the discussion about >>> tcpmem isolation opened by Google in LPC'22 [1]. >>> >>> We are facing the same problem that the global shared threshold can >>> cause isolation issues. Low priority jobs can hog TCP memory and >>> adversely impact higher priority jobs. What's worse is that these >>> low priority jobs usually have smaller cpu weights leading to poor >>> ability to consume rx data. >>> >>> To tackle this problem, an interface for non-root cgroup memory >>> controller named 'socket.urgent' is proposed. It determines whether >>> the sockets of this cgroup and its descendants can escape from the >>> constrains or not under global socket memory pressure. >>> >>> The 'urgent' semantics will not take effect under memcg pressure in >>> order to protect against worse memstalls, thus will be the same as >>> before without this patch. >>> >>> This proposal doesn't remove protocal's threshold as we found it >>> useful in restraining memory defragment. As aforementioned the low >>> priority jobs can hog lots of memory, which is unreclaimable and >>> unmovable, for some time due to small cpu weight. >>> >>> So in practice we allow high priority jobs with net-memcg accounting >>> enabled to escape the global constrains if the net-memcg itselt is >>> not under pressure. While for lower priority jobs, the budget will >>> be tightened as the memory usage of 'urgent' jobs increases. In this >>> way we can finally achieve: >>> >>> - Important jobs won't be priority inversed by the background >>> jobs in terms of socket memory pressure/limit. >>> >>> - Global constrains are still effective, but only on non-urgent >>> jobs, useful for admins on policy decision on defrag. >>> >>> Comments/Ideas are welcomed, thanks! >>> >> >> This seems to go in a complete opposite direction than memcg promises. >> >> Can we fix memcg, so that : >> >> Each group can use the memory it was provisioned (this includes TCP buffers) >> >> Global tcp_memory can disappear (set tcp_mem to infinity) > > I agree with Eric and this is exactly how we at Google overcome the > isolation issue. We have set tcp_mem to unlimited and enabled memcg > accounting of network memory (by surgically incorporating v2 semantics > of network memory accounting in our v1 environment). > > I do have one question though: > >> This proposal doesn't remove protocal's threshold as we found it >> useful in restraining memory defragment. > > Can you explain how you find the global tcp limit useful? What does > memory defragment mean? We co-locate different kinds of jobs with different priority in cgroups, among which there are some background jobs can have lots of net data to process, e.g. training jobs. These background jobs usually don't have enough cpu bandwidth to consume the rx data in time if more important jobs are running simultaneously. The data can be accumulated to eat up some or all of the provisioned memory. These unreclaimable memory could gradually fragment whole memory. We have already found many such cases in production environment. Maybe it's not proper to use the word 'defragment' as what we do is to try to prevent from fragmentation rather than defrag like compaction. With global tcp_mem pressure/limit and socket.urgent, we are able to achieve this goal, at least at some extent. And not only global tcp limit, the pressure threshold could also make something like priority inversion happen. We monitored top20 priority jobs and found their performance reduced by 2~9% when under global tcp memory pressure (and sometimes the majority of sk_memory_allocated() can be contributed by the low priority jobs). Although this has nothing to do with 'memory defrag'.