From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 22D3AC2D0DB for ; Thu, 30 Jan 2020 17:00:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D6C1020661 for ; Thu, 30 Jan 2020 17:00:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D6C1020661 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6D7EA6B0370; Thu, 30 Jan 2020 12:00:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 661006B0371; Thu, 30 Jan 2020 12:00:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5291D6B0373; Thu, 30 Jan 2020 12:00:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0036.hostedemail.com [216.40.44.36]) by kanga.kvack.org (Postfix) with ESMTP id 35F916B0370 for ; Thu, 30 Jan 2020 12:00:25 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id D9C30180AD804 for ; Thu, 30 Jan 2020 17:00:24 +0000 (UTC) X-FDA: 76434913968.22.wash29_5ad91e983900a X-HE-Tag: wash29_5ad91e983900a X-Filterd-Recvd-Size: 6183 Received: from mail-wm1-f65.google.com (mail-wm1-f65.google.com [209.85.128.65]) by imf23.hostedemail.com (Postfix) with ESMTP for ; Thu, 30 Jan 2020 17:00:24 +0000 (UTC) Received: by mail-wm1-f65.google.com with SMTP id t23so4574479wmi.1 for ; Thu, 30 Jan 2020 09:00:24 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=O19S44JgA364hfEmw1UsxeYnxuzE72hVX9NKFWCFSOg=; b=MQ2XThAT+G/jcChyeOaMzCWozvB9XZGaYNnaddrIXAnxh45+XnNkIv1jsDmc2VROBE L4CitqT32W1/iol0CziGdCMAeouLN1vYizukYMW+GJGsHbvFEXqHiVj00d+7roL3EOAn RGKHrAdKBFPi2MlvjxqA3XVT2O1k/0HO5zQJnNOl2Sng46re46A0Stn7kt5YodQv8HVZ PrEaaln6oZCocpyATNeU/RUKDnphMzNOqsJTEJjWki5Wby5mRty9TxllNo8+u84x5dmj 0CkHWjYKmyeH28pyT15AT5JKyp79I5r5uaMTfgd+azXl08fBo0npad12fu/NxqJpT8g8 leag== X-Gm-Message-State: APjAAAVwmTc0v3y9N9wgrwiMe8F41sIkWTwZlAjLZiNtzq/yBsmV2c4I 8RNFHoiRc76aBUW7rdUTn4k= X-Google-Smtp-Source: APXvYqwdX+hj8xuZWB1fIH0IInh2RVER3tbrowqd6kOiJs0+YGQp+XaPDGyatOBTpouCtewwfAsb6g== X-Received: by 2002:a1c:8156:: with SMTP id c83mr6470112wmd.164.1580403622689; Thu, 30 Jan 2020 09:00:22 -0800 (PST) Received: from localhost (prg-ext-pat.suse.com. [213.151.95.130]) by smtp.gmail.com with ESMTPSA id d14sm8428287wru.9.2020.01.30.09.00.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 30 Jan 2020 09:00:21 -0800 (PST) Date: Thu, 30 Jan 2020 18:00:20 +0100 From: Michal Hocko To: Johannes Weiner Cc: Andrew Morton , Roman Gushchin , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection Message-ID: <20200130170020.GZ24244@dhcp22.suse.cz> References: <20191219200718.15696-1-hannes@cmpxchg.org> <20191219200718.15696-4-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191219200718.15696-4-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu 19-12-19 15:07:18, Johannes Weiner wrote: > Right now, the effective protection of any given cgroup is capped by > its own explicit memory.low setting, regardless of what the parent > says. The reasons for this are mostly historical and ease of > implementation: to make delegation of memory.low safe, effective > protection is the min() of all memory.low up the tree. > > Unfortunately, this limitation makes it impossible to protect an > entire subtree from another without forcing the user to make explicit > protection allocations all the way to the leaf cgroups - something > that is highly undesirable in real life scenarios. > > Consider memory in a data center host. At the cgroup top level, we > have a distinction between system management software and the actual > workload the system is executing. Both branches are further subdivided > into individual services, job components etc. > > We want to protect the workload as a whole from the system management > software, but that doesn't mean we want to protect and prioritize > individual workload wrt each other. Their memory demand can vary over > time, and we'd want the VM to simply cache the hottest data within the > workload subtree. Yet, the current memory.low limitations force us to > allocate a fixed amount of protection to each workload component in > order to get protection from system management software in > general. This results in very inefficient resource distribution. I do agree that configuring the reclaim protection is not an easy task. Especially in a deeper reclaim hierarchy. systemd tends to create a deep and commonly shared subtrees. So having a protected workload really requires to be put directly into a new first level cgroup in practice AFAICT. That is a simpler example though. Just imagine you want to protect a certain user slice. You seem to be facing a different problem though IIUC. You know how much memory you want to protect and you do not have to care about the cgroup hierarchy up but you do not know/care how to distribute that protection among workloads running under that protection. I agree that this is a reasonable usecase. Those both problems however show that we have a more general configurability problem for both leaf and intermediate nodes. They are both a result of strong requirements imposed by delegation as you have noted above. I am thinking didn't we just go too rigid here? Delegation points are certainly a security boundary and they should be treated like that but do we really need a strong containment when the reclaim protection is under admin full control? Does the admin really have to reconfigure a large part of the hierarchy to protect a particular subtree? I do not have a great answer on how to implement this unfortunately. The best I could come up with was to add a "$inherited_protection" magic value to distinguish from an explicit >=0 protection. What's the difference? $inherited_protection would be a default and it would always refer to the closest explicit protection up the hierarchy (with 0 as a default if there is none defined). A / \ B C (low=10G) / \ D E (low = 5G) A, B don't get any protection (low=0). C gets protection (10G) and distributes the pressure to D, E when in excess. D inherits (low=10G) and E overrides the protection to 5G. That would help both usecases AFAICS while the delegation should be still possible (configure the delegation point with an explicit value). I have very likely not thought that through completely. Does that sound like a completely insane idea? Or do you think that the two usecases are simply impossible to handle at the same time? [...] -- Michal Hocko SUSE Labs