From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B7B7C433FE for ; Mon, 18 Oct 2021 14:32:13 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BC63B60F93 for ; Mon, 18 Oct 2021 14:32:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org BC63B60F93 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 1750D6B006C; Mon, 18 Oct 2021 10:32:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 125DE6B0071; Mon, 18 Oct 2021 10:32:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F2DF7900002; Mon, 18 Oct 2021 10:32:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0151.hostedemail.com [216.40.44.151]) by kanga.kvack.org (Postfix) with ESMTP id E04206B006C for ; Mon, 18 Oct 2021 10:32:11 -0400 (EDT) Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 8DE0B8249980 for ; Mon, 18 Oct 2021 14:32:11 +0000 (UTC) X-FDA: 78709798062.35.03CEF25 Received: from mail-pj1-f42.google.com (mail-pj1-f42.google.com [209.85.216.42]) by imf06.hostedemail.com (Postfix) with ESMTP id 85554801A8AE for ; Mon, 18 Oct 2021 14:32:10 +0000 (UTC) Received: by mail-pj1-f42.google.com with SMTP id q10-20020a17090a1b0a00b001a076a59640so108407pjq.0 for ; Mon, 18 Oct 2021 07:32:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=WWm9aJ7sj7+jdCgWihvW1SzD+d69W9pL0SQGXhJXr1I=; b=ogglsAG7CbcRaju3WkexVJzz6WoLN3CLqMlRm2JJxq+J1mrxRRTdmxvBuRyrBoykut ZURmztVXiPiT2Wi09hNULghIBjvTmSXlewYdsfnlQojO0kfPFDwTG1HYu6GOrrTCZ5nW rFdrUBIsmhuXrcFHI5xzG0VyWX7J/AaNk8d4Zb1gaP85lTtaMWtnDJB52R2m02c7ClKQ VP3tpivoSabcFg7uBNd9ovNGMvCoLJb5D4qWJP5Be5++qlpbJQzuwWffzPC0XrSJRdZj ZBCxlmsdFyWsMXEpPWegjiUeAgMOPpWRKzENgouEg6t/yyFEUYlvGnWHGMWeJiDN0KoP Pm6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=WWm9aJ7sj7+jdCgWihvW1SzD+d69W9pL0SQGXhJXr1I=; b=xYqW/6BNLr0qBcCqnKcslqV807GeXYZ/hYrRd3S7+RFXWJAJl+o08ZZTQ1xibaVgGD C5BQmX5gTqjbg2a+nbcib1xq+TOlKA2ttfJSnvIkkjkGbI1Tmf3qhEPugVQYD2DKTgD+ SYcMpaztkQ71oFJNrc0wimIRkrvzRHofE66rDy1L1vQNA7xWXqONIkA2qEZEvMnxhn5r kgwUhqqnLPJo0gnOH7eWzWfhGmr9L88YP6PP6ciTdy8R1NEQFhV5Mqg94+0vohN82lH6 yZR21Rd74iDk1Ztw/C9yDeUPqUKHRunrHEEH9+64I2aTXOTEuKdlw0WfZB7TXgplFCMU wPtA== X-Gm-Message-State: AOAM532NiVXFYcjsuwh5SMl0joUCjUHdKfFVy09p+TkCWZJi03ih7rPF bjNvbjm7S2eZCLbRwduK8l6e5FAB85pumyIWmqOyQw== X-Google-Smtp-Source: ABdhPJywDUyRc0wYjspci/5L7PV7QtxjrEYac/ynHQpCgatgArKWjxzReSm0gwjUhsH76Hi+yRjMruhu0P1YmiZUY9g= X-Received: by 2002:a17:90a:62ca:: with SMTP id k10mr33711933pjs.38.1634567529712; Mon, 18 Oct 2021 07:32:09 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Mina Almasry Date: Mon, 18 Oct 2021 07:31:58 -0700 Message-ID: Subject: Re: [RFC Proposal] Deterministic memcg charging for shared memory To: Michal Hocko Cc: Roman Gushchin , Shakeel Butt , Greg Thelen , Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , "open list:FILESYSTEMS (VFS and infrastructure)" , cgroups@vger.kernel.org, riel@surriel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: rujdrcmd4649iawgx75juumiowhsxipz Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=ogglsAG7; spf=pass (imf06.hostedemail.com: domain of almasrymina@google.com designates 209.85.216.42 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 85554801A8AE X-HE-Tag: 1634567530-431581 X-Bogosity: Ham, tests=bogofilter, spamicity=0.127738, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Oct 18, 2021 at 6:33 AM Michal Hocko wrote: > > On Wed 13-10-21 12:23:19, Mina Almasry wrote: > > Below is a proposal for deterministic charging of shared memory. > > Please take a look and let me know if there are any major concerns: > > > > Problem: > > Currently shared memory is charged to the memcg of the allocating > > process. This makes memory usage of processes accessing shared memory > > a bit unpredictable since whichever process accesses the memory first > > will get charged. We have a number of use cases where our userspace > > would like deterministic charging of shared memory: > > > > 1. System services allocating memory for client jobs: > > We have services (namely a network access service[1]) that provide > > functionality for clients running on the machine and allocate memory > > to carry out these services. The memory usage of these services > > depends on the number of jobs running on the machine and the nature of > > the requests made to the service, which makes the memory usage of > > these services hard to predict and thus hard to limit via memory.max. > > These system services would like a way to allocate memory and instruct > > the kernel to charge this memory to the client=E2=80=99s memcg. > > > > 2. Shared filesystem between subtasks of a large job > > Our infrastructure has large meta jobs such as kubernetes which spawn > > multiple subtasks which share a tmpfs mount. These jobs and its > > subtasks use that tmpfs mount for various purposes such as data > > sharing or persistent data between the subtask restarts. In kubernetes > > terminology, the meta job is similar to pods and subtasks are > > containers under pods. We want the shared memory to be > > deterministically charged to the kubernetes's pod and independent to > > the lifetime of containers under the pod. > > > > 3. Shared libraries and language runtimes shared between independent jo= bs. > > We=E2=80=99d like to optimize memory usage on the machine by sharing li= braries > > and language runtimes of many of the processes running on our machines > > in separate memcgs. This produces a side effect that one job may be > > unlucky to be the first to access many of the libraries and may get > > oom killed as all the cached files get charged to it. > > > > Design: > > My rough proposal to solve this problem is to simply add a > > =E2=80=98memcg=3D/path/to/memcg=E2=80=99 mount option for filesystems (= namely tmpfs): > > directing all the memory of the file system to be =E2=80=98remote charg= ed=E2=80=99 to > > cgroup provided by that memcg=3D option. > > Could you be more specific about how this matches the above mentioned > usecases? > For the use cases I've listed respectively: 1. Our network service would mount a tmpfs with 'memcg=3D'. Any memory the service is allocating on behalf of the client, the service will allocate inside of this tmpfs mount, thus charging it to the client's memcg without risk of hitting the service's limit. 2. The large job (kubernetes pod) would mount a tmpfs with 'memcg=3D. It will then share this tmpfs mount with the subtasks (containers in the pod). The subtasks can then allocate memory in the tmpfs, having it charged to the kubernetes job, without risk of hitting the container's limit. 3. We would need to extend this functionality to other file systems of persistent disk, then mount that file system with 'memcg=3D'. Jobs can then use the shared library and any memory allocated due to loading the shared library is charged to a dedicated memcg, and not charged to the job using the shared library. > What would/should happen if the target memcg doesn't or stop existing > under remote charger feet? > My thinking is that the tmpfs acts as a charge target to the memcg and blocks the memcg from being removed until the tmpfs mount is unmounted, similarly to when a user tries to rmdir a memcg with some processes still attached to it. But I don't feel strongly about this, and I'm happy to go with another approach if you have a strong opinion about this. > > Caveats: > > 1. One complication to address is the behavior when the target memcg > > hits its memory.max limit because of remote charging. In this case the > > oom-killer will be invoked, but the oom-killer may not find anything > > to kill in the target memcg being charged. In this case, I propose > > simply failing the remote charge which will cause the process > > executing the remote charge to get an ENOMEM This will be documented > > behavior of remote charging. > > Say you are in a page fault (#PF) path. If you just return ENOMEM then > you will get a system wide OOM killer via pagefault_out_of_memory. This > is very likely not something you want, right? Even if we remove this > behavior, which is another story, then the #PF wouldn't have other ways > than keep retrying which doesn't really look great either. > > The only "reasonable" way I can see right now is kill the remote > charging task. That might result in some other problems though. > Yes! That's exactly what I was thinking, and from discussions with userspace folks interested in this it doesn't seem like a problem. We'd kill the remote charging task and make it clear in the documentation that this is the behavior and the userspace is responsible for working around that. Worthy of mention is that if processes A and B are sharing memory via a tmpfs, they can set memcg=3D. Thus the memory is charged to a common ancestor of memcgs A and B and if the common ancestor hits its limit the oom-killer will get invoked and should always find something to kill. This will also be documented and the userspace can choose to go this route if they don't want to risk being killed on pagefault. > > 2. I would like to provide an initial implementation that adds this > > support for tmpfs, while leaving the implementation generic enough for > > myself or others to extend to more filesystems where they find the > > feature useful. > > How do you envision other filesystems would implement that? Should the > information be persisted in some way? > Yes my initial implementation has a struct memcg* hanging off the super block that is the memcg to charge, but I can move it if there is somewhere else you feel is appropriate once I send out the patches. > I didn't have time to give this a lot of thought and more questions will > likely come. My initial reaction is that this will open a lot of > interesting corner cases which will be hard to deal with. Thank you very much for your review so far and please let me know if you think of any more issues. My feeling is that hitting the remote memcg limit and the oom-killing behavior surrounding that is by far the most contentious issue. You don't seem completely revolted by what I'm proposing there so I'm somewhat optimistic we can deal with the rest of the corner cases :-) > -- > Michal Hocko > SUSE Labs