From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 219C6C433F5 for ; Wed, 13 Oct 2021 19:23:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B01FF611BF for ; Wed, 13 Oct 2021 19:23:33 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B01FF611BF Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id F10076B0072; Wed, 13 Oct 2021 15:23:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC01F900002; Wed, 13 Oct 2021 15:23:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DAE486B0074; Wed, 13 Oct 2021 15:23:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0036.hostedemail.com [216.40.44.36]) by kanga.kvack.org (Postfix) with ESMTP id C82AA6B0072 for ; Wed, 13 Oct 2021 15:23:32 -0400 (EDT) Received: from smtpin32.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 6CACE180388FE for ; Wed, 13 Oct 2021 19:23:32 +0000 (UTC) X-FDA: 78692388264.32.ECC0F14 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) by imf09.hostedemail.com (Postfix) with ESMTP id E56643000109 for ; Wed, 13 Oct 2021 19:23:31 +0000 (UTC) Received: by mail-pl1-f171.google.com with SMTP id n11so2512343plf.4 for ; Wed, 13 Oct 2021 12:23:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=f83NlsNXjMg5QDwE3dqfC8dTeYtf/vdW3ZOd3Omo9rc=; b=g/vAfOHiq9i34ozOpZngme9idv+KwO6KRzk+EqRmuWrF9SxgrFVCPVp1yzMi5gznyQ qI5Z1X4fdM4w7fJWPVdAU6j4Pn2ymxli4PnNYnxurPuK9y7p3UTd3IR28W1ONqlS8edc YcOC8jv5xaZwTvkt7bsUd+a8OcjLzQIHovzrbzXQKcwtkEzUBPXQmGXPrP5sQk9dbzGf mqfdBsei9udsXlihndHTrjYcP4KYUlb26TqJ6SE2cdCGCEbffjCjukOdsK3y8HqWrIaJ eX2MBqb5EyvxX13QGtrX9E27HrggpQP0cG1pyY4nJ7d9p1wFhCCBBuURlxVDCr8eyEVU CeaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=f83NlsNXjMg5QDwE3dqfC8dTeYtf/vdW3ZOd3Omo9rc=; b=cmkN6on1Jq7VEtbzlRDaIpXcYZJY2k4UJtJhWrpil3LuCLQZeDHCsy9WFFtdsCg3OH IhiwIzSB4kWZm4xYwy3PvA9mLvuzhuobTJpNOQ3MYK3ScEvARSRlT0QsbjuYgURuEaiM Ythpz2Q8Sx7YgYlpklPlPAfptwZ7hWCPAqwmkHgJ0Rzx9HO14O9TxEfJg03ABu9b7J7B rKg+1hMW57qWTfonW5WAHSfWh8HI5UfGxh/JfD8/HUdrmeuo1CJAuhn+ItiOLAiOBDAE dgQ30iKTzRovkRiQ6WdT4aSyTb193s2/MEtUoPGNAF2BtaeLJkJ4Xk8xJIF6UxI2xKJZ vJeA== X-Gm-Message-State: AOAM530/a/Alb3NpTaGKfklsLPXbVKd1fzdo3qJv4+bfyGCMjql6sNW6 Et544SmpVRHOaiO8mY3glY/l/WdqtaUB0evTicksCg== X-Google-Smtp-Source: ABdhPJx7B4wbxUigj3YkCzr4ajJ5PrOzGyI3Sp5lpbipGlagpogxA6eKEteE1V3/8BXY06a8GNqWVU4AwQRGk530ZBw= X-Received: by 2002:a17:903:22cc:b0:13e:fa73:6fef with SMTP id y12-20020a17090322cc00b0013efa736fefmr1015863plg.25.1634153010612; Wed, 13 Oct 2021 12:23:30 -0700 (PDT) MIME-Version: 1.0 From: Mina Almasry Date: Wed, 13 Oct 2021 12:23:19 -0700 Message-ID: Subject: [RFC Proposal] Deterministic memcg charging for shared memory To: Roman Gushchin , Shakeel Butt , Greg Thelen , Michal Hocko , Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , "open list:FILESYSTEMS (VFS and infrastructure)" , cgroups@vger.kernel.org, riel@surriel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: E56643000109 X-Stat-Signature: dkrscmfbh69k73fc4gym8wxwxrdacn4e Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="g/vAfOHi"; spf=pass (imf09.hostedemail.com: domain of almasrymina@google.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com X-HE-Tag: 1634153011-198468 X-Bogosity: Ham, tests=bogofilter, spamicity=0.082077, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Below is a proposal for deterministic charging of shared memory. Please take a look and let me know if there are any major concerns: Problem: Currently shared memory is charged to the memcg of the allocating process. This makes memory usage of processes accessing shared memory a bit unpredictable since whichever process accesses the memory first will get charged. We have a number of use cases where our userspace would like deterministic charging of shared memory: 1. System services allocating memory for client jobs: We have services (namely a network access service[1]) that provide functionality for clients running on the machine and allocate memory to carry out these services. The memory usage of these services depends on the number of jobs running on the machine and the nature of the requests made to the service, which makes the memory usage of these services hard to predict and thus hard to limit via memory.max. These system services would like a way to allocate memory and instruct the kernel to charge this memory to the client=E2=80=99s memcg. 2. Shared filesystem between subtasks of a large job Our infrastructure has large meta jobs such as kubernetes which spawn multiple subtasks which share a tmpfs mount. These jobs and its subtasks use that tmpfs mount for various purposes such as data sharing or persistent data between the subtask restarts. In kubernetes terminology, the meta job is similar to pods and subtasks are containers under pods. We want the shared memory to be deterministically charged to the kubernetes's pod and independent to the lifetime of containers under the pod. 3. Shared libraries and language runtimes shared between independent jobs. We=E2=80=99d like to optimize memory usage on the machine by sharing librar= ies and language runtimes of many of the processes running on our machines in separate memcgs. This produces a side effect that one job may be unlucky to be the first to access many of the libraries and may get oom killed as all the cached files get charged to it. Design: My rough proposal to solve this problem is to simply add a =E2=80=98memcg=3D/path/to/memcg=E2=80=99 mount option for filesystems (name= ly tmpfs): directing all the memory of the file system to be =E2=80=98remote charged= =E2=80=99 to cgroup provided by that memcg=3D option. Caveats: 1. One complication to address is the behavior when the target memcg hits its memory.max limit because of remote charging. In this case the oom-killer will be invoked, but the oom-killer may not find anything to kill in the target memcg being charged. In this case, I propose simply failing the remote charge which will cause the process executing the remote charge to get an ENOMEM This will be documented behavior of remote charging. 2. I would like to provide an initial implementation that adds this support for tmpfs, while leaving the implementation generic enough for myself or others to extend to more filesystems where they find the feature useful. 3. I would like to implement this for both cgroups v2 _and_ cgroups v1, as we still have cgroup v1 users. If this is unacceptable I can provide the v2 implementation only, and maintain a local patch for the v1 support. If this proposal sounds good in principle. I have an experimental implementation that I can make ready for review. Please let me know of any concerns you may have. Thank you very much in advance! Mina Almasry [1] https://research.google/pubs/pub48630/