From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 938BAC61DA3 for ; Tue, 21 Feb 2023 18:07:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 111EE6B0071; Tue, 21 Feb 2023 13:07:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0C25A6B0072; Tue, 21 Feb 2023 13:07:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA4706B0073; Tue, 21 Feb 2023 13:07:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D85526B0071 for ; Tue, 21 Feb 2023 13:07:19 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1765B12095A for ; Tue, 21 Feb 2023 18:07:19 +0000 (UTC) X-FDA: 80492080998.10.BE49715 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) by imf11.hostedemail.com (Postfix) with ESMTP id 37B6B40015 for ; Tue, 21 Feb 2023 18:07:17 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=JxVgPwFf; spf=pass (imf11.hostedemail.com: domain of htejun@gmail.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=htejun@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677002837; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=iC2+JAOPJ7xBzI8XxPV50aOEhLdtaBVXx3tgIqcIDA8=; b=vrlq4skRuL9j12jY3ElV1SDffTqT9qoaCVFMVlTNb79h1T/IF2lhS2j3wHfw2qKL3laoPX MgDre6dtW34q1xTnLKTJGxHvNnCPll9++Igrj5x0+sUGYI/bnfIe/g1yTb+bFETFr+c5TB pjcx6TYDYAwA10Rm9t58NVpgKG2I7fE= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=JxVgPwFf; spf=pass (imf11.hostedemail.com: domain of htejun@gmail.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=htejun@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677002837; a=rsa-sha256; cv=none; b=U+AuzeauwT2gUdozvfUfy7ThmP8J48vcuJQKM/i9a5tp9UU5vLUhV79Fj2YIcEebSDozhl 5/T2bDJ1uR3CEHEWnkvjASfOCZugeiQLZXzQB/6Xw+k+Dmkg4FPjdPHItdqDtLxPH7UADU PcUHQ8Iq0MXdiIJB5tc+TUTDrfyuM3U= Received: by mail-pf1-f171.google.com with SMTP id a7so3192690pfx.10 for ; Tue, 21 Feb 2023 10:07:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=iC2+JAOPJ7xBzI8XxPV50aOEhLdtaBVXx3tgIqcIDA8=; b=JxVgPwFf7PjLbqHyg/KF/bgcZTQyoMGHvbtVxDqt6yOY9aGl+AvfLI1AXDxz/r1wvV S+BrCUA9rK42zdI/Nzsaw//jN1f4IG1/si+33bGh8BWt8Ya1CJR38YIebWuF1Mo07cb9 ir4sJZrHHOcDWDDxAfugg9OaOSg8y43jssvNXjnAXq2byO9AFBVugRCbXUVssSAMabAd ekkiFJCsuLfKBpDbCzzRVEUuCoCPKn5Hleyy6/LRtTHq0QLWEKOURXVW89vZ9P0Hnt+Y ISdL8k+PVv40iT4STlT7TM32PbUzIoMuXdCOzE+L4zMeW65sJG/C04HXNTUu9KSvEhAu UuPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=iC2+JAOPJ7xBzI8XxPV50aOEhLdtaBVXx3tgIqcIDA8=; b=oDtnbSzbSMH0poCl9FTKIDe1l/mopcycy/GobrbW7DpInv6az6XQpEBK4grJJjwWXQ SHiq2kSLsCdMjLsQC1IomGvizPycgvnqcAfS7fl+ZTHbRz5zsbi6O/d06Xpno5u6lsNf KX249DTz/NRBV44UBl6Q4LQOFWSJueqZdZ7CS15rI35sW5qesXVqIIoBTlsHkcFqI4v7 8mHQ9zIVbYA9ZkYNO/eTZWzikvNS9tlvAm2h2b62t4fTgMMswWokyalIaesEGejeoyIV f5Bl42PhztJa5Vdveb3arfNrk1v7l9NhKeD59Yia1ygXtbm/zvHw7Q3Pe4mcPm25cI8I FSmw== X-Gm-Message-State: AO0yUKX5bPuJw2zRwwpyuKuHHslKDX9naI1cA/Nezl1Fm5Bv1DruYhYK ENNxRUnpG6lBkQUiWJBxwqM= X-Google-Smtp-Source: AK7set/sIEbxQwDSsYKcsiPToHrtUeG4U2oC9f/kXGSXecReTeGpfnX+J9vcCz1UZmzqM7677z82eQ== X-Received: by 2002:a62:174a:0:b0:5a8:bd14:d6f1 with SMTP id 71-20020a62174a000000b005a8bd14d6f1mr6422099pfx.7.1677002835751; Tue, 21 Feb 2023 10:07:15 -0800 (PST) Received: from localhost (2603-800c-1a02-1bae-a7fa-157f-969a-4cde.res6.spectrum.com. [2603:800c:1a02:1bae:a7fa:157f:969a:4cde]) by smtp.gmail.com with ESMTPSA id i13-20020aa7908d000000b005abc30d9445sm6746645pfa.180.2023.02.21.10.07.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 21 Feb 2023 10:07:15 -0800 (PST) Date: Tue, 21 Feb 2023 08:07:13 -1000 From: Tejun Heo To: Jason Gunthorpe Cc: Michal Hocko , Yosry Ahmed , Alistair Popple , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, jhubbard@nvidia.com, tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com, mkoutny@suse.com, daniel@ffwll.ch, "Daniel P . Berrange" , Alex Williamson , Zefan Li , Andrew Morton Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 37B6B40015 X-Rspam-User: X-Stat-Signature: oh1o8tirjjx83143o8uttjszhoo8pbph X-HE-Tag: 1677002837-778852 X-HE-Meta: U2FsdGVkX1/9csRZgWid60AixL7vOPvlWjYe6X3bLShyM4XIQw20pmeMcXrbiGmM1wj8Y8UlLXQtX0JRPN+ba/177KTwQh5vJ2PsoxCV2IyuvKBHslwyX/v4L3sSrDK490pYNCP+yw27KVnpiNgMGeaWxpgXo1PWpGAK5mfEbT1/LEzm7C4oriQPg3Fiw5LwtwmlXhRQR/NhDiC7VNWqD3Og7vtLj3UOP4V+1g6pWsLt6iItZuI/g4diIBY24Nkj9UzhpLhj0xatNkS92q6C29cPsiQAe3QHSDAexBrSpM/k485xUrfFrkK1jG3se36GD4R/CLU43FJ/acPXQ9rpp/I9rLRs84qJtsGJWCuVsFHyPDmiBD1sQ7rhBQlc2m4IOA9l0PIWaUkeiA8cNAnBb5l3goHiYzkiDVhJFKU051OKYgW0ombpH5bqCEP4HAgL3oThFPjqNkRtKYSUN2ui5XJoqJTMm7vMnKJVVaOEXt+KaYeRPM6Gm3tJjsrlcSf/btT0MlnJsSOd8yV10/jfNTUywa6NRcitsI/I1TPzVimOsmodBvGNJJJg8B6/XX+l3yqn3oFk78fxOvW82IOLepDFsUyRDJKQ1HHdfzJU2bI0BcJf5Sz81d0QEcPyujYA1zszwe5IvDx4zicnpIzd5I0A5nNdTHZYxyXTQAwlO+2eJFDp20jE46oiS6JkYSxwgcTl+jE+qTleCTvn047AS7kof8wa7J2xKD0wVCScSCH+9gXzmRtPgUbPvEOC3SHFMekgF5D1S5kT72bib2pXdsu1eLe/9TBwqz0dVxh0ovCM6LeqDpMWyGYBDB0fZv4FhZCJBB+ZR8FBNCbNjIGjoiQKAFMy7+97bi/r2dZsbvh3BX9zrLH4z/GXpN/XI+ePo0Dk7xaUNGC6WJw7zYaf/b0/x1ycpn4MIU7ogdjYtiUDj6Z6/keeonujj0IzSNRkRRbB7mvHS/yJy7bzcSs /Ox4XM/t X50Bx8SVvlmMvqJihEinjkY0xmb4aKRi0en4s+qMcko7tZ9oR7p+oRmDd7gaTqiPLFP3L5xIMI0zmfpfqFo1EPWEcp/BieBiFqrKkWBX/GoFDb5GrGB7MqlTMNruRBT2151rq4I9gzJmcfe+iNWWEVHzPlj9YMF6ibfAR24RgyroLO15ilq9VsAEHztmrtEbXThWbJAS6JiC0a76XSfoVYNzgeTpEnexHSH9rN0FaxzqVqsNsAy7tgyuNM41CCsJi+aLA8CIncGNpNqxbde4a3/O1Wtki5z+AtLlmiKZffvUhDdwKhcQG/86ce7TdDeQztB20ck41TpdI3VsYfsTnTbn0GYjykaHnMbc4QVvVFt7y1rkbxgwxZpPXTQvOP3CMUCBJsFF6IeEZguMUOjHixLJlrJ1FxjoPtdC02qjxjr7fp+abneOrlk3/b8M89cl0KHBc5VSx9mq1OHDouu2BnHJSRdpXKdSsy5kHNU497viE5AKTTBM8jwMK9g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, On Tue, Feb 21, 2023 at 01:51:12PM -0400, Jason Gunthorpe wrote: > > Yeah, so, what I'm trying to say is that that might be the source of the > > problem. Is the current page ownership attribution correct > > It should be correct. > > This mechanism is driven by pin_user_page(), (as it is the only API > that can actually create a pin) so the cgroup owner of the page is > broadly related to the "owner" of the VMA's inode. > > The owner of the pin is the caller of pin_user_page(), which is > initated by some FD/proces that is not necessarily related to the > VMA's inode. > > Eg concretely, something like io_uring will do something like: > buffer = mmap() <- Charge memcg for the pages > fd = io_uring_setup(..) > io_uring_register(fd,xx,buffer,..); <- Charge the pincg for the pin > > If mmap is a private anonymous VMA created by the same process then it > is likely the pages will have the same cgroup as io_uring_register and > the FD. > > Otherwise the page cgroup is unconstrained. MAP_SHARED mappings will > have the page cgroup point at whatever cgroup was first to allocate > the page for the VMA's inode. > > AFAIK there are few real use cases to establish a pin on MAP_SHARED > mappings outside your cgroup. However, it is possible, the APIs allow > it, and for security sandbox purposes we can't allow a process inside > a cgroup to triger a charge on a different cgroup. That breaks the > sandbox goal. It seems broken anyway. Please consider the following scenario: 1. A is a tiny cgroup which only does streaming IOs and has memory.high of 128M which is more than sufficient for IO window. The last file it streamed happened to be F which was about 256M. 2. B is an a lot larger cgroup w/ pin limit way above 256M. B pins the entirety of F. 3. A now tries to stream another file but F is almost fully occupying its memory allowance and can't be evicted. A keeps thrashing due to lack of memory and isolation is completely broken. This stems directly from page ownership and pin accounting discrepancy. > If memcg could support multiple owners then it would be logical that > the pinner would be one of the memcg owners. > > > for whatever reason is determining the pinning ownership or should the page > > ownership be attributed the same way too? If they indeed need to differ, > > that probably would need pretty strong justifications. > > It is inherent to how pin_user_pages() works. It is an API that > establishs pins on existing pages. There is nothing about it that says > who the page's memcg owner is. > > I don't think we can do anything about this without breaking things. That's a discrepancy in an internal interface and we don't wanna codify something like that into userspace interface. Semantially, it seems like if pin_user_pages() wanna charge pinning to the cgroup associated with an fd (or whatever), it should also claim the ownership of the pages themselves. I have no idea how feasiable that'd be from memcg POV tho. Given that this would be a fairly cold path (in most cases, the ownership should already match), maybe it won't be too bad? Thanks. -- tejun