From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A884C3DA49 for ; Tue, 16 Jul 2024 20:18:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F0EE86B0093; Tue, 16 Jul 2024 16:18:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EBED16B0095; Tue, 16 Jul 2024 16:18:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DADC36B0096; Tue, 16 Jul 2024 16:18:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id BC6766B0093 for ; Tue, 16 Jul 2024 16:18:37 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 6C1458035E for ; Tue, 16 Jul 2024 20:18:37 +0000 (UTC) X-FDA: 82346728674.24.7620019 Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182]) by imf02.hostedemail.com (Postfix) with ESMTP id 8B28E8000D for ; Tue, 16 Jul 2024 20:18:35 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=vimeo.com header.s=google header.b=HSUBw3X5; spf=pass (imf02.hostedemail.com: domain of davidf@vimeo.com designates 209.85.210.182 as permitted sender) smtp.mailfrom=davidf@vimeo.com; dmarc=pass (policy=reject) header.from=vimeo.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721161063; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FyC5BZh6YjG7u8zcZ2c65+wWIIwNIpHTlJEujslh7io=; b=vuwIECbH8oCb1PsXcYyY8p5jCR38KfzjOx1ixnID5Cln1UwysFAQYKh87Pn37xCWCH2KX7 Zy8Ol0L6BtE5pzIUX/BLpwtWasVgGI3kbh5tU6wL7BesqEJdbdL9m9pVLq00KNK0t9ubtB n48ClqV805/8oSJ/KbUb3C9UqDOaXMM= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=vimeo.com header.s=google header.b=HSUBw3X5; spf=pass (imf02.hostedemail.com: domain of davidf@vimeo.com designates 209.85.210.182 as permitted sender) smtp.mailfrom=davidf@vimeo.com; dmarc=pass (policy=reject) header.from=vimeo.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721161063; a=rsa-sha256; cv=none; b=398jmRE+0exyX3P9I211L/AzjdLsXpOxIHD3794DOlb+kbpBIDrd9wtGMNO1ocnOju1OUI MsvxF6jUIx0CBKhT5bL0u0p1EFW/73kXlvtxyu2jIWHeh18kQXryzigCqN4j91P+Mv9d02 tMhk0ntdmJm7nPDUckLomSBvsIKlXiY= Received: by mail-pf1-f182.google.com with SMTP id d2e1a72fcca58-70cdc91d25dso592184b3a.0 for ; Tue, 16 Jul 2024 13:18:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vimeo.com; s=google; t=1721161114; x=1721765914; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FyC5BZh6YjG7u8zcZ2c65+wWIIwNIpHTlJEujslh7io=; b=HSUBw3X5AYk9X/uvHUgUKtyebyKa3j5eFYUgf1rA0LJUFjtyp6LM/z3d42aqjDWBkG 6NPpRRwYFd3S5r7q8C8IhbUHeYyMNgUEFBrJIhqfi5sdNWDmXWjQ2P7YQPsfM0XS2Sqp W9adCag/X07gp64y7nSZfvFOfFlNzYz/EYR8U= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721161114; x=1721765914; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FyC5BZh6YjG7u8zcZ2c65+wWIIwNIpHTlJEujslh7io=; b=C4DHTZOH1m9DOpp9APYOqDA9nHyFmUSjbbAnj54u7rrHYrBoWlY+U/ZhsGG6dScn26 UwthfJl+pe0A3/4iM17RC+/jWcm/iQ/OvTQ/qesRRjN8B4Ccik2IJQWPK7k+9orHTCH2 OxmWz5EyY7pB19sMInTUOrrlDvO2YsJ5FCX6OVGkSv6ki6xNHOcGeTFCebHMK09ZwloB jFGK0UdAMCZGxXVhsj59ofwAd3hZIqJzix2e32ZBoSuR7L+ndBT+ByYrSFN1n4QntmeX YAfy7J/2R2sr9saVL+/ATmv19Nacdf+IC8qe+UW+07anvCJpCpjnk4mt2N07iH99PP36 H49A== X-Forwarded-Encrypted: i=1; AJvYcCVRoB0LuXE3LfLboKSRj+RTu/gAc1bheEuB8xPm1NRt79Cd4ON3PibwAMuAEaT1tFKV79ybkr0GlOTM3YSksVhMUYU= X-Gm-Message-State: AOJu0YzpN2caBzcjYz9VfrjHlDC3Arvq7hx2FcK4MFaUNTGgtGbM7i3w rw8NzI/ohZYSD5MMq04yPtg9yLyRjE75ANz56veyOkredCAiYcRLKSyDe6yfv6wYzXiqbipIMuH vUbWadRwNFvoOcF1IgpOikK1q9FxNn77R5UprqA== X-Google-Smtp-Source: AGHT+IEFgQRryyYjffP1tMyu+yRGi+Lu74CsJaCnG4LnNTpPp6GXalOUh/HXJ0e/K/ocxUbYqr7+Or8rdlcaOJ96Yo4= X-Received: by 2002:a05:6a00:1491:b0:70b:23d9:98ae with SMTP id d2e1a72fcca58-70c2e9d2173mr3909520b3a.28.1721161114136; Tue, 16 Jul 2024 13:18:34 -0700 (PDT) MIME-Version: 1.0 References: <20240715203625.1462309-1-davidf@vimeo.com> <20240715203625.1462309-2-davidf@vimeo.com> In-Reply-To: From: David Finkel Date: Tue, 16 Jul 2024 16:18:23 -0400 Message-ID: Subject: Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers To: Tejun Heo Cc: Michal Hocko , Muchun Song , Andrew Morton , core-services@vimeo.com, Jonathan Corbet , Roman Gushchin , Shakeel Butt , Shuah Khan , Johannes Weiner , Zefan Li , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 8B28E8000D X-Stat-Signature: 51k96uzpqao59ndgxbiincg4iw95srmr X-Rspam-User: X-HE-Tag: 1721161115-293227 X-HE-Meta: U2FsdGVkX194Ql/jXbY//IZiroxAT1O6Ykms5XBPy4e8dnM6nkajIsWgH1aJnD5A3P+OC44xQmXU75IrOp3pL4aZr+DrksSLqKdCe6ClmIIoHu8caUg5hz437T/90d9WrMtnd9uovzTRET2FyqS/2pnypYwKKYoGr+7DHXDzDWPWdUUb69gwKX6bo7RYKPvKePldY8TDcY42yiHUuwGdvUFfwQYYRqeX+Wjyr1+GM4gXjrMiCHWdowlUTu/+jVQG7cF5jz7kIj+OQn9ZYmlnyh/AzwTK6vxpn/jA6FcQ5xMGyuo6zGINKBjbMZV6ARJE+/ThlHZCm1Z4dHLv6yQaoycbxTz9sxVdQMwOJ8feS6KiyhbVT0bUWNZcxennrBBujRUnhKLycjL23nHFqSZ5nC49L90/tEj8JjhERPsoWrG5LVf84uYpFsBQDkfZsezLDbs9AnZ2hK/6JEgWWKmOOA4uwMSqmnbNIoknk+w0j+w1xdzirwCsB7gLGroCSR2LBr4klJlV+IQeOsa+m7hevPd42uOOMokA7fymjRNop2gXkQF5ZrOvQy3Q56jKlluI06HPcoID0KgReCT8E829L449sRyy0Ps1E4UaGPlfwbcXONm+MGTFJOTkIYwzJDm7jjXOzZkyv6zQ/JKZBOaJUoOC+gmuQ/YKmfBzfDHEdCIkR9uQz7abZZ9yuZhhrrD8DJM5yEy9E1Ja1AJWlvhFgHy6PquYTmMzo74pOUz30u5rrO6dJlpktKoXDNSthn6fgtBjJN3kVWb5vI8wc62yX9LqgNx9Y6pDgTE/70HMVYHc16ppK125iTfK8pvn+X/WO86o0sxSKyr4XbfNomTfyac9M91HReTCLpF3YUrTSg82iNvU1k+6RSFSHxYP+984cHAuRjJxLMKO0EWVVcrL54wavq6aUog3PN6Cavjs8x6Hil4NFmH6Un59RWVlD4r7YcTmIYc/FzngJjRsgnN Y7QyxU6v Z1b+ksyy+6T4cf1mJdlroORwyL4aCW/xW2LZ7eP4KGByL21Fr/fgF8VtmnL7KdSegziJu3ykKtZVeHZFz14HTCSX1PFI7sRTzHP9eNONMlfYOyVTjf+GLopaSVJLPu22B2T8T/o2Me1vNHeLPX6JnWswJAAmVjbqv9ctVzHV1aUWe3ymoHGjjJAfcMLTtXBOFlPfKyWXCWRlTYQ1g/CalvHHfv/i71XsdXkgu X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jul 16, 2024 at 3:48=E2=80=AFPM Tejun Heo wrote: > > Hello, > > On Tue, Jul 16, 2024 at 01:10:14PM -0400, David Finkel wrote: > > > Swap still has bad reps but there's nothing drastically worse about i= t than > > > page cache. ie. If you're under memory pressure, you get thrashing on= e way > > > or another. If there's no swap, the system is just memlocking anon me= mory > > > even when they are a lot colder than page cache, so I'm skeptical tha= t no > > > swap + mostly anon + kernel OOM kills is a good strategy in general > > > especially given that the system behavior is not very predictable und= er OOM > > > conditions. > > > > The reason we need peak memory information is to let us schedule work i= n a > > way that we generally avoid OOM conditions. For the workloads I work on= , > > we generally have very little in the page-cache, since the data isn't > > stored locally most of the time, but streamed from other storage/databa= se > > systems. For those cases, demand-paging will cause large variations in > > servicing time, and we'd rather restart the process than have > > unpredictable latency. The same is true for the batch/queue-work system= I > > wrote this patch to support. We keep very little data on the local disk= , > > so the page cache is relatively small. > > You can detect these conditions more reliably and *earlier* using PSI > triggers with swap enabled than hard allocations and OOM kills. Then, you > can take whatever decision you want to take including killing the job > without worrying about the whole system severely suffering. You can even = do > things like freezing the cgroup and taking backtraces and collecting othe= r > debug info to better understand why the memory usage is blowing up. > > There are of course multiple ways to go about things but I think it's use= ful > to note that hard alloc based on peak usage + OOM kills likely isn't the > best way here. To be clear, my goal with peak memory tracking is to bin-pack in a way that I don't encounter OOMs. I'd prefer to have a bit of headroom and avoid OOMs if I can. PSI does seem like a wonderful tool, and I do intend to use it, but since it's a reactive signal and doesn't provide absolute values for the total memory usage that we'd need to figure out in our central scheduler which work can cohabitate (and how many instances), it complements memory.peak rather than replacing my need for it. FWIW, at the moment, we have some (partially broken) OOM-detection, which does make sense to swap out for PSI tracking/trigger-watching that takes care of scaling down workers when there's resource-pressure. (Thanks for pointing out that PSI is generally a better signal than OOMs for memory pressure) Thanks again, > > ... > > I appreciate the ownership issues with the current resetting interface = in > > the other locations. However, this peak RSS data is not used by all tha= t > > many applications (as evidenced by the fact that the memory.peak file w= as > > only added a bit over a year ago). I think there are enough cases where > > ownership is enforced externally that mirroring the existing interface = to > > cgroup2 is sufficient. > > It's fairly new addition and its utility is limited, so it's not that wid= ely > used. Adding reset makes it more useful but in a way which can be > deterimental in the long term. > > > I do think a more stateful interface would be nice, but I don't know > > whether I have enough knowledge of memcg to implement that in a reasona= ble > > amount of time. > > Right, this probably isn't trivial. > > > Ownership aside, I think being able to reset the high watermark of a > > process makes it significantly more useful. Creating new cgroups and > > moving processes around is significantly heavier-weight. > > Yeah, the setup / teardown cost can be non-trivial for short lived cgroup= s. > I agree that having some way of measuring peak in different time interval= s > can be useful. > > Thanks. > > -- > tejun --=20 David Finkel Senior Principal Software Engineer, Core Services