From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 737E9C636D3 for ; Tue, 7 Feb 2023 19:21:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B944F280004; Tue, 7 Feb 2023 14:21:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B4424900002; Tue, 7 Feb 2023 14:21:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A32FA280004; Tue, 7 Feb 2023 14:21:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 95F38900002 for ; Tue, 7 Feb 2023 14:21:55 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 6C930407D5 for ; Tue, 7 Feb 2023 19:21:55 +0000 (UTC) X-FDA: 80441465790.02.A4F1F93 Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) by imf19.hostedemail.com (Postfix) with ESMTP id 9AC011A0008 for ; Tue, 7 Feb 2023 19:21:53 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=c24d5UBm; spf=pass (imf19.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.43 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675797713; a=rsa-sha256; cv=none; b=SYVueVAPNLwNlW2zhf1fy822zKyMXkpcxONOOFLTvXl3Cn+ex48URxRqUkZLgftuotWMPX X1B88RtwszmT3Y3Vg32n2iyP1oWv3m6rRRsPh23xesrH+yafqqwMZ2IkCY+bfXvaBjdwn2 vv3N+N1lFfQMGfokpNrXpa54MpK+rHs= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=c24d5UBm; spf=pass (imf19.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.43 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675797713; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VDILy+PNgcqxcF4yOSkG05rYNGvInPFxYKoFxYbI9dw=; b=j3NaDbCjJ8oyIEXiz4JkANCK4wHeLd8nh99gAb3io2+t/iqOEpkqMSQbeIvNtemJZOyBFB wytnWNkkowDOyhgCNfcvVD+jfBCc7gbSXuu5BT34dYkXjOafnZQLS/2gOWKfdL+v3ZvXua sMIdDl/QeYLlg2vk3Yct5isrOv+9xrM= Received: by mail-ed1-f43.google.com with SMTP id r3so7739063edq.13 for ; Tue, 07 Feb 2023 11:21:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=VDILy+PNgcqxcF4yOSkG05rYNGvInPFxYKoFxYbI9dw=; b=c24d5UBmzra/K5gTnmg2fAOQqtZAHh41pSuHu3OFxwKNWhulmtBuVr/MIxjG37RNDZ WfqAeBdCKw+ws8nmsqOZEblvRLbWueG/t2wQ8B1X+5vTr9dXjXThkzpyhIY9FeNl9ozd /KfbBwCajTzHBmfMuxiadNxb31PpZ4vee8+utToEDfH3whzIvj9MLBtFdwdQkep2fRcC t2FIJAGN5BttGo/22D1jJADqUmIKdrbkuT2EoWGH6RBthO89cHX+fGjYY2jLiU9hJaiZ +G4WHpwm4z0uXRS+7KgUrSPfUCaiyB0PQeZVKcXC4+D4IqIYqRw+VuL8EI+Im5ew2cik Uedg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=VDILy+PNgcqxcF4yOSkG05rYNGvInPFxYKoFxYbI9dw=; b=rHEBVoC/ZOiqOPwA04Xz4vgbl3Q2RVoLtGFbE1i7YtrKQshK+YCEi7VojRmBYTxjzC JkzRlbpN0sXXvxzbcwnNusvUxhWyTC0VbFeyZK4iztKDfQQlJ9uz1v0fcUlYY9mQXkQX G6XRlFsu8XCZ88FvXbE6MsZtdCDmKeSmAkB7QBxgEdCx+OBYqjv0SR6b4h+P8Hdd+Z3Y JpThDBbZ9y/ar80OF4zNZj2ZT+9vVlwwysa8SUJ9x11kQvOUPDLvdXrN2rFe91YDRMP8 1J1LWrW17bJzQ9m9twiufSFSaXsE059s7a7OZNBuyjvkqintdvvDS11q7AzeT7U0Qnfc fFdw== X-Gm-Message-State: AO0yUKXtm7scRc/MIb7t0KePkvAQlCN4DJhs2LFhmQWCHQ3y+LHsqYgK CDmai82imgUKbpiRmU6r79Mf64zbszJaR2cCYAOUcQ== X-Google-Smtp-Source: AK7set8XFOOUqqLWdCLhum9e/4ptokb0Bh2nheX44PWhOhBR8bKAVyyg9Z0TSHeFDFe2fjyvkpl1wm03TGlRN6U45JE= X-Received: by 2002:a50:875d:0:b0:49e:1638:1071 with SMTP id 29-20020a50875d000000b0049e16381071mr939175edv.5.1675797711931; Tue, 07 Feb 2023 11:21:51 -0800 (PST) MIME-Version: 1.0 References: <20230202155626.1829121-1-hannes@cmpxchg.org> In-Reply-To: From: Yosry Ahmed Date: Tue, 7 Feb 2023 11:21:15 -0800 Message-ID: Subject: Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits To: Johannes Weiner Cc: Michal Hocko , Shakeel Butt , Roman Gushchin , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Christian Brauner Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Queue-Id: 9AC011A0008 X-Rspamd-Server: rspam01 X-Stat-Signature: 5oqnzqnmbh93kgwggiwuri4cq9us48tk X-HE-Tag: 1675797713-134115 X-HE-Meta: U2FsdGVkX1/L9SJo3LfxtbDKhBfr3CK/otC+xHEnLuOZb4PlTGYH5Ge2DimoLTEPsDfHYBqODbNGC4WVZrMaNvIVPkZNHlZ4TOXWXyehuGsj/AEQxf4Jolk9kwV3eaYJMAv83LU4tPg+4pliJzLDN2y6lXn2ryqk/e50c1qFKFIrZtBNROx3/2bABDUbutw/7LvPPhgjxVKCUS7SHqDB/Crrnk54W3cXpdndU8BUcrJgoZlJfoxqObYc2UNXrEC3N66OFTw9OlUd9ZdPmHbDy7Ja573U53xYjuG28U6EqDM7WqXSibNYd9LlPk8Mavovsf0mK+gIw8PROLgVRVkAZ5GkngnccF7OOAf6X6CMOwIXmrKFCEiiUj5Yrh4RMTFHhQJS2xsHqdEOaRwv0UMeufItPXAI1PyRUBAquXpFYHukRWqdsmqgZNbw1jzzudhDb+XBE1CUdTKOsD73AXbP0A1HXEH9HHXAumkkKXN4H6GBkDwZpPx+l9egfYUxpPgQ+ts3DAT/L+Rj4iV36QhtMUXVPQ4FFrc+b8OMuITOxv3vYttAt0mPkFM12PkVCNR4AdDYMGcf9zg28zdBZhV53rgalbCrXC4+vYKu1lQRPqkorrQogqDJDTqdK+/0N1kvi4ONUyrM86gy1R29QXYMEW5m/bSm4ciVSX228qBOaKdeTaHCzrVQOUJnaewGBmyAh23z5AtE07/cntOR17fXVKp2M3nAFQr72xJ40Vd+IPFt33fZfni8xMScWxnxq3WIf+t40+bQJOsZ2/5f4O/bOqZcuDaH+ht8/K8qscpMElO6WuT9QO7vRCbIy45GIkehqy6laS0BUqztex0aTwZ3vyYYTAQI5jqA71dB9LAqQMQzDBGMP+EfKW5oZMgfQqDaRoZWpsrHJx1tMVBveI8bfFMjRRlju+1BtvIJ889eZIjv31CIUjDQGtAyP+4jvS0G2buYMIm4VS1C5ZMPDxf JAnS4aut pd0762EOUW/eNW1VhQKCkYDTxdZQj7dqTB+TevqumgIp7VCSdjtTKAdlgJa5DenBPCDZbVATUNJMpy3UXO2I01p4lv/tzfIGezq70z1VWN82YYU+7xW+GBxLbqgOxKQHiJj5iZwcp3Esc3L4yRSkOOabYLm7ca82sqOO1RAMvpKE0TiaSaD5J5uCsNeO8/I7Ik9VGfz8tu1Xyiy/RpbsnH95TG7mgbevg7Ke0cgPHjwKcsgQbV+slEvl3UyVTnvCWGHo/bauGzLhDqtW/+zvM9EnyQJPyDTyV2A5fEXFsxAzIXCE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 7, 2023 at 11:09 AM Johannes Weiner wrote: > > On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed wrote: > > On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner wrote: > > > > > > Christian reports the following situation in a cgroup that doesn't > > > have memory.swap.max configured: > > > > > > $ cat memory.swap.events > > > high 0 > > > max 0 > > > fail 6218 > > > > > > Upon closer examination, this is an ARM64 machine that doesn't support > > > swapping out THPs. In that case, the first get_swap_page() fails, and > > > the kernel falls back to splitting the THP and swapping the 4k > > > constituents one by one. /proc/vmstat confirms this with a high rate > > > of thp_swpout_fallback events. > > > > > > While the behavior can ultimately be explained, it's unexpected and > > > confusing. I see three choices how to address this: > > > > > > a) Specifically exlude THP fallbacks from being counted, as the > > > failure is transient and the memory is ultimately swapped. > > > > > > Arguably, though, the user would like to know if their cgroup's > > > swap limit is causing high rates of THP splitting during swapout. > > > > We have the option to add THP_SWPOUT_FALLBACK (and THP_SWPOUT for > > completeness) to memcg events for this if/when a use case arises, > > right? > > Yes, we can add that to memory.stat. > > > > b) Only count cgroup swap events when they are actually due to a > > > cgroup's own limit. Exclude failures that are due to physical swap > > > shortage or other system-level conditions (like !THP_SWAP). Also > > > count them at the level where the limit is configured, which may be > > > above the local cgroup that holds the page-to-be-swapped. > > > > > > This is in line with how memory.swap.high, memory.high and > > > memory.max events are counted. > > > > > > However, it's a change in documented behavior. > > > > This option makes sense to me, but I can't speak to the change of > > documented behavior. However, looking at the code, it seems like if we do this > > the "max" & "fail" counters become effectively the same. "fail" would > > not provide much value then. > > Right. > > > I wonder if it makes sense to have both, and clarify that "fail" - > > "max" would be non-limit based failures (e.g. ran out of swap space), > > or would this cause confusion as to whether those non-limit failures > > were transient (THP fallback) or eventual? > > If we add the fallback events, the user can calculate it. I wouldn't > split the fail counter itself. There are too many reasons why swap can > fail, half of them implementation-defined (as in the ARM example). > > So I think I'll send patches either way to: > > 1. Fix the hierarchical accounting of the events to make it consistent > with other counters. > > 2. Add THP swap/fallback counts to memory.stat Sounds good, thanks! > > We could consider excluding THP fallbacks from the fail count. But it > seems really wrong for the cgroup controller to start classifying > individual types of failures in the swap layer and make decisions on > how to report them to the user. Cgroups really shouldn't be in the > business of making up its own MM events. I should provide per-cgroup > accounting of native MM events. And nobody has felt the need to add > native swap failure counts yet. > > So I'd argue we should either remove the swap fail count altogether > for all the reasons mentioned, or just leave it as is and as a > documented interface that is unfortunately out the door. > > Thoughts? Agreed. We should either report all failures or failures specific to cgroup limits (which the max count already tracks). About removing the fail count completely as-is or completely removing it, I don't feel strongly either way. If we add THP swap fallbacks to memory.stat, then the fail counter becomes more understandable as you can break it down into (max, THP swap fallback, others), with others being *probably* swap is full. Arguably, it seems like the interesting components (or the cgroup-relevant) components are already there regardless. The counter might be useful from a monitoring perspective, if a memcg OOMs for example a high fail count can show us that it tried to swap multiple times but failed, we can then look at the max count and THP fallbacks to understand where the failure is coming from. I would say we can leave it as-is, but whatever you prefer.