From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6967CC46CD3 for ; Wed, 27 Dec 2023 00:53:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF6026B007D; Tue, 26 Dec 2023 19:53:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BA68B6B007E; Tue, 26 Dec 2023 19:53:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A6D0D6B0080; Tue, 26 Dec 2023 19:53:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 96B3C6B007D for ; Tue, 26 Dec 2023 19:53:36 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 6A5B4C01BF for ; Wed, 27 Dec 2023 00:53:36 +0000 (UTC) X-FDA: 81610775232.13.7EA2034 Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) by imf04.hostedemail.com (Postfix) with ESMTP id 9AE0B40008 for ; Wed, 27 Dec 2023 00:53:34 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=pbabR5Bs; spf=pass (imf04.hostedemail.com: domain of rientjes@google.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703638414; a=rsa-sha256; cv=none; b=aqznlvNz2zBJH/t6Vrsh2ZhuxmeVBbG7KHmeGrQGJ7MNYQEN1Uij0QJrGMgFqbQ7jB+UFN JUQdsk0Fj4fUp3dHL62/STfVDjCHjsiIK0m0FXHhoyw4HAHh6IEqgZgpqscmBDwvozNk7q QfLGjrBF0864m+LhAZvkgxengza66a8= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=pbabR5Bs; spf=pass (imf04.hostedemail.com: domain of rientjes@google.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1703638414; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sehOb+VMc78wKPL1TpSEJ0N7yS0xhk1UrWSEsSbLGEk=; b=yjwL+nPoS/5+Wot5Pk0TmuvjfWbZTbA9SNzRsxxAkz5134e1b4kNo2A3F9CJp4qWKAOL6x zsOJOk1EOC2pXcCrqfNkhPETpKLXxCekxspVd5k/8qwIy6GKLUxL/1ORz7auYcXLT+v3BX 48r5lwunBY1Koeq+17o1eLA2NVndvwc= Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-1d3fde109f2so470795ad.1 for ; Tue, 26 Dec 2023 16:53:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1703638413; x=1704243213; darn=kvack.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=sehOb+VMc78wKPL1TpSEJ0N7yS0xhk1UrWSEsSbLGEk=; b=pbabR5Bsp21in4S6Lbh146prAcGzvLwTLSXN3Dqyo7MnUCLsLvsKwuAB2/wufu8Nu9 9srHUE59aynEsFzkR9agOG1ALtQvZQ9fwD9KKHMDURyFxGDfuv0iWqgRS93erhnIuM1z DJ6kYtSkXFWUqHT8hXtmgM7Va38pQbzLmIxSr0Fl9QdlsNGnh1GU8GztURoojU2OLnBT J8MhmP1l37ZSwADvw+sOGGljUldqhMLfQNxBN31ORbT421+mBIY8tW4j3OQ+Qus0ebVX HT3WGNyJCNhbdxjuoKFZR+YdTzrFLdNUpsAMMJ9EWoxyLswBpWpEfrHArKCxf+aHUGhs 6U0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703638413; x=1704243213; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=sehOb+VMc78wKPL1TpSEJ0N7yS0xhk1UrWSEsSbLGEk=; b=VlVxEBfcQmOpfctW9s+Ln72ZN9WwpMH3eM60gHU1sHZUGIjYUO7+IAsybwclYcZKqM Wymv/X0MAQp7S7fsGrprfjT1H+B+IRFvoWHqjhsdajZPhnuN/8VSjdFLAG5As0f+Hu0y 1GCo+/ImOmXMlqrNdU9djNd8J+dvJZyx6BnBidxlxk3GYCctyRoiM1ZTfPDut04uRM+t mgvqw2DBzXvCxbubZEp1glRXHqOOaA4hCP1a+NNiQsJAbDc2K9dLh7/v3VZkWfmOZI/g OAtCec1VR1Xp7g40/FegCxjLuqgv64CNwOc8V8hPy9TJVvCNdUz47+CY1hZRu6uV4i8y Tzcg== X-Gm-Message-State: AOJu0Yz/Qhzar7ZEuwFhwFQlsNJmwXsF1rS2YxsWDwXIRMLyvzSQcmcH Ktdhc6N2jhFVHQySig7TRxaF2EWsJOZi X-Google-Smtp-Source: AGHT+IFFtPjyAMjq4eKscHzLFbwxPV9QJHI7QFXIhLqeyUYv9QY4jeEdcH6tYGkSv2Fz/JNeL/S6jA== X-Received: by 2002:a17:902:bb84:b0:1d4:55b3:45d9 with SMTP id m4-20020a170902bb8400b001d455b345d9mr233058pls.6.1703638413187; Tue, 26 Dec 2023 16:53:33 -0800 (PST) Received: from [2620:0:1008:15:5a8c:89e6:ca2f:ea30] ([2620:0:1008:15:5a8c:89e6:ca2f:ea30]) by smtp.gmail.com with ESMTPSA id 17-20020a17090a195100b0028b1fce7c01sm16313726pjh.37.2023.12.26.16.53.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 26 Dec 2023 16:53:32 -0800 (PST) Date: Tue, 26 Dec 2023 16:53:31 -0800 (PST) From: David Rientjes To: Pasha Tatashin , Linus Torvalds cc: Greg Kroah-Hartman , rafael@kernel.org, Andrew Morton , surenb@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, souravpanda@google.com Subject: Sysfs one-value-per-file (was Re: [PATCH] vmstat: don't auto expand the sysfs files) In-Reply-To: Message-ID: <13e5fbd4-d84d-faba-47f1-d0024d2c572d@google.com> References: <20231211154644.4103495-1-pasha.tatashin@soleen.com> <3d415ab4-e8c7-7e72-0379-952370612bdd@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 9AE0B40008 X-Stat-Signature: iuyqs745aarki6w9crrf5t1bfkwm1jut X-Rspam-User: X-HE-Tag: 1703638414-141188 X-HE-Meta: U2FsdGVkX1++XAlNxfWt9j4wSjK9DMJPxrEdVWZM1u/yUC/pNEKbqAwygOoEVz/2gfcobHGpNuAs2oeoaB+euRjWt5WKIp8T4nPgkOHxbFfW/N12EYQZ8gbYcHLEh3vfsGeMdl6KeYDowTRwX0Dbe4latEPHIgFPNV/kldfYVRUhVLzTzecZjDgZ3PrhT0orqmqQe/Xoghg2EHeJylyQKaVjlNxjDoWLcukvPoxV2xNkUc7V5/sIhLOUz/PvP9zHpnOHW1WSOLKCE1ltQxA9XYg0jmXc1UtHGtYKM7cSl+7h/OnaKb1K/PIfeDmrjcrmGIgT+7mtgbvc0cpZ1sR5v1hUBwMMZxcGij/YXuF1ExQ9iBJEXYiicYz5dbLyESKIOEE15SQ3PYRxH/0AHvpu7eN5Q/WbG347ao3lDbI3Bx3ccOlO5zH45+gGsiLhOjWjXSxZV4e4GhRcYD31lk0eAr4AFqiNWWUMk1pEJ7YwmbasszzDrzazjnEbsIC6rSLhPJOR3Qyq1G8OZu4frF4+IQUDxxKzp7VhciSJflwoBNVLH+hYPkU77aO0q+NDCcXrrhhdU2fpcPjA0msHMmL546pP9vw50jLAipO1lcT7MwhFJZU2tpmP5Z5Cb8d3ipJBNGh92YjRekvEADvX3G6GOgF/CqxqCSzsnhCsd4DOHY8eh7Na4xmBCtO/2pmhmWW2gp5oWHDXRNfQIVYP/uQHXwHrha5/Mubx7/NN6GKao0P5OLG1BIr/Qj08qazmWYMXzK6R56S3MYZoN4cibM3HTBZ3gHCo3646/+2hMPc8AHpP1IerjUGsl+Uv1uUNNYL2c4neN8h8tMR5za9upp7Y3uRN7ABnECfpkb+7ntJ9UZHEq2oIjWmie4E9FdwAsXT5LfyoB5m0ekCu1EdWz/ZuZwfF+Rp+WenXi/Cp06yfxExLOTNLiFdCejHZzgBU4MZAFXE/jWtkxRLApK/WrOq LrMve/8a DMhY6mbJtNAuvnDBW1aQ10gwfyjxzYFTBcTvIDS5rdCgrIEBw0xvsSX/NfhyJUQHgtz7MsMKRCzNYFPr4+lvYKo+qTm9dkg6iP25qRouWbitqr1/rZCq3KdBld+RZGso7HuVP85QqmY/a+kjwFA4BXk5xLapiQBXb2U32ui+9KvMP31N90tqPTicevCY/qeZE7bGN26rsgLBI5gXJRWBi7g+wrHZg2xW6j57Qdvsrj/rojg/+qocN1G5tSaaucYVP7I4bmBEDqP/Bzq5Ny2SqWJJgI7vMmUDnKQ5hwMvdZVqqT5MR4/uLlyK1q3saS2N/nIHpTgM/IRBsl/L5bMnPf2K6IF3aXNWuMhvYXMzt197281GFQ500KAaMHj/og8F99B26x9kYkLzWlTU+Dw9ggHKNwA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, 26 Dec 2023, Pasha Tatashin wrote: > > > > > Whenever a new fields are added one of the following: node_stat_item > > > > > numa_stat_item zone_stat_item, the /sys/devices/system/node/nodeX/vmstat > > > > > files are auto expanded. > > > > > > > > > > This is a problem, as sysfs files should be only one value per file. > > > > > > > > Does this patch address the one-value-per-file issue? (I think that ship > > > > has sailed for vmstat.) > > > > > > That ship has sailed for vmstat, this patch addresses what was asked > > > by GregKH: not to add new values to vmstat, as not to make the > > > existing problem even worse. The sysfs file system has a one page > > > limit per file. The developers will decide how to export the new items > > > added to node_stat, numa_stat, zone_stat individually. Each new item > > > can be exported in its own files, and must have its own documentation > > > about interface stability, value meaning, and expectations when the > > > stat file is absent. > > > > > > > As of at least 6.5, /proc/vmstat is a strict superset of the per-node > > vmstat. Why is that a problem? > > The intent of this series is to stop auto expanding > /sys/devices/system/node/nodeX/vmstat as sysfs should only be one > value per file, and the task is not to make things worse. /proc/vmstat > is mostly ok, however we might not need to auto expand it as well, to > avoid situations where removing a field becomes a problem, and we have > to keep it in the file forever, like what we do with nr_unstable. > > > There's great benefit to being able to use the sample implementations to > > parse either /proc/vmstat *or* the per-node vmstat and without needing to > > read the per-node vmstat plus some new set of sysfs files that are > > one-value-per-file. The per-node vmstat will always be multiple values, > > in fact it's a key value pair. > > Yes, but that file is already large, and soon can overflow a page > size, instead of converting it to a binary format, let's add new items > as one item per-file. > > > I have to think that doing anything else for vmstat is just adding > > complexity (like this patch) and actually making it *harder* on userspace > > to read the data it needs. > > > > Yes, the per-node vmstat likely shouldn't be in sysfs at all but it > > appears to have been added there 13+ years ago because it was a convenient > > place to add a per-node variant. That's not ideal, but owell. > > It is up-to GregKH who requested this change. Greg, specifically > requested not to add new fields into per-node vmstat, and we are > adding new fields with per-page metadata series, and IOMMU accounting > series as these files are auto-expanded without this series. > Thanks, let's get clarification on this once and for all from Andrew and Greg. I'd argue that the ship on the "sysfs one-value-per-file rule" has sailed for long-standing use cases where either (1) switching is just not possible or (2) switching would be an undue burden to the user. An example of (1) would be THP enablement and defrag options: $ grep . /sys/kernel/mm/transparent_hugepage/{defrag,enabled,shmem_enabled} /sys/kernel/mm/transparent_hugepage/defrag:always defer defer+madvise [madvise] never /sys/kernel/mm/transparent_hugepage/enabled:[always] madvise never /sys/kernel/mm/transparent_hugepage/shmem_enabled:always within_size advise [never] deny force This convention isn't going to change. We're not going to suddenly add a new enablement or defrag option that can only be set in a newly added file that is one-value-per-file. THP was obviously introduced before any sysfs "one-value-per-file rule" and, in retrospect, I think most people would agree that these files would be much better off implemented returning a single word. But, choices where made in the "before times", and it was implemented in a way that shows all the possible choices and which one is in effect. Owell. We deal with it, and we move on. Imagine if I add a new choice of "lightweight" to the "defrag" file. The only rational way to do that would be to extend the current interface, not create a new /sys/kernel/mm/transparent_hugepage/defrag/lightweight file that is one-value-per-file that unsets all the other options in "defrag." Please. I use this an an extreme example, but a very real one: we can't break userspace and that will always supercede the desire to maintain one-value-per-file. Don't get me wrong, I am all *for* one-value-per-file, it has many benefits. But I disagree we should use it a a northstar for *existing* use cases if it actually means we'll be breaking userspace or making it much harder on userspace. Under discussion in this thread would be an example of (2). /proc/vmstat is a strict superset of its per-node /sys/devices/system/node/node*/vmstat counterparts. For ZVCs, we *want* to ensure new stats are added to both. For the same reason that we want to ensure old stats are removed from both. Starting to maintain completely different implementations in the kernel comes with complexity, but also with risk of bugs. Especially for something that works exceedigly well for the user today. Saying that we can extend /proc/vmstat for more entries but then *not* extend /sys/devices/system/node/node*/vmstat for the exact same entries, and forcing those a new files with a single value, is doing a disservice to the user who wants to collect the data comprehensively with the same implementation for the entire system and each node. The *only* time I think that makes sense is if all new vmstats must be added in their own files in both procfs (system-wide) and sysfs (per-node). That very well may be the path forward, but get ready for an explosion in new top-level /proc files. (It'll also be a shame for anybody who just dumps /proc/meminfo and /proc/vmstat in testing for capturing the state of the system.) The discussion about needing to make vmstat be a binary file because it doesn't fit into a single page is not very realistic. It will result in userspace breakage for everything parsing it as it is today, so let's just set that aside. If a sysfs file needs to support returning more than PAGE_SIZE of contents, which is just an implementation detail, so be it. For brand new interfaces, I'm all for it: add one-value-per-file and be done with it. But for existing files and conventions, I think we should settle it as "keep doing what you've been doing for 13+ years" and don't force this argument every time a kernel developer wants to just add one more stat. Over to Andrew and Greg.