From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6CB9C6FD20 for ; Fri, 24 Mar 2023 21:59:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C2D656B0072; Fri, 24 Mar 2023 17:59:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BDCDB6B0074; Fri, 24 Mar 2023 17:59:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A90346B0075; Fri, 24 Mar 2023 17:59:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 97E546B0072 for ; Fri, 24 Mar 2023 17:59:40 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 67158C0B40 for ; Fri, 24 Mar 2023 21:59:40 +0000 (UTC) X-FDA: 80605159320.28.BE9FBF2 Received: from mail-vs1-f48.google.com (mail-vs1-f48.google.com [209.85.217.48]) by imf22.hostedemail.com (Postfix) with ESMTP id 95168C0007 for ; Fri, 24 Mar 2023 21:59:38 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="Y1Hcc/DO"; spf=pass (imf22.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679695178; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EMI9am9Who8Vr+JcgsFIxZwdNfhXbZD74wYkzXXqGrE=; b=8HGLS2JPgC5rxGfMFFdHGB0Ci2pyYhVIz0klQKYAD2IHl091n0eADD67E3S3OqqbJ9k6T0 aWfp2bnjdWtJKulXqTvbc4JC2JKCj/OyTwGMb7zx+52NAjERh1lOdAwGwaUMvIQFqfSrZ1 sr1MvBuIwC8ZcMF6LUgRumQ3jcpblxg= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="Y1Hcc/DO"; spf=pass (imf22.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679695178; a=rsa-sha256; cv=none; b=EkGxbW+i0vDb4YvgyulPPzJCesFsFW38ptwVUUF2CtAE3H0dJ9KKxoEQAuAOsRQhooQbQk T2z2NjDhv/YR8LOBXlqDtMdC5QYRVKKZo0/+tQXgH8OXVerEwxLkTj81x/DMDw15jyN6Wx RqIyFET8raZzyJYGEQdUnmW1OlTzbEI= Received: by mail-vs1-f48.google.com with SMTP id c1so2660553vsk.2 for ; Fri, 24 Mar 2023 14:59:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679695177; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=EMI9am9Who8Vr+JcgsFIxZwdNfhXbZD74wYkzXXqGrE=; b=Y1Hcc/DOwFkD5em8m3cGq9KizwVFun0h5yj9KWDVpN1IBHUCgwasXAzNydhazPclKW Af76dJeX4KB7uIu8sphfyGJy4i1/rfjN/pr1L/t0//eaaRfo8OJcV6J+jV90qYvjFUR1 7/CoguAgMQTFYyZNEfndEedGIfpptz/EP1AbJhaAu6TLYGeuIGE8Tz4dtWj/3lDewgc1 JLEEE+LlouIkpjPB7Mm04n6QazFsJAMPWEjzqaErkq95VL03WumVlDr/Hr68aW0Y2c5r Imr/voLpm5aAvWVnc0v77+zgKHkUH6MWwcxt3+fre+8+UWOa71p0M70TaB3VDdV+uEDH 9cgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679695177; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EMI9am9Who8Vr+JcgsFIxZwdNfhXbZD74wYkzXXqGrE=; b=dI5BCyb0xvStGZup5+Y1Gi6iHYeTnqLhXWocQUzKSi9JRP8QC+2KgP374+VdoxHJg5 +ill2RtElxxBI0BNirlfI9fv7T88ZJflA1p8OYi8IFCsWedX6nl4SWykFBsWibeexxXf GnFWfSTFsueY302L7g/4qD0+FO3gDyk6yC0jLxOKl47a27WSbhUDPCw1UvKG4Bb4t9lC JuCAPGkiZIpJCRcz7QPD1gcT9VDTj9lkuf0WMFs3LsNEH7IuhbeZfG3LRmG9JbdSFr1N OeZcMfmVYMk/6TtYmgUk3ye5iGE/1MDQVtaLKVftJYmKAgdXEIjkEWb0FrvEWfxIcIno 9NWA== X-Gm-Message-State: AAQBX9ewlEevzHGbKk88nUNzoiLs0MpQz5m8xwT5GrICReIHijpcLuZg u9c8t4RYlpNQqVoJp2VfGXHn8u7kZegJoQCr9MI= X-Google-Smtp-Source: AKy350Y4ycW+PakcRHUbwa/ef6g9P6OejUHdxPOZAdamYNr0z294CU3Glt+E1U8V91GyUv4t1y5UxN0/UsoqvLIoUNE= X-Received: by 2002:a67:c290:0:b0:426:8391:de08 with SMTP id k16-20020a67c290000000b004268391de08mr1332332vsj.2.1679695177471; Fri, 24 Mar 2023 14:59:37 -0700 (PDT) MIME-Version: 1.0 References: <20230308032748.609510-1-nphamcs@gmail.com> <20230314160041.960ede03d5f5ff3dbb3e3fd0@linux-foundation.org> <20230315170934.GA97793@cmpxchg.org> <20230315191459.f3z3gahxdew4dwrv@awork3.anarazel.de> In-Reply-To: <20230315191459.f3z3gahxdew4dwrv@awork3.anarazel.de> From: Nhat Pham Date: Fri, 24 Mar 2023 14:59:26 -0700 Message-ID: Subject: Re: [PATCH v11 0/3] cachestat: a new syscall for page cache state of files To: Andres Freund Cc: Johannes Weiner , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, bfoster@redhat.com, willy@infradead.org, arnd@arndb.de, linux-api@vger.kernel.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 95168C0007 X-Stat-Signature: nem8i57xksdfx19r3grc86th7m3mdibz X-Rspam-User: X-HE-Tag: 1679695178-144341 X-HE-Meta: U2FsdGVkX19RZbdDyhJt3EZNMn1ubY4IWeRabTsZi9VsbG5KaT0YXIFMBQRwwnMNNQCUlAC5PW9aK0jIZivJqDbK+dRvU6/5riY5AWI6wnINauAv5oBgm0gdeayzaRKFUUQouWbw8uGwHpCx+Bs1C4Ke8CmUPXopYANh/g5KCt3fZ6aYbks/nQ4KwANsM2ueeXjGfkwS6XOnAQj1UODE10X5sKk42CnTwHBAOqhG7wWM1aMAHP4n5UhKL3FXoBx/P2pdGG7KWP4Rf6pzo4ZzAkHCdpgu6A+M/IrxJS2gOZ9xE3XSk/gY1Rlw2zSN7CU3ao95ywP9sjfcYTtTMBsTjIkPv52mV/yK5EnuO5hH03Hf0UpKpWOYA+HINZy6WrvDNdJvM+OhSZckaHTAO21hcWVuGIJ9OhcmquY2LS5S/VMZ9QaNb1ZQ9d8SfLoByj1f80NtfHa3iFZ8cjuyY4iV3F7oVr+t38siWDGyB+tq799pcIyo8MjgJlqvcVubS7f9vNxfZFkhRsJ4b7YFbyIwoIqWx+HxRU0R6ARUrCXE0fOluUXI0NkvUyBFDg49mWTG433wwe6RaQVpEJWC3QofnjPls+X1toTXaFS/3BXTw8cwhZxPQmfjF2RGl0j5FalLJ2We3mT60booGooImEvY2x0Ln/iZFWUP2jdQo6T4qgT/s9RmZ4KWdCFmoa5YcGYz8TGwMUYDYDAaPx9bJD7xH7cfRjlkE2XxEqZpSexy+4DrzST8d9HoqQBNExcnlbB02WA3flhBgEm89tg48N0Ek6Q0EHc+sG2DnL4NIBTaUvLSd/QQSvvmm+j+OWMTuTUC0FeVwnNMZ6FMUW8ICCkbEfkfzYPbb4UHMjsOd80ZRDgXOPRAmWhl/b5jpyHKIsVdmuRTkcyjLAifobN3PlaW79KUqJNGa7e/eP23yGW8iZVFvTCrU8XyIM+uB84gl2Ke0gXwEGxV3OIUBLmBLCo Nxi611GY 1UCzFE8wn3xzAuNcTt4spjtt3p/oyhWn7y2VZJs5PZPiz6yLN/Cs8lF0TBWkCGCCgnBcUd4UEEwCMxN1cm1aME2/fECD0Bun0v09lttSHY4HzI4XyC6TlH/mRTt5aKkBzptr7RbH0Zxkc4h4oTzpY4ZWWaMBgYJCisuycUsMVmDkpKpoMau5wLnlmRPVgAu6Rdlc41Bo3x5KkNAWLDt9gevmI+UaCDLVe3xWHW/7vIYuWgv4tuyR09oyIedAaa9slosDhuGdMm3AFrBqz7hhnc4SXIsXSA430acK7xLOt8uXdwryKL87ki/HRMVXJ5BxKuSsdVOzJ04m/t9ugRwvYTvAjLM8VoiodRbL+CMga6Jf8UACw2+ruSER3yiBSRt8oqj7xz8vW+DqvTJCN7dgRQGkBRMtUARKSeQ+r41+62/CDFA5euIk4yD3TLg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Mar 15, 2023 at 12:15=E2=80=AFPM Andres Freund = wrote: > > Hi, > > On 2023-03-15 13:09:34 -0400, Johannes Weiner wrote: > > On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote: > > > A while ago I asked about the security implications - could cachestat= () > > > be used to figure out what parts of a file another user is reading. > > > This also applies to mincore(), but cachestat() newly permits user A = to > > > work out which parts of a file user B has *written* to. > > > > The caller of cachestat() must have the file open for reading. If they > > can read the contents that B has written, is the fact that they can > > see dirty state really a concern? > > Random idea: Only fill ->dirty/writeback if the fd is open for writing. > > > > > Secondly, I'm not seeing description of any use cases. OK, it's fast= er > > > and better than mincore(), but who cares? In other words, what > > > end-user value compels us to add this feature to Linux? > > > > Years ago there was a thread about adding dirty bits to mincore(), I > > don't know if you remember this: > > > > https://lkml.org/lkml/2013/2/10/162 > > > > In that thread, Rusty described a usecase of maintaining a journaling > > file alongside a main file. The idea for testing the dirty state isn't > > to call sync but to see whether the journal needs to be updated. > > > > The efficiency of mincore() was touched on too. Andres Freund (CC'd, > > hopefully I got the email address right) mentioned that Postgres has a > > usecase for deciding whether to do an index scan or query tables > > directly, based on whether the index is cached. Postgres works with > > files rather than memory regions, and Andres mentioned that the index > > could be quite large. > > This is still relevant, FWIW. And not just for deciding on the optimal qu= ery > plan, but also for reporting purposes. We can show the user what part of = the > query has done how much IO, but that can end up being quite confusing bec= ause > we're not aware of how much IO was fullfilled by the page cache. > > > > Most recently, the database team at Meta reached out to us and asked > > about the ability to query dirty state again. The motivation for this > > was twofold. One was simply visibility into the writeback algorithm, > > i.e. trying to figure out what it's doing when investigating > > performance problems. > > > > The second usecase they brought up was to advise writeback from > > userspace to manage the tradeoff between integrity and IO utilization: > > if IO capacity is available, sync more frequently; if not, let the > > work batch up. Blindly syncing through the file in chunks doesn't work > > because you don't know in advance how much IO they'll end up doing (or > > how much they've done, afterwards.) So it's difficult to build an > > algorithm that will reasonably pace through sparsely dirtied regions > > without the risk of overwhelming the IO device on dense ones. And it's > > not straight-forward to do this from the kernel, since it doesn't know > > the IO headroom the application needs for reading (which is dynamic). > > We ended up building something very roughly like that in userspace - each > backend tracks the last N writes, and once the numbers reaches a certain > limit, we sort and collapse the outstanding ranges and issue > sync_file_range(SYNC_FILE_RANGE_WRITE) for them. Different types of tasks= have > different limits. Without that latency in write heavy workloads is ... no= t > good (to this day, but to a lesser degree than 5-10 years ago). > > > > Another query we get almost monthly is service owners trying to > > understand where their memory is going and what's causing unexpected > > pressure on a host. They see the cache in vmstat, but between a > > complex application, shared libraries or a runtime (jvm, hhvm etc.) > > and a myriad of host management agents, there is so much going on on > > the machine that it's hard to find out who is touching which > > files. When it comes to disk usage, the kernel provides the ability to > > quickly stat entire filesystem subtrees and drill down with tools like > > du. It sure would be useful to have the same for memory usage. > > +1 > > Greetings, > > Andres Freund Thanks for the suggestion/discussion regarding cachestat's use cases, Johannes and Andres! I'll put a summary of these points (along with a link = to the original discussion thread) in the cover letter and commit message of the new version of the patch set. In the meantime, feel free to let me know if there is something else caches= tat could help with (along with any improvements that could facilitate such use cases) Best, Nhat