From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 06383C7618E for ; Fri, 21 Apr 2023 23:14:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 40A4E6B0071; Fri, 21 Apr 2023 19:14:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3BAA46B0072; Fri, 21 Apr 2023 19:14:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 25B7C6B0074; Fri, 21 Apr 2023 19:14:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 1730B6B0071 for ; Fri, 21 Apr 2023 19:14:26 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id CF1C680853 for ; Fri, 21 Apr 2023 23:14:25 +0000 (UTC) X-FDA: 80706954090.19.3F6FCEB Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) by imf03.hostedemail.com (Postfix) with ESMTP id 15AAF20004 for ; Fri, 21 Apr 2023 23:14:23 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=okHT+Kvv; spf=pass (imf03.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682118864; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=6Xv/2kclsGD89GAqCehsVLeXDFK4usBnjhSNfd0JqUI=; b=H8/D+j9BmYqUfKcsz/63xeMRTsBFGUxsameME9TNwOsEsOV8lzdXzNigKd6MqV2YFt0tRV KT5at/1OidjYsH0bSY+myAapRf1DkUc2kmCfhHuh21845CgSqbVS7O9pmMRuTHK3LUIXrE 41fHBDep/Zmaiczau760GTpBONPZISo= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=okHT+Kvv; spf=pass (imf03.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682118864; a=rsa-sha256; cv=none; b=Nacx7IYrmR56FMwcOf9YsVn2dVE0Lgj8Y4T5vzftfkM+Lm0i0hcGwV/GZCO67V6ggAehiz sk3pupOzMkmMgx8qIRi5CelLIBf9hew2dGk6aKQUlQab38qNZEdXtPbv1Qn9Pw6QwwQWyb 4ropgOFrQ+G8Jn3mmCQQjo288RL6U5s= Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-1a9253d4551so23524635ad.0 for ; Fri, 21 Apr 2023 16:14:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1682118863; x=1684710863; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=6Xv/2kclsGD89GAqCehsVLeXDFK4usBnjhSNfd0JqUI=; b=okHT+Kvv9o30UuAW1ii522SRwI5uailT685P5drlDTa5g/qkFqCPx6jaVWaAQZdnFU 6zMJ9D+tRU83TUE5YWlUDS+2p4diBIXZt1SrUcA+dhZmsNzBm3T1MEDICBBdiusEUwTk MUaR6Kr55PSEH4hXgy3hi9ky0sOuyYFXNT6W4csSlKX2B+HdkzTkfbDq2SxV6/dztcmd K8uUkIII/C4tILBLXv1MNsw4hWIviNt6tsA1KsaO2+QplItT7rTZ/uCABpIeZckueD5y 6e806ZgHMl0nERUegve0hwwu3CdOysYAZKxbb396i/KpwLKz4G0geR9mHb+S0EfRoD9a ZnEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682118863; x=1684710863; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=6Xv/2kclsGD89GAqCehsVLeXDFK4usBnjhSNfd0JqUI=; b=iA240bwgvhdw5G8kximkoXkvoVyFttx0/7TwnPTB/3Ux4gJE3rdIdAehcFwsCczzHa odwzclnX31MT745urZ8ce13OsmghZzBYhES648jaxhScNhF4pVFuP3rNnb5l5YQAjpQF 6MQmj6yGKmOkogrJZHvfChaMPy3gjZtSGGsU5/Ab1zdIYBswoS4BXg4C9BvEI/HveW+u /EFgQLSf3gWpFk0NxBu+v5pSQIDYugXmVNFwD1q5YnCVaXKGd6Eo0S3/35bE2OMPuOui /P/p45KlFQ271MQvMhLyQEalqp4wTtE7pmLjcoj1Vk2uKV5HlNn+vUYxZwbD+m6gLDeg T9zw== X-Gm-Message-State: AAQBX9cwchro+a2xhAM74btozLTp5YQyxGp/HpGShb05jNnyJGeqyV5G UNQtlQRPE2Uql3WN4Tngsdw= X-Google-Smtp-Source: AKy350ZOdbY4zO3Y5PpHpFtiQ0duBTeD770anEevEnM288pOhK+Pfq2lG/mVyLB90alRSBEhZ8WO8A== X-Received: by 2002:a17:903:22d0:b0:1a6:ed6f:d6b5 with SMTP id y16-20020a17090322d000b001a6ed6fd6b5mr7434968plg.38.1682118862851; Fri, 21 Apr 2023 16:14:22 -0700 (PDT) Received: from localhost (fwdproxy-prn-000.fbsv.net. [2a03:2880:ff::face:b00c]) by smtp.gmail.com with ESMTPSA id jc1-20020a17090325c100b0019682e27995sm3162123plb.223.2023.04.21.16.14.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 21 Apr 2023 16:14:22 -0700 (PDT) From: Nhat Pham To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, bfoster@redhat.com, willy@infradead.org, linux-api@vger.kernel.org, kernel-team@meta.com Subject: [PATCH v12 0/3] cachestat: a new syscall for page cache state of files Date: Fri, 21 Apr 2023 16:14:18 -0700 Message-Id: <20230421231421.2401346-1-nphamcs@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 15AAF20004 X-Stat-Signature: pu3cwpdt55h43xnnpnsgz45yjgrrbixk X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1682118863-922937 X-HE-Meta: U2FsdGVkX1+NpQmmsvuOw1nz8mOIaLP0VeFYBxraKJRlFCla3wddBrAsI2eFa5o8rtM7UJY9NnxfxUPaJ1/O6ExoNygRQjOa0ztfanM7D/fMtFt2hdn6V80fn5xR2+Akq50YRCAwSRllS25v/knRO6moTCtKgI2w6vJCpfaNTRManjr5BaR/BaWNOiNG8IK32hmLoe5mp0pz7nm6sW8Xc2y0cez8z51ZCie8vSvuUTAc8Oc4l+AViDeT9oSv3tf5SGca7JqL8sHSQwnUKZj5DucnZPmNAZI5mbc/bxWz0fbvHai1ebDIy7P3Lbz5khAw2zKbdHgNuMT6EjRxQoUYcIsa3IpyYy5jhpBkOqs6cmmA6ER+D9f9hXbvtaVDI7koH7kSuh7mcK6k5qI6c8rOB8VQkjyp94RH8xbvf3tcsTKPP44weO66YgDTNrGVaezcWl4NAnkYStnh77k6wSfjlSqKJeXaR2JdWHPltxiKRz3kWR3Bw/CIlYcBa7CSputOg5GJ3mw9AFVbhSMuontokOQO608GD0NF1Jm/nYFpH+ES13nB3jz+Q7cihVupZltGjBo4LWi29Bz4px+AdrEUcLmpog8399tuBSfeY95a5dqF84DsChJ71cTXRqt7y2yuFKbwYGOUgI9WmqaxurerwG4biDTKojWDMDDbDv3j+CxaIX+pQMnvtklbxK6uCfFALeNzKW465nauEPMJm3LqgDrqMMb8o5YchzgDw7wdfVWQ/mr+HDXWpTt9cY/FXAGwAElXkwFue2bE0tBw1QRmbxCDqC+81VybswShm8aGtvce1sxIvKFT500WyY9gsh0hDQkwa3tpnUtsR/meHWs9bekfT9wp4xsGxz0gB5k8RF9576h/m99PcEeTOBSz+o1EI3em3P3j7i+9WvHt4gz/EXOwTr/qSQPQnwsWMTVK7WpFyAmdt0LHGwN3paZplZWXO4LP42AwPng703JmPij oZr+eBZm /XJFrFjBWSNNEthNoIhwJNhu1I9ZwuSMIdDdVxW9yT0xry6yGnWi5cU7TyR5XCmy4ED0Sa6uTMazB6PjAXb2Kf/J0kBBmTCyCHxvsxPNSZ9jcmWCKT47TwzdApaB22xHF2TSuohPq1Ss4tC3YxZ+VCEPt3g3WxljaFtgqS8FLcY+BrB9V83PWvvZTgAXSKgYu9xQmCSsoN5Jp6OLWSXHRiL+Sk+W8iCd2Z+A+P2T6rxvWJaJiv4pPzBnjPa9BuVAgxFj/tm7i23RPqUUkCgZlgpgbfxPxWmrfN84jfuO9lVowqXL53zYHQ97Jn/GvVp/krUZXlGzBbLZmrLbMlFakOTcvreWcM+vN8eC0HEioVLW/C+rsJSV+jz6ckCIbTmHZ0YvdjRYaQvAM/VsDVoaleJGuRFHUESxSti/rgwge7S9Ija34rhFf+9zW6B2TRiEJcxYCVojeRc4k0LjLx+IIYeVr1i0G3O7VOy4/TslMUcaZ1ihgHORWOtRmLlCr2TeaiaqlJ7WVzuiHXDUg+I3gmDXwfw7N/Ivf/wXrRPnmPQQcKHzskcPOdsNwQqNtP921U6d0uA5CoWq363NHI/9NHQObQg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Changelog: v12: * Update the cover letter with more cachestat usecase and security concerns (suggested by Johannes Weiner and Andres Freund). * Fix a bug that crashes cachestat when processing recently evicted pages. * Add a new benchmark for cachestat v.s mincore v11: * Clean up code and comments/documentation. (patch 1 and 2) (suggested by Matthew Wilcox) * Drop support for hugetlbfs (patch 2) (from discussion with Johannes Weiner and Matthew Wilcox). v10: * Reorder the arguments for archs with alignment requirements. (patch 2) (suggested by Arnd Bergmann) v9: * Remove syscall from all the architectures syscall table except x86 (patch 2) * API change: handle different cases for offset and add compat syscall. (patch 2) (suggested by Johannes Weiner and Arnd Bergmann) v8: * Add syscall to mips syscall tables (detected by kernel test robot) (patch 2) * Add a missing return (suggested by Yu Zhao) (patch 2) v7: * Fix and use lru_gen_test_recent (suggested by Brian Foster) (patch 2) * Small formatting and organizational fixes v6: * Add a missing fdput() (suggested by Brian Foster) (patch 2) * Replace cstat_size with cstat_version (suggested by Brian Foster) (patch 2) * Add conditional resched to the xas walk. (suggested by Hillf Danton) (patch 2) v5: * Separate first patch into its own series. (suggested by Andrew Morton) * Expose filemap_cachestat() to non-syscall usage (patch 2) (suggested by Brian Foster). * Fix some build errors from last version. (patch 2) * Explain eviction and recent eviction in the draft man page and documentation (suggested by Andrew Morton). (patch 2) v4: * Refactor cachestat and move it to mm/filemap.c (patch 3) (suggested by Brian Foster) * Remove redundant checks (!folio, access_ok) (patch 3) (suggested by Matthew Wilcox and Al Viro) * Fix a bug in handling multipages folio. (patch 3) (suggested by Matthew Wilcox) * Add a selftest for shmem files, which can be used to test huge pages (patch 4) (suggested by Johannes Weiner) v3: * Fix some minor formatting issues and build errors. * Add the new syscall entry to missing architecture syscall tables. (patch 3). * Add flags argument for the syscall. (patch 3). * Clean up the recency refactoring (patch 2) (suggested by Yu Zhao) * Add the new Kconfig (CONFIG_CACHESTAT) to disable the syscall. (patch 3) (suggested by Josh Triplett) v2: * len == 0 means query to EOF. len < 0 is invalid. (patch 3) (suggested by Brian Foster) * Make cachestat extensible by adding the `cstat_size` argument in the syscall (patch 3) There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. This series consist of 3 patches: Nhat Pham (3): workingset: refactor LRU refault to expose refault recency check cachestat: implement cachestat syscall selftests: Add selftests for cachestat MAINTAINERS | 7 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/compat.h | 4 + include/linux/swap.h | 1 + include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/mman.h | 9 + init/Kconfig | 10 + kernel/sys_ni.c | 1 + mm/filemap.c | 179 ++++++++++++ mm/workingset.c | 150 ++++++---- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/cachestat/.gitignore | 2 + tools/testing/selftests/cachestat/Makefile | 8 + .../selftests/cachestat/test_cachestat.c | 257 ++++++++++++++++++ 16 files changed, 590 insertions(+), 49 deletions(-) create mode 100644 tools/testing/selftests/cachestat/.gitignore create mode 100644 tools/testing/selftests/cachestat/Makefile create mode 100644 tools/testing/selftests/cachestat/test_cachestat.c -- 2.34.1