From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A2DDC433F5 for ; Wed, 23 Feb 2022 13:51:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 209108D0005; Wed, 23 Feb 2022 08:51:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1B8488D0001; Wed, 23 Feb 2022 08:51:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 080978D0005; Wed, 23 Feb 2022 08:51:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0150.hostedemail.com [216.40.44.150]) by kanga.kvack.org (Postfix) with ESMTP id D97048D0001 for ; Wed, 23 Feb 2022 08:51:31 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 6B8E18249980 for ; Wed, 23 Feb 2022 13:51:31 +0000 (UTC) X-FDA: 79174181982.17.52E9A16 Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) by imf03.hostedemail.com (Postfix) with ESMTP id 062A720005 for ; Wed, 23 Feb 2022 13:51:30 +0000 (UTC) Received: by mail-pj1-f48.google.com with SMTP id q8-20020a17090a178800b001bc299b8de1so2780626pja.1 for ; Wed, 23 Feb 2022 05:51:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=mime-version:from:date:message-id:subject:to:cc; bh=oTjXXTO1s4UAL3yE3JqlubUGpOw0hyMgG/oOTjLDIZs=; b=mf5D3ZbCJlJiKVa69AF9EBpDf8fiRPX/13qo36lH65r0Ka7LqMOKca9yTlM2AqWpqd ozRI0Ru1abpXIHex5wgSRUn+53e9M0z5hR4jYVn+HJIt+s/s101VhkFyTqDshfc2IFiE eNz4pylR45PfS7e3uotmzkfuMPkFit70Ld9yY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=oTjXXTO1s4UAL3yE3JqlubUGpOw0hyMgG/oOTjLDIZs=; b=viv3iwUlp6YBeVKL7oW8X8jUpsVMtI9OmpHL1Rhkx5kxjFcsg9YrhXlLhmfNTrt06Y qwTeT+lJSdscoQDajUsPsjcFaIBHO8Z1j0D9vNzFgpAqnG/HKlAeqcZeRzNN0nm/IhAp w2rQT6CM3M9yv444HsHJihTE3luIgtZzOMrG91BwpO8v5FCQklSIK87wlG0OWoFiF97l PipO4Rw9UHwvbyZoRU2X5d0ZUCMewXUeWWfQSg4b9o+uD+kOGtkoqDyILBHFsVDcsrwS 2mw/9Ayhw/Y7fyJSSDn23BsRGd6uszGIleduH+m3d01LB/NVyNAPlzXAfWjbPMygK1gt rfMA== X-Gm-Message-State: AOAM533vEeUdXKfs58j0RmGSyPQqyvN5HGkH2JLsUEJenxwDPfsMMS0d szoZ2NWLuWMn92zC1fWZQQpMfLXmBhRHh4bRY/eB/Q== X-Google-Smtp-Source: ABdhPJzG7LgQG9UaXNiYS4vq4vnwTYwbBNjRmL5IIFwT4TakBHERo4bI9savqRU9EjJ6Ewdl8DpQLOinHnygj6j06dY= X-Received: by 2002:a17:902:8504:b0:14f:501:4caa with SMTP id bj4-20020a170902850400b0014f05014caamr27462486plb.93.1645624289360; Wed, 23 Feb 2022 05:51:29 -0800 (PST) MIME-Version: 1.0 From: Daniel Dao Date: Wed, 23 Feb 2022 13:51:18 +0000 Message-ID: Subject: Regression in workingset_refault latency on 5.15 To: shakeelb@google.com Cc: kernel-team , linux-mm@kvack.org, hannes@cmpxchg.org, guro@fb.com, feng.tang@intel.com, mhocko@kernel.org, hdanton@sina.com, mkoutny@suse.com, akpm@linux-foundation.org, torvalds@linux-foundation.org Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=cloudflare.com header.s=google header.b=mf5D3ZbC; dmarc=pass (policy=reject) header.from=cloudflare.com; spf=none (imf03.hostedemail.com: domain of dqminh@cloudflare.com has no SPF policy when checking 209.85.216.48) smtp.mailfrom=dqminh@cloudflare.com X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 062A720005 X-Stat-Signature: dqarnhtzphduobs47qu9ai8i653xp1f9 X-HE-Tag: 1645624290-512903 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000562, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi all, We are observing some regressions in workingset_refault on our newly upgraded 5.15.19 nodes with zram as swap. This manifests in several ways: 1) Regression of workingset_refault duration observed in flamegraph We regularly collect flamegraphs for running services on the node. Since upgrade to 5.15.19, we see that workingset_refault occupied a more significant part of the service flamegraph (13%) with the following call trace workingset_refault+0x128 add_to_page_cache_lru+0x9f page_cache_ra_unbounded+0x154 force_page_cache_ra+0xe2 filemap_get_pages+0xe9 filemap_read+0xa4 xfs_file_buffered_read+0x98 xfs_file_read_iter+0x6a new_sync_read+0x118 vfs_read+0xf2 __x64_sys_pread64+0x89 do_syscall_64+0x3b entry_SYSCALL_64_after_hwframe+0x44 2) Regression of userspace performance sensitive code We have some performance sensentive code running in userspace that have their runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as: now = clock_gettime(CLOCK_THREAD_CPUTIME_ID) func() elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more frequently than before. This went away after we disabled swap for the service using `memory.swap.max=0` memcg configuration. The new thing in 5.15 workingset_refault seems to be introduction of mem_cgroup_flush_stats() by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush lruvec stats in the refault). Given that mem_cgroup_flush_stats can take quite a long time for us on the standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service ) sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats Tracing function calls slower than 10 ms... Ctrl+C to quit. TIME COMM PID LAT(ms) RVAL FUNC 0.000000 804776 11.50 200 mem_cgroup_flush_stats 0.343383 647496 10.58 200 mem_cgroup_flush_stats 0.604309 804776 10.50 200 mem_cgroup_flush_stats 1.230416 803293 10.01 200 mem_cgroup_flush_stats 1.248442 646400 11.02 200 mem_cgroup_flush_stats could it be possible that workingset_refault in some unfortunate case can take much longer than before such that it increases the time observed by CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of workingset_refault observed by perf ?