From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BDF08C4167B for ; Fri, 15 Dec 2023 07:24:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E0F548D011A; Fri, 15 Dec 2023 02:24:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DBFFF8D0103; Fri, 15 Dec 2023 02:24:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C868D8D011A; Fri, 15 Dec 2023 02:24:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B99A78D0103 for ; Fri, 15 Dec 2023 02:24:33 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 90D901C09F5 for ; Fri, 15 Dec 2023 07:24:33 +0000 (UTC) X-FDA: 81568214826.11.8FF1563 Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by imf12.hostedemail.com (Postfix) with ESMTP id BF1DC4001A for ; Fri, 15 Dec 2023 07:24:31 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Hk2+EOdi; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of yuzhao@google.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1702625071; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ar1/QaxuZM6RP2byX2TH29qNuqN+lLyFuf9z+BnTVrk=; b=GewJVn0g28BtG+3bBsg5Ok22tNGz/qol4BEZqKrMVR4LanZb8sI361EFhlWtux0/t//6fn 7E/rxaXGqwpodhHvv7YDT9XykFeC03qAfGb3BSNgiGHNblTav/dW9jHg5bzkMSFLkQp56B ylIJrU8ILABbwCEWnnI9mRN68OHxhpE= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Hk2+EOdi; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of yuzhao@google.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702625071; a=rsa-sha256; cv=none; b=qqq1lrYcmw9pS437pSQC4wfWsYWmnlsDI9T9OacYKjxjzV6WDYrWGvKrwultSNf2K9OQqZ MkYsbc4OCzE5zszAJCElvLuOVktvrwNlXS6xBZmpptGHKUJw4ujsgQMMnFXbloKN4etp0E F4g/+v6pwHHFSGl7kqm0EHHOgKzj2YA= Received: by mail-ed1-f52.google.com with SMTP id 4fb4d7f45d1cf-548ae9a5eeaso5054a12.1 for ; Thu, 14 Dec 2023 23:24:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702625070; x=1703229870; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ar1/QaxuZM6RP2byX2TH29qNuqN+lLyFuf9z+BnTVrk=; b=Hk2+EOdi3Dw7cDyboOQlCoHvf8mRmSAzEXQT+Mi7DnWz1U331IsVh7E/n77JCiBtDd 37SsXuC0JT+V3Ej6XUtMWqKo8wrTbu36SnBgq0TO1Z5YJwMp0gmu8bzmpd6/tB8L8vDD SkM5GolKHZV8DhlGOun5EzDwd0yPK47CGEIOdx18T3ka66vdZL9X2PxwvcJMH3S9v3Nr Zs98Fzk9zFbnLtvLclevjJz7CgFuYbnKSDYZ5743tTJq6jnfsXwFEiy5jh8LApIEtJZo ZPQLpS+jHB+GaxYjFbgyfBzvxzePrsjHEHICSqnG02CFctkoDQhyIO1M7iJmg4aXHSPF wtgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702625070; x=1703229870; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ar1/QaxuZM6RP2byX2TH29qNuqN+lLyFuf9z+BnTVrk=; b=LT2e10SqafwSPoZls5jek2dn2D7x2ylxF3Hlt/w1QXS7RaUTH64mgxZYqPUWorZa2J cK9zzOkN3bfre7iBSUzFBKEtFGtOooPB0uHCH7JrQWEKQLb93/cw82wSHuCqgtgfO/9i snOvEHepg/uVdtIjMCXbAgFw3sAEyx01fAJPc/COmq8zWthImBvGHErvtQU92qtgE4nf NDYNAkwHu4ZJHw/FdQ/cT95wM2kMGiOc0eegyATlg3Erl+cOuUI4rUCOFVoMX0297jGj YpEjkzw6n3OUGQh3P27FBEe1osDvm0wFblq0Eaio+dFdcmkV4z3Apa02W5DuElBeUMo3 cctQ== X-Gm-Message-State: AOJu0Yy0HRQl+TKISOu5+dgxCQo5YOBsj4A57GAZ4jh0DthR11OLsORr 3qKXYq4iQmPxC0KvvnM8S5/TccVX5Z5o/WaHY/6kbw== X-Google-Smtp-Source: AGHT+IHuw2pFMJRAc/0G8aVaiU/P0CRJh2c6oR4wxF46Uv67s/swnBCz1A657gIOTw9m6By51s/dkit8N3mEzPMCOSM= X-Received: by 2002:a05:6402:35c5:b0:551:9870:472 with SMTP id z5-20020a05640235c500b0055198700472mr517917edc.1.1702625069994; Thu, 14 Dec 2023 23:24:29 -0800 (PST) MIME-Version: 1.0 References: <951fb7edab535cf522def4f5f2613947ed7b7d28.1701853894.git.henry.hj@antgroup.com> In-Reply-To: <951fb7edab535cf522def4f5f2613947ed7b7d28.1701853894.git.henry.hj@antgroup.com> From: Yu Zhao Date: Fri, 15 Dec 2023 00:23:52 -0700 Message-ID: Subject: Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap To: Henry Huang Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, =?UTF-8?B?6LCI6Ym06ZSL?= , =?UTF-8?B?5pyx6L6JKOiMtuawtCk=?= , akpm@linux-foundation.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: BF1DC4001A X-Stat-Signature: sbe49qnnfe4icmh56e434ja3wjm65rq4 X-HE-Tag: 1702625071-573035 X-HE-Meta: U2FsdGVkX1/WuBB7e5cqZ2GxYAc5ObQtEWxHuOodDdG7BouGp6qBW/aypwu0R4I7Ye7MpO7ohYqNMcIlBFjn36lm3HMjxGvTTkRExKYE01/X5bGZOjzDQohrw2smmfWhldyHWykBnqKvbnqJ6TCzPl85JsIwT4aGITVgHq9oDdhkl1O0hsn1c8UGppePtYTJuaPr7mGwGoYdOup2XVoWzCW0w9/rVMi7gA7HheMFgSzzTEoMimE2eVWxFejp1amFFaXjF8KqDSwIXchSZCfatKLc1HIJ3n5Z0f9fEx55Qtf8vO/wmY7JQHDxOgOqAkvZCL32auSpQkNzrIEu7RCKqgGxCWTzRHiZye7ug4wZ1V+DEldeRTgHSsJeJwRsg4Qk6Ovm8H0j75b1CcoB4vW3Am19s4E80UAwQj+HjN1qPbjAfES62CFpFX+Dh12xLBeRPmKy42G6j7hTbgYOjz1CqeS3APKHTvaqYEzpNLsGqBXkC0gk+y0rAEd4smTwzHJI12kJqEnQyvynbtUobzRGP1uLwtPbXH3D7bRlvTmZi/V0xK8KsV1FZkv10BXuwcNiozpOsue3ljJVH3F57WbTbt+rnkgS6HiNd45Qk8D7D/wts5/EhHCr6YLvHQ78/k8iy41OPbNX2ul3a0eYpeS4iVCqko5LizdNtKvYOGzpVTyMBOW6Vpk6DQFSTmJjfqzE1i/f8HEwLrXzTAbI5VOOuWKGafGHsd4VYtGBsRw0d4JX37Mu/OMzOIt7zSAgIU34M3RkpMG18gMCy4A2jXt+2N3GKJKAVoAD2OtXqnJB0SOjCrkBXgmV1HkD3C8Xin6QKq9a6++F89iiCUQkc06xOdzWQxrVv0AMHoZ0Wvec5vMZWWsG/npBp+mEVGfXM/2c4AQJbwCxibXqHYbiO4P2u1CxjO2YOxsiS745VH1XLkNqSyrMiGKmt+4ePrCCgEfoQQKTJ1YwTwoAGoggDYx zyhJDN98 1EV2U+VQ8ciwBVce1ZmCHmKV8m+RNQap8mfG6LQ2euz0gaNLyORvGV4wyybFhnFRO7f9bcnO+I0tvkCXvmAg+iO5f6JiwzkpL3tEhzabgyix2I61QjV9QRPjP6IvQ5WZVmCww4/oIQjFnX7qO91eXrTjMZqP210iqud701FEmBgp4noE8kgYlo8JKGSYiaPanopbCJIoBoTLaXx/GWv5EpU2+/V0AHhT48iI1b5cH2s5O2bBYn4817J4YVB2Ct9BllZmP4SzWYJTNpoJODV9GKSq3JImWB/siZM5SHAxJyGrl0N2Ruiow8g/syTEEaUUUUZGDkDxkyX0DO7lJtPxO7XrJnZmnI1DfFKbL8WrcVghiH/M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Dec 6, 2023 at 5:51=E2=80=AFAM Henry Huang = wrote: > > Multi-Gen LRU page-table walker clears pte young flag, but it doesn't > clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to chec= k > whether one page is accessed, it would tell us this page is idle, > but actually this page has been accessed. > > For those unmapped filecache pages, page idle flag would not been > cleared in folio_mark_accessed if Multi-Gen LRU is enabled. > So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether > a filecache page is read or written. > > What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag. > If one page is accessed, it would set page young flag. Multi-Gen LRU > page-table walker should check both page&pte young flags. > > how-to-reproduce-problem > > idle_page_track > a tools to track process accessed memory during a specific time > usage > idle_page_track $pid $time > how-it-works > 1. scan process vma from /proc/$pid/maps > 2. vfn --> pfn from /proc/$pid/pagemap > 3. write /sys/kernel/mm/page_idle/bitmap to > mark phy page idle flag and clear pte young flag > 4. sleep $time > 5. read /sys/kernel/mm/page_idle/bitmap to > test_and_clear pte young flag and > return whether phy page is accessed > > test ---- test program > > #include > #include > #include > #include > #include > #include > #include > > int main(int argc, const char *argv[]) > { > char *buf =3D NULL; > char pipe_info[4096]; > int n; > int fd =3D -1; > > buf =3D malloc(1024*1024*1024UL); > memset(buf, 0, 1024*1024*1024UL); > fd =3D open("access.pipe", O_RDONLY); > if (fd < 0) > goto out; > while (1) { > n =3D read(fd, pipe_info, sizeof(pipe_info)); > if (!n) { > sleep(1); > continue; > } else if (n < 0) { > break; > } > memset(buf, 0, 1024*1024*1024UL); > puts("finish access"); > } > out: > if (fd >=3D0) > close(fd); > if (buf) > free(buf); > > return 0; > } > > prepare: > mkfifo access.pipe > ./test > ps -ef | grep test > root 4106 3148 8 06:47 pts/0 00:00:01 ./test > > We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan. > > case 1: mglru walker break page_idle > ./idle_page_track 4106 60 & > sleep 5; echo 1 > access.pipe > sleep 5; echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen > > the output of idle_page_track is: > Est(s) Ref(MB) > 64.822 1.00 > only found 1MB were accessed during 64.822s, but actually 1024MB were > accessed. > > case 2: page_idle break mglru walker > echo 1 > access.pipe > ./idle_page_track 4106 10 > echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen > lru gen status: > memcg 8 /user.slice > node 0 > 5 772458 1065 9735 > 6 737435 262244 72 > 7 538053 1184 632 > 8 59404 6422 0 > almost pages should be in max_seq-1 queue, but actually not. > > Signed-off-by: Henry Huang Regarding the change itself, it'd cause a slight regression to other use cases (details below). > @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned lon= g start, unsigned long end, > unsigned long pfn; > struct folio *folio; > pte_t ptent =3D ptep_get(pte + i); > + bool is_pte_young; > > total++; > walk->mm_stats[MM_LEAF_TOTAL]++; > @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned l= ong start, unsigned long end, > if (pfn =3D=3D -1) > continue; > > - if (!pte_young(ptent)) { > - walk->mm_stats[MM_LEAF_OLD]++; Most overhead from page table scanning normally comes from get_pfn_folio() because it almost always causes a cache miss. This is like a pointer dereference, whereas scanning PTEs is like streaming an array (bad vs good cache performance). pte_young() is here to avoid an unnecessary cache miss from get_pfn_folio(). Also see the first comment in get_pfn_folio(). It should be easy to verify the regression -- FlameGraph from the memcached benchmark in the original commit message should do it. Would a tracepoint here work for you? > + is_pte_young =3D !!pte_young(ptent); > + folio =3D get_pfn_folio(pfn, memcg, pgdat, walk->can_swap= , is_pte_young); > + if (!folio) { > + if (!is_pte_young) > + walk->mm_stats[MM_LEAF_OLD]++; > continue; > } > > - folio =3D get_pfn_folio(pfn, memcg, pgdat, walk->can_swap= ); > - if (!folio) > + if (!folio_test_clear_young(folio) && !is_pte_young) { > + walk->mm_stats[MM_LEAF_OLD]++; > continue; > + } > > - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) > + if (is_pte_young && !ptep_test_and_clear_young(args->vma,= addr, pte + i)) > VM_WARN_ON_ONCE(true); > > young++;