From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9252EC41535 for ; Sat, 16 Dec 2023 21:07:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A043A6B0072; Sat, 16 Dec 2023 16:07:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 98C5F6B0074; Sat, 16 Dec 2023 16:07:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 806AF6B0075; Sat, 16 Dec 2023 16:07:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 69C1C6B0072 for ; Sat, 16 Dec 2023 16:07:41 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 403B8140715 for ; Sat, 16 Dec 2023 21:07:41 +0000 (UTC) X-FDA: 81573917922.12.3727A02 Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) by imf01.hostedemail.com (Postfix) with ESMTP id 57FEC40007 for ; Sat, 16 Dec 2023 21:07:39 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JzJ2aH1G; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of yuzhao@google.com designates 209.85.208.43 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1702760859; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Q8Lv4Mhj6UyiR6Q1vMDEx023oAmbaBlMqkNYOVuH8Dc=; b=IoM3lHxzehTNDUmfk4g1qjXodUoVZo/zDfqInV0II7BoRMd3yFZ36O9tfxDER8Av+BYPUx kCUNbNHCXuOtthlVLksoLWtoE8wcjplIoDZFx5PFPzXrJcN7gj/4qJtzQumT7t49bXlNtk t81VqiEwuUP1rKy9MbXhpdZsCMIYwe8= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JzJ2aH1G; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of yuzhao@google.com designates 209.85.208.43 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702760859; a=rsa-sha256; cv=none; b=dfqFoS4a4h+IiI0oZMWOKVRLbPuI57Z+Ip8XZM8y/zU6MkH5qwsWkrOBESee6wWLSni84a tkTN/+Co5GZu+xf1e+HTZ3k3f6Zs0+N5tASVphq9fEPV7YX9EGPFdA5PYgnzvs/1iovn0X Nxrxl2Me6Hbasm/dqzY9jQV1LTFuHyo= Received: by mail-ed1-f43.google.com with SMTP id 4fb4d7f45d1cf-54744e66d27so8166a12.0 for ; Sat, 16 Dec 2023 13:07:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702760858; x=1703365658; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Q8Lv4Mhj6UyiR6Q1vMDEx023oAmbaBlMqkNYOVuH8Dc=; b=JzJ2aH1G5XDJrSZ6n5zyAvjiwacWZYcDWIkG5HACNOtnNnRI2bE1YeOyoVpX5BJdcN IAwkWQ+NjjWatgMSPGzdsoWcwJUPdaugqIaM5DaIU8TwOp/2xZBxngF992gQM1NEZKjI WYwto7vDPAc4wiZUZgPgGOr3KB5HQI74HtLxX1xFuAlLbQPf+fz+f6q+eD0CcNOL7Tl8 pf5OjU5NFvbsiIODZRs9LgdTujZBkRAK+EEcqbQJECOWZnFP4Lstmil13MP0oMMv4CyF TDGWFoBzk0cmpVzn+EtbxA1F4nGO1Ub+yjmXHQFxgNstZtAu0cRuDB3zCzkehr/bXJ7Z SGxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702760858; x=1703365658; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Q8Lv4Mhj6UyiR6Q1vMDEx023oAmbaBlMqkNYOVuH8Dc=; b=xSCmwta3FN+KlUvbfMArSkhcwEPW9u6O1kxdKbCBq/9YgJYmhTCDUWR5JNTolNMyCM mg8xJ9ZLj4eco8N/xPn+Sg2HqkSLYUIkncgQvcxKqh7P/ruW3U9zF7laxek9nw4EqeNA amGf/B3PMV0kGYTm2Oyf5ywYB5+2B6RSv9q74EpxqFHizgXRGgAo6WtfYLe6sza6P3jY BD8XkYY+D40iYPevMlnjv2G4xppkSxH2pRaIpe5A2aJzMo2X1lIRdsDlLKqtLqE11fuG 83kRNQZZCQFEFcEcJ1O3ReJL9O5iCzJaem8RrOLORVcimKsax5GA84DJbCiuXL/Pgr4q JnAw== X-Gm-Message-State: AOJu0YzOpy0zrvVBrHB2b82vMT/ICjXdv7v/OliyBnEjreGsa+uoEY1W kzikaJNbHQWjocfQPoTYd2LiZEzWL6W5lsaJLS0I5vpao3EW X-Google-Smtp-Source: AGHT+IHEoDsbPQmFEtUMyt+gQsh1WoSkmyQJ0g+yvn8SwSukWuPPioH65QjB+Xq/QR03YA9f3kKpUqBQGrCe/x0aCbg= X-Received: by 2002:a50:8d15:0:b0:54d:6a88:507e with SMTP id s21-20020a508d15000000b0054d6a88507emr111501eds.4.1702760857619; Sat, 16 Dec 2023 13:07:37 -0800 (PST) MIME-Version: 1.0 References: <20231215105324.41241-1-henry.hj@antgroup.com> In-Reply-To: <20231215105324.41241-1-henry.hj@antgroup.com> From: Yu Zhao Date: Sat, 16 Dec 2023 14:06:59 -0700 Message-ID: Subject: Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap To: Henry Huang Cc: akpm@linux-foundation.org, =?UTF-8?B?6LCI6Ym06ZSL?= , linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?UTF-8?B?5pyx6L6JKOiMtuawtCk=?= Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 57FEC40007 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 6xhth5fdp3qzhbxf4f8wipwzz6f1fqta X-HE-Tag: 1702760859-27181 X-HE-Meta: U2FsdGVkX1+Dqi7xKmh58g7+l6CjvGs7C9vmyflsTcZ6SoUxLy1SPZLDWNyrQcuwGZIexcM5rMMnbBiguG+drNHiqsugZKNaM1uPTaNRzXvVwJKu2z1NxAW0yQGGTSW7k35m4SPdaEPexyqqPcrM6rVqQIg8EKEp69+F3AnOCCYpCtkVrwd9BqGL5m2ABea5TP6oILtzCKsnfBUfRoceZkD3bNW7p16FP1VX4qyIu9Tqhp+NcUTUAk2OOsUd1Fv4/kJcCzfhGF4AaVAZQAEKGwhuqV1GRwy2kxpPCKMOVF/NFMPvh3X9GQUHfHfWRIwT5sNQHeTGWR8XvdtO2XAkR25p9Y9jJWgP9UbuaKtI2ipS4zXPPqJTNUGO5TULavkM44YYjNpBKS7gntX7MKYXF6mRk77mXmlXseO6Qg/RactXHKINBsTJFcIZExT1jnldR+gTZgd9dTBZBoYoZ9X4ylT/fU+tY7IEO63AcRPcRnHjnzvjXwdrlEB/iGbLWqYSNMITM5O9CkcU3X8y1ezFHc+2Agc84xNojekbe7Xq5/qdEZz7G0FAQsM5cUMQ5B9/9CJp8I7OFZzvZ04z3Y4h0C6llxZ5H396uDNodhvLsTNbW6g91K/guzKTWGN1gFlWSUZFzAemSJcW86pGbp33Q6Rm0zHt0huq4yHzWE1QDxiE8/uVc8dU7rjFLpO88x25XgXw9ARIoS49AILkxQpY7FfliN2a9gynpJbDSLBt/KGr8gGZyft54ECumNj9kDEcI/b1i78r7AaKMRe8rvUoRyLgc5AfRMYrhDSWpT1SZ2sIAhOv/FRTzfWcNXujSkAb5XoNNVD5VOB2fQUSQoGHa33WBiZnprO/rR5GG3Cu6OeIiYIVKO+ifzSQ3TWuhllqWARcDxwN8f6QYiktMKIM6WsoVWmNjG0B/IXo7xWwMIgN4yoUHnu2tvE43fjS6cMO3VX0hXyQo6gHh+vzJIj W/3WfpEr od854CwyRtLOyBseVYXDwJOIfBogv9JGu23jbcSnmnFTATZP7hvRun0AM+QfpNaU91793WUWE5wMaPIraGgRycRO6cn096jT/PsBp0E0oLXqZ1Chg+bb3yhFEfJ2TrHuvcPlE81NMW73LPwIVJhBcETaeWPffuSqNS+aKK/NrzkIvr2qXyIbq5dY5HXI/D3Shq51Ui8xfvbeml8BLidW0dyEUc0kr5xQlLbt0Lq3XGLrYUJ2kwcQoV5UWQzbL9EpKCqexnof8WFGdKJfClwsvvcY/jazlkC1OK0epx2C4J31UQxb29g2GQXZjI0z6jdzy2h3M0xX7GRHb+kmy0bMyoI/PKQFXleeHN7DfcmHchVfCmW0/bn0q7daGp4lwQxd7AJ7v3MsuJHjkeffBRjzR7DMRFcQj8cS/+H19 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Dec 15, 2023 at 3:53=E2=80=AFAM Henry Huang = wrote: > > On Fri, Dec 15, 2023 at 14:46=E2=80=AFPM Yu Zhao wrot= e: > > > > > > Thanks for replying this RFC. > > > > > > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google > > > > proposed the idea [1], but we don't really use it anymore because o= f > > > > its poor scalability. > > > > > > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check > > > pages whether were accessed during a peroid of time. > > > > Is it a production environment? If so, what's your > > 1. scan interval > > 2. memory size > > > I'm trying to understand why scalability isn't a problem for you. On > > an average server, there are hundreds of millions of PFNs, so it'd be > > very expensive to use that ABI even for a time interval of minutes. > > Thanks for replying. > > Our scan interval is 10 minutes and total memory size is 512GB. > We perferred to reclaim pages which idle age > 1 hour at least. Yes, that makes sense. We have similar use cases, details below. > > > We manage all pages > > > idle time in userspace. Then use a prediction algorithm to select pag= es > > > to reclaim. These pages would more likely be idled for a long time. > > > "There is a system in place now that is based on a user-space process > > that reads a bitmap stored in sysfs, but it has a high CPU and memory > > overhead, so a new approach is being tried." > > https://lwn.net/Articles/787611/ > > > > Could you elaborate how you solved this problem? > > In out environment, we found that we take average 0.4 core and 300MB memo= ry > to do scan, basic analyse and reclaim idle pages. > > For reducing cpu & memroy usage, we do: > 1. We implement a ratelimiter to control rate of scan and reclaim. > 2. All pages info & idle age were stored in local DB file. Our prediction > algorithm don't need all pages info in memory at the same time. > > In out environment, about 1/3 memory was attemped to allocate as THP, > which may save some cpu usage of scan. > > > > We only need kernel to tell use whether a page is accessed, a boolean > > > value in kernel is enough for our case. > > > > How do you define "accessed"? I.e., through page tables or file > > descriptors or both? > > both > > > > > 2. PG_idle/young, being a boolean value, has poor granularity. If > > > > anyone must use page_idle/bitmap for some specific reason, I'd > > > > recommend exporting generation numbers instead. > > > > > > Yes, at first time, we try using multi-gen LRU proactvie scan and > > > exporting generation&refs number to do the same thing. > > > > > > But there are serveral problems: > > > > > > 1. multi-gen LRU only care about self-memcg pages. In our environment= , > > > it's likely to see that different memcg's process share pages. > > > > This is related to my question above: are those pages mapped into > > different memcgs or not? > > There is a case: > There are two cgroup A, B (B is child cgroup of A) > Process in A create a file and use mmap to read/write this file. > Process in B mmap this file and usually read this file. Yes, actually we have a private patch to solve a similar problem. Basically it finds VMAs from other processes in different memcgs that share a mapping and jumps to those VMAs to scan them. We can upstream it for you if you find it useful too. > > > We still have no ideas how to solve this problem. > > > > > > 2. We set swappiness 0, and use proactive scan to select cold pages > > > & proactive reclaim to swap anon pages. But we can't control passive > > > scan(can_swap =3D false), which would make anon pages cold/hot invers= ion > > > in inc_min_seq. > > > > There is an option to prevent the inversion, IIUC, the force_scan > > option is what you are looking for. > > It seems that doesn't work now. > > static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_= scan) > { > ...... > for (type =3D ANON_AND_FILE - 1; type >=3D 0; type--) { > if (get_nr_gens(lruvec, type) !=3D MAX_NR_GENS) > continue; > > VM_WARN_ON_ONCE(!force_scan && (type =3D=3D LRU_GEN_FILE || can_s= wap)); > > if (inc_min_seq(lruvec, type, can_swap)) > continue; > > spin_unlock_irq(&lruvec->lru_lock); > cond_resched(); > goto restart; > } > ..... > } > > force_scan is not a parameter of inc_min_seq. > In our environment, swappiness is 0, so can_swap would be false. > > static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) > { > int zone; > int remaining =3D MAX_LRU_BATCH; > struct lru_gen_folio *lrugen =3D &lruvec->lrugen; > int new_gen, old_gen =3D lru_gen_from_seq(lrugen->min_seq[type]); > > if (type =3D=3D LRU_GEN_ANON && !can_swap) > goto done; > ...... > } > > If can_swap is false, would pass anon lru list. > > What's more, in passive scan, force_scan is also false. Ok, I see what you mean. (I thought "passive" means proactive scans triggered by the debugfs interface, but it actually means "reactive" scans triggered by memory pressure.) We actually have a private patch too to solve this. But there is a corner case here: that private change, which is essentially the same as what you suggested, can stall direct reclaim when there is tons of cold anon memory. E.g., if there is 300GB anon memory in the oldest generation which can't be swapped, calling inc_min_seq() with can_swap being true would stall the direct reclaim. Does it make sense? Let me check the state of those private patches and get back to you in a couple of days. Thanks!