From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48EEBD68BE2 for ; Sun, 17 Nov 2024 03:26:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4D7E56B00B8; Sat, 16 Nov 2024 22:26:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4867B6B00B9; Sat, 16 Nov 2024 22:26:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 328A06B00BA; Sat, 16 Nov 2024 22:26:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 13BA26B00B8 for ; Sat, 16 Nov 2024 22:26:45 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 7BE70A0002 for ; Sun, 17 Nov 2024 03:26:44 +0000 (UTC) X-FDA: 82794148962.16.1E9EB76 Received: from mail-ua1-f54.google.com (mail-ua1-f54.google.com [209.85.222.54]) by imf27.hostedemail.com (Postfix) with ESMTP id 5288840004 for ; Sun, 17 Nov 2024 03:25:54 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cixgKdx7; spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731813762; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=h3HZNDghnUw69Fh7P2VlawROtSkACLEvtCtaigDJRCQ=; b=xASQeJ3lf22gacVuOv5Z50pJ069Y1xYguP/mX9hqxOqAw4x2oeb/JbAY+vVjFv1+CnVHKU FN9SWH/e5jhXLFly2puHNV6pXHPIRovklo4qA5uDEcL+/a+AYTyW9XKpS1202QOWd3PpCI PGrzGsh3MIrpFpbStZqxiNT7OoXhp3A= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cixgKdx7; spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731813762; a=rsa-sha256; cv=none; b=jVmBUSsk/od8ss+KTxk7GFtpCI3CsJrF91PuczCsbdk0xDeQzE8zorN0SOxiKZMOy4xJT+ pDeUFEv9JLc5RhMx+SXRBjvEmz8V4rYxIqgTsy/SzHC96SefI/RRRx1E0JDhVJtH1s7G3h l9aammNgOrQ1QRijhIMSoy80L1IS4O0= Received: by mail-ua1-f54.google.com with SMTP id a1e0cc1a2514c-85700f9cdd6so157855241.1 for ; Sat, 16 Nov 2024 19:26:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1731814002; x=1732418802; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=h3HZNDghnUw69Fh7P2VlawROtSkACLEvtCtaigDJRCQ=; b=cixgKdx7KnPUcwtOI9msf3jRakY3BrQ8DTww3HOfVw7THFGUOqYvlm5i3ZEIT3tQh6 SPpMm21ACM96nsOguRM8Vg+Dareyy3QrwMcRtgaKBuXu0hkgMUlMhAUfhF8dUHSB35S+ t6VoMZQOK6Zf84ejfV9I40QaY1ks2lZsvQi0/T96Vp5JwhqXf89wECXYU+H8QxJDI8Xw baeD7ifoRKkGqOVh+pSKjlwU939nlO0FZu3pIPLo+IMz7wRXSMIDboIn9hm9o1jOvsiH jzpyPbOdSSTlho8X5idURo+rEhMcswklfhJTHtSLHeMytIPaePdv0bSdTGQn69W/sVZZ vuRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731814002; x=1732418802; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=h3HZNDghnUw69Fh7P2VlawROtSkACLEvtCtaigDJRCQ=; b=rBr+Yg3ed/vj+Teelc6xbzObnXU4qqVr4T/edOSR4iguN8xOe0OleHp9IO7wJHm9n2 AXfCBYlVR4gBsuUz3JiBNUs4cJ8YOvtmdrtLv/jXwPpVb+jRuUNGY+JMIPeHFQBl67PX czoXxRrE2hhQohLOBmvGtTVHZez6oUPo8N92gsPy0uFhzjUrLTc8LzBWwglszNrwrSiN vshNEVD684huOcq8sRD7DpkrB0Bat2/V6lHLEn5elj7GyajdXN/wcSrau0Y0V+M6nWZC TxI3DBIlTCoidT38ouck+Zxa6Ienxr3SsRMr3/XYZHla/x8v32Vj6DBhg1Wy/2uALVwP OuFQ== X-Forwarded-Encrypted: i=1; AJvYcCVONcrBj4mMQPmehlyXC7gw7NybfTK59V6bYEcMrf9ghLyapSirkv1VJnw7uCKMaXwJIDTuIuMkOw==@kvack.org X-Gm-Message-State: AOJu0YyQu1KL94vYv8S3YMHK3CkSUa8TUWwJXSC9EZrGuBz92usXQOuK KRIBzx0wQVf/NmW0UPQ9DAWLB0Sj31F04ImMcE6q9WHpH3zJP8OSfaFcK5QjMZavKlyoAqv+ZMt rz2Y+vcZlrb+utMrINWF/EeFyrFk= X-Google-Smtp-Source: AGHT+IGT2F+YPc+DD3ZP4CLx6U2+XhG+QANRVdIermr92rMRisFrRNjweCep+S0cP94n26NM1PbnuGrvwNgJrF1/Cuk= X-Received: by 2002:a05:6102:38ce:b0:4ad:4bc2:7efd with SMTP id ada2fe7eead31-4ad62ba9defmr6723776137.10.1731814001708; Sat, 16 Nov 2024 19:26:41 -0800 (PST) MIME-Version: 1.0 References: <20241116091658.1983491-1-chenridong@huaweicloud.com> <20241116091658.1983491-2-chenridong@huaweicloud.com> In-Reply-To: <20241116091658.1983491-2-chenridong@huaweicloud.com> From: Barry Song <21cnbao@gmail.com> Date: Sun, 17 Nov 2024 16:26:30 +1300 Message-ID: Subject: Re: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the tail of LRU after shrinking To: Chen Ridong Cc: akpm@linux-foundation.org, mhocko@suse.com, hannes@cmpxchg.org, yosryahmed@google.com, yuzhao@google.com, david@redhat.com, willy@infradead.org, ryan.roberts@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, chenridong@huawei.com, wangweiyang2@huawei.com, xieym_ict@hotmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 5288840004 X-Stat-Signature: eaynduuzsoqsd36fep9zp3jp7q3w53bk X-Rspam-User: X-HE-Tag: 1731813954-956352 X-HE-Meta: U2FsdGVkX189h3tfXCpLRnX98iGyHIgA4xQ/v7IXW0mRTGc1fSF13v0Dow29u2J9sFdTRelulnQpvshPREEc8XPezaPYz/fxWC0opnTJZwYsO2vAjg9obNB0vcdL7diqO9mJsmF1eqsCqck9QQQg1VCTrp7FEQb69DSdq00nbqR/XqLkU1Ns3Fs24pSVLdNknfzM8b0dagH4gbE66egMj1h7exx7mZokIvMT/l+kAh5vvYcEuYmfCxFO0SziY4c+d8kgIbYqDwdlJVa+KvVDu7eNRggE8K1QgMU9MVTIuYhbPIpaiNeWZwKFzlsUXtL0qrZDrszjyPjFDZHPdtSclfFPsQkGjcxKyZv9XVntg10R8XYC58HEGieEnUAre13gpP7qzy6rtH3ojMjvTJwBxhpNy368IzixWvDpwIybEZHMkCtReQ1Sb3mV6zoDGYLtJfNepBlG2Xab1zC3C2hXfjC/amtepgepHt5Vm2K7LGwnIJkhqO+rZS2evCU7/I1khFFRT8AJ3pQK6TqqFfODZyVjQdQlgRiVfSRjZM7nXxhg0ssPkEiM8wuWdTQFHrDCtmsEGaE0VMeSnabkPmVECP9n4RzahuaXxejP+sNfvrAHJBnxZJzI31UZ1nGk59qA6XRDM1UTs9My0siiJOsFIdT/9Bg1dML2VhahMbhR6mPLnd2q2HUJNhDYV/OoTsE5nzQ/N3ALl9RxR7AK+oXYO6t73XscbZk2i0jyV17hZ2hgxerEk8NLw0tHZEaOWbcI56FJd13as7X1BL9V3rfrfnjrTOC4ToCj7Z9Gpy4NAC+yFK7/bvF3kpVQopmwG/9gEAIKn6Ex2/2AM710QZmwLiGdK0+C/xTtygqRCNeJB2zpsKhXVZWMTaIiIlSKdEALciBHMbdOwszZyLeq21v2qNYy6YZ9zSn6LQd+K/mANZY95+yf02zH4cvrv3e79irCjxjOFeHWSwz4R4qhUYk iy454Vra MEbtoFr685GTZznK64X1BGgkD2Da7v5zbIUmPXZisrq+xEFWVWnTjpA9WWHrTuNLGkcAAgBVZsijL8LPqrGgrXPLYMZm1yaLbfIfYwbaEMYoy2zVtbQDftuvrLLbZ4MCKhoOW8Bo+UUqVbrbCvqUIPjwN0Nkv+B8h2GJI4V/N2HFijbe+K5neqzZnAmlNL/u0GVZi9EM744y+dq9wEPIkROj9p7gWGc2P2sRD3lXk1yRx1JYtf9+xFak9rULGT01E9D7Xfam9JTmo4KHh7c8ox9Y50w6w9u+TPEep2KeJIecOWbvvaLDmvX9pC+T/XF3pHcwQA4OaJCyesKmM0z/QGZGx5fhbZzCQmjqKr4DuREk/pHOWrZqTXr28KA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Nov 16, 2024 at 10:26=E2=80=AFPM Chen Ridong wrote: > > From: Chen Ridong > > An issue was found with the following testing step: > 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=3Dy > 2. Mount memcg v1, and create memcg named test_memcg and set > usage_in_bytes=3D2.1G, memsw.usage_in_bytes=3D3G. > 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg. > > It was found that: > > cat memory.usage_in_bytes > 2144940032 > cat memory.memsw.usage_in_bytes > 2255056896 > > free -h > total used free > Mem: 31Gi 2.1Gi 27Gi > Swap: 1.0Gi 618Mi 405Mi > > As shown above, the test_memcg used about 100M swap, but 600M+ swap memor= y > was used, which means that 500M may be wasted because other memcgs can no= t > use these swap memory. > > It can be explained as follows: > 1. When entering shrink_inactive_list, it isolates folios from lru from > tail to head. If it just takes folioN from lru(make it simple). > > inactive lru: folio1<->folio2<->folio3...<->folioN-1 > isolated list: folioN > > 2. In shrink_page_list function, if folioN is THP(2M), it may be splited > and added to swap cache folio by folio. After adding to swap cache, > it will submit io to writeback folio to swap, which is asynchronous. > When shrink_page_list is finished, the isolated folios list will be > moved back to the head of inactive lru. The inactive lru may just look > like this, with 512 filioes have been move to the head of inactive lru= . > > folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 > > It committed io from folioN1 to folioN512, the later folios committed > was added to head of the 'ret_folios' in the shrink_page_list function= . > As a result, the order was shown as folioN512->folioN511->...->folioN1= . > > 3. When folio writeback io is completed, the folio may be rotated to tail > of the lru one by one. It's assumed that filioN1,filioN2, ...,filioN51= 2 > are completed in order(commit io in this order), and they are rotated = to > the tail of the LRU in order (filioN1<->...folioN511<->folioN512). > Therefore, those folios that are tail of the lru will be reclaimed as > soon as possible. > > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > > 4. However, shrink_page_list and folio writeback are asynchronous. If THP > is splited, shrink_page_list loops at least 512 times, which means tha= t > shrink_page_list is not completed but some folios writeback have been > completed, and this may lead to failure to rotate these folios to the > tail of lru. The lru may look likes as below: > > folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> > folioN51<->folioN52<->...folioN511<->folioN512 > > Although those folios (N1-N50) have been finished writing back, they > are still at the head of the lru. This is because their writeback_end > occurred while it were still looping in shrink_folio_list(), causing > folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving > these folios, which are not in the LRU but still in the 'folio_list', > to the tail of the LRU. > When isolating folios from lru, it scans from tail to head, so it is > difficult to scan those folios again. > > What mentioned above may lead to a large number of folios have been added > to swap cache but can not be reclaimed in time, which may reduce reclaim > efficiency and prevent other memcgs from using this swap memory even if > they trigger OOM. > > To fix this issue, the folios whose writeback has been completed should b= e > move to the tail of the LRU instead of always placing them at the head of > the LRU when the shrink_page_list is finished. It can be realized by > following steps. > 1. In the shrink_page_list function, the folios whose are committed to It seems like there's a grammatical error here=E2=80=94whose something=EF= =BC=9F > are added to the head of 'folio_list', which will be return to the > caller. > 2. When shrink_page_list finishes, it is known that how many folios have > been pageout, and they are all at the head of 'folio_list', which is > ready be moved back to LRU. So, in the 'move_folios_to_lru function' > function, if the first 'nr_io' folios (which have been pageout) have > been written back completely, move them to the tail of LRU. Otherwise, > move them to the head of the LRU. > > Signed-off-by: Chen Ridong > --- > mm/vmscan.c | 37 +++++++++++++++++++++++++++++-------- > 1 file changed, 29 insertions(+), 8 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 76378bc257e3..04f7eab9d818 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1046,6 +1046,7 @@ static unsigned int shrink_folio_list(struct list_h= ead *folio_list, > struct folio_batch free_folios; > LIST_HEAD(ret_folios); > LIST_HEAD(demote_folios); > + LIST_HEAD(pageout_folios); > unsigned int nr_reclaimed =3D 0; > unsigned int pgactivate =3D 0; > bool do_demote_pass; > @@ -1061,7 +1062,7 @@ static unsigned int shrink_folio_list(struct list_h= ead *folio_list, > struct address_space *mapping; > struct folio *folio; > enum folio_references references =3D FOLIOREF_RECLAIM; > - bool dirty, writeback; > + bool dirty, writeback, is_pageout =3D false; > unsigned int nr_pages; > > cond_resched(); > @@ -1384,6 +1385,7 @@ static unsigned int shrink_folio_list(struct list_h= ead *folio_list, > nr_pages =3D 1; > } > stat->nr_pageout +=3D nr_pages; > + is_pageout =3D true; > > if (folio_test_writeback(folio)) > goto keep; > @@ -1508,7 +1510,10 @@ static unsigned int shrink_folio_list(struct list_= head *folio_list, > keep_locked: > folio_unlock(folio); > keep: > - list_add(&folio->lru, &ret_folios); > + if (is_pageout) > + list_add(&folio->lru, &pageout_folios); > + else > + list_add(&folio->lru, &ret_folios); > VM_BUG_ON_FOLIO(folio_test_lru(folio) || > folio_test_unevictable(folio), folio); > } > @@ -1551,6 +1556,7 @@ static unsigned int shrink_folio_list(struct list_h= ead *folio_list, > free_unref_folios(&free_folios); > > list_splice(&ret_folios, folio_list); > + list_splice(&pageout_folios, folio_list); Do we really need this pageout_folios list? is the below not sufficient? + if (nr_io > 0 && + !folio_test_reclaim(folio) && + !folio_test_writeback(folio)) + lruvec_add_folio_tail(lruvec, folio); + else + lruvec_add_folio(lruvec, folio); > count_vm_events(PGACTIVATE, pgactivate); > > if (plug) > @@ -1826,11 +1832,14 @@ static bool too_many_isolated(struct pglist_data = *pgdat, int file, > > /* > * move_folios_to_lru() moves folios from private @list to appropriate L= RU list. > + * @lruvec: The LRU vector the list is moved to. > + * @list: The folio list are moved to lruvec > + * @nr_io: The first nr folios of the list that have been committed io. > * > * Returns the number of pages moved to the given lruvec. > */ > static unsigned int move_folios_to_lru(struct lruvec *lruvec, > - struct list_head *list) > + struct list_head *list, unsigned int nr_io) > { > int nr_pages, nr_moved =3D 0; > struct folio_batch free_folios; > @@ -1880,9 +1889,21 @@ static unsigned int move_folios_to_lru(struct lruv= ec *lruvec, > * inhibits memcg migration). > */ > VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), fol= io); > - lruvec_add_folio(lruvec, folio); > + /* > + * If the folio have been committed io and writed back co= mpletely, > + * it should be added to the tailed to the lru, so it can > + * be relaimed as soon as possible. > + */ > + if (nr_io > 0 && > + !folio_test_reclaim(folio) && > + !folio_test_writeback(folio)) > + lruvec_add_folio_tail(lruvec, folio); > + else > + lruvec_add_folio(lruvec, folio); > + > nr_pages =3D folio_nr_pages(folio); > nr_moved +=3D nr_pages; > + nr_io =3D nr_io > nr_pages ? (nr_io - nr_pages) : 0; > if (folio_test_active(folio)) > workingset_age_nonresident(lruvec, nr_pages); > } > @@ -1960,7 +1981,7 @@ static unsigned long shrink_inactive_list(unsigned = long nr_to_scan, > nr_reclaimed =3D shrink_folio_list(&folio_list, pgdat, sc, &stat,= false); > > spin_lock_irq(&lruvec->lru_lock); > - move_folios_to_lru(lruvec, &folio_list); > + move_folios_to_lru(lruvec, &folio_list, stat.nr_pageout); > > __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), > stat.nr_demoted); > @@ -2111,8 +2132,8 @@ static void shrink_active_list(unsigned long nr_to_= scan, > */ > spin_lock_irq(&lruvec->lru_lock); > > - nr_activate =3D move_folios_to_lru(lruvec, &l_active); > - nr_deactivate =3D move_folios_to_lru(lruvec, &l_inactive); > + nr_activate =3D move_folios_to_lru(lruvec, &l_active, 0); > + nr_deactivate =3D move_folios_to_lru(lruvec, &l_inactive, 0); > > __count_vm_events(PGDEACTIVATE, nr_deactivate); > __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deact= ivate); > @@ -4627,7 +4648,7 @@ static int evict_folios(struct lruvec *lruvec, stru= ct scan_control *sc, int swap > > spin_lock_irq(&lruvec->lru_lock); > > - move_folios_to_lru(lruvec, &list); > + move_folios_to_lru(lruvec, &list, 0); I'm not entirely convinced about using the 'nr' argument here. Is the goal to differentiate between two cases? 1. we need to take care of written-back folios 2. we don't need to take care of written-back folios? Would it be a bool or better to provide separate helpers? > > walk =3D current->reclaim_state->mm_walk; > if (walk && walk->batched) { > -- > 2.34.1 > Thanks Barry