From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BE50C4332F for ; Wed, 19 Oct 2022 19:09:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E39526B0072; Wed, 19 Oct 2022 15:09:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DE70A6B0073; Wed, 19 Oct 2022 15:09:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C614F6B0074; Wed, 19 Oct 2022 15:09:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B51656B0072 for ; Wed, 19 Oct 2022 15:09:23 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 55E861606FE for ; Wed, 19 Oct 2022 19:09:23 +0000 (UTC) X-FDA: 80038637406.18.52DB9B8 Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by imf17.hostedemail.com (Postfix) with ESMTP id A5AE140030 for ; Wed, 19 Oct 2022 19:09:22 +0000 (UTC) Received: from pps.filterd (m0001303.ppops.net [127.0.0.1]) by m0001303.ppops.net (8.17.1.5/8.17.1.5) with ESMTP id 29JIBEu6027665; Wed, 19 Oct 2022 12:09:15 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-id : mime-version : content-type : content-transfer-encoding; s=s2048-2021-q4; bh=8Il1ZJz0IaHrRTGIi5iVcqO6cBJpN8gdtXc+iFIv+YY=; b=edU6okLvXKSd6rBImWXrzQxemrqQ6nqqCBjF2t96LWgMw7C58XM/uakFiPG+naOLh4NK WEp4brfNF/zMEMSWCf/4RzPz0AWtICbGJk6nZqeHQULvJieAa7MW7blriG0elmUTB78B MpPxZGSxXFeU/q5LlT+exVE60wSble1amQRBDBLpjirYv3W6jIfAYL4lGEdcdbmFXdUP 1KcEU3a37uMN9rwEke0PP93g6/Nd2eTpstMvvMs65w7r2lOEWxgYDblRRZfdBMR5gZRd 11IEmeHwOYWmdBPj6F6ljlE/mk3LA0eQ8f0TY5sSZEDLkPiSs3QmP1DFymp86i0qQ6f5 xQ== Received: from nam11-bn8-obe.outbound.protection.outlook.com (mail-bn8nam11lp2168.outbound.protection.outlook.com [104.47.58.168]) by m0001303.ppops.net (PPS) with ESMTPS id 3ka9pn0101-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 19 Oct 2022 12:09:14 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Z0RCyFbQ1QBS5DJH+Xc9m9cYjXVIVnbpuy9HLoEUwWG5Jb1rZM7yqlGKl9pIYyrctJnVC0ZSq6gZsqs8M9zJSseCOPxZe7f3sx3GaJxSjVJehexOCA1yLV3krWXljLliUVG2wuSTfhVQQxH7zr0Or29c/FWDp/6tg/JflrXkSuvq3bANueS7PVcUHNgULNTjUFJQtQshhAclBF0IywdoPGCmE4RhZp/X+C0ccWLEVuVyiAJciGtxbUTX9gcjSN3mIhZsh4IRwVVzzaEe2o5VuyKatIjE80F4wCgL2Wz4xMUagfTCG1Hi0ZkAyGXw3ICjsWmkgEBfW9ql6I0LLnIJJA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=NhIlabukPFYczaWivYRxIzH5Mc+f6Z+stZScs3OmRKw=; b=Q0Q2PFRp4lUL72Zt166Nj5K9vQeChVZbnB7fzN7xybk4EBd+nnx6wLsP4C78LvtDgBidf4tHwpfIXc01EfW7HaFCnXtRWOlH3Qul+lYpC9jXbff1dmhN6ms37mEW9iblDYEr0OUFF50MASeLnQIzfOwhiFThIXqjKtHupCu8BOidj1fw0FG+TXC3SLtI/oERHN2lTVDktMwaUotQvmosgaQJ/HHTduFnRV+6D7rYSP+rdyHNp7z1WgSjlzSppzFf719Nhw/Zq4hkDwDMUOFwDYdGavOFaDkaBkJEydKLL+Qc0V/eu/buOT8KTn8F/1/MnayCN3Y/ekchszE9fS6acg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=meta.com; dmarc=pass action=none header.from=meta.com; dkim=pass header.d=meta.com; arc=none Received: from MWHPR15MB1742.namprd15.prod.outlook.com (2603:10b6:301:59::8) by DM6PR15MB2233.namprd15.prod.outlook.com (2603:10b6:5:8d::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5709.15; Wed, 19 Oct 2022 19:08:48 +0000 Received: from MWHPR15MB1742.namprd15.prod.outlook.com ([fe80::3167:368d:3127:1266]) by MWHPR15MB1742.namprd15.prod.outlook.com ([fe80::3167:368d:3127:1266%11]) with mapi id 15.20.5723.034; Wed, 19 Oct 2022 19:08:48 +0000 From: "Alex Zhu (Kernel)" To: "Huang, Ying" CC: "linux-mm@kvack.org" , Kernel Team , "willy@infradead.org" , "hannes@cmpxchg.org" , "riel@surriel.com" Subject: Re: [PATCH v3 3/3] mm: THP low utilization shrinker Thread-Topic: [PATCH v3 3/3] mm: THP low utilization shrinker Thread-Index: AQHY3ov4xE/oymI8I0CpAa1tUN4wBK4VVSaHgADKL4A= Date: Wed, 19 Oct 2022 19:08:48 +0000 Message-ID: <5D3EC059-F7E6-4943-AD16-CBE73FCA0357@fb.com> References: <87zgdsnvka.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87zgdsnvka.fsf@yhuang6-desk2.ccr.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-ms-traffictypediagnostic: MWHPR15MB1742:EE_|DM6PR15MB2233:EE_ x-ms-office365-filtering-correlation-id: 107fdba0-b37e-4631-6ef0-08dab20559fd x-fb-source: Internal x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: 4wuLk5w8Thw7uohmVHIQxfPhy+KKQDS0DAMBCmUOasAxZD1P7pTUPoCglmzVuEG687XvnXNi3aAsNEJFM1l60fgmFnwjEGogstCSassP/Ch0M716IelA30dyfeYfJUm9np925PSXjpyvgqtMSPoNQd6TQB7x5HOGhmTSYyQkZrzgmcmmSk0M81kGdwgAsB1QL/foHSbSg81KD4l5xlLSiRAevwy2Mwkelb6e0dhqNjCo+TPzzUOvpeeuFOzX5uJvwxJa4ZXV7iAo5B50vnvmdO5JVNPxPYNRJAvs2/wNZ+lCd9AFpol1CCICf+mtc9ARvYheLpRuT62caT2kRWpD4P64cxK5/vPKWCCmx41sYMahQXhVFSpzzz3hY3Ssk68ocpBtxFxFqqBmELCL7HoRESqoM+lEIwk8OTNC3TozSSyWXzzHZTlXl+GvB1/Z9FHpRzcN5SBwgWT9fxOCFJtg1HE2ia5XfX1ZV2YCPSjMjZZD8haim/4lFRzLgNobXiEDdU8szmCazmtmhzK5JnMoxjfVNpyy2vZMFO5hu28aGgolzAF1sDDDM295YpEO9oHJzIbYyXc1og0amhQs5rfp3Bl0XWu6GmZwKGoZ6iSENjymYg1cBI+0Zzvh9bKX95YOjMklvYwFuGTdGaxzNxd0b/jFcwzB+VYRrmgzbGCsM3GqzIv87LBZ5c1swxoHb6jn/PNRelvBz60CUJJhtmpklWhRGEXRsPIrn2rrsjIIFXf1W+crGd+SWXSHYwUqrimmspb9FzeSN3g+Bun+PjodDQ== x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:MWHPR15MB1742.namprd15.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230022)(4636009)(376002)(39860400002)(346002)(136003)(366004)(396003)(451199015)(478600001)(54906003)(6486002)(4326008)(6916009)(316002)(66946007)(53546011)(6506007)(71200400001)(66446008)(66476007)(91956017)(66556008)(76116006)(64756008)(8676002)(86362001)(9686003)(6512007)(36756003)(33656002)(41300700001)(8936002)(30864003)(5660300002)(38070700005)(83380400001)(122000001)(2906002)(186003)(38100700002)(579004);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?Z8EBW0GImUgKq/wf8OVkZIoI20K7pWp/kbpzeQMn1hUf5kcQoVfQFmHestb4?= =?us-ascii?Q?Qbok6HjYdUeuWie0J5fVNGgkwobmphP9B7pZG53ip3BOKivLJ4AT7aCTwuvP?= =?us-ascii?Q?xqm0i+3Xm9rKkPJqcQubavXgr3y8iDV3lMQzOWzKg9WUw1I1vSUAqCHmPcKB?= =?us-ascii?Q?Z/Zcvg9iM3OgkHVppU4uLcoacS3C5yxP+I0seWcUWzbwhSci+qgapzdCyiPd?= =?us-ascii?Q?EzIsS8xxuUutKp8HPhg//J5sDuBeAQnI1NrQM+ZlNTdyBooICZUhlfCAaHfO?= =?us-ascii?Q?NAN7p5qRsuk7v5a6tyhY/N9qacrSJhc4u9aHHT04rA3NDt/sKcDBxep/ILD3?= =?us-ascii?Q?EjC1i4YO6wu7vw01pLZKSP0zWwJamp2eiNP8r0oG/hwcuKMI1mGR/8yFJrdE?= =?us-ascii?Q?qz0nalTbYIAKCXub6TBVQpa3FJu/BeqiT3eA1F/rL3XHXyph+xH5ubxFjZr0?= =?us-ascii?Q?g9ezYjECUmTBuSZNw5kvhHETcYAkXd9BTGhfQc7vWt4eDysd2vh/KBiC8FmC?= =?us-ascii?Q?IOKwPM163BVc5YLbyeuBLm3Kkz2PQrFdFwKFYrzEmbNhUIr2TTHRo57/cQk8?= =?us-ascii?Q?UmYS4ju/FfDoR7iW211yuGZIxeAh+55liJk2rn4/P3nXNEvL8R56q9cpAz9Q?= =?us-ascii?Q?L6+v6+4Qjk5jsUfOaSAlbQlrqK0gqLwrXu6hZGduBP7QMopw5ULtn7JmpIQF?= =?us-ascii?Q?JkGNo4gdKc4JDsoabvPbrfxVbK0a8h2/OYyns1cDF2fdeHEWTOKy8ul4U/cM?= =?us-ascii?Q?2gan5be8fX2kUuuMQr3ue9OFbQuVBsg8lCflrp1fVSgwJ8aFjXhtwTgzhhV0?= =?us-ascii?Q?2Y7FhCPeggEbj990MLUSfUqOleNt685HmzpAV1Ylpo5qwKtAFCFfexvqjzAD?= =?us-ascii?Q?MQEiMeaXRBKW3TOqqzne1kQ3eClc3Kxsh1v5hY1U7gEpO4Z99J+b3ZjJ6c23?= =?us-ascii?Q?FNrp3lzjfhAjxaLxOrS1q6RAyBvnHmbJqTEozWfH+ffzTt+BGJ36EkAA03BT?= =?us-ascii?Q?hRRorUw9GQeenC10BlDl6oxe19UkZ6O+k6pxCKuCuG9mfzURGReXfNDZWP/5?= =?us-ascii?Q?yMrXlvbbpnnK/diwr5o7BFrb4Tinbqyl6cqBJb6EJV6/4zntq7dhA/To656g?= =?us-ascii?Q?uBKrY+XOiTFLJfuT2eH631fYjmr9vswyVhuWGUs7VLVpB04NsM018SiUbagF?= =?us-ascii?Q?NWfChJIYWhVFz2tj7eChizYxUpPIzM7flzLWnPHRdLEByjB5ckDCF+IBNqGh?= =?us-ascii?Q?QnESGYfxaKq1kRu3Dpi0tq7q3BRuXTNznVd49MjOHAxrfJratghaUHKt1ott?= =?us-ascii?Q?o/tSpcKYRPxS8Qb3Txo3z1u3VN9sO0/GIN7jHFS5F7SBCujLoJ9uyzsZQGeM?= =?us-ascii?Q?AgEE5gzDCyO5NpzEQfbbnfBte68j0KB6AsUa/yA5Lu/J/xLvL2sDkDDzAXyU?= =?us-ascii?Q?MuFYG2F7MKu69gZ9RiWm9V2IYBlG5rnhptVbRdpILvXr4DkXsDHUCNBfKvRb?= =?us-ascii?Q?l2D6/2s70/+PylXlqNa33sUu+Bf/hzmK1t6Rz3YXk+ey7XDZqrMkPwz4VOQK?= =?us-ascii?Q?du69a/z+wtiFryqOEXpXIjjcPfl4Sl/sYqWo6wPMvp/2Zt/7J31+SB2/6Va4?= =?us-ascii?Q?bDw75jc2QOXjio4fgmR6KpY=3D?= Content-ID: MIME-Version: 1.0 X-OriginatorOrg: meta.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: MWHPR15MB1742.namprd15.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 107fdba0-b37e-4631-6ef0-08dab20559fd X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Oct 2022 19:08:48.0552 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: xbQ6bW03reeYNgfxQVMIcKw5H4o7QkKvbg7YhQZWYFZFRoX//icOTZMvaAzyhverMFRzixhhfqTfEKAsA00b3Q== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM6PR15MB2233 X-Proofpoint-GUID: dxrg0KB4SV4zgyvNp-INlMn979hEXiLW X-Proofpoint-ORIG-GUID: dxrg0KB4SV4zgyvNp-INlMn979hEXiLW Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-10-19_11,2022-10-19_04,2022-06-22_01 ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1666206562; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8Il1ZJz0IaHrRTGIi5iVcqO6cBJpN8gdtXc+iFIv+YY=; b=RHgAtXuF6XYrtOJ4zl3E8NMxGXGmYYDnTJ8q1hOjSjY9yUEt56neq/3LmIYT4VRc/90Btx vKVSfuhJjlWVyWg2DnKefsfT6TPUm5VcaXJ7d4vJ887Bqo7nuXPex1shMPQC8VNWlIJk7g YQkIDy2AcaScRhX8/d6AKZCwrosWl9I= ARC-Authentication-Results: i=2; imf17.hostedemail.com; dkim=pass header.d=meta.com header.s=s2048-2021-q4 header.b=edU6okLv; dmarc=pass (policy=reject) header.from=meta.com; spf=pass (imf17.hostedemail.com: domain of "prvs=129168f899=alexlzhu@meta.com" designates 67.231.153.30 as permitted sender) smtp.mailfrom="prvs=129168f899=alexlzhu@meta.com"; arc=reject ("signature check failed: fail, {[1] = sig:microsoft.com:reject}") ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1666206562; a=rsa-sha256; cv=fail; b=K7FdYuZMKw0S0Kg8YVp0IOZlfddu+MLvGfjayQvNvI7+d12zScUNeWEvconpE7Dc/AA3G/ Uh39dvxn/qzk9t7AKv9Z+48ZtXWvQwjegGhPnyuXtylz5uYPYZeL8aYd4JEokecGk2uQFT Ro2B8eaBFcuHMOpuPd17Uvmhv3VVfDM= Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=meta.com header.s=s2048-2021-q4 header.b=edU6okLv; dmarc=pass (policy=reject) header.from=meta.com; spf=pass (imf17.hostedemail.com: domain of "prvs=129168f899=alexlzhu@meta.com" designates 67.231.153.30 as permitted sender) smtp.mailfrom="prvs=129168f899=alexlzhu@meta.com"; arc=reject ("signature check failed: fail, {[1] = sig:microsoft.com:reject}") X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: A5AE140030 X-Rspam-User: X-Stat-Signature: zngnn5hfzf6jbgz1cydgjqzg4666rfi7 X-HE-Tag: 1666206562-984482 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Oct 19, 2022, at 12:04 AM, Huang, Ying wrote: >=20 > >=20 > writes: >=20 >> From: Alexander Zhu >>=20 >> This patch introduces a shrinker that will remove THPs in the lowest >> utilization bucket. As previously mentioned, we have observed that >> almost all of the memory waste when THPs are always enabled >> is contained in the lowest utilization bucket. The shrinker will >> add these THPs to a list_lru and split anonymous THPs based off >> information from kswapd. It requires the changes from >> thp_utilization to identify the least utilized THPs, and the >> changes to split_huge_page to identify and free zero pages >> within THPs. >>=20 >> Signed-off-by: Alexander Zhu >> --- >> v2 to v3 >> -put_page() after trylock_page in low_util_free_page. put() to be called= after get() call=20 >> -removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There was= a double unlock.=20=20=20=20 >> -moved spin_unlock_irq() to below list_lru_isolate() in low_util_free_pa= ge. This is to shorten the critical section. >> -moved lock_page in add_underutilized_thp such that we only lock when al= locating and adding to the list_lru=20=20 >> -removed list_lru_alloc in list_lru_add_page and list_lru_delete_page as= these are no longer needed.=20 >>=20 >> v1 to v2 >> -Changed lru_lock to be irq safe. Added irq_save and restore around list= _lru adds/deletes. >> -Changed low_util_free_page() to trylock the page, and if it fails, unlo= ck lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim,= which calls split_huge_page() and the THP Shrinker >> -Changed low_util_free_page() to unlock lru_lock, split_huge_page, then = lock lru_lock. This way split_huge_page is not called with the lru_lock hel= d. That leads to deadlock as split_huge_page calls on_each_cpu_mask=20 >> -Changed list_lru_shrink_walk to list_lru_shrink_walk_irq.=20 >>=20 >> RFC to v1 >> -Remove all THPs that are not in the top utilization bucket. This is wha= t we have found to perform the best in production testing, we have found th= at there are an almost trivial number of THPs in the middle range of bucket= s that account for most of the memory waste.=20 >> -Added check for THP utilization prior to split_huge_page for the THP Sh= rinker. This is to account for THPs that move to the top bucket, but were u= nderutilized at the time they were added to the list_lru.=20 >> -Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is becau= se a THP is 512 pages, and should count as 512 objects in reclaim. This way= reclaim is triggered at a more appropriate frequency than in the RFC.=20 >>=20 >> include/linux/huge_mm.h | 7 +++ >> include/linux/list_lru.h | 24 +++++++++ >> include/linux/mm_types.h | 5 ++ >> mm/huge_memory.c | 114 ++++++++++++++++++++++++++++++++++++++- >> mm/list_lru.c | 49 +++++++++++++++++ >> mm/page_alloc.c | 6 +++ >> 6 files changed, 203 insertions(+), 2 deletions(-) >>=20 >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 13ac7b2f29ae..75e4080256be 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -192,6 +192,8 @@ static inline int split_huge_page(struct page *page) >> } >> void deferred_split_huge_page(struct page *page); >>=20 >> +void add_underutilized_thp(struct page *page); >> + >> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, >> unsigned long address, bool freeze, struct folio *folio); >>=20 >> @@ -305,6 +307,11 @@ static inline struct list_head *page_deferred_list(= struct page *page) >> return &page[2].deferred_list; >> } >>=20 >> +static inline struct list_head *page_underutilized_thp_list(struct page= *page) >> +{ >> + return &page[3].underutilized_thp_list; >> +} >> + >> #else /* CONFIG_TRANSPARENT_HUGEPAGE */ >> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) >> #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) >> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h >> index b35968ee9fb5..c2cf146ea880 100644 >> --- a/include/linux/list_lru.h >> +++ b/include/linux/list_lru.h >> @@ -89,6 +89,18 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memc= g, struct mem_cgroup *paren >> */ >> bool list_lru_add(struct list_lru *lru, struct list_head *item); >>=20 >> +/** >> + * list_lru_add_page: add an element to the lru list's tail >> + * @list_lru: the lru pointer >> + * @page: the page containing the item >> + * @item: the item to be deleted. >> + * >> + * This function works the same as list_lru_add in terms of list >> + * manipulation. Used for non slab objects contained in the page. >> + * >> + * Return value: true if the list was updated, false otherwise >> + */ >> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct = list_head *item); >> /** >> * list_lru_del: delete an element to the lru list >> * @list_lru: the lru pointer >> @@ -102,6 +114,18 @@ bool list_lru_add(struct list_lru *lru, struct list= _head *item); >> */ >> bool list_lru_del(struct list_lru *lru, struct list_head *item); >>=20 >> +/** >> + * list_lru_del_page: delete an element to the lru list >> + * @list_lru: the lru pointer >> + * @page: the page containing the item >> + * @item: the item to be deleted. >> + * >> + * This function works the same as list_lru_del in terms of list >> + * manipulation. Used for non slab objects contained in the page. >> + * >> + * Return value: true if the list was updated, false otherwise >> + */ >> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct = list_head *item); >> /** >> * list_lru_count_one: return the number of objects currently held by @l= ru >> * @lru: the lru pointer. >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 500e536796ca..da1d1cf42158 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -152,6 +152,11 @@ struct page { >> /* For both global and memcg */ >> struct list_head deferred_list; >> }; >> + struct { /* Third tail page of compound page */ >> + unsigned long _compound_pad_3; /* compound_head */ >> + unsigned long _compound_pad_4; >> + struct list_head underutilized_thp_list; >> + }; >> struct { /* Page table pages */ >> unsigned long _pt_pad_1; /* compound_head */ >> pgtable_t pmd_huge_pte; /* protected by page->ptl */ >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index a08885228cb2..362df977cc73 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -81,6 +81,8 @@ static atomic_t huge_zero_refcount; >> struct page *huge_zero_page __read_mostly; >> unsigned long huge_zero_pfn __read_mostly =3D ~0UL; >>=20 >> +static struct list_lru huge_low_util_page_lru; >> + >> static void thp_utilization_workfn(struct work_struct *work); >> static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn= ); >>=20 >> @@ -263,6 +265,57 @@ static struct shrinker huge_zero_page_shrinker =3D { >> .seeks =3D DEFAULT_SEEKS, >> }; >>=20 >> +static enum lru_status low_util_free_page(struct list_head *item, >> + struct list_lru_one *lru, >> + spinlock_t *lock, >> + void *cb_arg) >> +{ >> + int bucket, num_utilized_pages; >> + struct page *head =3D compound_head(list_entry(item, >> + struct page, >> + underutilized_thp_list)); >> + >> + if (get_page_unless_zero(head)) { >> + if (!trylock_page(head)) { >> + put_page(head); >> + return LRU_SKIP; >> + } >> + list_lru_isolate(lru, item); >> + spin_unlock_irq(lock); >> + num_utilized_pages =3D thp_number_utilized_pages(head); >> + bucket =3D thp_utilization_bucket(num_utilized_pages); >> + if (bucket < THP_UTIL_BUCKET_NR - 1) { >=20 > If my understanding were correct, the THP will be considered under > utilized if utilization percentage < 90%. Right? If so, I thought > something like utilization percentage < 50% appears more appropriate. Yes, < 90%. That is just the best number we have found so far from our expe= riments, as it seems almost all THPs are <10% utilized or >90% utilized . T= here are an almost trivial number of pages in between.=20 >=20 >> + split_huge_page(head); >> + spin_lock_irq(lock); >> + } >> + unlock_page(head); >> + put_page(head); >> + } >> + >> + return LRU_REMOVED_RETRY; >> +} >> + >> +static unsigned long shrink_huge_low_util_page_count(struct shrinker *s= hrink, >> + struct shrink_control *sc) >> +{ >> + return HPAGE_PMD_NR * list_lru_shrink_count(&huge_low_util_page_lru, s= c); >> +} >> + >> +static unsigned long shrink_huge_low_util_page_scan(struct shrinker *sh= rink, >> + struct shrink_control *sc) >> +{ >> + return HPAGE_PMD_NR * list_lru_shrink_walk_irq(&huge_low_util_page_lru, >> + sc, low_util_free_page, NULL); >> +} >> + >> +static struct shrinker huge_low_util_page_shrinker =3D { >> + .count_objects =3D shrink_huge_low_util_page_count, >> + .scan_objects =3D shrink_huge_low_util_page_scan, >> + .seeks =3D DEFAULT_SEEKS, >> + .flags =3D SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE | >> + SHRINKER_NONSLAB, >> +}; >> + >> #ifdef CONFIG_SYSFS >> static ssize_t enabled_show(struct kobject *kobj, >> struct kobj_attribute *attr, char *buf) >> @@ -515,6 +568,9 @@ static int __init hugepage_init(void) >> goto err_slab; >>=20 >> schedule_delayed_work(&thp_utilization_work, HZ); >> + err =3D register_shrinker(&huge_low_util_page_shrinker, "thp-low-util"= ); >> + if (err) >> + goto err_low_util_shrinker; >> err =3D register_shrinker(&huge_zero_page_shrinker, "thp-zero"); >> if (err) >> goto err_hzp_shrinker; >> @@ -522,6 +578,9 @@ static int __init hugepage_init(void) >> if (err) >> goto err_split_shrinker; >>=20 >> + err =3D list_lru_init_memcg(&huge_low_util_page_lru, &huge_low_util_pa= ge_shrinker); >> + if (err) >> + goto err_low_util_list_lru; >> /* >> * By default disable transparent hugepages on smaller systems, >> * where the extra memory used could hurt more than TLB overhead >> @@ -538,10 +597,14 @@ static int __init hugepage_init(void) >>=20 >> return 0; >> err_khugepaged: >> + list_lru_destroy(&huge_low_util_page_lru); >> +err_low_util_list_lru: >> unregister_shrinker(&deferred_split_shrinker); >> err_split_shrinker: >> unregister_shrinker(&huge_zero_page_shrinker); >> err_hzp_shrinker: >> + unregister_shrinker(&huge_low_util_page_shrinker); >> +err_low_util_shrinker: >> khugepaged_destroy(); >> err_slab: >> hugepage_exit_sysfs(hugepage_kobj); >> @@ -616,6 +679,7 @@ void prep_transhuge_page(struct page *page) >> */ >>=20 >> INIT_LIST_HEAD(page_deferred_list(page)); >> + INIT_LIST_HEAD(page_underutilized_thp_list(page)); >> set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); >> } >>=20 >> @@ -2529,8 +2593,7 @@ static void __split_huge_page_tail(struct page *he= ad, int tail, >> LRU_GEN_MASK | LRU_REFS_MASK)); >>=20 >> /* ->mapping in first tail page is compound_mapcount */ >> - VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping !=3D TAIL_MAPPING, >> - page_tail); >> + VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping !=3D TAIL_MAPPING, page_= tail); >> page_tail->mapping =3D head->mapping; >> page_tail->index =3D head->index + tail; >> page_tail->private =3D 0; >> @@ -2737,6 +2800,7 @@ int split_huge_page_to_list(struct page *page, str= uct list_head *list) >> struct folio *folio =3D page_folio(page); >> struct deferred_split *ds_queue =3D get_deferred_split_queue(&folio->pa= ge); >> XA_STATE(xas, &folio->mapping->i_pages, folio->index); >> + struct list_head *underutilized_thp_list =3D page_underutilized_thp_li= st(&folio->page); >> struct anon_vma *anon_vma =3D NULL; >> struct address_space *mapping =3D NULL; >> int extra_pins, ret; >> @@ -2844,6 +2908,9 @@ int split_huge_page_to_list(struct page *page, str= uct list_head *list) >> list_del(page_deferred_list(&folio->page)); >> } >> spin_unlock(&ds_queue->split_queue_lock); >> + if (!list_empty(underutilized_thp_list)) >> + list_lru_del_page(&huge_low_util_page_lru, &folio->page, >> + underutilized_thp_list); >> if (mapping) { >> int nr =3D folio_nr_pages(folio); >>=20 >> @@ -2886,6 +2953,7 @@ int split_huge_page_to_list(struct page *page, str= uct list_head *list) >> void free_transhuge_page(struct page *page) >> { >> struct deferred_split *ds_queue =3D get_deferred_split_queue(page); >> + struct list_head *underutilized_thp_list =3D page_underutilized_thp_li= st(page); >> unsigned long flags; >>=20 >> spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> @@ -2894,6 +2962,12 @@ void free_transhuge_page(struct page *page) >> list_del(page_deferred_list(page)); >> } >> spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> + if (!list_empty(underutilized_thp_list)) >> + list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_li= st); >> + >> + if (PageLRU(page)) >> + __folio_clear_lru_flags(page_folio(page)); >> + >> free_compound_page(page); >> } >>=20 >> @@ -2934,6 +3008,39 @@ void deferred_split_huge_page(struct page *page) >> spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> } >>=20 >> +void add_underutilized_thp(struct page *page) >> +{ >> + VM_BUG_ON_PAGE(!PageTransHuge(page), page); >=20 > Because we haven't get reference of the page, the page may be split or > free under us, so VM_BUG_ON_PAGE() here may be triggered wrongly? >=20 >> + >> + if (PageSwapCache(page)) >> + return; >> + >> + /* >> + * Need to take a reference on the page to prevent the page from getti= ng free'd from >> + * under us while we are adding the THP to the shrinker. >> + */ >> + if (!get_page_unless_zero(page)) >> + return; >> + >> + if (!is_anon_transparent_hugepage(page)) >> + goto out_put; >> + >> + if (is_huge_zero_page(page)) >> + goto out_put; >=20 > is_huge_zero_page() check can be done in thp_util_scan() too? Sounds good. >=20 >> + >> + lock_page(page); >> + >> + if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GF= P_KERNEL)) >> + goto out_unlock; >> + >> + list_lru_add_page(&huge_low_util_page_lru, page, page_underutilized_th= p_list(page)); >> + >> +out_unlock: >> + unlock_page(page); >> +out_put: >> + put_page(page); >> +} >> + >> static unsigned long deferred_split_count(struct shrinker *shrink, >> struct shrink_control *sc) >> { >> @@ -3478,6 +3585,9 @@ static void thp_util_scan(unsigned long pfn_end) >> if (bucket < 0) >> continue; >>=20 >> + if (bucket < THP_UTIL_BUCKET_NR - 1) >> + add_underutilized_thp(page); >> + >> thp_scan.buckets[bucket].nr_thps++; >> thp_scan.buckets[bucket].nr_zero_pages +=3D (HPAGE_PMD_NR - num_utiliz= ed_pages); >> } >> diff --git a/mm/list_lru.c b/mm/list_lru.c >> index a05e5bef3b40..8cc56a84b554 100644 >> --- a/mm/list_lru.c >> +++ b/mm/list_lru.c >> @@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list= _head *item) >> } >> EXPORT_SYMBOL_GPL(list_lru_add); >>=20 >> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct = list_head *item) >> +{ >> + int nid =3D page_to_nid(page); >> + struct list_lru_node *nlru =3D &lru->node[nid]; >> + struct list_lru_one *l; >> + struct mem_cgroup *memcg; >> + unsigned long flags; >> + >> + spin_lock_irqsave(&nlru->lock, flags); >> + if (list_empty(item)) { >> + memcg =3D page_memcg(page); >> + l =3D list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg)); >> + list_add_tail(item, &l->list); >> + /* Set shrinker bit if the first element was added */ >> + if (!l->nr_items++) >> + set_shrinker_bit(memcg, nid, >> + lru_shrinker_id(lru)); >> + nlru->nr_items++; >> + spin_unlock_irqrestore(&nlru->lock, flags); >> + return true; >> + } >> + spin_unlock_irqrestore(&nlru->lock, flags); >> + return false; >> +} >> +EXPORT_SYMBOL_GPL(list_lru_add_page); >> + >=20 > It appears that only 2 lines are different from list_lru_add(). Is it > possible for us to share code? For example, add another flag for > page_memcg() case? I believe there are 4 lines. The page_to_nid(page) and the spin_lock_irqsav= e/restore.=20 It was implemented this way as we found we needed to take a page as a param= eter and obtain the node id from the page. This is because the THP is not necessarily a slab object, as the = list_lru_add/delete code assumes.=20 Also, there is a potential deadlock when split_huge_page is called from rec= laim and when split_huge_page is called=20 by the THP Shrinker, which is why we need irqsave/restore.=20 I though this would be cleaner than attempting to shared code with list_lru= _add/delete. Only the shrinker makes use of this.=20 All other use cases assume slab objects.=20 >=20 >> bool list_lru_del(struct list_lru *lru, struct list_head *item) >> { >> int nid =3D page_to_nid(virt_to_page(item)); >> @@ -160,6 +186,29 @@ bool list_lru_del(struct list_lru *lru, struct list= _head *item) >> } >> EXPORT_SYMBOL_GPL(list_lru_del); >>=20 >> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct = list_head *item) >> +{ >> + int nid =3D page_to_nid(page); >> + struct list_lru_node *nlru =3D &lru->node[nid]; >> + struct list_lru_one *l; >> + struct mem_cgroup *memcg; >> + unsigned long flags; >> + >> + spin_lock_irqsave(&nlru->lock, flags); >> + if (!list_empty(item)) { >> + memcg =3D page_memcg(page); >> + l =3D list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg)); >> + list_del_init(item); >> + l->nr_items--; >> + nlru->nr_items--; >> + spin_unlock_irqrestore(&nlru->lock, flags); >> + return true; >> + } >> + spin_unlock_irqrestore(&nlru->lock, flags); >> + return false; >> +} >> +EXPORT_SYMBOL_GPL(list_lru_del_page); >> + >> void list_lru_isolate(struct list_lru_one *list, struct list_head *item) >> { >> list_del_init(item); >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index ac2c9f12a7b2..468eaaade7fe 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1335,6 +1335,12 @@ static int free_tail_pages_check(struct page *hea= d_page, struct page *page) >> * deferred_list.next -- ignore value. >> */ >> break; >> + case 3: >> + /* >> + * the third tail page: ->mapping is >> + * underutilized_thp_list.next -- ignore value. >> + */ >> + break; >> default: >> if (page->mapping !=3D TAIL_MAPPING) { >> bad_page(page, "corrupted mapping in tail page"); >=20 > Best Regards, > Huang, Ying