From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A98E0C77B70 for ; Mon, 17 Apr 2023 08:05:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 36D79900002; Mon, 17 Apr 2023 04:05:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 31D9F8E0001; Mon, 17 Apr 2023 04:05:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 197A9900002; Mon, 17 Apr 2023 04:05:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 0636D8E0001 for ; Mon, 17 Apr 2023 04:05:47 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CC415AC016 for ; Mon, 17 Apr 2023 08:05:46 +0000 (UTC) X-FDA: 80690149092.16.ACB4A57 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by imf06.hostedemail.com (Postfix) with ESMTP id E95AB180012 for ; Mon, 17 Apr 2023 08:05:42 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=TmgnzxlE; spf=pass (imf06.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com; arc=reject ("signature check failed: fail, {[1] = sig:microsoft.com:reject}") ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1681718743; a=rsa-sha256; cv=fail; b=31yN5cJi8aog5qdAsAnaTrpMWGbJ0vEi6Lvvv777bvY6runfMSGf9m4C8v2eEnTpdx1vVJ fBq4DkqPhghlQEE38Zl57t7wniXxzSyt8l9rLg+0YubBBtdaaKu+EOjmOwsd8sQJc9aPpK /Lu4A5p/Gi8NOZwTfD2nCISpknfMwZ0= ARC-Authentication-Results: i=2; imf06.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=TmgnzxlE; spf=pass (imf06.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com; arc=reject ("signature check failed: fail, {[1] = sig:microsoft.com:reject}") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681718743; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GMZBgSrJT/5kGmBPxOeD0q5XQIo6LR70/TK+KFf+jn0=; b=QfLKAqXrUUjPOaBII9AwNJHYaRw8/1fATtk9w/6jKGEZtOPJRi1WZb1gvxmQlViJCFUxq8 QV6LAoAentyQxIM03LSmOM6AgovT3kJ1TlZEJhsSPYYjepi0qFTI+8JkTogwvF3ZkR0B+L D5inVpt8ehlFnNcYMIiO8KrdGCLbWoA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1681718743; x=1713254743; h=message-id:date:subject:to:cc:references:from: in-reply-to:content-transfer-encoding:mime-version; bh=jLaQxp0i7Sa03o0bQV6+0/wgXgLNYh9c8To25A2DbmI=; b=TmgnzxlE5MRf38ASjsKBLOqDG9MjghXbyCOtZl6mnkkcJSQ7e2EiD65p RcLlbOSgnnRZ6ZxrJFPV+3i98xK26RscGErG9Hdg5AbC3jrIufBTp4dBQ DJUGnPxMy5hI/Y/VtEaEph/Hv0fB5dxLdEHJCCCPIypyEDjgOUf5QD7Sg nNhJbQpBq7e/XjUh18Se5/I+2Sxy9ofI8qfv+LC2uvHjNDpodZSJFEY1f 5ZCjRtXG6b1dl1KVappOoNFndhG1xAbBoGof0wZzDkFSWEMH3n7lZH8Ah Xfb+H+EqFELOhNBAxycR2KZUtVRQICGYxyrR7IovFamwVvAvitJFbJ6jF A==; X-IronPort-AV: E=McAfee;i="6600,9927,10682"; a="329000455" X-IronPort-AV: E=Sophos;i="5.99,203,1677571200"; d="scan'208";a="329000455" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Apr 2023 01:05:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10682"; a="693164827" X-IronPort-AV: E=Sophos;i="5.99,203,1677571200"; d="scan'208";a="693164827" Received: from fmsmsx601.amr.corp.intel.com ([10.18.126.81]) by fmsmga007.fm.intel.com with ESMTP; 17 Apr 2023 01:05:39 -0700 Received: from fmsmsx610.amr.corp.intel.com (10.18.126.90) by fmsmsx601.amr.corp.intel.com (10.18.126.81) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Mon, 17 Apr 2023 01:05:40 -0700 Received: from fmsedg601.ED.cps.intel.com (10.1.192.135) by fmsmsx610.amr.corp.intel.com (10.18.126.90) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23 via Frontend Transport; Mon, 17 Apr 2023 01:05:40 -0700 Received: from NAM02-DM3-obe.outbound.protection.outlook.com (104.47.56.49) by edgegateway.intel.com (192.55.55.70) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.23; Mon, 17 Apr 2023 01:05:40 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=TPoEXLFQrzo/2QaN4Y1SLqgi+BETOdASb9zn6Fia0KJGerkaQb8QF2KcrrYDPqPkggbPGppkbsM9ZMek6h2+sPiXcKtLg7cTKlzG5A7yM7hpepy+29PGqiCJKKp6/RN7QfCez/udbmvlVsrSpizBXBF+Q67WY/JKJH+0FF34S9BLOWbvHKcBKUYm3dLlXOzsuSdHjMFhxraPOGR4klpSUHxbJlpQ8nASnMyhcOXYcMBxDGK5pM5fQF0zzcXCyARg3ab5yhLJmLjEWZyWiKsC5mPRYGFgrqKbaOAQIBavOrITr/k9CtsXft7Ok0v6QQ/wSAMPH8GbHpuKX032cwSs5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=GMZBgSrJT/5kGmBPxOeD0q5XQIo6LR70/TK+KFf+jn0=; b=SGiypDY5CCe9jW4739aRwL724UYpskU4qwI6saZKlNeoHkU8BA3391NP6FVMgCcwYiaflvAbzJrI4MdMtRyVk/laNYkva71VbnQaeVUngAyHx8nLn2f4DaAnp2mm2kOsxQUoi/BB4kVICqT68fnsedEcAOpX/HBrwrhepUuxZHakPIEv1UxNrUXa6Xw2eic3Bk1znomKm9BwpMyiPgoJmngsUqDG8GbtiNM3ShGsPykw0kcbtXqH/tGHWbuCxTsPFFh0z1bTS32F151NFV2p2am7f+0ZbWBHDqmPNH0elW/7O3h/I35LbbDZDOm7aw16iRUgNJqbV8fJG5qhuzATzA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Received: from CO1PR11MB4820.namprd11.prod.outlook.com (2603:10b6:303:6f::8) by SA2PR11MB5163.namprd11.prod.outlook.com (2603:10b6:806:113::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6298.45; Mon, 17 Apr 2023 08:05:38 +0000 Received: from CO1PR11MB4820.namprd11.prod.outlook.com ([fe80::f670:cacc:d75f:fcc4]) by CO1PR11MB4820.namprd11.prod.outlook.com ([fe80::f670:cacc:d75f:fcc4%7]) with mapi id 15.20.6298.045; Mon, 17 Apr 2023 08:05:38 +0000 Message-ID: Date: Mon, 17 Apr 2023 16:04:57 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0 Thunderbird/102.10.0 Subject: Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory To: Ryan Roberts , Andrew Morton , "Matthew Wilcox (Oracle)" , Yu Zhao CC: , References: <20230414130303.2345383-1-ryan.roberts@arm.com> Content-Language: en-US From: "Yin, Fengwei" In-Reply-To: <20230414130303.2345383-1-ryan.roberts@arm.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-ClientProxiedBy: SGBP274CA0004.SGPP274.PROD.OUTLOOK.COM (2603:1096:4:b0::16) To CO1PR11MB4820.namprd11.prod.outlook.com (2603:10b6:303:6f::8) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CO1PR11MB4820:EE_|SA2PR11MB5163:EE_ X-MS-Office365-Filtering-Correlation-Id: d0a38bcf-eca8-4f04-80f4-08db3f1a874e X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: yNoTh9cMJnPqmwJY0lwjX2ge6cQOHirCAw7z3g+ZjiRTmO++aIszm9d3Vrxq7Of0qCtoNyFCYnutrP0FvNXRnVXm07Hz0HEUbIk90gqCrMF46iwzpwonQw2aSJQb8gCMbYWwg74FZuB0CWyKqUcJ5FjCPUdD26vAf/beK+pXWBjILIPXh5yBsXunLGX1ufZZLyvUt4XuQplFMy9NPLJIAXgiESKFMfATqDbjq3+V6HbBpmq4VeBiSBqloiA3LzrlnAGFj/BvOe6wipJvu3ZWp+V1mUCEvoPy1eugUgEn+Xz1r5uSKDPqZvGNFEcgwydJhTKH9xpF3+JJqatoRaHp+FxBuX1kYKaeTdL83h3sIvIBCy4bBNUsR4g2FZhVKMLLeZ0NVGOxstFyXUD0ckYeeYNx/D8XDInZ/apwa2yXOjy9cOp2WCci7/JjVhs9GxUPyoDXJ1JSIRvhZKqu+7Iqb6iYE+5VfyhqyHj6hw0ghMhoALtqbmRRCTdPLBT2PiVvtgAf53A5Pw3aWamnamCUZM5S1PYnZr5L+V06EBZZNdLa0aRdIRjATfLJDcVNBMJYrsX6DHNd851vGqzcRSdR3s24juZrRbNXi1o6CQ2JGtgfV/4124IPEtn7Okoefqb+kiJ6NXQbPXH/7me/X3xZiHQp95WpeBvAdiyfrhGCiJk= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CO1PR11MB4820.namprd11.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230028)(376002)(396003)(136003)(39860400002)(366004)(346002)(451199021)(966005)(6486002)(6666004)(86362001)(478600001)(31696002)(110136005)(2616005)(36756003)(83380400001)(26005)(6506007)(186003)(53546011)(6512007)(82960400001)(38100700002)(66946007)(66476007)(66556008)(30864003)(316002)(2906002)(4326008)(8936002)(8676002)(5660300002)(31686004)(41300700001)(66899021)(45980500001)(43740500002);DIR:OUT;SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?Wnh6SnFSSFFndXhBRWtRMGE3WVFBVUNRVVprYUxiZU50bGtOQXNGQ1VVNDhu?= =?utf-8?B?N0JCNE9tVVdaTnlrNDc4RFpqNmpRUElPeWRaWHQvV1ZicWIxQjYxYnpxZE9Y?= =?utf-8?B?VDVQbTZyc3h5aDY2dFFBazVPUHIvM1cxZC80R0YwaUxWOFB2a3dBd1U5ZTQx?= =?utf-8?B?a2VtYkovNDhLQzkrUm9vQjZXbCs4blhnWGF2bkU2dmcrMkN1TFZQb0NVR0p5?= =?utf-8?B?cDYwS2xOWm84RTdhT2ozN0Q2Ukp5YzVzQUJUYzE2eEl3aG5qWWxhbkQyVC8y?= =?utf-8?B?NkhMTTF0UnJudTN5MFZBQi9LbWRHK0NFOEpxNzVUcXNvZkVZV0pwakd1WElW?= =?utf-8?B?N0V3cDEzMWJQZWRKSGR6R2x0WnJlQnMvNFE2SUU0Y3kvOEtVeXdMUnJSV1Yz?= =?utf-8?B?ZEREMlJDdVJKYVNYSnY2dEdIelhwMy9EQ3Y3NS9Nak5reUhMUUI4U3lVUUp4?= =?utf-8?B?dUdqWXk5QU40RHhnYThCUGFSYm10ZXdrM0NiV3JQR2JPam4wTEZBNDl2enl3?= =?utf-8?B?NCszVTZsbWFnTUg3YVVRSnNsUE0yZlNib3VHUXFZaEFqdjRadDVnTmlnQWI0?= =?utf-8?B?N2Z2MS9ZR05NeU5kclBpNjJybmQ0SU9XWWwwS2drU3FtZi84Z1ZITk8xWStP?= =?utf-8?B?eXJUbVpta2FjRE1pYkFocGNUeHVWNXVNTUs5WG5hR3hSc29kVkVxblg1RmhR?= =?utf-8?B?aEVwRzN1U0l2aThoRFNaZGhoaXNJMDNuQTVmQ2kwSWJEcWhFc0VYcStBOE5U?= =?utf-8?B?MFNDT2F5a0F6WC81RjBZSG1Rd2p6NGNTc2RPeXl3dEEwdXZIc2NUQktoQ1lV?= =?utf-8?B?TFhXMnUzZVpOcXd1NlVsM1duV1N6VVd4Rmk4NlB6QlFNUlNjT2ZuNkNYU01o?= =?utf-8?B?UnpXd3hnLzBwTXNOZmp2UFBSclQxaDBBTFlqMFZVSEx5cGEwTWN4NW5kdDRi?= =?utf-8?B?eW93YWUxMDJKMGIwKzhZeFBWUWk1VGl6OTNlczQ3SFFyOFlZRVVUSmdGdi9H?= =?utf-8?B?TjVyUDFUVnE5b0xZaldjb1ZPRi9URWVFenRJT3d0R0RVRXlObTlHYkgxNGFS?= =?utf-8?B?UmN4aEJTZ0l0U1p4aTd6TitEeTBxdGo5RTZRWlV2WjlmU0liRlVUeWUzYjB3?= =?utf-8?B?OTFuZmp0YlAxOTd3dWsrSHFyK1l4UVZXbWZpWmNscVRUNzV3RWZOVEMxeG1l?= =?utf-8?B?dm14WmhxMWF1RjFFUVlmcVNFUWhvR2RTZGdpbml6NnBkZFNGTDJRL2p5RTBR?= =?utf-8?B?RC95NGphVWpMNzkzQWpLR0hKSkNrVkozRmYvMk54by82b3hlMDBaQXYyWHpk?= =?utf-8?B?WDNpNHRkc1FYMW5PZTZERUJKTGRsTEFVWGNJaE9xb0JFK2dyUk10YjhlR3FV?= =?utf-8?B?dzIxRkorTjBpY2Vlc2NGRm1nb0h5TkJaZG5Jc1NnZjdGUVBhSzlTTzl0YStZ?= =?utf-8?B?UkcwVkRUM0tNNStmUm5FNGNRNHN3R09jUGdrc3hrSTBwbHk3U2ZqdmdxclpH?= =?utf-8?B?bm4zYVdNNWdvZmtFbnZPZnlaTG82UU1MV21LYjAwaENFaWQreTA5OThoQU5Y?= =?utf-8?B?ODNxd1ZEZUowRkp5VFpVeTRjUE1YSytGUEhFZHg0Z09CV20yY1BOc0xSY3JH?= =?utf-8?B?ZEV1QVUzT3N1M2xENUYya09xUW1UL1F2aDlqT0RReTVvRTdJOVhTeXZuV2Fi?= =?utf-8?B?U1AvclNBRk53djQxM2JqYlJBNnRBK21VM2NRNWZmb09kbVZmMGFGTHk3ZlFF?= =?utf-8?B?WDVvWW1LYy9RSGZwMUxKMFRvYlBwQ1QwcC80Vjk5UnVLOTJ5Y2wyRXJ1UlVO?= =?utf-8?B?cEpXTFh1QXFQWVpMNUs0SUVWc0Q2TzdhL240dWRBdWxMZUpWWDFKSG1nVWlO?= =?utf-8?B?NjkxU1FnbjBZcG1Zb3RVR2xTU3R2WFhqR0x5bS9NS2dqMlpZRm5QT3A4bE9y?= =?utf-8?B?d2V5SkpZVHlyL3BvZllDRTc1MWh3WkVwRjVqQ1ZrS3grV0ppanlpd25iQnM4?= =?utf-8?B?bS9SVFU5VGhYZzRlOGl4WlBVSUpSN2lnYXplejJ5WVMvWG56cDZlQ2NDa3Zx?= =?utf-8?B?NENMZ2ZlTUZmbXNVU3loTm10NFpSZk1qSmxEdDdaQ21sYlZPS29Ec0Rjbk9l?= =?utf-8?Q?HrEHMge5ask8OYUHauEoBeJ9X?= X-MS-Exchange-CrossTenant-Network-Message-Id: d0a38bcf-eca8-4f04-80f4-08db3f1a874e X-MS-Exchange-CrossTenant-AuthSource: CO1PR11MB4820.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Apr 2023 08:05:37.9130 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: URKnFHmi7yNdbGMR5vsF1fAO1x1FH96cxMYh8uesDn9DQnB2eUaA8UAIC8mGxIQc7K8yt4mHEGBbDCkoiA+mAg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA2PR11MB5163 X-OriginatorOrg: intel.com X-Rspam-User: X-Rspamd-Queue-Id: E95AB180012 X-Rspamd-Server: rspam01 X-Stat-Signature: dynm5eoqrciy5iww9ynjrpy37ckgjm6i X-HE-Tag: 1681718742-67295 X-HE-Meta: U2FsdGVkX1/b6mBBCjIOQmPcv1ARwtcMSdRResFi3m1MBHlxrohlMO/5eZFsz4i6UI1uMWQBCaiFZAZHNcTaK9BwIEzQAHiDXWIQuUXjLedKSslWqzfv1mqzG4CBGkTU1Kyb+u3b3p/7ZogU2onsQDzwCKzQhHdxJ+wktZp2+Unkh8JbIDVemtRxqZfZGW/Ppsu0Dal5p8zwDIqc0k7kyxVcUM6qiu/56fXUfHUOxEUA1PUlwM1OITcnB5a3D+t50qjJmYPX5HWenDXHCgbWAdR8fvc6JZVIA3tiyBeSmcGidfG8+1+UdrCFC8onOayjCBeSvnmXQ+MYJ0ln8Cnpw8l6SaXOLvVYQJG0eOibY/z4hDx5F6Bns5u2aLGeV96Yjf3XTI3gS3esabr/pjg9vCzWonD6jvqgOqEjGjkd8yF/Ce6khHUh4DkcRGuvWeOT7ZpZYLjYN92M9eeDKLkiotkYF/7A3DzZ/13ntOGitk4UdaESAh66UOByO/bHjGNTQsrZ9nnX8twp+Lt4kFPWdrWA47MeCrGrjuhDnA5AWNrK9PG3Cypox5DguVuE3e7hiF88J/w+XKsQugYpb5qOp52IFhJ8da4DUezXT9Y9SfYIgBZCN+9HfM8k0CuiMxtbJlIjizHUoSu3u6de4PS/dvEhkgjej8tJirZpUNMcitK3Ffa8MfIbdDe26LjxYgrJIBDDG4vzA3vVu93G3N7bJxrOIfuOLwuvqwC65EDg5d28XhQXpc5FX9y4TSFEGWJJqxhIjR3S5r5b4pTBpg8QwofOKXcG77Epv8FwC0gso1Rq8oN8UYRJQ39J8sEgjp64CPlQryvJWQLPww0H/4QcsvGLturlfA/m2dqnkZHPaEYK6l3qvEN8nv/OxtgxY8U/BhOMiJZoozMbceOicX/ZI57BSohvwG3kIUW9AzasfcA3Qs8/ssmfRn4Rt0T5rcdalpS+ynKzEKuwI+VoDYz e7xOtfQM EnBBhuyLxQ5SBT+ws+qkcYubzuunVwUD811nOtf47YwvAFsnHX5LWpllSLLZG0ST6Gjlb5n2IMqTQ//WyP1cdfHM/22itTTWgc6TCb1r2QWpkroFURyEUpTBNfqHy5sHJi+KAmH+vPMcYlq4F8yr2DcgnwTLS7BHyo1YxNsbHAt6q8MVaOhSeYLq/fa3dQd1EkplXsOuRFafMDiKHMw+bhRK3OvHG/uhTMWdoKD+sLMfgmsEJJg2IhCdTSOV9+i4n4a8pd6qOCU7l9PJOq2PcXAHNnY4xvjaPIxoDb2uuXWY14K0alPH3hF+MIGcI6ic4lBb3J3Oy5m0mhr5xXMH+4rli6A+RUO7G42tBlIq1DO01QjhZ6lX4L2K4mn8lslBSchu0d6acbEqTekHVIBU0cyttZy4mhNWjRkKN7VySELvgQUkNu/JaCd3tXOlxWnLo4hVMDO2j0tpJR8z8TzxDPYWiNMSQkfXqVzsBx4i9/3d8h1U12qPMp8Hd+w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/14/2023 9:02 PM, Ryan Roberts wrote: > Hi All, > > This is a second RFC and my first proper attempt at implementing variable order, > large folios for anonymous memory. The first RFC [1], was a partial > implementation and a plea for help in debugging an issue I was hitting; thanks > to Yin Fengwei and Matthew Wilcox for their advice in solving that! > > The objective of variable order anonymous folios is to improve performance by > allocating larger chunks of memory during anonymous page faults: > > - Since SW (the kernel) is dealing with larger chunks of memory than base > pages, there are efficiency savings to be had; fewer page faults, batched PTE > and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel > overhead. This should benefit all architectures. > - Since we are now mapping physically contiguous chunks of memory, we can take > advantage of HW TLB compression techniques. A reduction in TLB pressure > speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce > TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2]. > > This patch set deals with the SW side of things only but sets us up nicely for > taking advantage of the HW improvements in the near future. > > I'm not yet benchmarking a wide variety of use cases, but those that I have > looked at are positive; I see kernel compilation time improved by up to 10%, > which I expect to improve further once I add in the arm64 "contiguous bit". > Memory consumption is somewhere between 1% less and 2% more, depending on how > its measured. More on perf and memory below. > > The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor > conflict resolution). I have a tree at [4]. > > [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/ > [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ > [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2 > > Approach > ======== > > There are 4 fault paths that have been modified: > - write fault on unallocated address: do_anonymous_page() > - write fault on zero page: wp_page_copy() > - write fault on non-exclusive CoW page: wp_page_copy() > - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() > > In the first 2 cases, we will determine the preferred order folio to allocate, > limited by a max order (currently order-4; see below), VMA and PMD bounds, and > state of neighboring PTEs. In the 3rd case, we aim to allocate the same order > folio as the source, subject to constraints that may arise if the source has > been mremapped or partially munmapped. And in the 4th case, we reuse as much of > the folio as we can, subject to the same mremap/munmap constraints. > > If allocation of our preferred folio order fails, we gracefully fall back to > lower orders all the way to 0. > > Note that none of this affects the behavior of traditional PMD-sized THP. If we > take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings. > > Open Questions > ============== > > How to Move Forwards > -------------------- > > While the series is a small-ish code change, it represents a big shift in the > way things are done. So I'd appreciate any help in scaling up performance > testing, review and general advice on how best to guide a change like this into > the kernel. > > Folio Allocation Order Policy > ----------------------------- > > The current code is hardcoded to use a maximum order of 4. This was chosen for a > couple of reasons: > - From the SW performance perspective, I see a knee around here where > increasing it doesn't lead to much more performance gain. > - Intuitively I assume that higher orders become increasingly difficult to > allocate. > - From the HW performance perspective, arm64's HPA works on order-2 blocks and > "the contiguous bit" works on order-4 for 4KB base pages (although it's > order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going > any higher. > > I suggest that ultimately setting the max order should be left to the > architecture. arm64 would take advantage of this and set it to the order > required for the contiguous bit for the configured base page size. > > However, I also have a (mild) concern about increased memory consumption. If an > app has a pathological fault pattern (e.g. sparsely touches memory every 64KB) > we would end up allocating 16x as much memory as we used to. One potential > approach I see here is to track fault addresses per-VMA, and increase a per-VMA > max allocation order for consecutive faults that extend a contiguous range, and > decrement when discontiguous. Alternatively/additionally, we could use the VMA > size as an indicator. I'd be interested in your thoughts/opinions. > > Deferred Split Queue Lock Contention > ------------------------------------ > > The results below show that we are spending a much greater proportion of time in > the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs. > > I think this is (at least partially) related for contention on the deferred > split queue lock. This is a per-memcg spinlock, which means a single spinlock > shared among all 160 CPUs. I've solved part of the problem with the last patch > in the series (which cuts down the need to take the lock), but at folio free > time (free_transhuge_page()), the lock is still taken and I think this could be > a problem. Now that most anonymous pages are large folios, this lock is taken a > lot more. > > I think we could probably avoid taking the lock unless !list_empty(), but I > haven't convinced myself its definitely safe, so haven't applied it yet. Yes. It's safe. We also identified other lock contention with large folio for anonymous mapping like lru lock and zone lock. My understanding is that the anonymous page has much higher alloc/free frequency than page cache. So the lock contention was not exposed by large folio for page cache. I posted the related patch to: https://lore.kernel.org/linux-mm/20230417075643.3287513-1-fengwei.yin@intel.com/T/#t Regards Yin, Fengwei > > Roadmap > ======= > > Beyond scaling up perf testing, I'm planning to enable use of the "contiguous > bit" on arm64 to validate predictions about HW speedups. > > I also think there are some opportunities with madvise to split folios to non-0 > orders, which might improve performance in some cases. madvise is also mistaking > exclusive large folios for non-exclusive ones at the moment (due to the "small > pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly > frees the folio. > > Results > ======= > > Performance > ----------- > > Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and > with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned > before each run. > > make defconfig && time make -jN Image > > First with -j8: > > | | baseline time | anonfolio time | percent change | > | | to compile (s) | to compile (s) | SMALLER=better | > |-----------|---------------:|---------------:|---------------:| > | real-time | 373.0 | 342.8 | -8.1% | > | user-time | 2333.9 | 2275.3 | -2.5% | > | sys-time | 510.7 | 340.9 | -33.3% | > > Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel > execution. The next 2 tables show a breakdown of the cycles spent in the kernel > for the 8 job config: > > | | baseline | anonfolio | percent change | > | | (cycles) | (cycles) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | data abort | 683B | 316B | -53.8% | > | instruction abort | 93B | 76B | -18.4% | > | syscall | 887B | 767B | -13.6% | > > | | baseline | anonfolio | percent change | > | | (cycles) | (cycles) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | arm64_sys_openat | 194B | 188B | -3.3% | > | arm64_sys_exit_group | 192B | 124B | -35.7% | > | arm64_sys_read | 124B | 108B | -12.7% | > | arm64_sys_execve | 75B | 67B | -11.0% | > | arm64_sys_mmap | 51B | 50B | -3.0% | > | arm64_sys_mprotect | 15B | 13B | -12.0% | > | arm64_sys_write | 43B | 42B | -2.9% | > | arm64_sys_munmap | 15B | 12B | -17.0% | > | arm64_sys_newfstatat | 46B | 41B | -9.7% | > | arm64_sys_clone | 26B | 24B | -10.0% | > > And now with -j160: > > | | baseline time | anonfolio time | percent change | > | | to compile (s) | to compile (s) | SMALLER=better | > |-----------|---------------:|---------------:|---------------:| > | real-time | 53.7 | 48.2 | -10.2% | > | user-time | 2705.8 | 2842.1 | 5.0% | > | sys-time | 1370.4 | 1064.3 | -22.3% | > > Above shows a 10.2% improvement in real time execution. But ~3x more time is > spent in the kernel than for the -j8 config. I think this is related to the lock > contention issue I highlighted above, but haven't bottomed it out yet. It's also > not yet clear to me why user-time increases by 5%. > > I've also run all the will-it-scale microbenchmarks for a single task, using the > process mode. Results for multiple runs on the same kernel are noisy - I see ~5% > fluctuation. So I'm just calling out tests with results that have gt 5% > improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests > are regressed: > > | benchmark | baseline | anonfolio | percent change | > | | ops/s | ops/s | BIGGER=better | > | ---------------------|---------:|----------:|---------------:| > | context_switch1.csv | 328744 | 351150 | 6.8% | > | malloc1.csv | 96214 | 50890 | -47.1% | > | mmap1.csv | 410253 | 375746 | -8.4% | > | page_fault1.csv | 624061 | 3185678 | 410.5% | > | page_fault2.csv | 416483 | 557448 | 33.8% | > | page_fault3.csv | 724566 | 1152726 | 59.1% | > | read1.csv | 1806908 | 1905752 | 5.5% | > | read2.csv | 587722 | 1942062 | 230.4% | > | tlb_flush1.csv | 143910 | 152097 | 5.7% | > | tlb_flush2.csv | 266763 | 322320 | 20.8% | > > I believe malloc1 is an unrealistic test, since it does malloc/free for 128M > object in a loop and never touches the allocated memory. I think the malloc > implementation is maintaining a header just before the allocated object, which > causes a single page fault. Previously that page fault allocated 1 page. Now it > is allocating 16 pages. This cost would be repaid if the test code wrote to the > allocated object. Alternatively the folio allocation order policy described > above would also solve this. > > It is not clear to me why mmap1 has slowed down. This remains a todo. > > Memory > ------ > > I measured memory consumption while doing a kernel compile with 8 jobs on a > system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the > workload, then calcualted "memory used" high and low watermarks using both > MemFree and MemAvailable. If there is a better way of measuring system memory > consumption, please let me know! > > mem-used = 4GB - /proc/meminfo:MemFree > > | | baseline | anonfolio | percent change | > | | (MB) | (MB) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | mem-used-low | 825 | 842 | 2.1% | > | mem-used-high | 2697 | 2672 | -0.9% | > > mem-used = 4GB - /proc/meminfo:MemAvailable > > | | baseline | anonfolio | percent change | > | | (MB) | (MB) | SMALLER=better | > |----------------------|---------:|----------:|---------------:| > | mem-used-low | 518 | 530 | 2.3% | > | mem-used-high | 1522 | 1537 | 1.0% | > > For the high watermark, the methods disagree; we are either saving 1% or using > 1% more. For the low watermark, both methods agree that we are using about 2% > more. I plan to investigate whether the proposed folio allocation order policy > can reduce this to zero. > > Thanks for making it this far! > Ryan > > > Ryan Roberts (17): > mm: Expose clear_huge_page() unconditionally > mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() > mm: Introduce try_vma_alloc_movable_folio() > mm: Implement folio_add_new_anon_rmap_range() > mm: Routines to determine max anon folio allocation order > mm: Allocate large folios for anonymous memory > mm: Allow deferred splitting of arbitrary large anon folios > mm: Implement folio_move_anon_rmap_range() > mm: Update wp_page_reuse() to operate on range of pages > mm: Reuse large folios for anonymous memory > mm: Split __wp_page_copy_user() into 2 variants > mm: ptep_clear_flush_range_notify() macro for batch operation > mm: Implement folio_remove_rmap_range() > mm: Copy large folios for anonymous memory > mm: Convert zero page to large folios on write > mm: mmap: Align unhinted maps to highest anon folio order > mm: Batch-zap large anonymous folio PTE mappings > > arch/alpha/include/asm/page.h | 5 +- > arch/arm64/include/asm/page.h | 3 +- > arch/arm64/mm/fault.c | 7 +- > arch/ia64/include/asm/page.h | 5 +- > arch/m68k/include/asm/page_no.h | 7 +- > arch/s390/include/asm/page.h | 5 +- > arch/x86/include/asm/page.h | 5 +- > include/linux/highmem.h | 23 +- > include/linux/mm.h | 8 +- > include/linux/mmu_notifier.h | 31 ++ > include/linux/rmap.h | 6 + > mm/memory.c | 877 ++++++++++++++++++++++++++++---- > mm/mmap.c | 4 +- > mm/rmap.c | 147 +++++- > 14 files changed, 1000 insertions(+), 133 deletions(-) > > -- > 2.25.1 >