From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC361C433EF for ; Tue, 12 Jul 2022 23:12:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 453F29400DA; Tue, 12 Jul 2022 19:12:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 403A1940063; Tue, 12 Jul 2022 19:12:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 27DEA9400DA; Tue, 12 Jul 2022 19:12:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 18C85940063 for ; Tue, 12 Jul 2022 19:12:57 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id CE312342E3 for ; Tue, 12 Jul 2022 23:12:56 +0000 (UTC) X-FDA: 79679999952.14.1015EAD Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by imf24.hostedemail.com (Postfix) with ESMTP id 6CB97180077 for ; Tue, 12 Jul 2022 23:12:56 +0000 (UTC) Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.17.1.5/8.17.1.5) with ESMTP id 26CLjbVK024477; Tue, 12 Jul 2022 16:12:25 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : content-id : content-transfer-encoding : mime-version; s=facebook; bh=6BwHR+f3Cnfy6OyCO1OV3vsk9G5y0yPiaANHhqtuSTo=; b=Ws8g5AZNxZTtKtAfUCoE1ltkLatKS3zbh4SYwIiawwYQbz7YapeUqdpyIya59v/nHpu7 /htqcxuy3/1X0owpZvbIJFk048/WpBzJZa+MaJewakL3eUAplDO7cXI4qej+W27sjDJS jNv5tHilypNI6BzmBZRDzYs47jnUjalhwYM= Received: from nam11-co1-obe.outbound.protection.outlook.com (mail-co1nam11lp2177.outbound.protection.outlook.com [104.47.56.177]) by m0089730.ppops.net (PPS) with ESMTPS id 3h9h5erdwg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 12 Jul 2022 16:12:24 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=J6IO3VsfnnJ+FyTX4hDy+pAWtrZnyNhqG1zH4BfKvHJ4PeRWZ2j1KFE4rJJed9XyMJnwNp0TxKQn5VAT+yxYlToP1yjkz7qswuFgBL+skIh7QB045dz+ownzpP2WJ4Xu2zYqNP0H1q+LWihQW1XEtwuqoz8sK3CHVHrJ53l2Q4+5PrgzhiGR+khlVfRLNOkGE5IZje/kPdYD3F0l6VKIPe+QeYVoONVY0+YfUi1Vf/0kVbne0ef8HHxdaibZXVHlHS2XcRo27l52r0u+c49EnQwT2hC1XU9vHlVlob86DUHuk/r3NXZgpUNb9Lr1PJe9ycr2KVh1uVBAZPsJ3gXMNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=6BwHR+f3Cnfy6OyCO1OV3vsk9G5y0yPiaANHhqtuSTo=; b=Qmxhezcmo+dQr/B5sF2ElKg/+i8m71GzjCnat6+3+BXCoSpeLqSRmkBVP8KqGlO5G0c2GwpskkPoIj97tYRoVitT7TGCIr1BXj3hHQMq7izbPtzS1YVJRWct/pRyL2Gf57qzXTXZAUOd5nthOjhGf5EACQ5bCqz5cnaf237A9PvTRARhqfMbriqdN95C4m9q1wjDu/dURhaHmHAtYQU1+mhGbQCFBSeEfiCDOSY61t+jbr3OEe1KXAIrhs/Ift9HJHP0hQ7ApI6R7wBL5mntA73O94O7gsCOn5PGt7l1EmVS4DZd0lXKj/8GZYYkL8To58bjwfMuC8ifPy2rTH7Dng== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fb.com; dmarc=pass action=none header.from=fb.com; dkim=pass header.d=fb.com; arc=none Received: from SA1PR15MB5109.namprd15.prod.outlook.com (2603:10b6:806:1dc::10) by DM6PR15MB3768.namprd15.prod.outlook.com (2603:10b6:5:2b0::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5417.26; Tue, 12 Jul 2022 23:12:22 +0000 Received: from SA1PR15MB5109.namprd15.prod.outlook.com ([fe80::e8cd:89e9:95b6:e19a]) by SA1PR15MB5109.namprd15.prod.outlook.com ([fe80::e8cd:89e9:95b6:e19a%8]) with mapi id 15.20.5417.026; Tue, 12 Jul 2022 23:12:22 +0000 From: Song Liu To: Luis Chamberlain CC: Peter Zijlstra , Steven Rostedt , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Masami Hiramatsu , "Naveen N. Rao" , "David S. Miller" , Anil S Keshavamurthy , Kees Cook , Song Liu , bpf , Christoph Hellwig , Davidlohr Bueso , lkml , Linux-MM , Daniel Borkmann , Kernel Team , "x86@kernel.org" , "dave.hansen@linux.intel.com" , "rick.p.edgecombe@intel.com" , "linux-modules@vger.kernel.org" Subject: Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Thread-Topic: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Thread-Index: AQHYklIC37qezjyxh0moueFgoO/pNa1zhdgAgAAO4YCAABD0AIAAC/MAgADwxACAAEM5AIAAKMMAgAAvboCABOqLgIAAGVQAgADeG4CAAEVBgA== Date: Tue, 12 Jul 2022 23:12:22 +0000 Message-ID: <6CB56563-29E2-4CE0-BF7B-360979E42429@fb.com> References: <863A2D5B-976D-4724-AEB1-B2A494AD2BDB@fb.com> <6214B9C9-557B-4DC0-BFDE-77EAC425E577@fb.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-mailer: Apple Mail (2.3696.100.31) x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 62ca44ee-cf4d-455f-5c58-08da645bf9c5 x-ms-traffictypediagnostic: DM6PR15MB3768:EE_ x-fb-source: Internal x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: 1MwXfFmy3fOdPPf/oTAtxgiudS8h9dDrqQ8xkxB1EW/3I+Gn/8jZWBl9fzrO1PqHBN02s6U37euhrj6bENRl38CDn0DB71pPwIMAYOMP3XlzvKbQ65sEmX509ozfM8TW8gr0NaPOZBR4Q4j1DqTIufxyibgjQ/luFAVUprWh2MNFLLS8QvpXJBRuVUMPGP0of7snoTw18dQlIbfwV4Qg9p2PmsHrYZ44zPdlXw82bGdc6l/ZK9V22RqOGji1fhgk1njw/kQwQO0TKU9y36QJ6EtY/KP9yyn5jGwvzYLLxF2rnpUsoL/m3ad0p2qe+xiThzug8QMWWMwzA+ZCfWZAFv9Cz76QlnOimCYFdQaCvOtTAeEZ3pZuLfPI2vombOC+sQWHTRG+j9xpinaqjyo616/Krdp2kWKGMjxRW4eXM352ZnrU79+Sx4ZxNYW2ByTT9l1gk1SLe6XA8gaV1eQ2G3g65pU3/6FmhjuiUDipSWuT+11m8kOyyMPgiq/9TV+egNke+Lv+6cOqFVQjV9cn7Q9JVgegOEDsbOZa4gUBXeP2Dksc1FhVLVnJUpAnxPwIXQrIDFxwYgf/D5enESgcSpMBr9H8lk4iJS8IGz6r7/T8ohLVLg6y/AquurOEQGshQzRbCsBjjcHsiOejmgXTBea2zOkx5+XJGzbDmDaRGcDJ9DcqLtqFhprN1jR7Z4CraCwocUpak5f0g5eRgPHE+06rESi/5Lsp4Rrp7yYSLvW2sNzlz/If0CeUr5RpDul1mVCync5+PuSXyKhBscZPE/3VnPzAUPvrsrmEV7/LmWyDDRC7XJfbXxd2zjQPAYprPp5JniawtcpGpqnfxCYEesMWOszxIQISUfi+7pLnzrA= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SA1PR15MB5109.namprd15.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230016)(4636009)(366004)(136003)(39860400002)(396003)(346002)(376002)(66946007)(91956017)(4326008)(66446008)(8676002)(33656002)(2616005)(8936002)(66556008)(5660300002)(316002)(76116006)(6512007)(6916009)(54906003)(64756008)(7416002)(186003)(36756003)(66476007)(2906002)(83380400001)(86362001)(478600001)(122000001)(38100700002)(6486002)(53546011)(71200400001)(41300700001)(6506007)(38070700005)(45980500001);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?r8s2JgPJECsdt26E2C85W+irzq6JPuNIyf7mlIHDQ154JS6lXCPiG67KuIQ9?= =?us-ascii?Q?TGNaREanTLjZ6Yncqo0kiymsTpPB7LzsDy5r2js3hHYEz9iQMdKpaUwYTtc1?= =?us-ascii?Q?hrFAaThDoWvTLuoP+1J++dVx6i15Wl/J/ortZtyoVfu80VleJ2qaFk3iHYGq?= =?us-ascii?Q?eUJVVF6N1/k5BdryFnhbruokF+FWeTFPv7gjcFvd3nMmP5OkEaPA6PhFpAIz?= =?us-ascii?Q?/1br+ycwtUbZtM1oToJRORnNV7j65M4nglX+8ZoydXn1lncbu3qEfKsFZmpi?= =?us-ascii?Q?dcRuXPuyGMOinOax/TE1rguOf3kcTiDF9YASXDhBSHkjL5eOnKSU1M82jRvP?= =?us-ascii?Q?yF31skyfNCxqEMtZys/CAgp4c1oeGoCVuNedauZWslBNJjYd1KCatouprQoJ?= =?us-ascii?Q?/nVvXw8FLrjOyNPjxXynYUveWipMVr+ExepXO6R3ndCAv7k0Nj94KB1hVPNf?= =?us-ascii?Q?RyeMe9x7XuIlaJ4iundSUPfgXWCZHFAv4tdVvG4tVX4KuWU8UjHqLirvYRF0?= =?us-ascii?Q?zrQUE1ACnTJbryOW+8nrnOydCs/jhkKSLvMUsDN1pdCl1j0Vu4G9UOjiD4mw?= =?us-ascii?Q?sJN3+PNmRFsqRiXF0gnMvvcvaMbct42QUVDFEQ8LkCvLuUKHtCycl6hUW3YU?= =?us-ascii?Q?+xhoiN6Mvm2hHa3uxjnBq7T+w0/Go8ir1eR0NN+t7FlxVHce9cBbuEk+kUBQ?= =?us-ascii?Q?MVAe/qef2uPbDzBCVyWxr8yLN1GJL9jlFvEJQBLoVLMrBgk50wsMbt34QBzu?= =?us-ascii?Q?/l8SqjWCrFAYfBJ3LNMS+X/eXVAJZzRrIuLmW3UGDUkHf61MqYQpSFsRQkEF?= =?us-ascii?Q?esMFgpqjU5jCRmBeh84tEXzy6vkMAb4YRQCdzzhuK+kQRKw7IqGzcssIRcGY?= =?us-ascii?Q?SsAD15oZ3evuEDFkSJ6oFRDY5H9KH7Bgn9F2C8ZdBixuWleSlz0msFfe51y9?= =?us-ascii?Q?1s89T9iclqVV3HSrUeCc1kgkOhkftRUDnHdyvgQ6ZagRw4OmhYPmFmk4QNN7?= =?us-ascii?Q?ECDV8EXPMB3V8+s327U+624d8Qga3wBVNuDFGjI9aibyf0LSEBIcShmrVqr/?= =?us-ascii?Q?dwUQs15xM4PFIpDI4cjAqsjrZsBuV7H6c+btkLWNXNKorvOxmQiaiLXlaCBW?= =?us-ascii?Q?eCZx9qRDPD7C5q3bR+NH4tBmbo7/vwMQScwdV2J6k8JsRJFSDf73tGgOrxOH?= =?us-ascii?Q?06q+reTrLaA5PlipuKE1fpjmoJkZ9a5Nm0DOrzqTpf8uaEFL6H/3Axqu2O0e?= =?us-ascii?Q?eXb9wh/XlK+5U+nbo2HlwT4uHaB102rpTohEMIaOioL1+888rmz1yahkyOLr?= =?us-ascii?Q?MFeb9e8aJ520T1QfFMzrmn4KsY74yrOkpwyrEoien+rjltic60xSjV58D4k2?= =?us-ascii?Q?7ga4jFqYki2S/6+fjZYzrnGbTSurwdSKIgIBavNIoUMAgPoTX4qM/gRgCPoJ?= =?us-ascii?Q?aMGFaYwZglmi1YMTebcFlepzKTA9UnQPpZ0sIgy7sJV3Uykcj+2+Kve1EMoT?= =?us-ascii?Q?nEzkPO0WTp9/E6wuuZgdscoxJgzsZCqG4XuyuYMS1raOf7BSA0O15iAMIxWn?= =?us-ascii?Q?vgQLMy8iw8xCXP1/YQ0sY/luMd/wgAh+kLhlPZOlACDU5VGol+a2WD6vLkw+?= =?us-ascii?Q?lxgQOlVLb97U8b9J/Sh5jso=3D?= Content-Type: text/plain; charset="us-ascii" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: fb.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: SA1PR15MB5109.namprd15.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 62ca44ee-cf4d-455f-5c58-08da645bf9c5 X-MS-Exchange-CrossTenant-originalarrivaltime: 12 Jul 2022 23:12:22.0983 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: vZAruDSayssIS41rlwMgC8VVmGT39N9M85uf0d2KHE0pUSPACDNHbft2hz8ADP42ChaxfkuW4gBn4Piv0IoLhQ== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM6PR15MB3768 X-Proofpoint-ORIG-GUID: InAEDVBkEuTBdqBavjTHsvJSRKzTUkky X-Proofpoint-GUID: InAEDVBkEuTBdqBavjTHsvJSRKzTUkky X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-12_12,2022-07-12_01,2022-06-22_01 ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657667576; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6BwHR+f3Cnfy6OyCO1OV3vsk9G5y0yPiaANHhqtuSTo=; b=LkfYQsDXVCyOi3oHoarZlRmA3jAMC1JCFXpgsA5Eyda1QsRT2EGi5/AUyouYDrGeYLm4+n u6EW2C3KELCvZW4EzDFNdYHYCWiPzqcAuq4BPnfgZz+JW4jImmvppzJ1TGqZFsacc6Z7WP gCcBNRcqtcZITDSQN46Gcvmzz2GC/uA= ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1657667576; a=rsa-sha256; cv=pass; b=7utaERKMeXpZdLqLo+siAcAxFZXqn7lOQ1zByQBx5GPJhm9D+fDgncuQ5fIrUYDsFjgZDg 4ih70hbCMp7qwbkA+FQVfb1yTgDKrA1Kwaxlzkiyv+Sy/qc8h6Ynz5MZschhhbKaWdJ1Rz pmich67k7EcF+iqtuUs1JPEETf8D4/k= ARC-Authentication-Results: i=2; imf24.hostedemail.com; dkim=pass header.d=fb.com header.s=facebook header.b=Ws8g5AZN; dmarc=pass (policy=reject) header.from=fb.com; spf=none (imf24.hostedemail.com: domain of "prvs=81920abf3d=songliubraving@fb.com" has no SPF policy when checking 67.231.153.30) smtp.mailfrom="prvs=81920abf3d=songliubraving@fb.com"; arc=pass ("microsoft.com:s=arcselector9901:i=1") X-Stat-Signature: 417z9nhat3h5ptm1tr5sfp5u3q3u8srw X-Rspamd-Queue-Id: 6CB97180077 X-Rspam-User: Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=fb.com header.s=facebook header.b=Ws8g5AZN; dmarc=pass (policy=reject) header.from=fb.com; spf=none (imf24.hostedemail.com: domain of "prvs=81920abf3d=songliubraving@fb.com" has no SPF policy when checking 67.231.153.30) smtp.mailfrom="prvs=81920abf3d=songliubraving@fb.com"; arc=pass ("microsoft.com:s=arcselector9901:i=1") X-Rspamd-Server: rspam05 X-HE-Tag: 1657667576-372665 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Jul 12, 2022, at 12:04 PM, Luis Chamberlain wrote: >=20 > On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote: >>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain wrote= : >>=20 >>> I believe you are mentioning requiring text_poke() because the way >>> eBPF code uses the module_alloc() is different. Correct me if I'm >>> wrong, but from what I gather is you use the text_poke_copy() as the da= ta >>> is already RO+X, contrary module_alloc() use cases. You do this since y= our >>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after >>> module_alloc() and before you can use this memory. This is a different = type >>> of allocator. And, again please correct me if I'm wrong but now you wan= t to >>> share *one* 2 MiB huge-page for multiple BPF programs to help with the >>> impact of TLB misses. >>=20 >> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke.= =20 >> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe >> and ftrace only uses a fraction of a 4kB page. Most BPF programs and=20 >> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add >> much value on top of current module_alloc().=20 >=20 > Thanks for the clarification. >=20 >>> A vmalloc_ro_exec() by definition would imply a text_poke(). >>>=20 >>> Can kprobes, ftrace and modules use it too? It would be nice >>> so to not have to deal with the loose semantics on the user to >>> have to use set_vm_flush_reset_perms() on ro+x later, but >>> I think this can be addressed separately on a case by case basis. >>=20 >> I am pretty confident that kprobe and ftrace can share huge pages with=20 >> BPF programs. >=20 > Then wonderful, we know where to go in terms of a new API then as it > can be shared in the future for sure and there are gains. >=20 >> I haven't looked into all the details with modules, but=20 >> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also=20 >> possible. >=20 > Sure. >=20 >> Once this is done, a regular system (without huge BPF program or huge >> modules) will just use 1x 2MB page for text from module, ftrace, kprobe,= =20 >> and bpf programs.=20 >=20 > That would be nice, if possible, however modules will require likely its > own thing, on my system I see about 57 MiB used on coresize alone. >=20 > lsmod | grep -v Module | cut -f1 -d ' ' | \ > xargs sudo modinfo | grep filename | \ > grep -o '/.*' | xargs stat -c "%s - %n" | \ > awk 'BEGIN {sum=3D0} {sum+=3D$1} END {print sum}' > 60001272 >=20 > And so perhaps we need such a pool size to be configurable. >=20 >>> But a vmalloc_ro_exec() with a respective free can remove the >>> requirement to do set_vm_flush_reset_perms(). >>=20 >> Removing the requirement to set_vm_flush_reset_perms() is the other >> reason to go directly to vmalloc_ro_exec().=20 >=20 > Yes fantastic. >=20 >> My current version looks like this: >>=20 >> void *vmalloc_exec(unsigned long size); >> void vfree_exec(void *ptr, unsigned int size); >>=20 >> ro is eliminated as there is no rw version of the API.=20 >=20 > Alright. >=20 > I am not sure if 2 MiB will suffice given what I mentioned above, and > what to do to ensure this grows at a reasonable pace. Then, at least for > usage for all architectures since not all will support text_poke() we > will want to consider a way to make it easy to users to use non huge > page fallbacks, but that would be up to those users, so we can wait for > that. We are not limited to 2MiB total. The logic is like:=20 1. Anything bigger than 2MiB gets its own allocation. 2. We maintain a list of 2MiB pages, and bitmaps showing which parts of=20 these pages are in use.=20 3. For objects smaller than 2MiB, we will try to fit it in one of these pages.=20 3. a) If there isn't a page with big enough continuous free space, we will allocate a new 2MiB page.=20 (For system with n NUMA nodes, multiple 2MiB above by n).=20 So, if we have 100 kernel modules using 1MiB each, they will share 50x 2MiB pages.=20 Thanks, Song=