From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48FD5E7717F for ; Thu, 12 Dec 2024 08:54:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D45386B0082; Thu, 12 Dec 2024 03:54:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CF53E6B0085; Thu, 12 Dec 2024 03:54:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B6F0E6B0088; Thu, 12 Dec 2024 03:54:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 998AF6B0082 for ; Thu, 12 Dec 2024 03:54:25 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 55C4CA1AF7 for ; Thu, 12 Dec 2024 08:54:25 +0000 (UTC) X-FDA: 82885694556.08.7E58DFA Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by imf24.hostedemail.com (Postfix) with ESMTP id ECA45180004 for ; Thu, 12 Dec 2024 08:54:19 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=aE9JzDgn; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=efzOIyRj; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=JOrYy3C5; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=AbPyUWxb; dmarc=none; spf=pass (imf24.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733993640; a=rsa-sha256; cv=none; b=bTHncPr6xNvTOy3skfX/hHNVXTTlZ3rfT/2paAB5LXi/iom3nFpsKYz9x9a+MxfsM2A9Nk f4YR3AllWQqwvtJ9JOa86kXVwxkVdXGsjiD6Qem/ygPzhi7iiM9FKifyqXz9NuQtmSMwO0 nVS2RJ65RYAFVyLkCi5qr1lHRfFEKtM= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=aE9JzDgn; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=efzOIyRj; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=JOrYy3C5; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=AbPyUWxb; dmarc=none; spf=pass (imf24.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.130 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733993640; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=gjPNJSRzK/Ratzm0ZHNDEIOEws4z2LJnQIizC2zS7Y4=; b=z0sKgWrg0g/t4bJqW75JRVJoxDT1+yat69yUV9x1SskiXGjIOtROpxUcGzkSnG6Uwgfyec W/a0dKAdxHSNZoOTWcycShMrn6xI+byWdBz1bEVUNL/V57FoTTeTmggB4KQ51GvgQsaHvR l38ovuyECU7pLlqvuo+W7L49izVIgRU= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 347242115A; Thu, 12 Dec 2024 08:54:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1733993661; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=gjPNJSRzK/Ratzm0ZHNDEIOEws4z2LJnQIizC2zS7Y4=; b=aE9JzDgnNgf4j9J9ad8Va7nEduNDzHOg6ouCzeZPc6/Qv6BxSNvNbdfmzaFafm3e+FRpaS SnflCDnOOnVMR90qPOzSzDmyVeSO07HhANxjnh59rn2p7wE2pJdQMx4qBwwG40I7yRbfVt tDJe/Dy9pTttK5nL5C+10f0o0+xZBTY= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1733993661; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=gjPNJSRzK/Ratzm0ZHNDEIOEws4z2LJnQIizC2zS7Y4=; b=efzOIyRji+r/INgyzxdBfbsVTNTrWcOUav8HeHIsWYdab+7h2rlALJFPoididfpUioyzqY 6cs9+IpOIZd1laCg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1733993660; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=gjPNJSRzK/Ratzm0ZHNDEIOEws4z2LJnQIizC2zS7Y4=; b=JOrYy3C5tZDFGuftTbvKZci0+rDeU7RmNChKgVg1S/mdnX52UmgjJqH2toTc3qJXaGOR1O +Ra+h7T/UH0zugKhH29gHl2sRyZlcPNTajeuInyMrYUPSmgjvbgfyHu47E7jfmFpmBaCRa s4tYkrT4LGfal3r2GIr7/27Rz49uWxk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1733993660; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=gjPNJSRzK/Ratzm0ZHNDEIOEws4z2LJnQIizC2zS7Y4=; b=AbPyUWxb15yCzQKEoSzKAR+u8LDhXJBKvGlx2OAVDhppPgl6XfmiMRMVWf6yIaTFOWQe4o vP0htlbNhHMsmdDg== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 10E7A13508; Thu, 12 Dec 2024 08:54:20 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id x4TPA7ykWmcKBAAAD6G6ig (envelope-from ); Thu, 12 Dec 2024 08:54:20 +0000 Message-ID: <95dab49e-52fa-4aa7-a668-5fb95c69d0a1@suse.cz> Date: Thu, 12 Dec 2024 09:54:19 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH bpf-next v2 1/6] mm, bpf: Introduce __GFP_TRYLOCK for opportunistic page allocation Content-Language: en-US To: Alexei Starovoitov Cc: Sebastian Andrzej Siewior , bpf , Andrii Nakryiko , Kumar Kartikeya Dwivedi , Andrew Morton , Peter Zijlstra , Steven Rostedt , Hou Tao , Johannes Weiner , shakeel.butt@linux.dev, Michal Hocko , Matthew Wilcox , Thomas Gleixner , Tejun Heo , linux-mm , Kernel Team References: <20241210023936.46871-1-alexei.starovoitov@gmail.com> <20241210023936.46871-2-alexei.starovoitov@gmail.com> <20241210090136.DGfYLmeo@linutronix.de> From: Vlastimil Babka Autocrypt: addr=vbabka@suse.cz; keydata= xsFNBFZdmxYBEADsw/SiUSjB0dM+vSh95UkgcHjzEVBlby/Fg+g42O7LAEkCYXi/vvq31JTB KxRWDHX0R2tgpFDXHnzZcQywawu8eSq0LxzxFNYMvtB7sV1pxYwej2qx9B75qW2plBs+7+YB 87tMFA+u+L4Z5xAzIimfLD5EKC56kJ1CsXlM8S/LHcmdD9Ctkn3trYDNnat0eoAcfPIP2OZ+ 9oe9IF/R28zmh0ifLXyJQQz5ofdj4bPf8ecEW0rhcqHfTD8k4yK0xxt3xW+6Exqp9n9bydiy tcSAw/TahjW6yrA+6JhSBv1v2tIm+itQc073zjSX8OFL51qQVzRFr7H2UQG33lw2QrvHRXqD Ot7ViKam7v0Ho9wEWiQOOZlHItOOXFphWb2yq3nzrKe45oWoSgkxKb97MVsQ+q2SYjJRBBH4 8qKhphADYxkIP6yut/eaj9ImvRUZZRi0DTc8xfnvHGTjKbJzC2xpFcY0DQbZzuwsIZ8OPJCc LM4S7mT25NE5kUTG/TKQCk922vRdGVMoLA7dIQrgXnRXtyT61sg8PG4wcfOnuWf8577aXP1x 6mzw3/jh3F+oSBHb/GcLC7mvWreJifUL2gEdssGfXhGWBo6zLS3qhgtwjay0Jl+kza1lo+Cv BB2T79D4WGdDuVa4eOrQ02TxqGN7G0Biz5ZLRSFzQSQwLn8fbwARAQABzSBWbGFzdGltaWwg QmFia2EgPHZiYWJrYUBzdXNlLmN6PsLBlAQTAQoAPgIbAwULCQgHAwUVCgkICwUWAgMBAAIe AQIXgBYhBKlA1DSZLC6OmRA9UCJPp+fMgqZkBQJkBREIBQkRadznAAoJECJPp+fMgqZkNxIQ ALZRqwdUGzqL2aeSavbum/VF/+td+nZfuH0xeWiO2w8mG0+nPd5j9ujYeHcUP1edE7uQrjOC Gs9sm8+W1xYnbClMJTsXiAV88D2btFUdU1mCXURAL9wWZ8Jsmz5ZH2V6AUszvNezsS/VIT87 AmTtj31TLDGwdxaZTSYLwAOOOtyqafOEq+gJB30RxTRE3h3G1zpO7OM9K6ysLdAlwAGYWgJJ V4JqGsQ/lyEtxxFpUCjb5Pztp7cQxhlkil0oBYHkudiG8j1U3DG8iC6rnB4yJaLphKx57NuQ PIY0Bccg+r9gIQ4XeSK2PQhdXdy3UWBr913ZQ9AI2usid3s5vabo4iBvpJNFLgUmxFnr73SJ KsRh/2OBsg1XXF/wRQGBO9vRuJUAbnaIVcmGOUogdBVS9Sun/Sy4GNA++KtFZK95U7J417/J Hub2xV6Ehc7UGW6fIvIQmzJ3zaTEfuriU1P8ayfddrAgZb25JnOW7L1zdYL8rXiezOyYZ8Fm ZyXjzWdO0RpxcUEp6GsJr11Bc4F3aae9OZtwtLL/jxc7y6pUugB00PodgnQ6CMcfR/HjXlae h2VS3zl9+tQWHu6s1R58t5BuMS2FNA58wU/IazImc/ZQA+slDBfhRDGYlExjg19UXWe/gMcl De3P1kxYPgZdGE2eZpRLIbt+rYnqQKy8UxlszsBNBFsZNTUBCACfQfpSsWJZyi+SHoRdVyX5 J6rI7okc4+b571a7RXD5UhS9dlVRVVAtrU9ANSLqPTQKGVxHrqD39XSw8hxK61pw8p90pg4G /N3iuWEvyt+t0SxDDkClnGsDyRhlUyEWYFEoBrrCizbmahOUwqkJbNMfzj5Y7n7OIJOxNRkB IBOjPdF26dMP69BwePQao1M8Acrrex9sAHYjQGyVmReRjVEtv9iG4DoTsnIR3amKVk6si4Ea X/mrapJqSCcBUVYUFH8M7bsm4CSxier5ofy8jTEa/CfvkqpKThTMCQPNZKY7hke5qEq1CBk2 wxhX48ZrJEFf1v3NuV3OimgsF2odzieNABEBAAHCwXwEGAEKACYCGwwWIQSpQNQ0mSwujpkQ PVAiT6fnzIKmZAUCZAUSmwUJDK5EZgAKCRAiT6fnzIKmZOJGEACOKABgo9wJXsbWhGWYO7mD 8R8mUyJHqbvaz+yTLnvRwfe/VwafFfDMx5GYVYzMY9TWpA8psFTKTUIIQmx2scYsRBUwm5VI EurRWKqENcDRjyo+ol59j0FViYysjQQeobXBDDE31t5SBg++veI6tXfpco/UiKEsDswL1WAr tEAZaruo7254TyH+gydURl2wJuzo/aZ7Y7PpqaODbYv727Dvm5eX64HCyyAH0s6sOCyGF5/p eIhrOn24oBf67KtdAN3H9JoFNUVTYJc1VJU3R1JtVdgwEdr+NEciEfYl0O19VpLE/PZxP4wX PWnhf5WjdoNI1Xec+RcJ5p/pSel0jnvBX8L2cmniYnmI883NhtGZsEWj++wyKiS4NranDFlA HdDM3b4lUth1pTtABKQ1YuTvehj7EfoWD3bv9kuGZGPrAeFNiHPdOT7DaXKeHpW9homgtBxj 8aX/UkSvEGJKUEbFL9cVa5tzyialGkSiZJNkWgeHe+jEcfRT6pJZOJidSCdzvJpbdJmm+eED w9XOLH1IIWh7RURU7G1iOfEfmImFeC3cbbS73LQEFGe1urxvIH5K/7vX+FkNcr9ujwWuPE9b 1C2o4i/yZPLXIVy387EjA6GZMqvQUFuSTs/GeBcv0NjIQi8867H3uLjz+mQy63fAitsDwLmR EP+ylKVEKb0Q2A== In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Action: no action X-Stat-Signature: ctfttqjpg5fganfssy5yjqwi4c1fayrr X-Rspam-User: X-Rspamd-Queue-Id: ECA45180004 X-Rspamd-Server: rspam08 X-HE-Tag: 1733993659-587891 X-HE-Meta: U2FsdGVkX183PkTfQGFYMavJXMVEt0mGSs+khUivEKgRlsDzXdI0C9S5XFv33mL5WjSEkbWrGfJnCBrchHMvt4rIAZPGzi21R6m/3x6j6vaqz6tV5EWtIEr0j6kXYP+ALKfe/ia7NN1aUiqJWzL4FvuE6iDbmrlVlo/Llbj5rovlq01FRhyhHN4B2OI5OWBlfs0g0bT71dTw6bJD6QX0qT71aliR2RXJMStcGA/kzJGJf7FlqmusOHEmd90Ol5utY5JAqqCXN2Z9yEYBrapZ214qLHAZ5DEwFzSlBue7lHFJ2BK+FCyp1qyzwI4T+B8fW7qPHj8gcH7LkFzBkmmH2mwu/x64pG+xDKP2AsG6J4tA3qwVovyT/K7MxT8PvzqES5TpoQd/jVmpohXYY8X7Al+/EXWYn97THgzsYrbUZbKYSDCw/tjRZTWYz+zB8YLQ4zJJH4cYOzN52tGGqbeSpwwNbaLdKl049kqFFM4cvsIc3haklahQKeJkw5YQ+bP2rfCL7IiV6mQim6hTLzuEz3auxhThXzLxYLZx/OMzMEuEEwdimogWAOxjJxLiitwk1B5EXyD9RGcyggCb5oks7GIV7ivaVOX6CPVPjCYaeTGH+556tov3kwARQIS7R1XUW8zi+qsIicCFU5gIV6orKMXEAUU+JCWEPKvfs4peoMWFvYWfbiVle1FofijZYGve4SnaXQduUPasiDNGnFS00TFHaDmb0VX+O59VwJasYRWRIKYCxffIHoiYTBBeX/IYLcjhr0O99T0UZa7FyGR2goofK2BiqS4KZu+ih1dGgbIo3TMeHuLVZUPf4Jp8LfcK7JoE1sWoel5qB4ZGafMILt6YBb6GAPfjrlfHJZ+Jnm41Md57cIS7hQleAVy+8tT8aczDZq+58eu8CTx9RA0cqH8vSesO9U5lNCdYmuYX/naqLEddsvbZl20YaAEofrnNajKIt9bYJ82ihz9we+A c+nX93Fo lrmsenhS4yyptXzt9k8Br/3XDkKfzopcMJEb/EIYPbVGL8O71laFqFxHqAOQzft8ByYgliIrDtdXx7bE7RdRTGHDrqCf/sLwLNAhVTz8XYj/sdNGxgDLVa58weOG9v4/h3U2aRints41XkD93RRFhj1zK1tYz36AN8CAILuWiTHHpEQH9koRyZFt7pHNXh+6dtDXgm0hEvZva6Zi0Eo7WdLHyy6mcoZiuke8m1ocd3d5mmWXBGWM3sul6QbLe5OGE+Cc7oNPoMPuz0E/jQVNEBrpb3SD7q9WN8CBLkDFIv0L6/5WgNYn97mZdCrkr0PAFk43HOTTc6e3emAKy9U6TSY/KEYBUqVLbSG/f0NPIiTPgcyCCQQeVjSxpoGb5NjiPyfOZvWODYhhVg/opvLRbs4f1hojw0W2T97YPuqMWqc21AaNllqAbXIvbHEDl2DoTUnZATyM+U3ob2/U= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 12/12/24 03:14, Alexei Starovoitov wrote: > On Wed, Dec 11, 2024 at 12:39 AM Vlastimil Babka wrote: >> >> On 12/10/24 22:53, Alexei Starovoitov wrote: >> > On Tue, Dec 10, 2024 at 1:01 AM Sebastian Andrzej Siewior >> > wrote: >> >> >> >> On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote: >> >> > From: Alexei Starovoitov >> >> > >> >> > Tracing BPF programs execute from tracepoints and kprobes where running >> >> > context is unknown, but they need to request additional memory. >> >> > The prior workarounds were using pre-allocated memory and BPF specific >> >> > freelists to satisfy such allocation requests. Instead, introduce >> >> > __GFP_TRYLOCK flag that makes page allocator accessible from any context. >> >> > It relies on percpu free list of pages that rmqueue_pcplist() should be >> >> > able to pop the page from. If it fails (due to IRQ re-entrancy or list >> >> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock >> >> > and refill percpu freelist as normal. >> >> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, >> >> > so trylock is the only option. >> >> >> >> The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF >> >> where it is not known how much memory will be needed and what the >> >> calling context is. >> > >> > Exactly. >> > >> >> I hope it does not spread across the kernel where >> >> people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they >> >> learn that this does not work, add this flag to the mix to make it work >> >> without spending some time on reworking it. >> > >> > We can call it __GFP_BPF to discourage any other usage, >> > but that seems like an odd "solution" to code review problem. >> >> Could we perhaps not expose the flag to public headers at all, and keep it >> only as an internal detail of try_alloc_pages_noprof()? > > public headers? I mean it could be (with some work) defined only in e.g. mm/internal.h, which the flag printing code would then need to include. > To pass additional bit via gfp flags into alloc_pages > gfp_types.h has to be touched. Ah right, try_alloc_pages_noprof() would need to move to page_alloc.c instead of being static inline in the header. > If you mean moving try_alloc_pages() into mm/page_alloc.c and > adding another argument to __alloc_pages_noprof then it's not pretty. > It has 'gfp_t gfp' argument. It should to be used to pass the intent. __GFP_TRYLOCK could be visible in page_alloc.c to do this, but not ouside mm code. > We don't have to add GFP_TRYLOCK at all if we go with > memalloc_nolock_save() approach. I have doubts about that idea. We recently rejected PF_MEMALLOC_NORECLAIM because it could lead to allocations nested in that scope failing and they might not expect it. Scoped trylock would have even higher chance of failing. I think here we need to pass the flag as part of gfp flags only within nested allocations (for metadata or debugging) within the slab/page allocator itself, which we already mostly do. The harder problem is not missing any place where it should affect taking a lock, and a PF_ flag won't help with that (as we can't want all locking functions to look at it). Maybe it could help with lockdep helping us find locks that we missed, but I'm sure lockdep could be made to track the trylock scope even without a PF flag? > So I started looking at it, > but immediately hit trouble with bits. > There are 5 bits left in PF_ and 3 already used for mm needs. > That doesn't look sustainable long term. > How about we alias nolock concept with PF_MEMALLOC_PIN ? > > As far as I could trace PF_MEMALLOC_PIN clears GFP_MOVABLE and nothing else. > > The same bit plus lack of __GFP_KSWAPD_RECLAIM in gfp flags > would mean nolock mode in alloc_pages, > while PF_MEMALLOC_PIN alone would mean nolock in free_pages > and deeper inside memcg paths and such. > > thoughts? too hacky?