From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A82C0E6FE47 for ; Sat, 7 Sep 2024 04:03:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 90D2A6B007B; Sat, 7 Sep 2024 00:03:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8BD2E6B0083; Sat, 7 Sep 2024 00:03:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7848D6B0085; Sat, 7 Sep 2024 00:03:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6382C6B007B for ; Sat, 7 Sep 2024 00:03:57 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B01C21214E2 for ; Sat, 7 Sep 2024 04:03:56 +0000 (UTC) X-FDA: 82536598872.15.D278451 Received: from out-178.mta1.migadu.com (out-178.mta1.migadu.com [95.215.58.178]) by imf19.hostedemail.com (Postfix) with ESMTP id C26531A0004 for ; Sat, 7 Sep 2024 04:03:54 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=iHU0WDR1; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf19.hostedemail.com: domain of martin.lau@linux.dev designates 95.215.58.178 as permitted sender) smtp.mailfrom=martin.lau@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725681809; a=rsa-sha256; cv=none; b=k8beNRP5GzvgkUgPXbfQTmGmkh3+1xmt/oHRkB9Si/SZ1ACKHNTVRoKggZvdOLvCg8IL2z 8Le1XMEfBj0p1eFJjwyqWErJNFNZ8DQ6yx1KM46dN9+TjaQvy251RBnjgnTdr+bqyRPq8V 4dsOT3ZmkdNjXTQQK0dKUOfXo220bYM= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=iHU0WDR1; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf19.hostedemail.com: domain of martin.lau@linux.dev designates 95.215.58.178 as permitted sender) smtp.mailfrom=martin.lau@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725681809; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Kk7p9YBa55BsktNkHzf4/b7BnD7sUCF6HUOCyDf0jYU=; b=OeH3vosEUmjao1TxXeP71sRCcT2rnYSvNV2U6tYkb05OyL8/f1RUEe2MN0jIwamHgaX5J0 ZVStXZyiSbQEEoS8375sMuPyRjDnzM0wZ20wlDmm6x0dtvmJxioCQKfQG7Pmo7rZoyJLGO +hm/kAL9QHZlYP8K4ykdmMZ71I9ENqc= Message-ID: <769541a2-4009-4035-a327-838ebdfbf258@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1725681832; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Kk7p9YBa55BsktNkHzf4/b7BnD7sUCF6HUOCyDf0jYU=; b=iHU0WDR1dOZNoZZkfm5XAv7h8fI4ccTHq9QizVq3xSFD5oWtPp1tZ9hhrKFdg91eaFnPQH 6jeAlFkRbXhPdflaOOHjaPkVnVmBHZ8nlide2ImMuAE3tWZYxWfKkukFCW4gTpegjsoROQ 7lz6SzUvQkaXnxBN++JQy17I8OKf0bc= Date: Fri, 6 Sep 2024 21:03:44 -0700 MIME-Version: 1.0 X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Martin KaFai Lau Subject: Re: [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls. To: Alexei Starovoitov Cc: Kui-Feng Lee , bpf , Alexei Starovoitov , Andrii Nakryiko , Kui-Feng Lee , linux-mm References: <20240816191213.35573-1-thinker.li@gmail.com> <20240816191213.35573-5-thinker.li@gmail.com> <70a1b24f-84cd-464c-8fb6-a2c52fd3d703@linux.dev> <8b61f093-a6a6-4f99-91f8-20f2a7235d76@linux.dev> Content-Language: en-US In-Reply-To: <8b61f093-a6a6-4f99-91f8-20f2a7235d76@linux.dev> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Queue-Id: C26531A0004 X-Rspamd-Server: rspam01 X-Stat-Signature: scpjnt9hasrxb3d16oiiqn1agpwpxia5 X-HE-Tag: 1725681834-34259 X-HE-Meta: U2FsdGVkX18k+xMQpJkhBD1mf+hor1JSaaaWW7nT6h/jeXZ3iGzTgyMkcINCn5i7hT8FQ/JLSPWytwj1p9i6COHdazNRfB2v+his39sJGH7obD6SIKtfD6ougEwuDHnJCNOwjLnIC6XAXSG9j8rTciGly1DtWvpM6FCJaES3rS4PBBg/bzhoisfDZAbmK0gT5QzpogXqciyymDXroTyKrQOpVqPYvS5aTTozk9Ky3qCpDRbKR6OOsNuiZorngDXtC0BkvQpiIouVyDcgRMLkj2eJ04fX2KF2d8Nl1AmaTUIJXG7ZQQFK5NyzNvGJ2A7wf6YMpC9V70Nhg4o+0goOJzcsaXQEpiX5qP0WTFFGpSQOlqE9/jkc9nAZltiqyJhpfVic/ovoXIoZ7rjzabevcb90vDTcVDjSoiGyHZIeZx75In26JXgJOBtz0j7zKppm1Sk8HtSOtZVGzfamRXzWH5pRmtOiqxV6rTnkromz5P2b9QgAMfAn5U9Dsm7rQPALS1JSb01VircaxTxSwex/tXWkOAus+5yeDyrSnAYZOBzocOuv6xEogM+V7qFP3+peVcSj81KlWKFLSIwibg8EmtMED0ZqscFXEqmFW87GbiWMBJixQ4kMtpIR/p54AY+K4tGR5UiGakDbPQ1DxbdXtyQ2v+kZc2OzfZTjIoMtCaFkK7KFKroGD9JJQBzQ7KYAyy++5bXzWaJ92bk5dKywgNMOxii8pjhjFdKNDx8t/SnC8ipZ7dP7fEdspuJeRoSMLvH7ZS2Ei8tmJWSH85LrHAcnTskz/WHIONFOwrerZFW0f0YrFqirO7Dt4jSIfM3/S1oBNRUwWtLTPjeyth/JVJHsyWWVIoR+rAXV2hORGNQ3+qeVDwIzDHMMOOcqbWByFT7dTTPM6jyB3Ub5Musp3H+3Ar1NqWe8gLGXswSe+yF4VEX8+cfkcrXKJ5KtatgjWALz/GrhIJUYmaTn+cl QxHEet6M P3PK32YVTETACpDMCm4SXoe5ZapwIZinYjP8ZWkntuDQVy7cmoeUoZ854r9BGUECPLkftD6YLMK5ROGpWP21J2s89kMVG6bm0ftFoCdTm+gACuqEi+YSaSaDDb6Ge9ALwOiY6iDwZ42pMERG38fdqMbdEVo9HxfCdTAKyQ4HBxYGximigIQ7N5NJBD6oITblV4KHmoCbOnv+CGFy1zA4+sBOj1mOndFX0PnryQ1i8gOmp85ZI+6MdXKJ1epj+gE0feIL6CdOHCg9w7pRpvx8Yl22dREMCCG3z1vSh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 9/6/24 6:32 PM, Martin KaFai Lau wrote: > On 9/6/24 4:44 PM, Alexei Starovoitov wrote: >> On Fri, Sep 6, 2024 at 1:11 PM Martin KaFai Lau wrote: >>> >>> On 9/4/24 3:21 PM, Martin KaFai Lau wrote: >>>> On 8/28/24 4:24 PM, Alexei Starovoitov wrote: >>>>>> @@ -714,6 +869,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, >>>>>> void *obj) >>>>>>                                   field->kptr.dtor(xchgd_field); >>>>>>                           } >>>>>>                           break; >>>>>> +               case BPF_UPTR: >>>>>> +                       if (*(void **)field_ptr) >>>>>> +                               bpf_obj_unpin_uptr(field, *(void >>>>>> **)field_ptr); >>>>>> +                       *(void **)field_ptr = NULL; >>>>> This one will be called from >>>>>    task_storage_delete->bpf_selem_free->bpf_obj_free_fields >>>>> >>>>> and even if upin was safe to do from that context >>>>> we cannot just do: >>>>> *(void **)field_ptr = NULL; >>>>> >>>>> since bpf prog might be running in parallel, >>>>> it could have just read that addr and now is using it. >>>>> >>>>> The first thought of a way to fix this was to split >>>>> bpf_obj_free_fields() into the current one plus >>>>> bpf_obj_free_fields_after_gp() >>>>> that will do the above unpin bit. >>>>> and call the later one from bpf_selem_free_rcu() >>>>> while bpf_obj_free_fields() from bpf_selem_free() >>>>> will not touch uptr. >>>>> >>>>> But after digging further I realized that task_storage >>>>> already switched to use bpf_ma, so the above won't work. >>>>> >>>>> So we need something similar to BPF_KPTR_REF logic: >>>>> xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0); >>>>> and then delay of uptr unpin for that address into call_rcu. >>>>> >>>>> Any better ideas? >>>> >>> >>> I think the existing reuse_now arg in the bpf_selem_free can be used. reuse_now >>> (renamed from the earlier use_trace_rcu) was added to avoid call_rcu_tasks_trace >>> for the common case. >>> >>> selem (in type "struct bpf_local_storage_elem") is the one exposed to the bpf >>> prog. >>> >>> bpf_selem_free knows whether a selem can be reused immediately based on the >>> caller. It is currently flagged in the reuse_now arg: "bpf_selem_free(...., bool >>> reuse_now)". >>> >>> If a selem cannot be reuse_now (i.e. == false), it is currently going through >>> "call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu)". We can do >>> unpin_user_page() in the rcu callback. >>> >>> A selem can be reuse_now (i.e. == true) if the selem is no longer needed because >>> either its owner (i.e. the task_struct here) is going away in free_task() or the >>> bpf map is being destructed in bpf_local_storage_map_free(). No bpf prog should >>> have a hold on the selem at this point. I think for these two cases, the >>> unpin_user_page() can be directly called in bpf_selem_free(). >> >> but there is also this path: >> bpf_task_storage_delete -> task_storage_delete -> bpf_selem_free >>   -> bpf_obj_free_fields >> >> In this case bpf prog may still be looking at uptr address >> and we cannot do unpin right away in bpf_obj_free_fields. > > cannot unpin immediately in the bpf_task_storage_delete() path is understood. > task_storage can be used in sleepable. It needs to wait for the tasks_trace and > the regular rcu gp before unpin. > > I forgot to mention earlier that bpf_task_storage_delete() will have the > bpf_selem_free(..., reuse_now == false). It will then do the > "call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu);". The unpin could > happen in bpf_selem_free_trace_rcu() in this case. I am suggesting to unpin in > bpf_selem_free_trace_rcu together with the selem free. > > I just noticed the map and its btf_record are gone in > bpf_selem_free_trace_rcu()... so won't work. :( Thought about it more. Adding a rcu_barrier() to bpf_local_storage_map_free() may be enough. Then bpf_selem_free_rcu() will have the map->record to unpin. will need to think about it more. > >> All other special fields in map value are ok, >> since they are either relying on bpf_mem_alloc and >> have rcu/rcu_tasks_trace gp >> or extra indirection like timer/wq. >> >>> One existing bug is, from looking at patch 6, I don't think the free_task() case >>> can be "reuse_now == true" anymore because of the bpf_task_release kfunc did not >>> mark the previously obtained task_storage to be invalid: >>> >>> data_task = bpf_task_from_pid(parent_pid); >>> ptr = bpf_task_storage_get(&datamap, data_task, 0, ...); >>> bpf_task_release(data_task); >>> if (!ptr) >>>          return 0; >>> /* The prog still holds a valid task storage ptr. */ >>> udata = ptr->udata; >>> >>> It can be fixed by marking the ref_obj_id of the "ptr". Although it is more >>> correct to make the task storage "ptr" invalid after task_release, it may break >>> the existing progs. >> >> Are you suggesting that bpf_task_release should invalidate all pointers >> fetched from map value? > > I was thinking at least the map value ptr itself needs to be invalidated. > >> That will work, but it's not an issue for other special fields in there >> like kptr. >> So this invalidation would be need only for uptr which feels >> weird to special case it and probably will be confusing to users writing >> such programs. > > hmm... I haven't thought about the other pointer fields that read before the > task_release(). > > Agreed, it is hard to use if only marks uptr invalid. Thinking about it. Even > marking the map value ptr invalid while other previously read fields keep > working is also the same weirdness. > >> Above bpf prog example should be ok to use. >> We only need to delay unpin after rcu/rcu_task_trace gp. >> Hence my proposal in bpf_obj_free_fields() do: >>   case UPTR: >>     xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0); >>     call_rcu(...) to unpin. > > Agree that call_rcu() here is the only option. It probably needs to go through > the tasks_trace gp also. > > Can the page->rcu_head be used here? >