From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5376DC77B73 for ; Wed, 24 May 2023 12:00:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CCA7B900004; Wed, 24 May 2023 08:00:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C7B13900003; Wed, 24 May 2023 08:00:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B4265900004; Wed, 24 May 2023 08:00:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A31E2900003 for ; Wed, 24 May 2023 08:00:26 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 0486CA0984 for ; Wed, 24 May 2023 12:00:25 +0000 (UTC) X-FDA: 80825006052.01.2CD45DC Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by imf04.hostedemail.com (Postfix) with ESMTP id B3B144001A for ; Wed, 24 May 2023 12:00:20 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf04.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684929623; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xV1jccLpGfjGWTFV0NV+dDkXjB67HHgQDZ01uKDEH1c=; b=IyBHBNcyL94vwskdf/SRPXn17185nkgkGcpdl2eMbx2NepeQLDWri2xpVaOAA6q4iU/MoX gWgJ+NHm6FuDnHtOuIft3pvqJJKENPsKTZBVxzVPIp2D/aGRydjDiOeypNuMThd3iFVIsH LrC6cK2OfwkARJ1E3giEg4JsBvA6IO0= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf04.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684929623; a=rsa-sha256; cv=none; b=zZ4Qtght4FmX+oHyu1zQNYUNhxGwqJy+LcByNmhCjTvyD+dTW19iYD5ZsIEaZ95DYYhdUU hTBTNk7fgwBvQ6E6UyghckE8Abcseyl8CbEOvdWIB+R3G0D/DSKCJF+sOepuB+MTxDESeq vPfX/K3gwk96LZvGWVlkTh0yGSwRZW4= Received: from dggpemm500005.china.huawei.com (unknown [172.30.72.56]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4QR8pt0Nl8zLmHP; Wed, 24 May 2023 19:58:46 +0800 (CST) Received: from [10.69.30.204] (10.69.30.204) by dggpemm500005.china.huawei.com (7.185.36.74) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Wed, 24 May 2023 20:00:14 +0800 Subject: Re: [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme To: =?UTF-8?Q?Toke_H=c3=b8iland-J=c3=b8rgensen?= , Jesper Dangaard Brouer , Ilias Apalodimas , , Eric Dumazet , , Mel Gorman CC: , , "David S. Miller" , Jakub Kicinski , Paolo Abeni , Andrew Morton , References: <168485351546.2849279.13771638045665633339.stgit@firesoul> <168485357834.2849279.8073426325295894331.stgit@firesoul> <87h6s3nhv4.fsf@toke.dk> From: Yunsheng Lin Message-ID: <1d4d9c47-c236-b661-4ac7-788102af8bed@huawei.com> Date: Wed, 24 May 2023 20:00:13 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: <87h6s3nhv4.fsf@toke.dk> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 8bit X-Originating-IP: [10.69.30.204] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemm500005.china.huawei.com (7.185.36.74) X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: B3B144001A X-Stat-Signature: 6mqrak1ax5s51wt71kqb348choyyi8gf X-HE-Tag: 1684929620-695580 X-HE-Meta: U2FsdGVkX19UmKOjLWqWD0b4F8qqOhBa1zfhdlJETu15eIHQ9XhRXmSv0Mkr5SMX8ORe2tQFe9imw1/Q7EVKkgupTHQgbT0Fr2h1ORoO8pudZCShR+66YkHm1Ym7AtQ0kpoGe+us53udauGXBjk18oRWNnSd5cuLhuyxZoFJRbtev0JlkxMMcm6EXtST3yBgev5nvcGgwHYELdzyjETwqN/MSKwO7fVzWh8Pw8eLdK8YcBR8T9uiTrL56aZ3Gn3kJG3QOSMkWG3QKC1wygFX3XCT9YtI7Mw9bYqp3oR1XA9g6cJ1BhE8pArpNmXJWtnvMzVT0bmzvg/Ssgy6ltC9Obvm8EAjko6ia1NUCPqydsoFMFL+An61llqQYxWuGVWsVbXrJs7Tsr28Opm2GPeYIgL26Gbuzk1tzlxRszqd+R28q1PHBnbwUGB4fOd2dbN60dgQcWG7Iekl7SiDM09HHOqGepXVjvblO6CiNZ41z0klHikbNDkVet/SgQ12oPrtk/Et9QFQSaKmF8hX2b3VdCZheBSt+8P7KNAIdn/epcdqjVoTckRh+nLiAe1Vro1WzAigGsWGxhJm9HcpBktF59KnHlQVbDZ5K4sRuTckcmasGsHy8nCwSpL4+bMBlXW5//el+hShvSM3LS2DbScQu1igfDVtBdpeCWViqDYJK/2+cEXtmCtPlMDDqGgVUR0ExiMCFG2K5hNVW9r5J0dj8mQ0PUH/tAuVErXcevo6H2VrJwycCEv0q2TJLM1i7iIJtibQ4shDJDgqbaRm6PgpfidVXz/EFXcse4/QkHXSmIt8GB8y5KzMS9bvLS3y8UyLqs4J2zCt5KUlHOD9PAn3+E99pMdsdAzUtwD0yUIX9Ki3vnkwOX24YJgzPSvDpZPjxq7DlVOkIn7EIXqzy5USg59d7Mhnlx2Fe3rHUIdYjUUUm3oAI9XCaA7lNBs2hwrG541kPx60ygf2lPBKVHU q5DyR8hU AnNcpAsjs5iXPcwvOEDZZ4eF3mxKBU0qYqxyqnG1+3jeTyhfuNHyOQgDLGfD6UmQYE58DADkYY6Gdd/sEz+dJvmayOBIHswHAXXtEN5h4YERh7kUOY1SYEek0P1urVw+AIC9nzIJUqI1jEoVaSMGtAz8mxSDlFDn6EDcrFTfzSLGdzw/N+l77Aw1Ld3Y3B0EXkRyrND/d6nak+pCSyNt+P0FVpRuTE3EW2RoCYmCJlNJE3usjmmnYZZJBXE0SUoB4UZ0P X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/5/24 0:16, Toke Høiland-Jørgensen wrote: >> void page_pool_destroy(struct page_pool *pool) >> { >> + unsigned int flags; >> + u32 release_cnt; >> + u32 hold_cnt; >> + >> if (!pool) >> return; >> >> @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool) >> if (!page_pool_release(pool)) >> return; >> >> - pool->defer_start = jiffies; >> - pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; >> + /* PP have pages inflight, thus cannot immediately release memory. >> + * Enter into shutdown phase, depending on remaining in-flight PP >> + * pages to trigger shutdown process (on concurrent CPUs) and last >> + * page will free pool instance. >> + * >> + * There exist two race conditions here, we need to take into >> + * account in the following code. >> + * >> + * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last >> + * pages into the ptr_ring. Thus, it missed triggering shutdown >> + * process, which can then be stalled forever. >> + * >> + * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last >> + * page, which triggered shutdown process and freed pool >> + * instance. Thus, its not safe to dereference *pool afterwards. >> + * >> + * Handling races by holding a fake in-flight count, via artificially >> + * bumping pages_state_hold_cnt, which assures pool isn't freed under >> + * us. Use RCU Grace-Periods to guarantee concurrent CPUs will >> + * transition safely into the shutdown phase. >> + * >> + * After safely transition into this state the races are resolved. For >> + * race(1) its safe to recheck and empty ptr_ring (it will not free >> + * pool). Race(2) cannot happen, and we can release fake in-flight count >> + * as last step. >> + */ >> + hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1; >> + WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt); >> + synchronize_rcu(); >> + >> + flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN; >> + WRITE_ONCE(pool->p.flags, flags); >> + synchronize_rcu(); > > Hmm, synchronize_rcu() can be quite expensive; why do we need two of > them? Should be fine to just do one after those two writes, as long as > the order of those writes is correct (which WRITE_ONCE should ensure)? I am not sure rcu is the right scheme to fix the problem, as rcu is usually for one doing freeing/updating and many doing reading, while the case we try to fix here is all doing the reading and trying to do the freeing. And there might still be data race here as below: cpu0 calling page_pool_destroy() cpu1 caling page_pool_release_page() WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt); WRITE_ONCE(pool->p.flags, flags); synchronize_rcu(); atomic_inc_return() release_cnt = atomic_inc_return(); page_pool_free_attempt(pool, release_cnt); rcu call page_pool_free_rcu() if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN) page_pool_free_attempt() As the rcu_read_[un]lock are only in page_pool_free_attempt(), cpu0 will see the inflight being zero and triger the rcu to free the pp, and cpu1 see the pool->p.flags with PP_FLAG_SHUTDOWN set, it will access pool->pages_state_hold_cnt in __page_pool_inflight(), causing a use-after-free problem? > > Also, if we're adding this (blocking) operation in the teardown path we > risk adding latency to that path (network interface removal, > BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an > improvement anymore, as opposed to just keeping the workqueue but > dropping the warning? we might be able to remove the workqueue from the destroy path, a workqueue might be still needed to be trigered to call page_pool_free() in non-atomic context instead of calling page_pool_free() directly in page_pool_release_page(), as page_pool_release_page() might be called in atomic context and page_pool_free() requires a non-atomic context for put_device() and pool->disconnect using the mutex_lock() in mem_allocator_disconnect().