From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FF78C77B7A for ; Wed, 24 May 2023 16:42:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 69549900003; Wed, 24 May 2023 12:42:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 61CA0900002; Wed, 24 May 2023 12:42:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 496AF900003; Wed, 24 May 2023 12:42:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 31032900002 for ; Wed, 24 May 2023 12:42:44 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id E6BB0ADF04 for ; Wed, 24 May 2023 16:42:43 +0000 (UTC) X-FDA: 80825717406.21.D1C972A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf23.hostedemail.com (Postfix) with ESMTP id 44D6E140024 for ; Wed, 24 May 2023 16:42:40 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VPUAB4LX; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf23.hostedemail.com: domain of jbrouer@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=jbrouer@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684946561; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/2B7F3M00+McC92sehlFDs+CO2Y7boezU8Ef7qqG54k=; b=ngJLQYKQ8TeC6/hrn6czsAnXFqrZP90OiJ77kXmX4ppgxhh/P8mUixwFu0E2yij7WUmpLn XP1UVXHzXR1AYJ647QZ4u618fDLTlDQU1y8LtfA2YkCwGGYOds0lgpnZFRkL24B/k21A1N PgayxNMM0VJ8EauzJs/SuexhqeGIzDU= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VPUAB4LX; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf23.hostedemail.com: domain of jbrouer@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=jbrouer@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684946561; a=rsa-sha256; cv=none; b=fn3kDj7a/VZGW+UXifoyQD1of68GQY+3UBz/CruT1ZocTLDqx7pLl8rI++NqfhIB8icUrd RCxYSVDENkFzsrIFeGI9CkUCwzMmUMloPKmIO2k/rtW4NFz3ENfpAyaZpINX1jTnfoJhi5 nhn6//JnpmJnwro30SniaUEUjsx09yE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684946560; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/2B7F3M00+McC92sehlFDs+CO2Y7boezU8Ef7qqG54k=; b=VPUAB4LX5brO0Fe9MntY5FEjdz8DA6ezIl8zajTVKyADBJmvrNBfd5WbkcciHNKEBhIJYl 3fHKY9Gn/EqE8gpPOXmJKzTtK2Y9HAEwxRHo7bMwd18p2DKZBpqzsOnWxE3+oPO8qx+k41 9TNB1D4pJ3UKhATWK5qBHxHTMqHCXH4= Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-8-Vuwk80LRMzmw5P8prWFVsg-1; Wed, 24 May 2023 12:42:39 -0400 X-MC-Unique: Vuwk80LRMzmw5P8prWFVsg-1 Received: by mail-ed1-f70.google.com with SMTP id 4fb4d7f45d1cf-506beab6a73so1619229a12.1 for ; Wed, 24 May 2023 09:42:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684946558; x=1687538558; h=content-transfer-encoding:in-reply-to:references:to :content-language:subject:cc:user-agent:mime-version:date:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/2B7F3M00+McC92sehlFDs+CO2Y7boezU8Ef7qqG54k=; b=lbMOwC0/MX+RTOJ8+5KJ9J6K+QC+ux3C+0X7tyy9+68RgKUxCen43JRDPijG3cN5TR IoUivG/gukVysl4mJ4Ve31xcmsu9oGd4WZKzIqaC5ncj/RK2wEVW57AC3GJzH3mUdJYA p7PG2wRMijN+pJmEHy2t7XM1P42yBcY3kyXzmgpPGGbg9Nk4ULsSQq1SnqGEkUAUQMMY 3qH2ys4BARaVduVCkSzl6p/Uj9yj86SdzKXhwSJ7u04X0NKHOGrYIrxB3jnlWDu65tRc yisIAZNyExhyhhpjtTFWFAkBbbVFAeEQuHn+dP+++L/Xbbo5K3Dsn1LXtR2QK/lDi1qd ngDA== X-Gm-Message-State: AC+VfDwfGOqqe8hslNqh68w3j25xcrjy7jvd+fIRQ2WexVA7CfvLYniZ 9A61twJAfqzVkkcehfNMm/VamoE41CCfwFLjh1ZRYi/T0uuVMA5hVubEeStkt4ZdiLWFDebBNRX BMbiq+TZD01E= X-Received: by 2002:a17:907:26c3:b0:94f:2a13:4e01 with SMTP id bp3-20020a17090726c300b0094f2a134e01mr16908183ejc.74.1684946557870; Wed, 24 May 2023 09:42:37 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ74II5GwvSEw7I5JlQywPvZTw3pPPBDn1CoxlgtWKyNk7YpiSokSdYRyebPEoS8/p3MxttS+g== X-Received: by 2002:a17:907:26c3:b0:94f:2a13:4e01 with SMTP id bp3-20020a17090726c300b0094f2a134e01mr16908157ejc.74.1684946557588; Wed, 24 May 2023 09:42:37 -0700 (PDT) Received: from [192.168.42.222] (194-45-78-10.static.kviknet.net. [194.45.78.10]) by smtp.gmail.com with ESMTPSA id d5-20020a170906344500b0096f937b0d3esm5844588ejb.3.2023.05.24.09.42.36 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 24 May 2023 09:42:36 -0700 (PDT) From: Jesper Dangaard Brouer X-Google-Original-From: Jesper Dangaard Brouer Message-ID: <13917453-bca3-82aa-e265-b652bda0d29d@redhat.com> Date: Wed, 24 May 2023 18:42:35 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Cc: brouer@redhat.com, lorenzo@kernel.org, bpf@vger.kernel.org, "David S. Miller" , Jakub Kicinski , Paolo Abeni , Andrew Morton , willy@infradead.org Subject: Re: [PATCH RFC net-next/mm V4 2/2] page_pool: Remove workqueue in new shutdown scheme To: Yunsheng Lin , =?UTF-8?Q?Toke_H=c3=b8iland-J=c3=b8rgensen?= , Ilias Apalodimas , netdev@vger.kernel.org, Eric Dumazet , linux-mm@kvack.org, Mel Gorman References: <168485351546.2849279.13771638045665633339.stgit@firesoul> <168485357834.2849279.8073426325295894331.stgit@firesoul> <87h6s3nhv4.fsf@toke.dk> <1d4d9c47-c236-b661-4ac7-788102af8bed@huawei.com> In-Reply-To: <1d4d9c47-c236-b661-4ac7-788102af8bed@huawei.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 44D6E140024 X-Stat-Signature: x665jofwomukfpz7n3yty4rs7jdkzjfu X-HE-Tag: 1684946560-76828 X-HE-Meta: U2FsdGVkX1+llT947dxm03ZD0sjHi0sB+w3FpF3hciez4RUA3xF8VohKeFPheM883Qkx0B8tRShUsDLZ8JrTqlEy11BgzsBk5evnIiN5HzjsXBhbe0LavPg0gi8c0yO4H6Rp6LijWz9AIIVLQHGBhJI9RCyDO/hU0kHGAl6LLAaCRhGpHT1FdSWY3IfZaOpR8Kmf5ig9pOb2BfmdYpBkeOLCPYHa48Yt6brTcjEIcTtX54+qOFoWtBe4jbRLuZrzjuU8Va6kgn/P+a9YGwi8ryXo6e3sBBLIKHYETZF5eSshhLeuWtXlQeGMu9QOkc0FaH4ZhEVFxH/s4Lzzm6+0Sph12V3nSrVOkD/rgNPhtTpKtzd1fNcWU4BhoUYgQDbsM/tFuzpKpChBjiyuKtV5K7bTDDdP93XBltDbBp8OHU51zepIlc0pmu4+I5ankE5qZCZ4bS0my2mu8pBHQ3qtMaoZSTMO15K6M/vacEofRqFKJQrQHNfxFTnaSMFPQiBb7dQcHGWsaQpe4ndyPrixZ5SfA4T9PSwxsO8PopPaULPvdh1JtN7O5J7J0/qAzewGgzY7lb2KM22U1Fn4lYq+FVUbQuNOI3m5Wner+P8PnEp6zCt/BlxRbhPCw7Pd4aIbnD7qSG+O4bJDlQpImHuZQwD77TJI2UlL7cHByWjkf1SsXxAQPRN9PLiMILPCqjJCFiKLhzZ64WbJ56DVvK/ri9wedkJUUKPHsQHxENDCfyS3Jy9JT2ZYrkkrBJ11GYtnVjqyJ7bRYj335LV6QagXSlctAtznk1kbV6RYVs5uWtFJJKmjL7/YQzqjYV0oP6f+kljvQzcbGSjjaCrP/KCvEB5d0Lb/jkE06f0WbKRZMYAaVxiiTD42O7og0UPyXdR98atPejGj6buPTynQlUA9ZTqQJ+Wc0K2Bx7wAulIqh5I48hS9OM0SVf7/qVR6jisUmY/YSTuReeqrRowEFnm C/QZHL3h MoRc6aO/rdYcHyZRnNkxCtokfZOcrr1fEYjBTxx5JXtERhS1DH8603wz+dxT2mY5Hbb26Yp0EnHxf/pjVDnC2sDnUhoEFkedYpzLaRiIMGnTqwlyIV2mP4C8A/Dh5vnU56c5AUizUweX0RmWjumuT/RuDN/l9M5WZkdqdw2MyKg5BYirO4sjTV90A5cg6gRk9cigJUmYuDdGi3Otlol2wwKJvB7Clk13NwpweBAzAmrRRm2/a1QW2YiMzS+P0NfntZdK3ilYco0WViM2pgbJ7qaJKP+jh7QoJnKhCG18wamiDsmkoyPPJ16DaegxUieoLuE40sfN+WWd7DlHHpyswU13nOE+12MSv3dDRjhHr2hadKkqDWhEnvArJ8gMeBsA0rK4TuYjg0HRei+K/wPx6d+yNz0oOEo9+pUaUX4Brwdo1Ie3vLqmCxL17IajOSmlzQZXHuX+OIyV7kDi75P8XvRL2FSLxzPJPb2JGL7Xtq6eJCG0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 24/05/2023 14.00, Yunsheng Lin wrote: > On 2023/5/24 0:16, Toke Høiland-Jørgensen wrote: >>> void page_pool_destroy(struct page_pool *pool) >>> { >>> + unsigned int flags; >>> + u32 release_cnt; >>> + u32 hold_cnt; >>> + >>> if (!pool) >>> return; >>> >>> @@ -868,11 +894,45 @@ void page_pool_destroy(struct page_pool *pool) >>> if (!page_pool_release(pool)) >>> return; >>> >>> - pool->defer_start = jiffies; >>> - pool->defer_warn = jiffies + DEFER_WARN_INTERVAL; >>> + /* PP have pages inflight, thus cannot immediately release memory. >>> + * Enter into shutdown phase, depending on remaining in-flight PP >>> + * pages to trigger shutdown process (on concurrent CPUs) and last >>> + * page will free pool instance. >>> + * >>> + * There exist two race conditions here, we need to take into >>> + * account in the following code. >>> + * >>> + * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last >>> + * pages into the ptr_ring. Thus, it missed triggering shutdown >>> + * process, which can then be stalled forever. >>> + * >>> + * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last >>> + * page, which triggered shutdown process and freed pool >>> + * instance. Thus, its not safe to dereference *pool afterwards. >>> + * >>> + * Handling races by holding a fake in-flight count, via artificially >>> + * bumping pages_state_hold_cnt, which assures pool isn't freed under >>> + * us. Use RCU Grace-Periods to guarantee concurrent CPUs will >>> + * transition safely into the shutdown phase. >>> + * >>> + * After safely transition into this state the races are resolved. For >>> + * race(1) its safe to recheck and empty ptr_ring (it will not free >>> + * pool). Race(2) cannot happen, and we can release fake in-flight count >>> + * as last step. >>> + */ >>> + hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1; >>> + WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt); >>> + synchronize_rcu(); >>> + >>> + flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN; >>> + WRITE_ONCE(pool->p.flags, flags); >>> + synchronize_rcu(); >> >> Hmm, synchronize_rcu() can be quite expensive; why do we need two of >> them? Should be fine to just do one after those two writes, as long as >> the order of those writes is correct (which WRITE_ONCE should ensure)? > > I am not sure rcu is the right scheme to fix the problem, as rcu is usually > for one doing freeing/updating and many doing reading, while the case we > try to fix here is all doing the reading and trying to do the freeing. > > And there might still be data race here as below: > cpu0 calling page_pool_destroy() cpu1 caling page_pool_release_page() > > WRITE_ONCE(pool->pages_state_hold_cnt, hold_cnt); > WRITE_ONCE(pool->p.flags, flags); > synchronize_rcu(); > atomic_inc_return() > > release_cnt = atomic_inc_return(); > page_pool_free_attempt(pool, release_cnt); > rcu call page_pool_free_rcu() > > if (READ_ONCE(pool->p.flags) & PP_FLAG_SHUTDOWN) > page_pool_free_attempt() > > As the rcu_read_[un]lock are only in page_pool_free_attempt(), cpu0 > will see the inflight being zero and triger the rcu to free the pp, > and cpu1 see the pool->p.flags with PP_FLAG_SHUTDOWN set, it will > access pool->pages_state_hold_cnt in __page_pool_inflight(), causing > a use-after-free problem? > > >> >> Also, if we're adding this (blocking) operation in the teardown path we >> risk adding latency to that path (network interface removal, >> BPF_PROG_RUN syscall etc), so not sure if this actually ends up being an >> improvement anymore, as opposed to just keeping the workqueue but >> dropping the warning? > > we might be able to remove the workqueue from the destroy path, a > workqueue might be still needed to be trigered to call page_pool_free() > in non-atomic context instead of calling page_pool_free() directly in > page_pool_release_page(), as page_pool_release_page() might be called > in atomic context and page_pool_free() requires a non-atomic context > for put_device() and pool->disconnect using the mutex_lock() in > mem_allocator_disconnect(). > I thought the call_rcu() callback provided the right context, but skimming call_rcu() I think it doesn't. Argh, I think you are right, we cannot avoid the workqueue, as we need the non-atomic context. Thanks for catching and pointing this out :-) --Jesper