From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 476C9C4167B for ; Tue, 28 Nov 2023 08:18:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C64BF6B02A7; Tue, 28 Nov 2023 03:18:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C13366B02AC; Tue, 28 Nov 2023 03:18:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AB3F06B02AD; Tue, 28 Nov 2023 03:18:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9906B6B02A7 for ; Tue, 28 Nov 2023 03:18:07 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6D012160147 for ; Tue, 28 Nov 2023 08:18:07 +0000 (UTC) X-FDA: 81506660214.08.D73C157 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) by imf29.hostedemail.com (Postfix) with ESMTP id 79B3A12001A for ; Tue, 28 Nov 2023 08:18:05 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Rfrdy8ZV; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701159485; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=avri5+l4jh4I23MBWhAH2OdfRZsJMW+CaKeEjn0D5Uc=; b=2VbZLF3eZSPz1H9YWpZQ6blbuq9UuK/cIGcy/cn37aI8mWd04SJNFDLWGxRXxXZp4uEj1u BNt1YiiYgJkwu5KFllgxajcoyhJHMxBMMJKoRkFAqSUWUx/I/EbjJNfe7OoIL48x4uNWsk NoGoTyDRGBLFwGzBrHDLZ7pHLz9fSEU= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Rfrdy8ZV; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701159485; a=rsa-sha256; cv=none; b=fKOTGCwxfEiMprXc+3t3w1KpS7AEpSSg7M0fEpavNE6NHhhrhBFoimBD3VhEnKxixfPSs6 0QBBfJhxaqCRzywXQgqno9JSI4ZpObN8oBEWcwirfpmPAXLbycIEzH+lhdkwdvEyJl/CK2 px3prBoebryk8jMJAjN7jwkHG2oeWE8= Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-1cfb2176150so23278605ad.3 for ; Tue, 28 Nov 2023 00:18:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701159484; x=1701764284; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=avri5+l4jh4I23MBWhAH2OdfRZsJMW+CaKeEjn0D5Uc=; b=Rfrdy8ZVSeEe8ruuX+nWbd+Se8GkRPOiksNoidCqg4v3gOLXYn3Xtv/bAz7yi1Kzxq xzCdngRHz2ZewCVmpkge/mYTsBltUX7z6rhKtVpEZFEHuVNny+Xs3XOfqQB/iyRF6uAV 7PN/lJb2qhkfF/ujvAC2LYDs4NDWr1XEEJcarQ6TEy8yat+eBIWb1CSlSYBwuOx6S4tO hXPe9ZMydeCLn2tq3cwmLEyXoldSJrE/GuBZhUZF6PWXQaCAv7ieaqu/rBPyiovL8+wc RczumUTl+Ng24lYfhzCpGtxmydeBbteG5t7v+Z6XymWCyKv19j2YIHAtHc8dcEgdvvGm +btA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701159484; x=1701764284; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=avri5+l4jh4I23MBWhAH2OdfRZsJMW+CaKeEjn0D5Uc=; b=xRKUd1Z8xq+mJnZhV8objwvVh/tT+4YsvNov7/9VXT/SYRrEWTnDvxqS7NBEZl2p49 lRp0nUJIKVxHJwZMN4Lg9yOLPiR/TxZtudSrL72kg/+bEWUvhwZhrmFCc+g9YrKRra/8 mX1pFPN7KllX3n6z5illoQq+M5b+WKPAhibLt6hmQIRUJdG8ZPaankzsDRlhkH/JT04e 4Rr6mO+1/oKyBLSf8mkrodl9uE1XgSRXIP6e4fRIi6k7BcnneXpFkw5o5G8DKRy/W1jV BoT+lKczJXHW6GHBrDvuiBAhGGfQpzudn8h9D647NTwuVmjpkf2QH3NMJdt19L02j5v9 fxmw== X-Gm-Message-State: AOJu0YyQyyAzm2AlCEjZoOKeAFgiHLAI9v31U1Z4FzQW/Qrj6XsvUze9 /HTyzVfDuJPMYe+V5lsw2sA= X-Google-Smtp-Source: AGHT+IHg5I/Ci0iCDZnbxkvP6wNr9HbmrpJG8sakGeJnQdcAnq4ez+i1YbJ3DGIkY3W0SdSQ+FYcFQ== X-Received: by 2002:a17:903:246:b0:1cf:b29d:3e8e with SMTP id j6-20020a170903024600b001cfb29d3e8emr10213348plh.58.1701159484095; Tue, 28 Nov 2023 00:18:04 -0800 (PST) Received: from barry-desktop.hub ([2407:7000:8942:5500:6e62:da63:8968:8aec]) by smtp.gmail.com with ESMTPSA id v9-20020a170902b7c900b001cfcc10491fsm3628891plz.161.2023.11.28.00.17.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Nov 2023 00:18:03 -0800 (PST) From: Barry Song <21cnbao@gmail.com> X-Google-Original-From: Barry Song To: ryan.roberts@arm.com Cc: akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, ardb@kernel.org, catalin.marinas@arm.com, david@redhat.com, dvyukov@google.com, glider@google.com, james.morse@arm.com, jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, ziy@nvidia.com Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown Date: Tue, 28 Nov 2023 21:17:42 +1300 Message-Id: <20231128081742.39204-1-v-songbaohua@oppo.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115163018.1303287-15-ryan.roberts@arm.com> References: <20231115163018.1303287-15-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 79B3A12001A X-Stat-Signature: cfu3wxyyzs4qdsorojhgkshxx7zdzign X-HE-Tag: 1701159485-154229 X-HE-Meta: U2FsdGVkX1/6TXKlAX8/1l+PwFhm/u/xqRdP5F+FO1Pm1dSuOzpj5PQiMxB43x15LnEIpCpdlWnghUMzaMcSiZ7ZDCR93VygyprB/FHq8lyL25b09tkxGEiRQKVQ+mmCBb1NB29T/s53cqujjhZXmB1pBzhquks8NGTvdn0pRuH3zBaBAq9qBpCCjTuYWy7xWW94tsDRpf8VKSEY02dpAVVKoJ/JYlF9j2kWCnzELPWOfydxv9YPOGm5Hf0+ZJf+62XwkNbm4uz31hH0owG3Sffiq0YeWuUXSuPmMC8jFFfHTkLEY9LbB51iBWeBhXEJ7xDhG6Pr3z+E3u67xghBQfoWOjnCNfIB2v55R77WXUviFUfRMdkpX52V2VWs5x4SsFQwMu/nZOlx7n8cGVXTKmIR2akXSLNm7RC0XXcSOlEV9yKgFRSFCh/C6tCffV3hYOxfw6IVDPBQAdfwyGXzuFa/3Z320K58QygoO3v7e22N6Yphy0nQIPKIXlHVfgIv1UURAmk45AkDNP1kZZhR6mSAsInBHlREYEHsF7Q+ica0Th0oUHhTxg49vec3HVCwmSTUYzSOLC5TRp4aFepv3ONX/ySnjSfmTKuxu+ynpm8fMsXUxNEVLgsmz3F1IYFuDpBUYAYCsBz9LIas+M0QLNoh5I3AvPmu3UCRRaLwJBk18DUpF8EMcfMYTUk/FLbkJ5oP4Bpp07f/7CTaMRe8CwydBy33kGE/1qG/sTWob+zKZGsqA6/34ys60ip6ROEw6FmBxmMczC9ImEK3URMz/ZHjSxnhH0SJCDtn1jiseI1gRcQrAM9sKkcmMPyJand0nk3Sb5J4d1glSHXWf6xznZRQdJ76iWN0rVcGpT0U6syHNGwJrE8DbbNnuTu0CD9D/xaJAKZUzfAyyuK9AwRgNA8JA7aj3ACIY6WYkFqHFmk3uH7k6gcmBtftZYa9ORNziBn3VNbarxYUH6lH/oz 5efdqlaF Ba6t6wvn/yBqvbqB1rCRKbyvWFa3Rq60AzIa55NfRFxgFdrGfBErdOn2VR7on5eVPzd3Emc4NzX4R4XKEP8DDbVBEDt6yJDDpIY+jEi3Jx+vb0qLvftcWcuaOB0OaefbGWqk6j3quKd8N4f6HNmOu609f3LzQYFgb9Tpy5sOGtx8ziRcSk/hKHU6HOOQ49YU+wNNO0bcZ/kU4lVE9BOydvIbtvsiCI8z66Z4DtZnjQ7qI8JbLCm/H95JOZbcrKAXNr50sbzET0sR1wJx864QB8Wws7nSOzXv0YfBlCLYQxzcMKH2F6olpnDcqldaxXWbBg/VrCh3S4yUhUbn8k1zdg4VCg0FsO3ChqPSr7P8alXfJ0QRN0qvJ/XT7xUEABpa9AIdJbFE5/8adQi2et8qb1r1PO+Vu7yTPGiuIsGd74YFDppiBfAS0xVNLiGiLdsq40KWaDGdKVyNXha6GH1IJ8Jc9YX4bjibGWjt/vC/gyvrwoTLi6I3E+CHTBWo+i6OqTCZlwcsWpxaZGnHAV4vyX6YmCMDJylknsKJSG00RbaNWFR0Yemaf+e+YEMueMnXq3Jnnz4dE0clVuP0K/YLUWR+WqdkDg789t7NE X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm, > + unsigned long addr, pte_t *ptep) > +{ > + /* > + * When doing a full address space teardown, we can avoid unfolding the > + * contiguous range, and therefore avoid the associated tlbi. Instead, > + * just get and clear the pte. The caller is promising to call us for > + * every pte, so every pte in the range will be cleared by the time the > + * tlbi is issued. > + * > + * This approach is not perfect though, as for the duration between > + * returning from the first call to ptep_get_and_clear_full() and making > + * the final call, the contpte block in an intermediate state, where > + * some ptes are cleared and others are still set with the PTE_CONT bit. > + * If any other APIs are called for the ptes in the contpte block during > + * that time, we have to be very careful. The core code currently > + * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so > + * ptep_get() must be careful to ignore the cleared entries when > + * accumulating the access and dirty bits - the same goes for > + * ptep_get_lockless(). The only other calls we might resonably expect > + * are to set markers in the previously cleared ptes. (We shouldn't see > + * valid entries being set until after the tlbi, at which point we are > + * no longer in the intermediate state). Since markers are not valid, > + * this is safe; set_ptes() will see the old, invalid entry and will not > + * attempt to unfold. And the new pte is also invalid so it won't > + * attempt to fold. We shouldn't see this for the 'full' case anyway. > + * > + * The last remaining issue is returning the access/dirty bits. That > + * info could be present in any of the ptes in the contpte block. > + * ptep_get() will gather those bits from across the contpte block. We > + * don't bother doing that here, because we know that the information is > + * used by the core-mm to mark the underlying folio as accessed/dirty. > + * And since the same folio must be underpinning the whole block (that > + * was a requirement for folding in the first place), that information > + * will make it to the folio eventually once all the ptes have been > + * cleared. This approach means we don't have to play games with > + * accumulating and storing the bits. It does mean that any interleaved > + * calls to ptep_get() may lack correct access/dirty information if we > + * have already cleared the pte that happened to store it. The core code > + * does not rely on this though. even without any other threads running and touching those PTEs, this won't survive on some hardware. we expose inconsistent CONTPTEs to hardware, this might result in crashed firmware even in trustzone, strange&unknown faults to trustzone we have seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs with dropped CONT but still some other PTEs have CONT, we make hardware totally confused. zap_pte_range() has a force_flush when tlbbatch is full: if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { force_flush = 1; addr += PAGE_SIZE; break; } this means you can expose partial tlbi/flush directly to hardware while some other PTEs are still CONT. on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend on fullmm, as long as zap range covers a large folio, we can flush tlbi for those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather than clearing one PTE. Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end of the large folio: #ifdef CONFIG_CONT_PTE_HUGEPAGE if (pte_cont(ptent)) { unsigned long next = pte_cont_addr_end(addr, end); if (next - addr != HPAGE_CONT_PTE_SIZE) { __split_huge_cont_pte(vma, pte, addr, false, NULL, ptl); /* * After splitting cont-pte * we need to process pte again. */ goto again_pte; } else { cont_pte_huge_ptep_get_and_clear(mm, addr, pte); tlb_remove_cont_pte_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; if (is_huge_zero_page(page)) { tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE); goto cont_next; } rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR; page_remove_rmap(page, true); if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE); } cont_next: /* "do while()" will do "pte++" and "addr + PAGE_SIZE" */ pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE; addr = next - PAGE_SIZE; continue; } #endif this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and it never requires tlb->fullmm at all. static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm, unsigned long addr, pte_t *ptep, bool flush) { pte_t orig_pte = ptep_get(ptep); CHP_BUG_ON(!pte_cont(orig_pte)); CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE)); CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR)); return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush); } [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539 > + */ > + > + return __ptep_get_and_clear(mm, addr, ptep); > +} > +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full); > + Thanks Barry