From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DEAFDC04FDE for ; Fri, 9 Dec 2022 17:01:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 78CC88E0007; Fri, 9 Dec 2022 12:01:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6EEC98E0001; Fri, 9 Dec 2022 12:01:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 541218E0007; Fri, 9 Dec 2022 12:01:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 415BD8E0001 for ; Fri, 9 Dec 2022 12:01:21 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 165171612A5 for ; Fri, 9 Dec 2022 17:01:21 +0000 (UTC) X-FDA: 80223383562.09.1C968A3 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf18.hostedemail.com (Postfix) with ESMTP id 8969D1C002F for ; Fri, 9 Dec 2022 17:01:18 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Uw8a7AAh; spf=pass (imf18.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670605278; a=rsa-sha256; cv=none; b=wdaQMRn7DRey8B6loFI3mlRloUWfSxzuRejAMp5AmK5J3aCjlDeORQnrVFtyyfwScB1S7u QFIqNnf278tULAb3SYFrj5mxfmGODPaPBl29fLn0rPj536nRvCFZk2qcKRWdtauUAEJsEV 7t1vGpJUtde5RLVIWIq2sHKiYtrstF0= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Uw8a7AAh; spf=pass (imf18.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670605278; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KF8YBIr+SenpQMPOCGcYKB11m6lZYSBtX9iYqo4LY9E=; b=QenYlXgeMX+Du45+SmDtDmVuDdL22I/fIErQcg0/6Efquverva0AWj7lcz0zPcy4LX36hw XBFLJZXtocG0vJWW8IsR4RCkxNNGvpAC3Tv7cmFOaNfuEB78M0gfqz7S2ur0dSUMTRa5xy cdNxjg0q6UvovPPkYCiNJRHB8QR655M= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670605277; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KF8YBIr+SenpQMPOCGcYKB11m6lZYSBtX9iYqo4LY9E=; b=Uw8a7AAhzLIhx8sWioDN6YkBSOA9pkP3o+0vi3iDFLGzZJUFY0frrw4aEa/RB+wm5sWIKz ybSQicFWvU7mdKGoi0fYMzujHTjCJlByj3EVQmR2ozN2Ro0uaUHvMavdZS1l3KC+fibX6l ci+pwETu/EXAGrrMcxsWJIzfGCQvbSM= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-114-UaZz1C1YM3eftrvPBNMWKQ-1; Fri, 09 Dec 2022 12:01:17 -0500 X-MC-Unique: UaZz1C1YM3eftrvPBNMWKQ-1 Received: by mail-qv1-f72.google.com with SMTP id qf9-20020a0562144b8900b004c71efc3528so4909612qvb.22 for ; Fri, 09 Dec 2022 09:01:17 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KF8YBIr+SenpQMPOCGcYKB11m6lZYSBtX9iYqo4LY9E=; b=RFbXymOvHJuNGo+n9jN0hIkQ2xhI5jktiR0ebPZUhIugRu9d8uAo1iEPSdPJK6M/2K uv1XnZW9YMjJGAHzzWWWQFOT1NCle3PMfiNYl1vDbqoc6XFw0OT8BK0DNk/NeAWMsj6a GAHcaKW0y7hrdhaMda/J6ze0zHYgz8+35yRl5mtvxtzJFCVazEyu8FBNWpj3OouNpqGj nRRCJ9HvKXexjNl+upfeycvtTv0KfVZDMo4sq2imLEa+hDc5jb27odcfzPf0T43eIP0b VfwuIHAuga9s5qN0nWYdixmsratChsGs77/SQfZjt5cBOQrN7x6HJzlxvIorPSHoyoQ9 2qJQ== X-Gm-Message-State: ANoB5plIjED9ZSuaJhRkAtCuQ3aWuadYusonpmGIyeS7e4rug9gkGDdJ XGenHb+PvccaNpKpc0KaSIA8XvCp9AfnVR+YuzcaM7cHKHPrBu+0AwFdloy+vhOg/fCgv+LT6J7 brac+vz3hfSfw0tabGXwTmzLDFch5vDFCr//5UpNQI+djObJjGWgOu1yUnDrh X-Received: by 2002:a05:622a:181b:b0:3a6:8b0a:89f4 with SMTP id t27-20020a05622a181b00b003a68b0a89f4mr5562665qtc.37.1670605275503; Fri, 09 Dec 2022 09:01:15 -0800 (PST) X-Google-Smtp-Source: AA0mqf7GjjyC9zskYeywpb+Ewu/62cki7ptxUtlYncuU3jbM740NAM3GpI8Fy9pUHhtyChxzmv3XbA== X-Received: by 2002:a05:622a:181b:b0:3a6:8b0a:89f4 with SMTP id t27-20020a05622a181b00b003a68b0a89f4mr5562589qtc.37.1670605275115; Fri, 09 Dec 2022 09:01:15 -0800 (PST) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id q7-20020a05620a0d8700b006cf38fd659asm178907qkl.103.2022.12.09.09.01.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 09 Dec 2022 09:01:14 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Miaohe Lin , David Hildenbrand , Nadav Amit , peterx@redhat.com, Andrea Arcangeli , Jann Horn , John Hubbard , Mike Kravetz , James Houghton , Rik van Riel , Muchun Song Subject: [PATCH v3 3/9] mm/hugetlb: Document huge_pte_offset usage Date: Fri, 9 Dec 2022 12:00:54 -0500 Message-Id: <20221209170100.973970-4-peterx@redhat.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221209170100.973970-1-peterx@redhat.com> References: <20221209170100.973970-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-type: text/plain Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 8969D1C002F X-Rspamd-Server: rspam01 X-Stat-Signature: g4amw3x7zyd4zafoiwsmuaiikzsujccp X-HE-Tag: 1670605278-643355 X-HE-Meta: U2FsdGVkX1/J1+KLxadkBINH6zy0n6oIXLcXOzJ+RVav1F/3ABXJRD9/7Ca4VJ1s4teBr4EgxBek4H2jTNScvLz78a07q+6jaVK8bVVAS0SBbLHcxIRUuhEQaKp2PGeYzddWrJEq82X3nEUwm2cia3YasVZyMgpVFdUskyg16oie0FynjcuSmzzbNsraaPfU4YdRzhDtJfzuHNG2P2i8wY6NQ66o3U/UzSF/WskDapq5yeISE/13yZ//EF9qCmZ7tKlEwPoRLIzo67ECIBIijwcMrK7rLCjsPgVqVDRplB695TfEP4AElD0jTwgITepI8TGe1jejoKdtcok9y2mlCrcvNLwIipTyeu5AIWVDuiPz5N2IF2+lQlUFJ2f7LOTS1ydZ2dVlHarSPXZM2n7hHsoGY7Q0eotTv4PrWWZ5EgGzceZhV8YYlR8dqFmIbsXy6bwZCaiGZkemONX4xiO7zSCA/8KsSHLvu/vMXqs1IKxKCK2KV0MRfuBlsPwyyBrXrkVXHpro9dhAgKTSVj+VLwqKDIzputuqAKYJVtBqJL8WOPfEZp9VyCuLqMiBLqnG98YlfAha0+92611Jq201bjvqBU17mdlFh9zMEut7TICko/Sn2dux+kdYjnZWg+o3qRoHQME4oOe+4U0bZvbUSoz+n6qWWAvPohsdwff7u70J826L50Tf4QXZmTnUOZdPKVYGHUTs3BJFUscB7T4Bz31Lbuu6xS/lu8QX7pTutYkykVPcsimQqAF9MbCgpqofp+wLbPhcf5KAaWFHln7bM31gm0CPXNa9qIRjq0zmSuOKsZl549WQSSisdMPnzbmiNM5TKWg07H2+1Tmj1WSxZR1ATX1mJHjF0ciUux+ws5B8RqRqnAY1ySxrQhHQk48xWn/pon/BVtgryWeO+Lqz4+f6jLJzE1zY+czMAzeMKFf00t2duUGv0tD8KaI3quMFlszjpS7qdRIvKt7EMOB HM8DqhrQ 1ICy9+hGhesd1K05aGW8h7d2ecSOJ5Sb6KnfKMhPmJy6UK/2kBJrm+GdnwhRLR5xD4DKsoxhFbcwLRXten9j7i7NdoO/6pL5HZWZ9xQbxP8ShJvHEI89YZfAzGO7Zto5pSJXcW1SFqaKoY9Ta00shyU9rQSBG5m4d9xIzwMyFgrQnPhfe3Agi3KQtkqUFeGox18/+phXUujijsArVFsUV3xacvoFKf9RZPUeQp1so2BnhICUNZRjzJ3osNTv7QmTJl9dSrPUK8COse3M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a hugetlb address. Normally, it's always safe to walk a generic pgtable as long as we're with the mmap lock held for either read or write, because that guarantees the pgtable pages will always be valid during the process. But it's not true for hugetlbfs, especially shared: hugetlbfs can have its pgtable freed by pmd unsharing, it means that even with mmap lock held for current mm, the PMD pgtable page can still go away from under us if pmd unsharing is possible during the walk. So we have two ways to make it safe even for a shared mapping: (1) If we're with the hugetlb vma lock held for either read/write, it's okay because pmd unshare cannot happen at all. (2) If we're with the i_mmap_rwsem lock held for either read/write, it's okay because even if pmd unshare can happen, the pgtable page cannot be freed from under us. Document it. Reviewed-by: John Hubbard Reviewed-by: David Hildenbrand Signed-off-by: Peter Xu --- include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 551834cd5299..d755e2a7c0db 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages; pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long sz); +/* + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE. + * Returns the pte_t* if found, or NULL if the address is not mapped. + * + * Since this function will walk all the pgtable pages (including not only + * high-level pgtable page, but also PUD entry that can be unshared + * concurrently for VM_SHARED), the caller of this function should be + * responsible of its thread safety. One can follow this rule: + * + * (1) For private mappings: pmd unsharing is not possible, so holding the + * mmap_lock for either read or write is sufficient. Most callers + * already hold the mmap_lock, so normally, no special action is + * required. + * + * (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged + * pgtable page can go away from under us! It can be done by a pmd + * unshare with a follow up munmap() on the other process), then we + * need either: + * + * (2.1) hugetlb vma lock read or write held, to make sure pmd unshare + * won't happen upon the range (it also makes sure the pte_t we + * read is the right and stable one), or, + * + * (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make + * sure even if unshare happened the racy unmap() will wait until + * i_mmap_rwsem is released. + * + * Option (2.1) is the safest, which guarantees pte stability from pmd + * sharing pov, until the vma lock released. Option (2.2) doesn't protect + * a concurrent pmd unshare, but it makes sure the pgtable page is safe to + * access. + */ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); unsigned long hugetlb_mask_last_page(struct hstate *h); -- 2.37.3