From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8C82C4167B for ; Wed, 6 Dec 2023 10:24:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 48F076B0085; Wed, 6 Dec 2023 05:24:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 43EA06B0089; Wed, 6 Dec 2023 05:24:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 32FCF6B0092; Wed, 6 Dec 2023 05:24:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 237B46B0085 for ; Wed, 6 Dec 2023 05:24:43 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 002551A010B for ; Wed, 6 Dec 2023 10:24:42 +0000 (UTC) X-FDA: 81536009646.12.5AB574B Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf04.hostedemail.com (Postfix) with ESMTP id 250B44001A for ; Wed, 6 Dec 2023 10:24:40 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=none; spf=pass (imf04.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701858281; a=rsa-sha256; cv=none; b=01NnReYD5f8CVGW0yYJnli6AVARNQABBAqEw/1ji8BzJaQTUcVnRNHUZBWvYAL/XiTFtyQ o2yTJmI9Zor1ElRQGOR2HsFAMCaUSK8Jd446xslexLZPQHKwC9s95/6VDq9cmbIIAGX2+O qCLyclcLJzhQYTQYhJVKdCg13mIbUtI= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=none; spf=pass (imf04.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701858281; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vjZ/rF+hFw417CDNYTlvYp9NbKyV3oQ5/B8nA/lJZfk=; b=hUuwJGBneLTmR+d/V+MDbmRaSLtjDCKnrWO07PQyQAu3nIFvWTjzqqVYsSdBfdnuacMybN J2O3jmKejZqmIvNd5BvJsbn7gEd4PcF7Y3yHQAoOcPdjZ+psrs48h/IM0EArdfkzxATlRA 8WlGrX3thPWPD6GcvJyRSp1hYdhtanY= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 701761474; Wed, 6 Dec 2023 02:25:26 -0800 (PST) Received: from [10.57.73.130] (unknown [10.57.73.130]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id BEFEF3F762; Wed, 6 Dec 2023 02:24:38 -0800 (PST) Message-ID: <54be0bd1-9397-4b7c-9b3c-6680c5d4c248@arm.com> Date: Wed, 6 Dec 2023 10:24:37 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/2] arm64: hugetlb: Fix page fault loop for sw-dirty/hw-clean contiguous PTEs Content-Language: en-GB To: James Houghton Cc: Steve Capper , Will Deacon , Andrew Morton , Mike Kravetz , Muchun Song , Anshuman Khandual , Catalin Marinas , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20231204172646.2541916-1-jthoughton@google.com> <70dcdad7-5952-48ce-a9b9-042cfea59a5d@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 250B44001A X-Stat-Signature: p3wmny3bspoc8oa83tsckg3rcuu1s9th X-HE-Tag: 1701858280-238816 X-HE-Meta: U2FsdGVkX19NL15CasVwfgyq0SCg1yYa6+aPMs0q7esfGO61orckNQwBXIQfvBzYSO14QGiMvxcDITk5x3MYkV7rNP8+IYsBaiqpMTdyYXIA2zldv6YX1IcUSCi9ziRKA/d0PjEtLIhaWl2dFB/4x1L5gsWAYGuZxEwhltH80VYCkN78eTLUhxs9/OzaoNazjfNPsxeiAgpABXT4rKot+YbD+YJmr9bYytqLccsGtwVtkiEvUiGxwOsLUK4ljsKMYUXKEYbLY/mzhZw8EP7nI+GXJYRTk85VipCBFDRnxA615aHZK5174zOQj5ewQmDt7X8s08+e1wAmoTS/0jhNTC3JAi0Nqs9mTvpEr8Mxpp9yWoz5dFFwTR4Cxfk7RJWk/4RsXRhIUSyqPKtI+JS16rK1aXjfcNRbmA5i72oj5MTjvKZj2digRzCbVlK4XrDWsIUbya6S9CKaeSBjx59hfzOR34VJaF9HpOQyuQtS5lxG3kgua7/bFQ/KYY5nqK+5vNwDAKslUBLlBl8HJ4Cw5FiaDElRJod/zLTFRVtk9MzH2EeJgo4AYMxBIKy+SiIK5JSl3bFLwsVTz5VbaaMED6WurarwzvljV2R+ZUteFo8TKgJqDr4R3OEao40cYr6xQijMmXUq5aZbhRwcA+uYUdDNb421q1RihdZKwen/FEa0BGjNOhZrXqtx/I1+73dNt93+574NbU90sDYYWAEYqErRzGjhuUiNcaEbWFW7erujJZnWwzgRb/lwnr0koRZb+wNXpmtL0SBxcuXSxgrPha3GmC0CgHG7CuoAfn3bjabfmi3a4322CRpakUu/wFcvZ7B/Te2VMmcVdmVMxdRMEuHU8O3fpfpHyxRDtEn8wgd42j0872JuPQylQ2XsFvDcJerfzmbJSjvGlvHk01SumhapZMPM2kY/ERgYc25rUcaYKd1G35cl97bTI9uKKZ/tTPnEGdr/LlftGMLlyj8 4Nn1bM6I M+Qxmnt43WCbXfjnXbKrKug2jBkKAG7IIU6GUzR1MdAjHUAs13bv91NWl8zYyNqRDH4swCqSm0UHUqtIVAkL5UDP7m5yLsWWuk60E/UT7MsJyJo1ENux2UYYFLvZj68a62uE4vxXv4ufK86PPixVx2lEduzHO/R7G8MeCrVZEWLuFYWz8uRUZbXFfu9FJLeY28+xjSaFeyhd+fpzDu/RTHTbC4mH6PKIfmQsy X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 05/12/2023 17:54, James Houghton wrote: > On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts wrote: >> >> On 04/12/2023 17:26, James Houghton wrote: >>> It is currently possible for a userspace application to enter a page >>> fault loop when using HugeTLB pages implemented with contiguous PTEs >>> when HAFDBS is not available. This happens because: >>> 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean >>> (PTE_DIRTY | PTE_RDONLY | PTE_WRITE). >> >> Hi James, >> >> Do you know how this happens? > > Hi Ryan, > > Thanks for taking a look! I do understand why this is happening. There > is an explanation in the reproducer[1] and also in this cover letter > (though I realize I could have been a little clearer). See below. Sigh... sorry! I totally missed your (excellent) explanation. > >> AFAIK, this is the set of valid bit combinations, and >> PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is >> to understand how this is happening and prevent it? >> >> /* >> * PTE bits configuration in the presence of hardware Dirty Bit Management >> * (PTE_WRITE == PTE_DBM): >> * >> * Dirty Writable | PTE_RDONLY PTE_WRITE PTE_DIRTY (sw) >> * 0 0 | 1 0 0 >> * 0 1 | 1 1 0 >> * 1 0 | 1 0 1 >> * 1 1 | 0 1 x >> * >> * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via >> * the page fault mechanism. Checking the dirty status of a pte becomes: >> * >> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) >> */ > > Thanks for pointing this out. So (1) is definitely a bug. The second > patch in this series makes it impossible to create such a PTE via > pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well). Yes; I think the second patch should be sufficient; I took a quick look at the other helpers and I don't see anything else that could get the PTE to the invalid state. I have a series that starts using the contpte bit for (multi-size) THP opportunistically. This bug will affect that too I think. Your patch #2 will fix for both hugetlb and my series. I'd rather not apply an equivalent to your patch #1 because its not quite as straightforward in my code path. But I'm pretty confident that patch # is all that's needed here. Thanks, Ryan > >>> The second patch in this series makes step (1) less likely to occur. > > It makes it impossible to create this invalid set of bits via > pte_modify(). Assuming all PTE pgprot updates are done via the proper > interfaces, patch #2 might actually make this invalid bit combination > impossible to produce (that's certainly the goal). So perhaps language > stronger than "less likely" is appropriate. > > Here's the sequence of events to trigger this bug, via mprotect(): > >>> Without this patch, we can get the kernel to write a sw-dirty, hw-clean >>> PTE with the following steps (showing the relevant VMA flags and pgprot >>> bits): >>> i. Create a valid, writable contiguous PTE. >>> VMA vmflags: VM_SHARED | VM_READ | VM_WRITE >>> VMA pgprot bits: PTE_RDONLY | PTE_WRITE >>> PTE pgprot bits: PTE_DIRTY | PTE_WRITE >>> ii. mprotect the VMA to PROT_NONE. >>> VMA vmflags: VM_SHARED >>> VMA pgprot bits: PTE_RDONLY >>> PTE pgprot bits: PTE_DIRTY | PTE_RDONLY >>> iii. mprotect the VMA back to PROT_READ | PROT_WRITE. >>> VMA vmflags: VM_SHARED | VM_READ | VM_WRITE >>> VMA pgprot bits: PTE_RDONLY | PTE_WRITE >>> PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY > > With patch #2, the PTE pgprot bits in step iii become PTE_DIRTY | > PTE_WRITE (hw-dirtiness is set, as the PTE is sw-dirty). > > Thanks! > >>> [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0