From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 971FAFA373E for ; Tue, 25 Oct 2022 20:37:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D5A758E0002; Tue, 25 Oct 2022 16:37:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D09BB8E0001; Tue, 25 Oct 2022 16:37:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF8768E0002; Tue, 25 Oct 2022 16:37:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B12F28E0001 for ; Tue, 25 Oct 2022 16:37:58 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 872F61A0E4B for ; Tue, 25 Oct 2022 20:37:58 +0000 (UTC) X-FDA: 80060633436.09.03925C5 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by imf08.hostedemail.com (Postfix) with ESMTP id 28E76160015 for ; Tue, 25 Oct 2022 20:37:56 +0000 (UTC) Received: from imladris.surriel.com ([96.67.55.152]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1onQgK-0001bw-04; Tue, 25 Oct 2022 16:37:52 -0400 Message-ID: <215d225585ff3c5ea90c64e6c9bdff04ab548156.camel@surriel.com> Subject: [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS From: Rik van Riel To: Mike Kravetz Cc: Chris Mason , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Andrew Morton Date: Tue, 25 Oct 2022 16:37:51 -0400 Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="=-JdsCtim/hl2EscNS8G/m" User-Agent: Evolution 3.42.4 (3.42.4-2.fc35) MIME-Version: 1.0 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1666730278; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references; bh=5+ahKGMg+HZejrIJRI04ZkFDkxcbyt64jrVZIpXjHlM=; b=aQy5inxHPEvw5aakdeoi1250+kMdP64n/3m8rVvF3+COGQePQXF56SGwYRCaYAJsfinZJB vEXtBdXs74/DhHVqb9qHZXv4ffAHOjK/8EPAYvZXzMF/uZMrnIBKJMCq/u5OLmPvq2Chdj Wm9M5FlFzJuuLXtqxUy+xweIa8//XZg= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; spf=none (imf08.hostedemail.com: domain of riel@shelob.surriel.com has no SPF policy when checking 96.67.55.147) smtp.mailfrom=riel@shelob.surriel.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666730278; a=rsa-sha256; cv=none; b=Rk8NR1Ai/R1jaMJXo4wrLsHqNhg/ADB3Ixoia8d7DfQVjqum2W66Z3UK8a0OiFpQdtfeFG kvHF2+kXRZJlwmF1cVyKAY1wYhsVseboUEnGv+hFua8X9P4hxOV3/R4kTaBelOofCIshI6 SKVcB0e2hsxyT4vvpUfjmVbtxPAoJzs= X-Rspamd-Queue-Id: 28E76160015 Authentication-Results: imf08.hostedemail.com; dkim=none; spf=none (imf08.hostedemail.com: domain of riel@shelob.surriel.com has no SPF policy when checking 96.67.55.147) smtp.mailfrom=riel@shelob.surriel.com; dmarc=none X-Rspam-User: X-Rspamd-Server: rspam10 X-Stat-Signature: 1w38zhr6jrkz88dxe8uhzjp6wspbhqpp X-HE-Tag: 1666730276-713341 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --=-JdsCtim/hl2EscNS8G/m Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Mike, After getting promising results initially, we discovered there is yet another bug left with hugetlbfs MADV_DONTNEED. This one involves a page fault on a hugetlbfs address, while another thread in the same process is in the middle of MADV_DONTNEED on that same memory address. The code in __unmap_hugepage_range() will clear the page table entry, and then at some point later the lazy TLB code will=20 actually free the huge page back into the hugetlbfs free page pool. Meanwhile, hugetlb_no_page will call alloc_huge_page, and that will fail because the code calling __unmap_hugepage_range() has not actually returned the page to the free list yet. The result is that the process gets killed with SIGBUS. I have thought of a few different solutions to this problem, but none of them look good: - Make MADV_DONTNEED take a write lock on mmap_sem, to exclude page faults. This could make MADV_DONTNEED on VMAs with 4kB pages unacceptably slow. - Some sort of atomic counter kept by __unmap_hugepage_range() that huge pages may be getting placed in the tlb gather, and freed later by tlb_finish_mmu(). This would involve changes to the MMU gather code, outside of hugetlbfs. - Some sort of generation counter that tracks tlb_gather_mmu cycles in progress, with the alloc_huge_page failure path waiting until all mmu gather operations that started before it to finish, before retrying the allocation. This requires changes to the generic code, outside of hugetlbfs. What are the reasonable alternatives here? Should we see if anybody can come up with a simple solution to the problem, or would it be better to just disable MADV_DONTNEED on hugetlbfs for now? --=20 All Rights Reversed. --=-JdsCtim/hl2EscNS8G/m Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEKR73pCCtJ5Xj3yADznnekoTE3oMFAmNYSR8ACgkQznnekoTE 3oPphgf9Gzrh965pMbAxa+/exyZoh+bwvS54y3Ro9Djptt2/v8zJFdLaiHdlr1YC MV5acp8sIWgrjz/Qxa9WOIl4XBQ0eVRdf1HHFEKKKmDls5iBHqvNSSZWs0CGi3+0 jyJydcxTyfXs/yyXDI7b5DdnwTtQlF0mcm3raIOvz8dubl6gslxW22Tec4Joyejf 6V465UKgV/ZBj5sjkoVOBns+0ilmAV7XvqHxIMU1DgTX7P5LwadqvZJe9hqPmVIb XsKiwl/Dk3L861Rmr+63e/z+U85NG77SsVFCkFQ6IGJJbetbuFUOEx5GPSFyHp5x DzV8ldRCqbbpcVZHr/qa1lRcCwg4zg== =nKVs -----END PGP SIGNATURE----- --=-JdsCtim/hl2EscNS8G/m--