From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0EF4C00140 for ; Fri, 5 Aug 2022 16:29:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 22D828E0003; Fri, 5 Aug 2022 12:29:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1DD618E0001; Fri, 5 Aug 2022 12:29:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 057BA8E0003; Fri, 5 Aug 2022 12:29:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E877F8E0001 for ; Fri, 5 Aug 2022 12:29:24 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 969C6AC48C for ; Fri, 5 Aug 2022 16:29:24 +0000 (UTC) X-FDA: 79766074248.11.1453159 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 6ED6340019 for ; Fri, 5 Aug 2022 16:29:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1659716962; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=O1CTGsepbWky42F1RTyXU8gWqoR/QXpIkPCqa2oyjX8=; b=S5VPn2mNOoohOahFEjrb0Sss8ytA/2LyQ9cQXfKlQKzvWNEO6HASJ7B4ppUAZypsoCrIZW Ya2oUDrMD2J5iFipV3pwgANUjdC51CRhB/8mF4jSIi8RS69fzp1SjrvcEV2+SQjCn6NrzS Dr3RJdNjPgyKH6fh4A+JkO3/iit8QuQ= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-550-ywzkb-Q4NvmfxmziEcoaRQ-1; Fri, 05 Aug 2022 12:29:17 -0400 X-MC-Unique: ywzkb-Q4NvmfxmziEcoaRQ-1 Received: by mail-wm1-f71.google.com with SMTP id q10-20020a1ce90a000000b003a4f6e08166so567281wmc.5 for ; Fri, 05 Aug 2022 09:29:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc; bh=O1CTGsepbWky42F1RTyXU8gWqoR/QXpIkPCqa2oyjX8=; b=uHTAFkry/pdNtA+8O39e1RdeTCJm+Fr0VufNs0JtJSH2YxVeFGItlYTD9W1qIrcwo/ kmldZPV2rKyfB2wsf68Fbnxh0kzGmeLm65XV6Z1uKiZB9B1BHaVBqIy32n6XDBI18Fdf hfLcGLKv8Rj9KTl/PixsyScOlvBe7U+mcN+eeLZEPNzdWfHJL1MqOKE2zs+OMUo2H2pu 66kuPzaHA2NkwaHGk/QzdhlTZgddtXN0beRwI+iTVQRVcTTI+AyrnaFQBpbhU8WOVyy4 wYAVNtCyXPMKHagx4VdZiuI+vhfzUqphTRxAJmFvMqErbd2rSibR9ZUDxSVvUdndABiW +KyQ== X-Gm-Message-State: ACgBeo1Cq5ITbsVIFPXbF30PDbc6Jp+qp1I2hOGxwOh5lHltDrl9Lg5j Y7NNU40tT1BXwv4Aat27mXELFrHlN+Dw9tO/qyDK8YzWHazexveQLxJizguYs9aN+sziGdFly3P mZrmO6kTgyHo= X-Received: by 2002:a5d:6b10:0:b0:21e:4bbd:e893 with SMTP id v16-20020a5d6b10000000b0021e4bbde893mr4876966wrw.613.1659716955714; Fri, 05 Aug 2022 09:29:15 -0700 (PDT) X-Google-Smtp-Source: AA6agR6gSfiuOmIzGGnakz5kqQKANQjqtdgDGvfuY8GUxyjWL663RBOrC690GnTBbkDorGJ4vk59Ig== X-Received: by 2002:a5d:6b10:0:b0:21e:4bbd:e893 with SMTP id v16-20020a5d6b10000000b0021e4bbde893mr4876944wrw.613.1659716955386; Fri, 05 Aug 2022 09:29:15 -0700 (PDT) Received: from ?IPV6:2003:cb:c706:fb00:f5c3:24b2:3d03:9d52? (p200300cbc706fb00f5c324b23d039d52.dip0.t-ipconnect.de. [2003:cb:c706:fb00:f5c3:24b2:3d03:9d52]) by smtp.gmail.com with ESMTPSA id j20-20020a05600c485400b003a3561d4f3fsm4498609wmo.43.2022.08.05.09.28.55 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 05 Aug 2022 09:29:09 -0700 (PDT) Message-ID: Date: Fri, 5 Aug 2022 18:28:41 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 To: Mike Kravetz , Miaohe Lin Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song , Michal Hocko , Peter Xu , Naoya Horiguchi , "Aneesh Kumar K . V" , Andrea Arcangeli , "Kirill A . Shutemov" , Davidlohr Bueso , Prakash Sangappa , James Houghton , Mina Almasry , Pasha Tatashin , Axel Rasmussen , Ray Fucillo , Andrew Morton References: <20220706202347.95150-1-mike.kravetz@oracle.com> <20220706202347.95150-5-mike.kravetz@oracle.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH v4 4/8] hugetlbfs: catch and handle truncate racing with page faults In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659716964; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=O1CTGsepbWky42F1RTyXU8gWqoR/QXpIkPCqa2oyjX8=; b=HzfTTp51MwxiWRZkZjhzXq5VFNdrWSMH7AoZ1/qdbNO2aNP6SHYr13n0347TvHb7FqRtOC aOYfPCYSsN0vIrU9o/zILyrbBuErgrm5SYlL8qgkH/dq56Zth6Ch1GZ7j9xPDS63Ysbjb/ +C/XPGIAUU61a/LzWbJbVximCXlmGmQ= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=S5VPn2mN; spf=pass (imf11.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659716964; a=rsa-sha256; cv=none; b=v/4MqSxh3sa97pJuvJ/YbMAIwYI6qxDo3cKeMQ80KN+6BtGdqpx8PsxtJ3jFg9EGI2liMZ OZgxvQhqSq1sonyTqv1kjgF55nAMUXYVw1E+jDG6u7XHYqm6U4XnSPK/tYxCniitN6n/x1 UnIs8sae+Uksxm9cajyql1QqD7y4Iy4= Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=S5VPn2mN; spf=pass (imf11.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 6ED6340019 X-Stat-Signature: 8aojtz9gxwnumbonju7ef4tsyu4uqd6m X-Rspam-User: X-HE-Tag: 1659716963-771185 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 28.07.22 18:45, Mike Kravetz wrote: > On 07/28/22 10:02, Miaohe Lin wrote: >> On 2022/7/28 3:00, Mike Kravetz wrote: >>> On 07/27/22 17:20, Miaohe Lin wrote: >>>> On 2022/7/7 4:23, Mike Kravetz wrote: >>>>> Most hugetlb fault handling code checks for faults beyond i_size. >>>>> While there are early checks in the code paths, the most difficult >>>>> to handle are those discovered after taking the page table lock. >>>>> At this point, we have possibly allocated a page and consumed >>>>> associated reservations and possibly added the page to the page cache. >>>>> >>>>> When discovering a fault beyond i_size, be sure to: >>>>> - Remove the page from page cache, else it will sit there until the >>>>> file is removed. >>>>> - Do not restore any reservation for the page consumed. Otherwise >>>>> there will be an outstanding reservation for an offset beyond the >>>>> end of file. >>>>> >>>>> The 'truncation' code in remove_inode_hugepages must deal with fault >>>>> code potentially removing a page/folio from the cache after the page was >>>>> returned by filemap_get_folios and before locking the page. This can be >>>>> discovered by a change in folio_mapping() after taking folio lock. In >>>>> addition, this code must deal with fault code potentially consuming >>>>> and returning reservations. To synchronize this, remove_inode_hugepages >>>>> will now take the fault mutex for ALL indices in the hole or truncated >>>>> range. In this way, it KNOWS fault code has finished with the page/index >>>>> OR fault code will see the updated file size. >>>>> >>>>> Signed-off-by: Mike Kravetz >>>>> --- >>>> >>>> >>>> >>>>> @@ -5606,8 +5610,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, >>>>> >>>>> ptl = huge_pte_lock(h, mm, ptep); >>>>> size = i_size_read(mapping->host) >> huge_page_shift(h); >>>>> - if (idx >= size) >>>>> + if (idx >= size) { >>>>> + beyond_i_size = true; >>>> >>>> Thanks for your patch. There is one question: >>>> >>>> Since races between hugetlb pagefault and truncate is guarded by hugetlb_fault_mutex, >>>> do we really need to check it again after taking the page table lock? >>>> >>> >>> Well, the fault mutex can only guard a single hugetlb page. The fault mutex >>> is actually an array/table of mutexes hashed by mapping address and file index. >>> So, during truncation we take take the mutex for each page as they are >>> unmapped and removed. So, the fault mutex only synchronizes operations >>> on one specific page. The idea with this patch is to coordinate the fault >>> code and truncate code when operating on the same page. >>> >>> In addition, changing the file size happens early in the truncate process >>> before taking any locks/mutexes. >> >> I wonder whether we can somewhat live with it to make code simpler. When changing the file size happens >> after checking i_size but before taking the page table lock in hugetlb_fault, the truncate code would >> remove the hugetlb page from the page cache for us after hugetlb_fault finishes if we don't roll back >> when checking i_size again under the page table lock? >> >> In a word, if hugetlb_fault see a truncated inode, back out early. If not, let truncate code does its >> work. So we don't need to complicate the already complicated error path. Or am I miss something? >> > > Thank you! I believe your observations and suggestions are correct. > > We can just let the fault code proceed after the early "idx >= size", > and let the truncation code remove the page. This also eliminates the > need for patch 3 (hugetlbfs: move routine remove_huge_page to hugetlb.c). At least remaining the functions would be very welcome nonetheless :) > > I will make these changes in the next version. Just so I understand correctly, we want to let fault handling code back out early if we find any incompatible change, and simply retry the fault? I'm thinking about some kind of a high-level seqcount. -- Thanks, David / dhildenb