From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29FD3C433E7 for ; Wed, 14 Oct 2020 15:38:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2CD782222C for ; Wed, 14 Oct 2020 15:38:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="a0nr9Jo2" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2CD782222C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8611B900002; Wed, 14 Oct 2020 11:38:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E9DF6B0070; Wed, 14 Oct 2020 11:38:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6B16A900002; Wed, 14 Oct 2020 11:38:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0173.hostedemail.com [216.40.44.173]) by kanga.kvack.org (Postfix) with ESMTP id 3292C6B006E for ; Wed, 14 Oct 2020 11:38:45 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id B3DBB824999B for ; Wed, 14 Oct 2020 15:38:44 +0000 (UTC) X-FDA: 77370938568.24.duck06_0b111bf2720c Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin24.hostedemail.com (Postfix) with ESMTP id 919C51A4A7 for ; Wed, 14 Oct 2020 15:38:44 +0000 (UTC) X-HE-Tag: duck06_0b111bf2720c X-Filterd-Recvd-Size: 9089 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf08.hostedemail.com (Postfix) with ESMTP for ; Wed, 14 Oct 2020 15:38:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding: Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description; bh=gCQenQmonwfZIEocLMbRlLubh9CqyAW01QUy3LdwUVk=; b=a0nr9Jo2mkDYlUfy/x6loKnsUx wavuwzHSlSherzjmTVTOS1Ry8BxZFllJ6TUgEWDMFp8TBsAxH0vuDS9iE1nanvNvA/EMvQ/327INC ZkeONnkRD+A6HvJXgAXUN78c+pc+rhWoPGGVpnL/gkBKE6GUvYjADrtjmQIKxYU25sYrdK9dl0+Cy xbzptPxy7sogf/dIn00HLHhpFuyFl9FPNN8PkvTHHn+Nuxt1G44eqVtyhvVXMmiixIe/WITX0KdC0 A4mSpwJ4m25riQTOdyIfKccOZQrMKzFFOwDPg/dIjcCLFckcEzdqBl+f4O2R5TBP9MmgCzbuo/MJs j5K34sIA==; Received: from willy by casper.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1kSirM-0006HA-H3; Wed, 14 Oct 2020 15:38:36 +0000 Date: Wed, 14 Oct 2020 16:38:36 +0100 From: Matthew Wilcox To: Chris Mason Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Guoqing Jiang , Miaohe Lin , David Howells Subject: Re: PagePrivate handling Message-ID: <20201014153836.GM20115@casper.infradead.org> References: <20201014134909.GL20115@casper.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 14, 2020 at 10:50:51AM -0400, Chris Mason wrote: > On 14 Oct 2020, at 9:49, Matthew Wilcox wrote: > > Our handling of PagePrivate, page->private and PagePrivate2 is a gian= t > > mess. Let's recap. > >=20 > > Filesystems which use bufferheads (ie most of them) set page->private > > to point to a buffer_head, set the PagePrivate bit and increment the > > refcount on the page. > >=20 > > The vmscan pageout code (*) needs to know whether a page is freeable: > > if (!is_page_cache_freeable(page)) > > return PAGE_KEEP; > > ... where is_page_cache_freeable() contains ... > > return page_count(page) - page_has_private(page) =3D=3D 1 + > > page_cache_pins; > >=20 > > That's a little inscrutable, but the important thing is that if > > page_has_private() is true, then the page's reference count is suppos= ed > > to be one higher than it would be otherwise. And that makes sense gi= ven > > how "having bufferheads" means "page refcount ges incremented". > >=20 > > But page_has_private() doesn't actually mean "PagePrivate is set". > > It means "PagePrivate or PagePrivate2 is set". And I don't understan= d > > how filesystems are supposed to keep that straight -- if we're settin= g > > PagePrivate2, and PagePrivate is clear, increment the refcount? > > If we're clearing PagePrivate, decrement the refcount if PagePrivate2 > > is also clear? >=20 > At least for btrfs, only PagePrivate elevates the refcount on the page. > PagePrivate2 means: >=20 > This page has been properly setup for COW=E2=80=99d IO, and it went thr= ough the > normal path of page_mkwrite() or file_write() instead of being silently > dirtied by a deep corner of the MM. What's not clear to me is whether btrfs can be in the situation where PagePrivate2 is set and PagePrivate is clear. If so, then we have a bug to fix. > > We introduced attach_page_private() and detach_page_private() earlier > > this year to help filesystems get the refcount right. But we still > > have a few filesystems using PagePrivate themselves (afs, btrfs, ceph= , > > crypto, erofs, f2fs, jfs, nfs, orangefs & ubifs) and I'm not convince= d > > they're all getting it right. > >=20 > > Here's a bug I happened on while looking into this: > >=20 > > if (page_has_private(page)) > > attach_page_private(newpage, detach_page_private(page= )); > >=20 > > if (PagePrivate2(page)) { > > ClearPagePrivate2(page); > > SetPagePrivate2(newpage); > > } > >=20 > > The aggravating thing is that this doesn't even look like a bug. > > You have to be in the kind of mood where you're thinking "What if pag= e > > has Private2 set and Private clear?" and the answer is that newpage > > ends up with PagePrivate set, but page->private set to NULL. >=20 > Do you mean PagePrivate2 set but page->private NULL? Sorry, I skipped a step of the explanation. page_has_private returns true if Private or Private2 is set. So if page has PagePrivate clear and PagePrivate2 set, newpage will end up with both PagePrivate and PagePrivate2 set -- attach_page_private() doesn't check whether the pointer is NULL (and IMO, it shouldn't). Given our current macros, what was _meant_ here was: if (PagePrivate(page)) attach_page_private(newpage, detach_page_private(page)); but that's not obviously right. > Btrfs should only hage > PagePrivate2 set on pages that are formally in our writeback state mach= ine, > so it=E2=80=99ll get cleared as we unwind through normal IO or truncate= etc. For > data pages, btrfs page->private is simply set to 1 so the MM will kindl= y > call releasepage for us. That's not what I'm seeing here: static void attach_extent_buffer_page(struct extent_buffer *eb, struct page *page) { if (!PagePrivate(page)) attach_page_private(page, eb); else WARN_ON(page->private !=3D (unsigned long)eb); } Or is that not a data page? > > So what shold we do about all this? First, I want to make the code > > snippet above correct, because it looks right. So page_has_private() > > needs to test just PagePrivate and not PagePrivate2. Now we need a > > new function to call to determine whether the filesystem needs its > > invalidatepage callback invoked. Not sure what that should be named. >=20 > I haven=E2=80=99t checked all the page_has_private() callers, but maybe > page_has_private() should stay the same and add page_private_count() fo= r > times where we need to get out our fingers and toes for the refcount ma= th. I was thinking about page_expected_count() which returns the number of references from the page cache plus the number of references from the various page privates. So is_page_cache_freeable() becomes: return page_count(page) =3D=3D page_expected_count(page) + 1; can_split_huge_page() becomes: if (page_has_private(page)) return false; return page_count(page) =3D=3D page_expected_count(page) + total_mapcount(page) + 1; > > I think I also want to rename PG_private_2 to PG_owner_priv_2. > > There's a clear relationship between PG_private and page->private. > > There is no relationship between PG_private_2 and page->private, so i= t's > > a misleading name. Or maybe it should just be PG_fscache and btrfs c= an > > find some other way to mark the pages? >=20 > Btrfs should be able to flip bits in page->private to cover our current > usage of PG_private_2. If we ever attach something real to page->priva= te, > we can flip bits in that instead. It=E2=80=99s kinda messy though and = we=E2=80=99d have to > change attach_page_private a little to reflect its new life as a bit se= tting > machine. It's not great, but with David wanting to change how PageFsCache is used, it may be unavoidable (I'm not sure if he's discussed that with you yet) https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/com= mit/?h=3Dfscache-iter&id=3D6f10fd7766ed6d87c3f696bb7931281557b389f5 shows= part of it -- essentially he wants to make PagePrivate2 mean that I/O is currently ongoing to an fscache, and so truncate needs to wait on it being finished= . > >=20 > > Also ... do we really need to increment the page refcount if we have > > PagePrivate set? I'm not awfully familiar with the buffercache -- is > > it possible we end up in a situation where a buffer, perhaps under I/= O, > > has the last reference to a struct page? It seems like that referenc= e > > is > > always put from drop_buffers() which is called from > > try_to_free_buffers() > > which is always called by someone who has a reference to a struct pag= e > > that they got from the pagecache. So what is this reference count fo= r? >=20 > I=E2=80=99m not sure what we gain by avoiding the refcount bump? Many = filesystems > use the pattern of: =E2=80=9Cput something in page->private, free that = thing in > releasepage.=E2=80=9D Without the refcount bump it feels like we=E2=80= =99d have more magic > to avoid freeing the page without leaking things in page->private. I t= hink > the extra ref lets the FS crowd keep our noses out of the MM more often= , so > it seems like a net positive to me. The question is whether the "thing" in page->private can ever have the last reference on a struct page. Gao says erofs can be in that situation= , so never mind this change.