linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Yang Shi <yang.shi@linux.alibaba.com>
Cc: hughd@google.com, kirill.shutemov@linux.intel.com,
	aarcange@redhat.com, akpm@linux-foundation.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially
Date: Mon, 25 Nov 2019 21:33:50 +0300	[thread overview]
Message-ID: <20191125183350.5gmcln6t3ofszbsy@box> (raw)
In-Reply-To: <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com>

On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
> 
> 
> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
> > On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
> > > Currently when truncating shmem file, if the range is partial of THP
> > > (start or end is in the middle of THP), the pages actually will just get
> > > cleared rather than being freed unless the range cover the whole THP.
> > > Even though all the subpages are truncated (randomly or sequentially),
> > > the THP may still be kept in page cache.  This might be fine for some
> > > usecases which prefer preserving THP.
> > > 
> > > But, when doing balloon inflation in QEMU, QEMU actually does hole punch
> > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
> > > So, when using shmem THP as memory backend QEMU inflation actually doesn't
> > > work as expected since it doesn't free memory.  But, the inflation
> > > usecase really needs get the memory freed.  Anonymous THP will not get
> > > freed right away too but it will be freed eventually when all subpages are
> > > unmapped, but shmem THP would still stay in page cache.
> > > 
> > > To protect the usecases which may prefer preserving THP, introduce a
> > > new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is
> > > preferred behavior if truncating partial THP.  This mode just makes
> > > sense to tmpfs for the time being.
> > We need to clarify interaction with khugepaged. This implementation
> > doesn't do anything to prevent khugepaged from collapsing the range back
> > to THP just after the split.
> 
> Yes, it doesn't. Will clarify this in the commit log.

Okay, but I'm not sure that documention alone will be enough. We need
proper design.

> > > @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> > >   			}
> > >   			unlock_page(page);
> > >   		}
> > > +rescan_split:
> > >   		pagevec_remove_exceptionals(&pvec);
> > >   		pagevec_release(&pvec);
> > > +
> > > +		if (split && PageTransCompound(page)) {
> > > +			/* The THP may get freed under us */
> > > +			if (!get_page_unless_zero(compound_head(page)))
> > > +				goto rescan_out;
> > > +
> > > +			lock_page(page);
> > > +
> > > +			/*
> > > +			 * The extra pins from page cache lookup have been
> > > +			 * released by pagevec_release().
> > > +			 */
> > > +			if (!split_huge_page(page)) {
> > > +				unlock_page(page);
> > > +				put_page(page);
> > > +				/* Re-look up page cache from current index */
> > > +				goto again;
> > > +			}
> > > +			unlock_page(page);
> > > +			put_page(page);
> > > +		}
> > > +rescan_out:
> > >   		index++;
> > >   	}
> > Doing get_page_unless_zero() just after you've dropped the pin for the
> > page looks very suboptimal.
> 
> If I don't drop the pins the THP can't be split. And, there might be more
> than one pins from find_get_entries() if I read the code correctly. For
> example, truncate 8K length in the middle of THP, the THP's refcount would
> get bumpped twice since  two sub pages would be returned.

Pin the page before pagevec_release() and avoid get_page_unless_zero().

Current code is buggy. You need to check that the page is still belong to
the file after speculative lookup.

-- 
 Kirill A. Shutemov


  reply	other threads:[~2019-11-25 18:33 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-23  1:05 Yang Shi
2019-11-25  9:36 ` Kirill A. Shutemov
2019-11-25 18:24   ` Yang Shi
2019-11-25 18:33     ` Kirill A. Shutemov [this message]
2019-11-25 19:33       ` Yang Shi
2019-11-26 23:34         ` Yang Shi
2019-11-28  3:06           ` Hugh Dickins
2019-11-28 11:34             ` Kirill A. Shutemov
2019-12-02 23:14               ` Yang Shi
2019-12-02 23:07             ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191125183350.5gmcln6t3ofszbsy@box \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox