Re: [RFC v3 PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	David Rientjes <rientjes@google.com>,
	Hugh Dickins <hughd@google.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>,
	Hillf Danton <hillf.zj@alibaba-inc.com>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [RFC v3 PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate()
Date: Tue, 26 May 2015 10:53:18 -0700	[thread overview]
Message-ID: <5564B30E.8040809@oracle.com> (raw)
In-Reply-To: <20150526065453.GB17652@hori1.linux.bs1.fc.nec.co.jp>

On 05/25/2015 11:54 PM, Naoya Horiguchi wrote:
> On Thu, May 21, 2015 at 08:47:43AM -0700, Mike Kravetz wrote:
>> This is based on the shmem version, but it has diverged quite
>> a bit.  We have no swap to worry about, nor the new file sealing.
>> Add synchronication via the fault mutex table to coordinate
>> page faults,  fallocate allocation and fallocate hole punch.
>>
>> What this allows us to do is move physical memory in and out of
>> a hugetlbfs file without having it mapped.  This also gives us
>> the ability to support MADV_REMOVE since it is currently
>> implemented using fallocate().  MADV_REMOVE lets madvise() remove
>> pages from the middle of a hugetlbfs file, which wasn't possible
>> before.
>>
>> hugetlbfs fallocate only operates on whole huge pages.
>>
>> Based-on code-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> This patch changes the behavior of user API, so please update manpage of
> fallocate(2).

Will do.

Unfortunately, I believe hugetlbfs does not follow the man page
for ftruncate.  So, I will look to get that updated as well.

>> ---
>>   fs/hugetlbfs/inode.c    | 169 +++++++++++++++++++++++++++++++++++++++++++++++-
>>   include/linux/hugetlb.h |   3 +
>>   mm/hugetlb.c            |   2 +-
>>   3 files changed, 172 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index dfa88a5..4b1535f 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/thread_info.h>
>>   #include <asm/current.h>
>>   #include <linux/sched.h>		/* remove ASAP */
>> +#include <linux/falloc.h>
>>   #include <linux/fs.h>
>>   #include <linux/mount.h>
>>   #include <linux/file.h>
>> @@ -493,6 +494,171 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>>   	return 0;
>>   }
>>   
>> +static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>> +{
>> +	struct hstate *h = hstate_inode(inode);
>> +	unsigned long hpage_size = huge_page_size(h);
>> +	loff_t hole_start, hole_end;
>> +
>> +	/*
>> +	 * For hole punch round up the beginning offset of the hole and
>> +	 * round down the end.
>> +	 */
>> +	hole_start = (offset + hpage_size - 1) & huge_page_mask(h);
>> +	hole_end = (offset + len) & huge_page_mask(h);
> 
> We have round_up/round_up macro, so please use them here.
> Then, it's self-descriptive, so you don't have to write comment.
> 
>> +
>> +	if ((u64)hole_end > (u64)hole_start) {
> 
> Why is this casting to u64 necessary?

It is not necessary.  I will remove it.

>> +		struct address_space *mapping = inode->i_mapping;
>> +
>> +		mutex_lock(&inode->i_mutex);
>> +		i_mmap_lock_write(mapping);
>> +		if (!RB_EMPTY_ROOT(&mapping->i_mmap))
>> +			hugetlb_vmdelete_list(&mapping->i_mmap,
>> +						hole_start >> PAGE_SHIFT,
>> +						hole_end  >> PAGE_SHIFT);
>> +		i_mmap_unlock_write(mapping);
>> +		remove_inode_hugepages(inode, hole_start, hole_end);
>> +		mutex_unlock(&inode->i_mutex);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>> +				loff_t len)
>> +{
>> +	struct inode *inode = file_inode(file);
>> +	struct address_space *mapping = inode->i_mapping;
>> +	struct hstate *h = hstate_inode(inode);
>> +	struct vm_area_struct pseudo_vma;
>> +	unsigned long hpage_size = huge_page_size(h);
>> +	unsigned long hpage_shift = huge_page_shift(h);
>> +	pgoff_t start, index, end;
>> +	int error;
>> +	u32 hash;
>> +
>> +	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
>> +		return -EOPNOTSUPP;
>> +
>> +	if (mode & FALLOC_FL_PUNCH_HOLE)
>> +		return hugetlbfs_punch_hole(inode, offset, len);
>> +
>> +	/*
>> +	 * Default preallocate case.
>> +	 * For this range, start is rounded down and end is rounded up.
>> +	 */
>> +	start = offset >> hpage_shift;
>> +	end = (offset + len + hpage_size - 1) >> hpage_shift;
>> +
>> +	mutex_lock(&inode->i_mutex);
>> +
>> +	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
>> +	error = inode_newsize_ok(inode, offset + len);
>> +	if (error)
>> +		goto out;
>> +
>> +	/*
>> +	 * Initialize a pseudo vma that just contains the policy used
>> +	 * when allocating the huge pages.  The actual policy field
>> +	 * (vm_policy) is determined based on the index in the loop below.
>> +	 */
>> +	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
>> +	pseudo_vma.vm_start = 0;
>> +	pseudo_vma.vm_flags |= (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
> 
> Maybe '|' isn't necessary.

No, it is not necessary.  I will remove.

>> +	pseudo_vma.vm_file = file;
>> +
>> +	for (index = start; index < end; index++) {
>> +		/*
>> +		 * This is supposed to be the vaddr where the page is being
>> +		 * faulted in, but we have no vaddr here.
>> +		 */
>> +		struct page *page;
>> +		unsigned long addr;
>> +		int avoid_reserve = 0;
> 
> avoid_reserve is referred only once and never changed, so no need to use
> the variable?

It is not necessary.  I will remove it.

>> +
>> +		cond_resched();
>> +
>> +		/*
>> +		 * fallocate(2) manpage permits EINTR; we may have been
>> +		 * interrupted because we are using up too much memory.
>> +		 */
>> +		if (signal_pending(current)) {
>> +			error = -EINTR;
>> +			break;
>> +		}
>> +
>> +		/* Get policy based on index */
>> +		pseudo_vma.vm_policy =
>> +			mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
>> +							index);
>> +
>> +		/* addr is the offset within the file (zero based) */
>> +		addr = index * hpage_size;
>> +
>> +		/* mutex taken here, fault path and hole punch */
>> +		hash = hugetlb_fault_mutex_shared_hash(mapping, index);
>> +		hugetlb_fault_mutex_lock(hash);
>> +
>> +		/* see if page already exists to avoid alloc/free */
>> +		page = find_get_page(mapping, index);
>> +		if (page) {
>> +			put_page(page);
>> +			hugetlb_fault_mutex_unlock(hash);
> 
> Don't you need mpol_cond_put() here?

Yes.  Thank you, I will add it.

>> +			continue;
>> +		}
>> +
>> +		page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
>> +		mpol_cond_put(pseudo_vma.vm_policy);
>> +		if (IS_ERR(page)) {
>> +			hugetlb_fault_mutex_unlock(hash);
>> +			error = PTR_ERR(page);
>> +			goto out;
>> +		}
>> +		clear_huge_page(page, addr, pages_per_huge_page(h));
>> +		__SetPageUptodate(page);
> 
> Note that recently I added page_huge_active() to mark activeness of hugepages,
> so when you rebased to v4.1-rc1+, please insert set_page_huge_active(page) here.
> 

Yes, I noticed your change.

>> +		error = huge_add_to_page_cache(page, mapping, index);
>> +		if (error) {
>> +			/*
>> +			 * An entry already exists in the cache.  This implies
>> +			 * a region also existed in the reserve map at the time
>> +			 * the page was allocated above.  Therefore, no use
>> +			 * count was added to the subpool for the page.  Before
>> +			 * freeing the page, clear the subpool reference so
>> +			 * that the count is not decremented.
>> +			 */
>> +			set_page_private(page, 0);/* clear spool reference */
> 
> This looks unclear to me. Which "count" do you refer to in the comment
> "no use count was added to the subpool" or "the count is not decremented"?
> I guess spool->used_hpages or spool->rsv_hpages, but alloc_huge_page() above
> should call hugepage_subpool_get_pages(), so it's accounted, right?
> Could you write comments more specifically?

Yes, this is confusing.  As I am reexamining the code, I see that
it is incorrect.  This code may not be necessary.  It was there to
handle a race with page faults.  The code now uses the hugetlb fault
mutex table to synchronize with page faults.  I will do some more
work here and expect this confusing code to go away.

Thank you for your comments,
-- 
Mike Kravetz

> 
> Thanks,
> Naoya Horiguchi
> 
>> +			put_page(page);
>> +
>> +			hugetlb_fault_mutex_unlock(hash);
>> +			/* Keep going if we see an -EEXIST */
>> +			if (error == -EEXIST) {
>> +				error = 0;	/* do not return to user */
>> +				continue;
>> +			} else
>> +				goto out;
>> +		}
>> +
>> +		hugetlb_fault_mutex_unlock(hash);
>> +
>> +		/*
>> +		 * page_put due to reference from alloc_huge_page()
>> +		 * unlock_page because locked by add_to_page_cache()
>> +		 */
>> +		put_page(page);
>> +		unlock_page(page);
>> +	}
>> +
>> +	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
>> +		i_size_write(inode, offset + len);
>> +	inode->i_ctime = CURRENT_TIME;
>> +	spin_lock(&inode->i_lock);
>> +	inode->i_private = NULL;
>> +	spin_unlock(&inode->i_lock);
>> +out:
>> +	mutex_unlock(&inode->i_mutex);
>> +	return error;
>> +}
>> +
>>   static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
>>   {
>>   	struct inode *inode = dentry->d_inode;
>> @@ -804,7 +970,8 @@ const struct file_operations hugetlbfs_file_operations = {
>>   	.mmap			= hugetlbfs_file_mmap,
>>   	.fsync			= noop_fsync,
>>   	.get_unmapped_area	= hugetlb_get_unmapped_area,
>> -	.llseek		= default_llseek,
>> +	.llseek			= default_llseek,
>> +	.fallocate		= hugetlbfs_fallocate,
>>   };
>>   
>>   static const struct inode_operations hugetlbfs_dir_inode_operations = {
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 934f339..fa36b7a 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -327,6 +327,8 @@ struct huge_bootmem_page {
>>   #endif
>>   };
>>   
>> +struct page *alloc_huge_page(struct vm_area_struct *vma,
>> +				unsigned long addr, int avoid_reserve);
>>   struct page *alloc_huge_page_node(struct hstate *h, int nid);
>>   struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>>   				unsigned long addr, int avoid_reserve);
>> @@ -481,6 +483,7 @@ static inline bool hugepages_supported(void)
>>   
>>   #else	/* CONFIG_HUGETLB_PAGE */
>>   struct hstate {};
>> +#define alloc_huge_page(v, a, r) NULL
>>   #define alloc_huge_page_node(h, nid) NULL
>>   #define alloc_huge_page_noerr(v, a, r) NULL
>>   #define alloc_bootmem_huge_page(h) NULL
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 94c6154..1e95038 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1444,7 +1444,7 @@ static long vma_commit_reservation(struct hstate *h,
>>   /* Forward declaration */
>>   static int hugetlb_acct_memory(struct hstate *h, long delta);
>>   
>> -static struct page *alloc_huge_page(struct vm_area_struct *vma,
>> +struct page *alloc_huge_page(struct vm_area_struct *vma,
>>   				    unsigned long addr, int avoid_reserve)
>>   {
>>   	struct hugepage_subpool *spool = subpool_vma(vma);
>> -- 
>> 2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-05-26 17:53 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-21 15:47 [RFC v3 PATCH 00/10] hugetlbfs: add fallocate support Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 01/10] mm/hugetlb: compute/return the number of regions added by region_add() Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 02/10] mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 03/10] mm/hugetlb: add region_del() to delete a specific range of entries Mike Kravetz
2015-05-22  6:21   ` Naoya Horiguchi
2015-05-22 16:48     ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 04/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate Mike Kravetz
2015-05-22  6:23   ` Naoya Horiguchi
2015-05-22 16:50     ` Mike Kravetz
2015-05-22 17:01   ` Davidlohr Bueso
2015-05-22 17:10     ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 05/10] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 06/10] hugetlbfs: truncate_hugepages() takes a range of pages Mike Kravetz
2015-05-22  8:08   ` Naoya Horiguchi
2015-05-22 17:07     ` Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 07/10] hugetlbfs: New huge_add_to_page_cache helper routine Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 08/10] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch Mike Kravetz
2015-05-21 15:47 ` [RFC v3 PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate() Mike Kravetz
2015-05-26  6:54   ` Naoya Horiguchi
2015-05-26 17:53     ` Mike Kravetz [this message]
2015-05-21 15:47 ` [RFC v3 PATCH 10/10] mm: madvise allow remove operation for hugetlbfs Mike Kravetz
2015-05-22 21:50 ` [RFC v3 PATCH 00/10] hugetlbfs: add fallocate support Davidlohr Bueso
2015-05-23  2:32   ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5564B30E.8040809@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@stgolabs.net \
    --cc=hch@infradead.org \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox