[RFC] per thread page reservation patch

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] per thread page reservation patch
       [not found]       ` <20050107144644.GA9606@infradead.org>
@ 2005-01-07 17:16         ` Vladimir Saveliev
  2005-01-07 18:48           ` Andrew Morton
                             ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Vladimir Saveliev @ 2005-01-07 17:16 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 900 bytes --]

Hello

On Fri, 2005-01-07 at 17:46, Christoph Hellwig wrote:
>  
> > reiser4-perthread-pages.patch
> 
> this one I don't particularly object, but I'm not sure it's really
> the right thing to do.  Can you post it with a detailed description
> to linux-mm so we can kick off discussion?
> 

This patch adds an API to reserve some number of pages for exclusive use
of calling thread.  This is necessary to support operations that cannot
tolerate memory allocation failures in the middle (reiser4 uses this API
to not have to undo complex uncompleted (due to ENOMEM) filesystem tree
modifications).

Implementation is simple: reserved pages are stored in a list hanging
off calling task_struct. Caller is to manually release unused pages.

This API is supposed to be used for small reservation (no more than few
dozens of pages).
 
Looking forward for advices on how to do that better/properly
Thanks



[-- Attachment #2: reiser4-perthread-pages.patch --]
[-- Type: text/plain, Size: 6484 bytes --]



This patch adds an API to reserve some number of pages for exclusive use of
calling thread.  This is necessary to support operations that cannot tolerate
memory allocation failures in the middle.  Implementation is very simplistic
and would only work efficiently for small reservations (no more than few
dozens of pages).

Reserved pages are stored in a list hanging off calling task_struct.  This
list is only accessed by the current thread, hence, no locking is required. 
Note that this makes use of reservation from interrupt handlers impossible
without some external synchronization.  Caller is responsible for manually
releasing unused pages by calling perthread_pages_release(int nrpages) with
proper argument.

Signed-off-by: Andrew Morton <akpm@osdl.org>
---

diff -puN include/linux/gfp.h~reiser4-perthread-pages include/linux/gfp.h


 include/linux/gfp.h       |    4 +
 include/linux/init_task.h |    4 -
 include/linux/sched.h     |    3 +
 kernel/fork.c             |    4 +
 mm/page_alloc.c           |   97 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 110 insertions(+), 2 deletions(-)

diff -puN include/linux/gfp.h~reiser4-perthread-pages include/linux/gfp.h
--- linux-2.6.10-rc3/include/linux/gfp.h~reiser4-perthread-pages	2004-12-22 20:09:44.153164276 +0300
+++ linux-2.6.10-rc3-vs/include/linux/gfp.h	2004-12-22 20:09:44.199167794 +0300
@@ -134,4 +134,8 @@ extern void FASTCALL(free_cold_page(stru
 
 void page_alloc_init(void);
 
+int  perthread_pages_reserve(int nrpages, int gfp);
+void perthread_pages_release(int nrpages);
+int  perthread_pages_count(void);
+
 #endif /* __LINUX_GFP_H */
diff -puN include/linux/init_task.h~reiser4-perthread-pages include/linux/init_task.h
--- linux-2.6.10-rc3/include/linux/init_task.h~reiser4-perthread-pages	2004-12-22 20:09:44.160164812 +0300
+++ linux-2.6.10-rc3-vs/include/linux/init_task.h	2004-12-22 20:09:44.201167947 +0300
@@ -112,8 +112,8 @@ extern struct group_info init_groups;
 	.proc_lock	= SPIN_LOCK_UNLOCKED,				\
 	.switch_lock	= SPIN_LOCK_UNLOCKED,				\
 	.journal_info	= NULL,						\
+	.private_pages	= LIST_HEAD_INIT(tsk.private_pages),		\
+	.private_pages_count = 0,					\
 }
 
-
-
 #endif
diff -puN include/linux/sched.h~reiser4-perthread-pages include/linux/sched.h
--- linux-2.6.10-rc3/include/linux/sched.h~reiser4-perthread-pages	2004-12-22 20:09:44.170165576 +0300
+++ linux-2.6.10-rc3-vs/include/linux/sched.h	2004-12-22 20:09:44.204168176 +0300
@@ -691,6 +691,9 @@ struct task_struct {
 	nodemask_t mems_allowed;
 	int cpuset_mems_generation;
 #endif
+
+	struct list_head private_pages;	/* per-process private pages */
+	int private_pages_count;
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
diff -puN kernel/fork.c~reiser4-perthread-pages kernel/fork.c
--- linux-2.6.10-rc3/kernel/fork.c~reiser4-perthread-pages	2004-12-22 20:09:44.179166264 +0300
+++ linux-2.6.10-rc3-vs/kernel/fork.c	2004-12-22 20:09:44.215169017 +0300
@@ -154,6 +154,10 @@ static struct task_struct *dup_task_stru
 	tsk->thread_info = ti;
 	ti->task = tsk;
 
+	/* initialize list of pages privately reserved by process */
+	INIT_LIST_HEAD(&tsk->private_pages);
+	tsk->private_pages_count = 0;
+
 	/* One for us, one for whoever does the "release_task()" (usually parent) */
 	atomic_set(&tsk->usage,2);
 	return tsk;
diff -puN mm/page_alloc.c~reiser4-perthread-pages mm/page_alloc.c
--- linux-2.6.10-rc3/mm/page_alloc.c~reiser4-perthread-pages	2004-12-22 20:09:44.187166876 +0300
+++ linux-2.6.10-rc3-vs/mm/page_alloc.c	2004-12-22 20:09:44.219169323 +0300
@@ -551,6 +551,97 @@ void fastcall free_cold_page(struct page
 	free_hot_cold_page(page, 1);
 }
 
+static inline struct list_head *get_per_thread_pages(void)
+{
+	return &current->private_pages;
+}
+
+int perthread_pages_reserve(int nrpages, int gfp)
+{
+	int i;
+	struct list_head  accumulator;
+	struct list_head *per_thread;
+
+	per_thread = get_per_thread_pages();
+	INIT_LIST_HEAD(&accumulator);
+	list_splice_init(per_thread, &accumulator);
+	for (i = 0; i < nrpages; ++i) {
+		struct page *page;
+
+		page = alloc_page(gfp);
+		if (page != NULL)
+			list_add(&page->lru, &accumulator);
+		else {
+			for (; i > 0; --i) {
+				page = list_entry(accumulator.next, struct page, lru);
+				list_del(&page->lru);
+				page_cache_release(page);
+			}
+			return -ENOMEM;
+		}
+	}
+	/*
+	 * Q: why @accumulator is used, instead of directly adding pages to
+	 * the get_per_thread_pages()?
+	 *
+	 * A: because after first page is added to the get_per_thread_pages(),
+	 * next call to the alloc_page() (on the next loop iteration), will
+	 * re-use it.
+	 */
+	list_splice(&accumulator, per_thread);
+	current->private_pages_count += nrpages;
+	return 0;
+}
+EXPORT_SYMBOL(perthread_pages_reserve);
+
+void perthread_pages_release(int nrpages)
+{
+	struct list_head *per_thread;
+
+	current->private_pages_count -= nrpages;
+	per_thread = get_per_thread_pages();
+	for (; nrpages != 0; --nrpages) {
+		struct page *page;
+
+		BUG_ON(list_empty(per_thread));
+		page = list_entry(per_thread->next, struct page, lru);
+		list_del(&page->lru);
+		page_cache_release(page);
+	}
+}
+EXPORT_SYMBOL(perthread_pages_release);
+
+int perthread_pages_count(void)
+{
+	return current->private_pages_count;
+}
+EXPORT_SYMBOL(perthread_pages_count);
+
+static inline struct page *
+perthread_pages_alloc(void)
+{
+	struct list_head *perthread_pages;
+
+	/*
+	 * try to allocate pages from the per-thread private_pages pool. No
+	 * locking is needed: this list can only be modified by the thread
+	 * itself, and not by interrupts or other threads.
+	 */
+	perthread_pages = get_per_thread_pages();
+	if (!in_interrupt() && !list_empty(perthread_pages)) {
+		struct page *page;
+
+		page = list_entry(perthread_pages->next, struct page, lru);
+		list_del(&page->lru);
+		current->private_pages_count--;
+		/*
+		 * per-thread page is already initialized, just return it.
+		 */
+		return page;
+	} else
+		return NULL;
+}
+
 /*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
@@ -667,6 +758,12 @@ __alloc_pages(unsigned int gfp_mask, uns
 	 */
 	can_try_harder = (unlikely(rt_task(p)) && !in_interrupt()) || !wait;
 
+	if (order == 0) {
+		page = perthread_pages_alloc();
+		if (page != NULL)
+			return page;
+	}
+
 	zones = zonelist->zones;  /* the list of zones suitable for gfp_mask */
 
 	if (unlikely(zones[0] == NULL)) {

_

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 17:16         ` [RFC] per thread page reservation patch Vladimir Saveliev
@ 2005-01-07 18:48           ` Andrew Morton
  2005-01-07 20:21             ` Nikita Danilov
  2005-01-07 19:05           ` Christoph Hellwig
  2005-01-07 19:14           ` Paulo Marques
  2 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2005-01-07 18:48 UTC (permalink / raw)
  To: Vladimir Saveliev; +Cc: linux-mm, linux-kernel, hch

Vladimir Saveliev <vs@namesys.com> wrote:
>
> +int perthread_pages_reserve(int nrpages, int gfp)
>  +{
>  +	int i;
>  +	struct list_head  accumulator;
>  +	struct list_head *per_thread;
>  +
>  +	per_thread = get_per_thread_pages();
>  +	INIT_LIST_HEAD(&accumulator);
>  +	list_splice_init(per_thread, &accumulator);
>  +	for (i = 0; i < nrpages; ++i) {

This will end up reserving more pages than were asked for, if
current->private_pages_count is non-zero.  Deliberate?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 18:48           ` Andrew Morton
@ 2005-01-07 20:21             ` Nikita Danilov
  0 siblings, 0 replies; 27+ messages in thread
From: Nikita Danilov @ 2005-01-07 20:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, hch

Andrew Morton <akpm@osdl.org> writes:

> Vladimir Saveliev <vs@namesys.com> wrote:
>>
>> +int perthread_pages_reserve(int nrpages, int gfp)
>>  +{
>>  +	int i;
>>  +	struct list_head  accumulator;
>>  +	struct list_head *per_thread;
>>  +
>>  +	per_thread = get_per_thread_pages();
>>  +	INIT_LIST_HEAD(&accumulator);
>>  +	list_splice_init(per_thread, &accumulator);
>>  +	for (i = 0; i < nrpages; ++i) {
>
> This will end up reserving more pages than were asked for, if
> current->private_pages_count is non-zero.  Deliberate?

Yes. This is to make modular usage possible, so that


        perthread_pages_reserve(nrpages, gfp_mask);

        /* call some other code... */

        perthread_pages_release(unused_pages);

works correctly if "some other code" does per-thread reservations
too.

Nikita.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 17:16         ` [RFC] per thread page reservation patch Vladimir Saveliev
  2005-01-07 18:48           ` Andrew Morton
@ 2005-01-07 19:05           ` Christoph Hellwig
  2005-01-07 19:12             ` Christoph Hellwig
                               ` (2 more replies)
  2005-01-07 19:14           ` Paulo Marques
  2 siblings, 3 replies; 27+ messages in thread
From: Christoph Hellwig @ 2005-01-07 19:05 UTC (permalink / raw)
  To: Vladimir Saveliev; +Cc: linux-mm, Andrew Morton, linux-kernel

> diff -puN include/linux/gfp.h~reiser4-perthread-pages include/linux/gfp.h
> --- linux-2.6.10-rc3/include/linux/gfp.h~reiser4-perthread-pages	2004-12-22 20:09:44.153164276 +0300
> +++ linux-2.6.10-rc3-vs/include/linux/gfp.h	2004-12-22 20:09:44.199167794 +0300
> @@ -134,4 +134,8 @@ extern void FASTCALL(free_cold_page(stru
>  
>  void page_alloc_init(void);
>  
> +int  perthread_pages_reserve(int nrpages, int gfp);
> +void perthread_pages_release(int nrpages);
> +int  perthread_pages_count(void);
> +
>  #endif /* __LINUX_GFP_H */
> diff -puN include/linux/init_task.h~reiser4-perthread-pages include/linux/init_task.h
> --- linux-2.6.10-rc3/include/linux/init_task.h~reiser4-perthread-pages	2004-12-22 20:09:44.160164812 +0300
> +++ linux-2.6.10-rc3-vs/include/linux/init_task.h	2004-12-22 20:09:44.201167947 +0300
> @@ -112,8 +112,8 @@ extern struct group_info init_groups;
>  	.proc_lock	= SPIN_LOCK_UNLOCKED,				\
>  	.switch_lock	= SPIN_LOCK_UNLOCKED,				\
>  	.journal_info	= NULL,						\
> +	.private_pages	= LIST_HEAD_INIT(tsk.private_pages),		\
> +	.private_pages_count = 0,					\
>  }
>  
> -
> -
>  #endif
> diff -puN include/linux/sched.h~reiser4-perthread-pages include/linux/sched.h
> --- linux-2.6.10-rc3/include/linux/sched.h~reiser4-perthread-pages	2004-12-22 20:09:44.170165576 +0300
> +++ linux-2.6.10-rc3-vs/include/linux/sched.h	2004-12-22 20:09:44.204168176 +0300
> @@ -691,6 +691,9 @@ struct task_struct {
>  	nodemask_t mems_allowed;
>  	int cpuset_mems_generation;
>  #endif
> +
> +	struct list_head private_pages;	/* per-process private pages */
> +	int private_pages_count;
>  };
>  
>  static inline pid_t process_group(struct task_struct *tsk)
> diff -puN kernel/fork.c~reiser4-perthread-pages kernel/fork.c
> --- linux-2.6.10-rc3/kernel/fork.c~reiser4-perthread-pages	2004-12-22 20:09:44.179166264 +0300
> +++ linux-2.6.10-rc3-vs/kernel/fork.c	2004-12-22 20:09:44.215169017 +0300
> @@ -154,6 +154,10 @@ static struct task_struct *dup_task_stru
>  	tsk->thread_info = ti;
>  	ti->task = tsk;
>  
> +	/* initialize list of pages privately reserved by process */
> +	INIT_LIST_HEAD(&tsk->private_pages);
> +	tsk->private_pages_count = 0;
> +
>  	/* One for us, one for whoever does the "release_task()" (usually parent) */
>  	atomic_set(&tsk->usage,2);
>  	return tsk;
> diff -puN mm/page_alloc.c~reiser4-perthread-pages mm/page_alloc.c
> --- linux-2.6.10-rc3/mm/page_alloc.c~reiser4-perthread-pages	2004-12-22 20:09:44.187166876 +0300
> +++ linux-2.6.10-rc3-vs/mm/page_alloc.c	2004-12-22 20:09:44.219169323 +0300
> @@ -551,6 +551,97 @@ void fastcall free_cold_page(struct page
>  	free_hot_cold_page(page, 1);
>  }
>  
> +static inline struct list_head *get_per_thread_pages(void)
> +{
> +	return &current->private_pages;
> +}

this is a completely useless wrapper that doesn't buy anything.

> +int perthread_pages_reserve(int nrpages, int gfp)
> +{
> +	int i;
> +	struct list_head  accumulator;
> +	struct list_head *per_thread;
> +
> +	per_thread = get_per_thread_pages();
> +	INIT_LIST_HEAD(&accumulator);
> +	list_splice_init(per_thread, &accumulator);

Can probably be written much simpler as:

int perthread_pages_reserve(int nrpages, int gfp)
{
	LIST_HEAD(accumulator);
	int i;

	list_splice_init(&current->private_pages, &accumulator);

Now the big question is, what's synchronizing access to
current->private_pages?

> +	for (i = 0; i < nrpages; ++i) {
> +		struct page *page;
> +
> +		page = alloc_page(gfp);
> +		if (page != NULL)
> +			list_add(&page->lru, &accumulator);
> +		else {
> +			for (; i > 0; --i) {
> +				page = list_entry(accumulator.next, struct page, lru);
> +				list_del(&page->lru);
> +				page_cache_release(page);
> +			}
> +			return -ENOMEM;
> +		}
> +	}

Would be much more readable with the error condition handling at the
end of the function and a goto.

> +int perthread_pages_count(void)
> +{
> +	return current->private_pages_count;
> +}
> +EXPORT_SYMBOL(perthread_pages_count);

Again a completely useless wrapper.

> +static inline struct page *
> +perthread_pages_alloc(void)
> +{
> +	struct list_head *perthread_pages;
> +
> +	/*
> +	 * try to allocate pages from the per-thread private_pages pool. No
> +	 * locking is needed: this list can only be modified by the thread
> +	 * itself, and not by interrupts or other threads.
> +	 */
> +	perthread_pages = get_per_thread_pages();
> +	if (!in_interrupt() && !list_empty(perthread_pages)) {
> +		struct page *page;
> +
> +		page = list_entry(perthread_pages->next, struct page, lru);
> +		list_del(&page->lru);
> +		current->private_pages_count--;
> +		/*
> +		 * per-thread page is already initialized, just return it.
> +		 */
> +		return page;
> +	} else
> +		return NULL;
> +}

Would be written much cleaner with a single return statement and the
comment above the function, ala:

/*
 * try to allocate pages from the per-thread private_pages pool. No
 * locking is needed: this list can only be modified by the thread
 * itself, and not by interrupts or other threads.
 */
static inline struct page *perthread_pages_alloc(void)
{
	struct list_head *perthread_pages = &current->private_pages;
	struct page *page = NULL;

	if (!in_interrupt() && !list_empty(perthread_pages)) {
		page = list_entry(perthread_pages->next, struct page, lru);
		list_del(&page->lru);
		current->private_pages_count--;
	}

	return page;
}

> +	if (order == 0) {
> +		page = perthread_pages_alloc();
> +		if (page != NULL)
> +			return page;
> +	}
> +

You should at least move the !in_interrupt() condition out here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 19:05           ` Christoph Hellwig
@ 2005-01-07 19:12             ` Christoph Hellwig
  2005-01-07 19:21             ` Robert Love
  2005-01-07 20:48             ` Nikita Danilov
  2 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2005-01-07 19:12 UTC (permalink / raw)
  To: Christoph Hellwig, Vladimir Saveliev, linux-mm, Andrew Morton,
	linux-kernel

> Now the big question is, what's synchronizing access to
> current->private_pages?

Umm, it's obviously correct as we're thread-local.  Objection taken back :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 19:05           ` Christoph Hellwig
  2005-01-07 19:12             ` Christoph Hellwig
@ 2005-01-07 19:21             ` Robert Love
  2005-01-07 20:48             ` Nikita Danilov
  2 siblings, 0 replies; 27+ messages in thread
From: Robert Love @ 2005-01-07 19:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vladimir Saveliev, linux-mm, Andrew Morton, linux-kernel

On Fri, 2005-01-07 at 19:05 +0000, Christoph Hellwig wrote:

> int perthread_pages_reserve(int nrpages, int gfp)
> {
> 	LIST_HEAD(accumulator);
> 	int i;
> 
> 	list_splice_init(&current->private_pages, &accumulator);
> 
> Now the big question is, what's synchronizing access to
> current->private_pages?

Safe without locks so long as there is no other way to get at another
process's private_pages. ;)

Best,

	Robert Love


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 19:05           ` Christoph Hellwig
  2005-01-07 19:12             ` Christoph Hellwig
  2005-01-07 19:21             ` Robert Love
@ 2005-01-07 20:48             ` Nikita Danilov
  2005-01-07 20:54               ` Christoph Hellwig
  2 siblings, 1 reply; 27+ messages in thread
From: Nikita Danilov @ 2005-01-07 20:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vladimir Saveliev, linux-mm, Andrew Morton, linux-kernel

Christoph Hellwig <hch@infradead.org> writes:

>> diff -puN include/linux/gfp.h~reiser4-perthread-pages include/linux/gfp.h
>> --- linux-2.6.10-rc3/include/linux/gfp.h~reiser4-perthread-pages	2004-12-22 20:09:44.153164276 +0300

[...]

>
>> +int perthread_pages_count(void)
>> +{
>> +	return current->private_pages_count;
>> +}
>> +EXPORT_SYMBOL(perthread_pages_count);
>
> Again a completely useless wrapper.

I disagree. Patch introduces explicit API

int  perthread_pages_reserve(int nrpages, int gfp);
void perthread_pages_release(int nrpages);
int  perthread_pages_count(void);

sufficient to create and use per-thread reservations. Using
current->private_pages_count directly

 - makes API less uniform, not contained within single namespace
   (perthread_pages_*), and worse,

 - exhibits internal implementation detail to the user.

>

[...]

Nikita.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 20:48             ` Nikita Danilov
@ 2005-01-07 20:54               ` Christoph Hellwig
  2005-01-07 21:00                 ` Nikita Danilov
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2005-01-07 20:54 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Vladimir Saveliev, Andrew Morton, linux-kernel, linux-mm

On Fri, Jan 07, 2005 at 11:48:58PM +0300, Nikita Danilov wrote:
> sufficient to create and use per-thread reservations. Using
> current->private_pages_count directly
> 
>  - makes API less uniform, not contained within single namespace
>    (perthread_pages_*), and worse,
> 
>  - exhibits internal implementation detail to the user.

Completely disagreed, hiding all the details doesn't help for such
trivial access, it's just obsfucating things.

But looking at it the API doesn't make sense at all.  Number of pages
in the thread-pool is an internal implementation detail and no caller
must look at it - think about two callers, e.g. filesystem and iscsi
initiator using it in the same thread.

Here's an updated patch with my suggestions implemented and other goodies
such a kerneldoc comments (but completely untested so far):


--- 1.20/include/linux/gfp.h	2005-01-05 18:30:39 +01:00
+++ edited/include/linux/gfp.h	2005-01-07 20:30:20 +01:00
@@ -130,6 +130,9 @@
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr),0)
 
+extern int perthread_pages_reserve(unsigned int nrpages, unsigned int gfp_mask);
+extern void perthread_pages_release(unsigned int nrpages);
+
 void page_alloc_init(void);
 
 #endif /* __LINUX_GFP_H */
--- 1.33/include/linux/init_task.h	2005-01-05 03:48:20 +01:00
+++ edited/include/linux/init_task.h	2005-01-07 20:33:25 +01:00
@@ -112,6 +112,7 @@
 	.proc_lock	= SPIN_LOCK_UNLOCKED,				\
 	.switch_lock	= SPIN_LOCK_UNLOCKED,				\
 	.journal_info	= NULL,						\
+	.private_pages  = LIST_HEAD_INIT(tsk.private_pages),		\
 }
 
 
--- 1.280/include/linux/sched.h	2005-01-05 03:48:20 +01:00
+++ edited/include/linux/sched.h	2005-01-07 20:32:44 +01:00
@@ -658,6 +658,7 @@
 
 /* VM state */
 	struct reclaim_state *reclaim_state;
+	struct list_head private_pages; /* per-process private pages */
 
 	struct dentry *proc_dentry;
 	struct backing_dev_info *backing_dev_info;
--- 1.229/kernel/fork.c	2005-01-05 03:48:20 +01:00
+++ edited/kernel/fork.c	2005-01-07 20:34:04 +01:00
@@ -153,6 +153,9 @@
 	tsk->thread_info = ti;
 	ti->task = tsk;
 
+	/* initialize list of pages privately reserved by process */
+	INIT_LIST_HEAD(&tsk->private_pages);
+
 	/* One for us, one for whoever does the "release_task()" (usually parent) */
 	atomic_set(&tsk->usage,2);
 	return tsk;
--- 1.249/mm/page_alloc.c	2005-01-05 18:32:52 +01:00
+++ edited/mm/page_alloc.c	2005-01-07 20:36:34 +01:00
@@ -641,6 +641,86 @@
 	return 1;
 }
 
+/**
+ * perthread_pages_reserve  --  reserve some pages for this thread
+ * @nrpages:		number of pages to reserve
+ * @gfp_mask:		constaints passed to alloc_page()
+ *
+ * This functions allocates @nrpages into a thread-local pool that
+ * __alloc_pages() will use.  Used to support operations that cannot
+ * tolerate memory allocation failures.
+ */
+int perthread_pages_reserve(unsigned int nrpages, unsigned int gfp_mask)
+{
+	LIST_HEAD(accumulator);
+	int i;
+
+	list_splice_init(&current->private_pages, &accumulator);
+
+	for (i = 0; i < nrpages; ++i) {
+		struct page *page = alloc_page(gfp_mask);
+
+		if (!page)
+			goto fail_free_pages;
+		list_add(&page->lru, &accumulator);
+	}
+
+	/*
+	 * Q: why @accumulator is used, instead of directly adding pages to
+	 * current->private_pages?
+	 * 
+	 * A: because after first page is added to the current->private_pages
+	 * next call to the alloc_page() (on the next loop iteration), will
+	 * re-use it.
+	 */
+	list_splice(&accumulator, &current->private_pages);
+	return 0;
+ fail_free_pages:
+	perthread_pages_release(i);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(perthread_pages_reserve);
+
+/**
+ * perthread_pages_release  --  release pages from thread-local pool
+ * @nrpages:		number of pages to release
+ *
+ * This functions release @nrpages that we allocated to a thread-local
+ * pool using perthread_pages_reserve().
+ */
+void perthread_pages_release(unsigned int nrpages)
+{
+	struct list_head *perthread_pages = &current->private_pages;
+
+	for (; nrpages; --nrpages) {
+		struct page *page;
+	
+		BUG_ON(list_empty(perthread_pages));
+		page = list_entry(perthread_pages->next, struct page, lru);
+		list_del(&page->lru);
+		page_cache_release(page);
+	}
+}
+EXPORT_SYMBOL_GPL(perthread_pages_release);
+
+/*
+ * Try to allocate pages from the per-thread private_pages pool. No
+ * locking is needed: this list can only be modified by the thread
+ * itself, and not by interrupts or other threads.
+ */
+static inline struct page *__alloc_perthread_page(void)
+{
+	struct list_head *perthread_pages = &current->private_pages;
+	struct page *page = NULL;
+
+	if (!list_empty(perthread_pages)) {
+		page = list_entry(perthread_pages->next, struct page, lru);
+		list_del(&page->lru);
+	}
+
+	return page;
+}
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  *
@@ -672,6 +752,12 @@
 	int can_try_harder;
 
 	might_sleep_if(wait);
+
+	if (order == 0 && !in_interrupt()) {
+		page = __alloc_perthread_page();
+		if (page)
+			return page;
+	}
 
 	/*
 	 * The caller may dip into page reserves a bit more if the caller
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 20:54               ` Christoph Hellwig
@ 2005-01-07 21:00                 ` Nikita Danilov
  2005-01-07 21:07                   ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Nikita Danilov @ 2005-01-07 21:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vladimir Saveliev, Andrew Morton, linux-kernel, linux-mm

Christoph Hellwig <hch@infradead.org> writes:

> On Fri, Jan 07, 2005 at 11:48:58PM +0300, Nikita Danilov wrote:
>> sufficient to create and use per-thread reservations. Using
>> current->private_pages_count directly
>> 
>>  - makes API less uniform, not contained within single namespace
>>    (perthread_pages_*), and worse,
>> 
>>  - exhibits internal implementation detail to the user.
>
> Completely disagreed, hiding all the details doesn't help for such
> trivial access, it's just obsfucating things.
>
> But looking at it the API doesn't make sense at all.  Number of pages
> in the thread-pool is an internal implementation detail and no caller
> must look at it - think about two callers, e.g. filesystem and iscsi
> initiator using it in the same thread.
>
> Here's an updated patch with my suggestions implemented and other goodies
> such a kerneldoc comments (but completely untested so far):
>
>
> --- 1.20/include/linux/gfp.h	2005-01-05 18:30:39 +01:00
> +++ edited/include/linux/gfp.h	2005-01-07 20:30:20 +01:00
> @@ -130,6 +130,9 @@
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr),0)
>  
> +extern int perthread_pages_reserve(unsigned int nrpages, unsigned int gfp_mask);
> +extern void perthread_pages_release(unsigned int nrpages);

I don't see how this can work.

        /* make conservative estimation... */
        perthread_pages_reserve(100, GFP_KERNEL);

        /* actually, use only 10 pages during atomic operation */

        /* want to release remaining 90 pages... */
        perthread_pages_release(???);

Can you elaborate how to calculate number of pages that has to be
released?

Nikita.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 21:00                 ` Nikita Danilov
@ 2005-01-07 21:07                   ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2005-01-07 21:07 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Christoph Hellwig, Vladimir Saveliev, Andrew Morton,
	linux-kernel, linux-mm

On Sat, Jan 08, 2005 at 12:00:00AM +0300, Nikita Danilov wrote:
> I don't see how this can work.
> 
>         /* make conservative estimation... */
>         perthread_pages_reserve(100, GFP_KERNEL);
> 
>         /* actually, use only 10 pages during atomic operation */
> 
>         /* want to release remaining 90 pages... */
>         perthread_pages_release(???);
> 
> Can you elaborate how to calculate number of pages that has to be
> released?

That's a good question actually, and it's true for the original
version aswell if we want (and to make the API useful we must) make
it useable for multiple callers in the same thread.

Assuming the callers strictly nest we can indeed just go back to the
orininal count, maybe with an API like:

int perthread_pages_reserve(unsigned int nrpages, unsigned int gfp_mask);
void perthread_pages_release(unsigned int nrpages_goal);


	unsigned int old_nr_pages;

	...

	old_nr_pages = perthread_pages_reserve(100, GFP_KERNEL);

	...

	perthread_pages_release(old_nr_pages);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 17:16         ` [RFC] per thread page reservation patch Vladimir Saveliev
  2005-01-07 18:48           ` Andrew Morton
  2005-01-07 19:05           ` Christoph Hellwig
@ 2005-01-07 19:14           ` Paulo Marques
  2005-01-07 19:32             ` Christoph Hellwig
  2005-01-07 20:55             ` Nikita Danilov
  2 siblings, 2 replies; 27+ messages in thread
From: Paulo Marques @ 2005-01-07 19:14 UTC (permalink / raw)
  To: Vladimir Saveliev
  Cc: linux-mm, Andrew Morton, linux-kernel, Christoph Hellwig

Vladimir Saveliev wrote:
> [...]
> +	if (order == 0) {
> +		page = perthread_pages_alloc();
> +		if (page != NULL)
> +			return page;
> +	}

I hope this has not been extensively discussed yet, and I missed the 
thread but, does everybody think this is a good thing?

This seems like a very asymmetrical behavior. If the code explicitly 
reserves pages, it should explicitly use them, or it will become 
impossible to track down who is using what (not to mention that this 
will slow down every regular user of __alloc_pages, even if it is just 
for a quick test).

Why are there specialized functions to reserve the pages, but then they 
are used through the standard __alloc_pages interface?

At the very least this test should be moved to the very beginning of the 
function. It is of no use to calculate "can_try_harder" before running 
this code if it will use a reserved page.

Having a specialized function to get the reserved pages, would also make 
the logic in "perthread_pages_reserve" more clear (i.e., that comment 
would become unnecessary), and lose the test to "in_interrupt()" in 
"perthread_pages_alloc", if I'm reading this correctly.

-- 
Paulo Marques - www.grupopie.com

"A journey of a thousand miles begins with a single step."
Lao-tzu, The Way of Lao-tzu
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 19:14           ` Paulo Marques
@ 2005-01-07 19:32             ` Christoph Hellwig
  2005-01-07 19:42               ` Andi Kleen
  2005-01-07 20:55             ` Nikita Danilov
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2005-01-07 19:32 UTC (permalink / raw)
  To: Paulo Marques
  Cc: Vladimir Saveliev, linux-mm, Andrew Morton, linux-kernel,
	Christoph Hellwig

On Fri, Jan 07, 2005 at 07:14:15PM +0000, Paulo Marques wrote:
> This seems like a very asymmetrical behavior. If the code explicitly 
> reserves pages, it should explicitly use them, or it will become 
> impossible to track down who is using what (not to mention that this 
> will slow down every regular user of __alloc_pages, even if it is just 
> for a quick test).
> 
> Why are there specialized functions to reserve the pages, but then they 
> are used through the standard __alloc_pages interface?

That seems to be the whole point of the patch, as that way it'll serve
all sub-allocators or kernel function called by the user.  Without this
behaviour the caller could have simply used a mempool.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 19:32             ` Christoph Hellwig
@ 2005-01-07 19:42               ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2005-01-07 19:42 UTC (permalink / raw)
  To: Christoph Hellwig, Paulo Marques, Vladimir Saveliev, linux-mm,
	Andrew Morton, linux-kernel

On Fri, Jan 07, 2005 at 07:32:40PM +0000, Christoph Hellwig wrote:
> On Fri, Jan 07, 2005 at 07:14:15PM +0000, Paulo Marques wrote:
> > This seems like a very asymmetrical behavior. If the code explicitly 
> > reserves pages, it should explicitly use them, or it will become 
> > impossible to track down who is using what (not to mention that this 
> > will slow down every regular user of __alloc_pages, even if it is just 
> > for a quick test).
> > 
> > Why are there specialized functions to reserve the pages, but then they 
> > are used through the standard __alloc_pages interface?
> 
> That seems to be the whole point of the patch, as that way it'll serve
> all sub-allocators or kernel function called by the user.  Without this
> behaviour the caller could have simply used a mempool.

I still don't get it why reiserfs can't put the preallocated 
pool somewhere private like all other users who do similar things
do too (and yes there are quite a lot of subsystems who do 
preallocation) 

Why pollute task_struct for this? 

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 19:14           ` Paulo Marques
  2005-01-07 19:32             ` Christoph Hellwig
@ 2005-01-07 20:55             ` Nikita Danilov
  2005-01-07 21:24               ` Andrew Morton
  1 sibling, 1 reply; 27+ messages in thread
From: Nikita Danilov @ 2005-01-07 20:55 UTC (permalink / raw)
  To: Paulo Marques; +Cc: linux-mm, Andrew Morton, linux-kernel, Christoph Hellwig

Paulo Marques <pmarques@grupopie.com> writes:

[...]

>
> This seems like a very asymmetrical behavior. If the code explicitly
> reserves pages, it should explicitly use them, or it will become
> impossible to track down who is using what (not to mention that this
> will slow down every regular user of __alloc_pages, even if it is just
> for a quick test).
>
> Why are there specialized functions to reserve the pages, but then they
> are used through the standard __alloc_pages interface?

That's the whole idea behind this patch: at the beginning of "atomic"
operation, some number of pages is reserved. As these pages are
available through page allocator, _all_ allocations done by atomic
operation will use reserved pages transparently. For example:

        perthread_pages_reserve(nr, GFP_KERNEL);

        foo()->

            bar()->

                page = find_or_create_page(some_mapping, ...);

        perthread_pages_release(unused_pages);

find_or_create() pages will use pages reserved for this thread and,
hence, is guaranteed to succeed (given correct reservation).

Alternative is to pass some sort of handle all the way down to actual
calls to allocator, and to modify all generic code to use reservations.

Nikita.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 20:55             ` Nikita Danilov
@ 2005-01-07 21:24               ` Andrew Morton
  2005-01-07 21:24                 ` Andi Kleen
                                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Andrew Morton @ 2005-01-07 21:24 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: pmarques, linux-mm, linux-kernel, hch

Nikita Danilov <nikita@clusterfs.com> wrote:
>
> That's the whole idea behind this patch: at the beginning of "atomic"
> operation, some number of pages is reserved. As these pages are
> available through page allocator, _all_ allocations done by atomic
> operation will use reserved pages transparently. For example:
> 
>         perthread_pages_reserve(nr, GFP_KERNEL);
> 
>         foo()->
> 
>             bar()->
> 
>                 page = find_or_create_page(some_mapping, ...);
> 
>         perthread_pages_release(unused_pages);
> 
> find_or_create() pages will use pages reserved for this thread and,
> hence, is guaranteed to succeed (given correct reservation).
> 
> Alternative is to pass some sort of handle all the way down to actual
> calls to allocator, and to modify all generic code to use reservations.

Maybe I'm being thick, but I don't see how you can protect the reservation
of an outer reserver in the above way:

	perthread_pages_reserve(10);
	...				/* current->private_pages_count = 10 */
		perthread_pages_reserve(10)	/* private_pages_count = 20 */
		use 5 pages			/* private_pages_count = 15 */
		perthread_pages_release(5);

But how does the caller come up with the final "5"?

Seems better to me if prethread_pages_reserve() were to return the initial
value of private_pages_count, so the caller can do:

	old = perthread_pages_reserve(10);
		use 5 pages
	perthread_pages_release(old);

or whatever.

That kinda stinks too in a way, because both the outer and the inner
callers need to overallocate pages on behalf of the worst case user in some
deep call stack.

And the whole idea is pretty flaky really - how can one precalculate how
much memory an arbitrary md-on-dm-on-loop-on-md-on-NBD stack will want to
use?  It really would be better if we could drop the whole patch and make
reiser4 behave more sanely when its writepage is called with for_reclaim=1.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 21:24               ` Andrew Morton
@ 2005-01-07 21:24                 ` Andi Kleen
  2005-01-07 22:12                 ` Nikita Danilov
  2005-01-25 16:39                 ` reiser4 core patches: [Was: [RFC] per thread page reservation patch] Vladimir Saveliev
  2 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2005-01-07 21:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nikita Danilov, pmarques, linux-mm, linux-kernel, hch

> And the whole idea is pretty flaky really - how can one precalculate how
> much memory an arbitrary md-on-dm-on-loop-on-md-on-NBD stack will want to
> use?  It really would be better if we could drop the whole patch and make
> reiser4 behave more sanely when its writepage is called with for_reclaim=1.

Or just preallocate into a private array. 

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 21:24               ` Andrew Morton
  2005-01-07 21:24                 ` Andi Kleen
@ 2005-01-07 22:12                 ` Nikita Danilov
  2005-01-07 23:03                   ` Andrew Morton
  2005-01-25 16:39                 ` reiser4 core patches: [Was: [RFC] per thread page reservation patch] Vladimir Saveliev
  2 siblings, 1 reply; 27+ messages in thread
From: Nikita Danilov @ 2005-01-07 22:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: pmarques, linux-mm, linux-kernel, hch

Andrew Morton <akpm@osdl.org> writes:

[...]

>
> Maybe I'm being thick, but I don't see how you can protect the reservation
> of an outer reserver in the above way:
>
> 	perthread_pages_reserve(10);
> 	...				/* current->private_pages_count = 10 */
> 		perthread_pages_reserve(10)	/* private_pages_count = 20 */
> 		use 5 pages			/* private_pages_count = 15 */
> 		perthread_pages_release(5);
>
> But how does the caller come up with the final "5"?

	wasreserved = perthread_pages_count();
	result = perthread_pages_reserve(estimate_reserve(), GFP_KERNEL);
	if (result != 0)
		return result;

    /* do something that consumes reservation */

	perthread_pages_release(perthread_pages_count() - wasreserved);

>
> Seems better to me if prethread_pages_reserve() were to return the initial
> value of private_pages_count, so the caller can do:
>
> 	old = perthread_pages_reserve(10);
> 		use 5 pages
> 	perthread_pages_release(old);
>
> or whatever.
>
> That kinda stinks too in a way, because both the outer and the inner
> callers need to overallocate pages on behalf of the worst case user in some
> deep call stack.
>
> And the whole idea is pretty flaky really - how can one precalculate how
> much memory an arbitrary md-on-dm-on-loop-on-md-on-NBD stack will want to
> use?  It really would be better if we could drop the whole patch and make
> reiser4 behave more sanely when its writepage is called with for_reclaim=1.

Reiser4 doesn't use this for ->writepage(), by the way. This is used by
tree balancing code to assure that balancing cannot get -ENOMEM in the
middle of tree modification, because undo is _so_ very complicated.

Nikita.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 22:12                 ` Nikita Danilov
@ 2005-01-07 23:03                   ` Andrew Morton
  2005-01-07 23:17                     ` Nikita Danilov
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2005-01-07 23:03 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: pmarques, linux-mm, linux-kernel, hch

Nikita Danilov <nikita@clusterfs.com> wrote:
>
> > And the whole idea is pretty flaky really - how can one precalculate how
> > much memory an arbitrary md-on-dm-on-loop-on-md-on-NBD stack will want to
> > use?  It really would be better if we could drop the whole patch and make
> > reiser4 behave more sanely when its writepage is called with for_reclaim=1.
> 
> Reiser4 doesn't use this for ->writepage(), by the way. This is used by
> tree balancing code to assure that balancing cannot get -ENOMEM in the
> middle of tree modification, because undo is _so_ very complicated.

Oh.  And that involves performing I/O, yes?

Why does the filesystem risk going oom during the rebalance anyway?  Is it
doing atomic allocations?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 23:03                   ` Andrew Morton
@ 2005-01-07 23:17                     ` Nikita Danilov
  2005-01-07 23:43                       ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Nikita Danilov @ 2005-01-07 23:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: pmarques, linux-mm, linux-kernel, hch

Andrew Morton <akpm@osdl.org> writes:

> Nikita Danilov <nikita@clusterfs.com> wrote:
>>
>> > And the whole idea is pretty flaky really - how can one precalculate how
>> > much memory an arbitrary md-on-dm-on-loop-on-md-on-NBD stack will want to
>> > use?  It really would be better if we could drop the whole patch and make
>> > reiser4 behave more sanely when its writepage is called with for_reclaim=1.
>> 
>> Reiser4 doesn't use this for ->writepage(), by the way. This is used by
>> tree balancing code to assure that balancing cannot get -ENOMEM in the
>> middle of tree modification, because undo is _so_ very complicated.
>
> Oh.  And that involves performing I/O, yes?

Yes, balancing may read tree or bitmap nodes from the disk.

>
> Why does the filesystem risk going oom during the rebalance anyway?  Is it
> doing atomic allocations?

No, just __alloc_pages(GFP_KERNEL, 0, ...) returns NULL. When this
happens, the only thing balancing can do is to panic.

Nikita.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 23:17                     ` Nikita Danilov
@ 2005-01-07 23:43                       ` Andrew Morton
  2005-01-08 12:44                         ` Nikita Danilov
  2005-01-09 11:35                         ` Marcelo Tosatti
  0 siblings, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2005-01-07 23:43 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: pmarques, linux-mm, linux-kernel, hch

Nikita Danilov <nikita@clusterfs.com> wrote:
>
> >
> > Why does the filesystem risk going oom during the rebalance anyway?  Is it
> > doing atomic allocations?
> 
> No, just __alloc_pages(GFP_KERNEL, 0, ...) returns NULL. When this
> happens, the only thing balancing can do is to panic.

__alloc_pages(GFP_KERNEL, ...) doesn't return NULL.  It'll either succeed
or never return ;) That behaviour may change at any time of course, but it
does make me wonder why we're bothering with this at all.  Maybe it's
because of the possibility of a GFP_IO failure under your feet or
something?

What happens if reiser4 simply doesn't use this code?

If we introduce this mechanism, people will end up using it all over the
place.  Probably we could remove radix_tree_preload(), which is the only
similar code I can I can immediately think of.

Page reservation is not a bad thing per-se, but it does need serious
thought.

How does reiser4 end up deciding how many pages to reserve?  Gross
overkill?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 23:43                       ` Andrew Morton
@ 2005-01-08 12:44                         ` Nikita Danilov
  2005-01-08 13:43                           ` Hugh Dickins
  2005-01-09 11:35                         ` Marcelo Tosatti
  1 sibling, 1 reply; 27+ messages in thread
From: Nikita Danilov @ 2005-01-08 12:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: pmarques, linux-mm, linux-kernel, hch

Andrew Morton <akpm@osdl.org> writes:

> Nikita Danilov <nikita@clusterfs.com> wrote:
>>
>> >
>> > Why does the filesystem risk going oom during the rebalance anyway?  Is it
>> > doing atomic allocations?
>> 
>> No, just __alloc_pages(GFP_KERNEL, 0, ...) returns NULL. When this
>> happens, the only thing balancing can do is to panic.
>
> __alloc_pages(GFP_KERNEL, ...) doesn't return NULL.  It'll either succeed
> or never return ;) That behaviour may change at any time of course, but it

Hmm... it used to, when I wrote that code.

> does make me wonder why we're bothering with this at all.  Maybe it's
> because of the possibility of a GFP_IO failure under your feet or
> something?

This is what happens:

 - we start inserting new item into balanced tree,

 - lock nodes on the leaf level and modify them

 - go to the parent level

 - lock nodes on the parent level and modify them. This may require
   allocating new nodes. If allocation fails---we have to panic, because
   tree is in inconsistent state and there is no roll-back; if allocation
   hangs forever---deadlock is on its way, because we are still keeping
   locks on nodes on the leaf level.

>
> What happens if reiser4 simply doesn't use this code?

At the time I tested it, it panicked after getting NULL from
__alloc_pages(). With current `do_retry' logic in __alloc_pages() it
will deadlock, I guess.

>
>
> If we introduce this mechanism, people will end up using it all over the
> place.  Probably we could remove radix_tree_preload(), which is the only
> similar code I can I can immediately think of.
>
> Page reservation is not a bad thing per-se, but it does need serious
> thought.
>
> How does reiser4 end up deciding how many pages to reserve?  Gross
> overkill?

Worst-case behavior of tree algorithms is well-known. Yes it's overkill.

Nikita.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-08 12:44                         ` Nikita Danilov
@ 2005-01-08 13:43                           ` Hugh Dickins
  0 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2005-01-08 13:43 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Andrew Morton, pmarques, linux-mm, linux-kernel, hch

On Sat, 8 Jan 2005, Nikita Danilov wrote:
> Andrew Morton <akpm@osdl.org> writes:
> >
> > __alloc_pages(GFP_KERNEL, ...) doesn't return NULL.  It'll either succeed
> > or never return ;) That behaviour may change at any time of course, but it
> 
> Hmm... it used to, when I wrote that code.

And still does, if OOM decides to kill _your_ task: OOM sets PF_MEMDIE,
and then you don't get to go the retry route at all:

	/* This allocation should allow future memory freeing. */
	if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
		/* go through the zonelist yet again, ignoring mins */
		for (i = 0; (z = zones[i]) != NULL; i++) {
			page = buffered_rmqueue(z, order, gfp_mask);
			if (page)
				goto got_pg;
		}
		goto nopage;
	}
....
nopage:
	....
	return NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-07 23:43                       ` Andrew Morton
  2005-01-08 12:44                         ` Nikita Danilov
@ 2005-01-09 11:35                         ` Marcelo Tosatti
  2005-01-09 18:16                           ` Nikita Danilov
  1 sibling, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2005-01-09 11:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nikita Danilov, pmarques, linux-mm, linux-kernel, hch

On Fri, Jan 07, 2005 at 03:43:05PM -0800, Andrew Morton wrote:
> Nikita Danilov <nikita@clusterfs.com> wrote:
> >
> > >
> > > Why does the filesystem risk going oom during the rebalance anyway?  Is it
> > > doing atomic allocations?
> > 
> > No, just __alloc_pages(GFP_KERNEL, 0, ...) returns NULL. When this
> > happens, the only thing balancing can do is to panic.
> 
> __alloc_pages(GFP_KERNEL, ...) doesn't return NULL.  It'll either succeed
> or never return ;) That behaviour may change at any time of course, but it
> does make me wonder why we're bothering with this at all.  Maybe it's
> because of the possibility of a GFP_IO failure under your feet or
> something?
> 
> What happens if reiser4 simply doesn't use this code?
> 
> 
> If we introduce this mechanism, people will end up using it all over the
> place.  Probably we could remove radix_tree_preload(), which is the only
> similar code I can I can immediately think of.
> 
> Page reservation is not a bad thing per-se, but it does need serious
> thought.

Whenever scheme comes up I dont think the current check in __alloc_pages() is 
any good:

        if (order == 0) {
                page = perthread_pages_alloc();
                if (page != NULL)
                        return page;
        }

Two things:

- all instances of an allocator from the current thread will eat from the perthread
  reserves, you probably want only a few special allocations to eat from the reserves?
  Thing is its not really a reservation intended for emergency situations,
  rather a "generic per-thread pool" the way things are now.

- its a real fast path, we're adding quite some instructions there which are only
  used by reiserfs now.

I think in a "final" implementation emergency allocations should be explicitly stated 
as such by the callers ?

> How does reiser4 end up deciding how many pages to reserve?  Gross
> overkill?
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] per thread page reservation patch
  2005-01-09 11:35                         ` Marcelo Tosatti
@ 2005-01-09 18:16                           ` Nikita Danilov
  0 siblings, 0 replies; 27+ messages in thread
From: Nikita Danilov @ 2005-01-09 18:16 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, pmarques, linux-mm, linux-kernel, hch

Marcelo Tosatti <marcelo.tosatti@cyclades.com> writes:

[...]

>
> - all instances of an allocator from the current thread will eat from the perthread
>   reserves, you probably want only a few special allocations to eat from the reserves?
>   Thing is its not really a reservation intended for emergency situations,
>   rather a "generic per-thread pool" the way things are now.

Per-thread reservations are only useful when they are transparent. If
users have to call special function, or to pass special flag to
__alloc_pages() they might just use mempool as well. Idea is to reserve
some number of pages before starting complex operation so that generic
function (like find_get_page()) called as part of this operation are
guaranteed to succeed.

>
> - its a real fast path, we're adding quite some instructions there which are only
>   used by reiserfs now.

Yes, this worries me too. Possibly we should move this check below in
__alloc_pages(), so that per-thread reservation is only checked if
fast-path allocation failed.

Nikita.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* reiser4 core patches: [Was: [RFC] per thread page reservation patch]
  2005-01-07 21:24               ` Andrew Morton
  2005-01-07 21:24                 ` Andi Kleen
  2005-01-07 22:12                 ` Nikita Danilov
@ 2005-01-25 16:39                 ` Vladimir Saveliev
  2005-01-27 10:37                   ` Adrian Bunk
  2005-01-27 11:01                   ` Christoph Hellwig
  2 siblings, 2 replies; 27+ messages in thread
From: Vladimir Saveliev @ 2005-01-25 16:39 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig; +Cc: Nikita Danilov, linux-mm, linux-kernel

Hello

[per thread page reservation discussion is snipped]

> And the whole idea is pretty flaky really - how can one precalculate how
> much memory an arbitrary md-on-dm-on-loop-on-md-on-NBD stack will want to
> use?  It really would be better if we could drop the whole patch and make
> reiser4 behave more sanely when its writepage is called with for_reclaim=1.

ok, we will change reiser4 to keep its pool of preallocated pages
privately as Andi suggested. This will require to have
reiser4_find_or_create_page which will do what find_or_create_page does
plus get a page from the list of preallocated pages if alloc_page
returns NULL.

So, currently, reiser4 depends on the core patches listed below. Would
you please look over them and let us know which look reasonable and
which are to be eliminated.

reiser4-sb_sync_inodes.patch
This patch adds new operation (sync_inodes) to struct super_operations.
This operation allows a filesystem to writeout dirty pages not
necessarily on per-inode basis. Default implementation of this operation
is sync_sb_inodes.

reiser4-allow-drop_inode-implementation.patch
This EXPORT_SYMBOL-s inodes_stat, generic_forget_inode, destroy_inode
and wake_up_inode which are needed to implement drop_inode.
reiser4 implements function similar to generic_delete_inode to be able
to truncate inode pages together with metadata destroying in
reiser4_delete_inode whereas generic_delete_inode first truncates pages
and then calls foofs_delete_inode.

reiser4-truncate_inode_pages_range.patch
This patch makes truncate_inode_pages_range from truncate_inode_pages.
truncate_inode_pages_range can truncate only pages which fall into
specified range. truncate_inode_pages which trucates all pages starting
from specified offset is made a one liner which calls
truncate_inode_pages_range.

reiser4-rcu-barrier.patch
This patch introduces a new interface - rcu_barrier() which waits until
all the RCUs queued until this call have been completed.
This patch is by Dipankar Sarma <dipankar@in.ibm.com>

reiser4-reget-page-mapping.patch
This patch allows to remove page from page cache in foofs_releasepage.

reiser4-radix_tree_lookup_slot.patch
This patch extents radxi tree API with a function which returns pointer
to found item within the tree.

reiser4-export-remove_from_page_cache.patch
reiser4-export-page_cache_readahead.patch
reiser4-export-pagevec-funcs.patch
reiser4-export-radix_tree_preload.patch
reiser4-export-find_get_pages.patch
reiser4-export-generic_sync_sb_inodes.patch
The above patches EXPORT_SYMBOL several functions. Others may find it
useful if they were exported. 

reiser4-export-inode_lock.patch
Reiser4 used to manipulate with super block inode lists so it needs
inode_lock exported.
We are working now to not need this. But quite many things are based on
it. Is there any chance to have it included?

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: reiser4 core patches: [Was: [RFC] per thread page reservation patch]
  2005-01-25 16:39                 ` reiser4 core patches: [Was: [RFC] per thread page reservation patch] Vladimir Saveliev
@ 2005-01-27 10:37                   ` Adrian Bunk
  2005-01-27 11:01                   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Adrian Bunk @ 2005-01-27 10:37 UTC (permalink / raw)
  To: Vladimir Saveliev
  Cc: Andrew Morton, Christoph Hellwig, Nikita Danilov, linux-mm, linux-kernel

Hi Vladimir,

a general request:

If you add new EXPORT_SYMBOL's, I'd strongly prefer them being 
EXPORT_SYMBOL_GPL's unless there's a very good reason for an 
EXPORT_SYMBOL.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: reiser4 core patches: [Was: [RFC] per thread page reservation patch]
  2005-01-25 16:39                 ` reiser4 core patches: [Was: [RFC] per thread page reservation patch] Vladimir Saveliev
  2005-01-27 10:37                   ` Adrian Bunk
@ 2005-01-27 11:01                   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2005-01-27 11:01 UTC (permalink / raw)
  To: Vladimir Saveliev; +Cc: Andrew Morton, Nikita Danilov, linux-mm, linux-kernel

> So, currently, reiser4 depends on the core patches listed below. Would
> you please look over them and let us know which look reasonable and
> which are to be eliminated.

> reiser4-sb_sync_inodes.patch
> This patch adds new operation (sync_inodes) to struct super_operations.
> This operation allows a filesystem to writeout dirty pages not
> necessarily on per-inode basis. Default implementation of this operation
> is sync_sb_inodes.

ok.

> reiser4-allow-drop_inode-implementation.patch
> This EXPORT_SYMBOL-s inodes_stat, generic_forget_inode, destroy_inode
> and wake_up_inode which are needed to implement drop_inode.
> reiser4 implements function similar to generic_delete_inode to be able
> to truncate inode pages together with metadata destroying in
> reiser4_delete_inode whereas generic_delete_inode first truncates pages
> and then calls foofs_delete_inode.

Not okay.  I though I explained you how to do it instead already?

> reiser4-truncate_inode_pages_range.patch
> This patch makes truncate_inode_pages_range from truncate_inode_pages.
> truncate_inode_pages_range can truncate only pages which fall into
> specified range. truncate_inode_pages which trucates all pages starting
> from specified offset is made a one liner which calls
> truncate_inode_pages_range.

Ok.

> reiser4-rcu-barrier.patch
> This patch introduces a new interface - rcu_barrier() which waits until
> all the RCUs queued until this call have been completed.
> This patch is by Dipankar Sarma <dipankar@in.ibm.com>

No idea, but if Dipankar things it's okay it probably is.

> reiser4-reget-page-mapping.patch
> This patch allows to remove page from page cache in foofs_releasepage.

probably ok, Andrew?

> reiser4-radix_tree_lookup_slot.patch
> This patch extents radxi tree API with a function which returns pointer
> to found item within the tree.

Idea is okay, implementation needs work:

__lookup_slot should be called radix_tree_lookup_slot and exported direcly.

radix_tree_lookup should be an inlined wrapper in the header, and kill
the != NULL comparims, it's superflous:

static inline void *radix_tree_lookup(struct radix_tree_root *root,
		unsigned long index)
{
	void **slot = radix_tree_lookup_slot(root, index);
	return slot ? *slot : NULL;
}

> reiser4-export-remove_from_page_cache.patch

Probably okay, but why do you need both remove_from_page_cache
and __remove_from_page_cache?

> reiser4-export-page_cache_readahead.patch

ok

> reiser4-export-pagevec-funcs.patch

ok

> reiser4-export-radix_tree_preload.patch

this one is nasty.  not your fault though, but why do we even
have global radix-tree preloads instead of pools.  What do you
need it for?

> reiser4-export-find_get_pages.patch

Why don't you use pagevecs instead?

> reiser4-export-generic_sync_sb_inodes.patch

Can't find this one in current -mm

> reiser4-export-inode_lock.patch
> Reiser4 used to manipulate with super block inode lists so it needs
> inode_lock exported.
> We are working now to not need this. But quite many things are based on
> it. Is there any chance to have it included?

No.  Modules must not look at inode_lock, it's far too much of an
implementation detail.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2005-01-27 11:01 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20050103011113.6f6c8f44.akpm@osdl.org>
     [not found] ` <20050103114854.GA18408@infradead.org>
     [not found]   ` <41DC2386.9010701@namesys.com>
     [not found]     ` <1105019521.7074.79.camel@tribesman.namesys.com>
     [not found]       ` <20050107144644.GA9606@infradead.org>
2005-01-07 17:16         ` [RFC] per thread page reservation patch Vladimir Saveliev
2005-01-07 18:48           ` Andrew Morton
2005-01-07 20:21             ` Nikita Danilov
2005-01-07 19:05           ` Christoph Hellwig
2005-01-07 19:12             ` Christoph Hellwig
2005-01-07 19:21             ` Robert Love
2005-01-07 20:48             ` Nikita Danilov
2005-01-07 20:54               ` Christoph Hellwig
2005-01-07 21:00                 ` Nikita Danilov
2005-01-07 21:07                   ` Christoph Hellwig
2005-01-07 19:14           ` Paulo Marques
2005-01-07 19:32             ` Christoph Hellwig
2005-01-07 19:42               ` Andi Kleen
2005-01-07 20:55             ` Nikita Danilov
2005-01-07 21:24               ` Andrew Morton
2005-01-07 21:24                 ` Andi Kleen
2005-01-07 22:12                 ` Nikita Danilov
2005-01-07 23:03                   ` Andrew Morton
2005-01-07 23:17                     ` Nikita Danilov
2005-01-07 23:43                       ` Andrew Morton
2005-01-08 12:44                         ` Nikita Danilov
2005-01-08 13:43                           ` Hugh Dickins
2005-01-09 11:35                         ` Marcelo Tosatti
2005-01-09 18:16                           ` Nikita Danilov
2005-01-25 16:39                 ` reiser4 core patches: [Was: [RFC] per thread page reservation patch] Vladimir Saveliev
2005-01-27 10:37                   ` Adrian Bunk
2005-01-27 11:01                   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox