* [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio
[not found] ` <48ECB215.4040409@linux.vnet.ibm.com>
@ 2008-10-09 15:29 ` Andrea Righi
2008-10-10 0:41 ` KAMEZAWA Hiroyuki
2008-11-10 20:58 ` [PATCH -mm] mm: fine-grained dirty_ratio_pcm and dirty_background_ratio_pcm (v2) Andrea Righi
0 siblings, 2 replies; 9+ messages in thread
From: Andrea Righi @ 2008-10-09 15:29 UTC (permalink / raw)
To: balbir, KAMEZAWA Hiroyuki, Michael Rubin
Cc: Andrew Morton, menage, dave, chlunde, dpshah, eric.rannaud,
fernando, agk, m.innocenti, s-uchida, ryov, matt, dradford,
KOSAKI Motohiro, linux-mm, LKML
The current granularity of 5% of dirtyable memory for dirty pages writeback is
too coarse for large memory machines and this will get worse as
memory-size/disk-speed ratio continues to increase.
These large writebacks can be unpleasant for desktop or latency-sensitive
environments, where the time to complete a writeback can be perceived as a
lack of responsiveness by the whole system.
So, something to define fine grained settings is needed.
Following there's a similar solution as discussed in [1], but I tried to
simplify the things a little bit, in order to provide the same functionality
(in particular try to avoid backward compatibility problems) and reduce the
amount of code needed to implement an in-kernel parser to handle percentages
with decimals digits.
The kernel provides the following parameters:
- dirty_ratio, dirty_background_ratio in percentage
(1 ... 100)
- dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille
(1 ... 100,000)
Both dirty_ratio and dirty_ratio_pcm refer to the same vm_dirty_ratio variable,
only the interface to read/write this value is different. The same is valid for
dirty_background_ratio and dirty_background_ratio_pcm.
In this way it's possible to provide a fine grained interface to configure the
writeback policy and at the same time preserve the compatibility with the old
coarse grained dirty_ratio / dirty_background_ratio users.
Examples:
# echo 5 > /proc/sys/vm/dirty_ratio
# cat /proc/sys/vm/dirty_ratio
5
# cat /proc/sys/vm/dirty_ratio_pcm
5000
# echo 500 > /proc/sys/vm/dirty_ratio_pcm
# cat /proc/sys/vm/dirty_ratio
0
# cat /proc/sys/vm/dirty_ratio_pcm
500
# echo 5500 > /proc/sys/vm/dirty_ratio_pcm
# cat /proc/sys/vm/dirty_ratio
5
# cat /proc/sys/vm/dirty_ratio_pcm
5500
[1] http://lkml.org/lkml/2008/10/7/230
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
Documentation/filesystems/proc.txt | 20 +++++++++
include/linux/sysctl.h | 7 +++
kernel/sysctl.c | 80 +++++++++++++++++++++++++++++++++--
kernel/sysctl_check.c | 3 +
mm/page-writeback.c | 29 ++++++++++---
5 files changed, 128 insertions(+), 11 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 394eb2c..95f31f5 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1383,6 +1383,16 @@ dirty_background_ratio
Contains, as a percentage of total system memory, the number of pages at which
the pdflush background writeback daemon will start writing out dirty data.
+dirty_background_ratio_pcm
+--------------------------
+
+A fine-grained interface to configure dirty_background_ratio.
+
+Contains, as a percentage in units of pcm (percent mille) of the dirtyable
+system memory (free pages + mapped pages + file cache, not including locked
+pages and HugePages), the number of pages at which the pdflush background
+writeback daemon will start writing out dirty data.
+
dirty_ratio
-----------------
@@ -1390,6 +1400,16 @@ Contains, as a percentage of total system memory, the number of pages at which
a process which is generating disk writes will itself start writing out dirty
data.
+dirty_ratio_pcm
+---------------
+
+A fine-grained interface to configure dirty_ratio.
+
+Contains, as a percentage in units of pcm (percent mille) of the dirtyable
+system memory (free pages + mapped pages + file cache, not including locked
+pages and HugePages), the number of pages at which a process which is
+generating disk writes will itself start writing out dirty data.
+
dirty_writeback_centisecs
-------------------------
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 39d471d..799594b 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -32,6 +32,9 @@
struct file;
struct completion;
+#define PERCENT_PCM 1000
+#define ONE_HUNDRED_PCM (100 * PERCENT_PCM)
+
#define CTL_MAXNAME 10 /* how many path components do we allow in a
call to sysctl? In other words, what is
the largest acceptable value for the nlen
@@ -205,6 +208,8 @@ enum
VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
+ VM_DIRTY_BACKGROUND_PCM = 36, /* fine-grained dirty_background_ratio */
+ VM_DIRTY_RATIO_PCM = 37, /* fine-grained dirty_ratio */
};
@@ -991,6 +996,8 @@ extern int proc_dointvec_userhz_jiffies(struct ctl_table *, int, struct file *,
void __user *, size_t *, loff_t *);
extern int proc_dointvec_ms_jiffies(struct ctl_table *, int, struct file *,
void __user *, size_t *, loff_t *);
+extern int proc_dointvec_pcm_minmax(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
extern int proc_doulongvec_minmax(struct ctl_table *, int, struct file *,
void __user *, size_t *, loff_t *);
extern int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index fcd66f1..e22ab48 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -89,9 +89,7 @@ extern int rcutorture_runnable;
#endif /* #ifdef CONFIG_RCU_TORTURE_TEST */
/* Constants used for minimum and maximum */
-#if defined(CONFIG_HIGHMEM) || defined(CONFIG_DETECT_SOFTLOCKUP)
static int one = 1;
-#endif
#ifdef CONFIG_DETECT_SOFTLOCKUP
static int sixty = 60;
@@ -104,6 +102,7 @@ static int two = 2;
static int zero;
static int one_hundred = 100;
+static int one_hundred_pcm = ONE_HUNDRED_PCM;
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
@@ -910,12 +909,23 @@ static struct ctl_table vm_table[] = {
.data = &dirty_background_ratio,
.maxlen = sizeof(dirty_background_ratio),
.mode = 0644,
- .proc_handler = &proc_dointvec_minmax,
+ .proc_handler = &proc_dointvec_pcm_minmax,
.strategy = &sysctl_intvec,
- .extra1 = &zero,
+ .extra1 = &one,
.extra2 = &one_hundred,
},
{
+ .ctl_name = VM_DIRTY_BACKGROUND_PCM,
+ .procname = "dirty_background_ratio_pcm",
+ .data = &dirty_background_ratio,
+ .maxlen = sizeof(dirty_background_ratio),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &one,
+ .extra2 = &one_hundred_pcm,
+ },
+ {
.ctl_name = VM_DIRTY_RATIO,
.procname = "dirty_ratio",
.data = &vm_dirty_ratio,
@@ -923,10 +933,21 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = &dirty_ratio_handler,
.strategy = &sysctl_intvec,
- .extra1 = &zero,
+ .extra1 = &one,
.extra2 = &one_hundred,
},
{
+ .ctl_name = VM_DIRTY_RATIO_PCM,
+ .procname = "dirty_ratio_pcm",
+ .data = &vm_dirty_ratio,
+ .maxlen = sizeof(vm_dirty_ratio),
+ .mode = 0644,
+ .proc_handler = &dirty_ratio_handler,
+ .strategy = &sysctl_intvec,
+ .extra1 = &one,
+ .extra2 = &one_hundred_pcm,
+ },
+ {
.procname = "dirty_writeback_centisecs",
.data = &dirty_writeback_interval,
.maxlen = sizeof(dirty_writeback_interval),
@@ -2539,6 +2560,35 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
lenp, ppos, HZ, 1000l);
}
+static int do_proc_dointvec_pcm_minmax_conv(int *negp, unsigned long *lvalp,
+ int *valp, int write, void *data)
+{
+ struct do_proc_dointvec_minmax_conv_param *param = data;
+ int val;
+
+ if (write) {
+ if (*lvalp > LONG_MAX / PERCENT_PCM)
+ return -EINVAL;
+ val = *negp ? -*lvalp : *lvalp;
+ if ((param->min && *param->min > val) ||
+ (param->max && *param->max < val))
+ return -EINVAL;
+ *valp = val * PERCENT_PCM;
+ } else {
+ unsigned long lval;
+
+ val = *valp;
+ if (val < 0) {
+ *negp = -1;
+ lval = (unsigned long)-val;
+ } else {
+ *negp = 0;
+ lval = (unsigned long)val;
+ }
+ *lvalp = lval / PERCENT_PCM;
+ }
+ return 0;
+}
static int do_proc_dointvec_jiffies_conv(int *negp, unsigned long *lvalp,
int *valp,
@@ -2677,6 +2727,19 @@ int proc_dointvec_ms_jiffies(struct ctl_table *table, int write, struct file *fi
do_proc_dointvec_ms_jiffies_conv, NULL);
}
+int proc_dointvec_pcm_minmax(struct ctl_table *table, int write,
+ struct file *filp, void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ struct do_proc_dointvec_minmax_conv_param param = {
+ .min = (int *)table->extra1,
+ .max = (int *)table->extra2,
+ };
+
+ return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
+ do_proc_dointvec_pcm_minmax_conv, ¶m);
+}
+
static int proc_do_cad_pid(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
@@ -2725,6 +2788,13 @@ int proc_dointvec_jiffies(struct ctl_table *table, int write, struct file *filp,
return -ENOSYS;
}
+int proc_dointvec_pcm_minmax(struct ctl_table *table, int write,
+ struct file *filp, void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ return -ENOSYS;
+}
+
int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index c35da23..83934a8 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -111,7 +111,9 @@ static const struct trans_ctl_table trans_vm_table[] = {
{ VM_OVERCOMMIT_MEMORY, "overcommit_memory" },
{ VM_PAGE_CLUSTER, "page-cluster" },
{ VM_DIRTY_BACKGROUND, "dirty_background_ratio" },
+ { VM_DIRTY_BACKGROUND_PCM, "dirty_background_ratio_pcm" },
{ VM_DIRTY_RATIO, "dirty_ratio" },
+ { VM_DIRTY_RATIO_PCM, "dirty_ratio_pcm" },
{ VM_DIRTY_WB_CS, "dirty_writeback_centisecs" },
{ VM_DIRTY_EXPIRE_CS, "dirty_expire_centisecs" },
{ VM_NR_PDFLUSH_THREADS, "nr_pdflush_threads" },
@@ -1494,6 +1496,7 @@ int sysctl_check_table(struct nsproxy *namespaces, struct ctl_table *table)
(table->proc_handler == proc_dostring) ||
(table->proc_handler == proc_dointvec) ||
(table->proc_handler == proc_dointvec_minmax) ||
+ (table->proc_handler == proc_dointvec_pcm_minmax) ||
(table->proc_handler == proc_dointvec_jiffies) ||
(table->proc_handler == proc_dointvec_userhz_jiffies) ||
(table->proc_handler == proc_dointvec_ms_jiffies) ||
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c6d6088..6bc8c9b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -66,7 +66,7 @@ static inline long sync_writeback_pages(void)
/*
* Start background writeback (via pdflush) at this percentage
*/
-int dirty_background_ratio = 5;
+int dirty_background_ratio = 5 * PERCENT_PCM;
/*
* free highmem will not be subtracted from the total free memory
@@ -77,7 +77,7 @@ int vm_highmem_is_dirtyable;
/*
* The generator of dirty data starts writeback at this percentage
*/
-int vm_dirty_ratio = 10;
+int vm_dirty_ratio = 10 * PERCENT_PCM;
/*
* The interval between `kupdate'-style writebacks, in jiffies
@@ -135,7 +135,8 @@ static int calc_period_shift(void)
{
unsigned long dirty_total;
- dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
+ dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
+ / ONE_HUNDRED_PCM;
return 2 + ilog2(dirty_total - 1);
}
@@ -147,7 +148,23 @@ int dirty_ratio_handler(struct ctl_table *table, int write,
loff_t *ppos)
{
int old_ratio = vm_dirty_ratio;
- int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+ int ret;
+
+ switch (table->ctl_name) {
+ case VM_DIRTY_RATIO:
+ ret = proc_dointvec_pcm_minmax(table, write, filp, buffer,
+ lenp, ppos);
+ break;
+ case VM_DIRTY_RATIO_PCM:
+ ret = proc_dointvec_minmax(table, write, filp, buffer,
+ lenp, ppos);
+ break;
+ default:
+ ret = -EINVAL;
+ WARN_ON(1);
+ break;
+ }
+
if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
int shift = calc_period_shift();
prop_change_shift(&vm_completions, shift);
@@ -380,8 +397,8 @@ get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
if (background_ratio >= dirty_ratio)
background_ratio = dirty_ratio / 2;
- background = (background_ratio * available_memory) / 100;
- dirty = (dirty_ratio * available_memory) / 100;
+ background = (background_ratio * available_memory) / ONE_HUNDRED_PCM;
+ dirty = (dirty_ratio * available_memory) / ONE_HUNDRED_PCM;
tsk = current;
if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
background += background / 4;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio
2008-10-09 15:29 ` [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio Andrea Righi
@ 2008-10-10 0:41 ` KAMEZAWA Hiroyuki
2008-10-10 9:32 ` Andrea Righi
2008-11-10 20:58 ` [PATCH -mm] mm: fine-grained dirty_ratio_pcm and dirty_background_ratio_pcm (v2) Andrea Righi
1 sibling, 1 reply; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-10 0:41 UTC (permalink / raw)
To: righi.andrea
Cc: balbir, Michael Rubin, Andrew Morton, menage, dave, chlunde,
dpshah, eric.rannaud, fernando, agk, m.innocenti, s-uchida, ryov,
matt, dradford, KOSAKI Motohiro, linux-mm, LKML
On Thu, 09 Oct 2008 17:29:46 +0200
Andrea Righi <righi.andrea@gmail.com> wrote:
> The current granularity of 5% of dirtyable memory for dirty pages writeback is
> too coarse for large memory machines and this will get worse as
> memory-size/disk-speed ratio continues to increase.
>
> These large writebacks can be unpleasant for desktop or latency-sensitive
> environments, where the time to complete a writeback can be perceived as a
> lack of responsiveness by the whole system.
>
> So, something to define fine grained settings is needed.
>
> Following there's a similar solution as discussed in [1], but I tried to
> simplify the things a little bit, in order to provide the same functionality
> (in particular try to avoid backward compatibility problems) and reduce the
> amount of code needed to implement an in-kernel parser to handle percentages
> with decimals digits.
>
> The kernel provides the following parameters:
> - dirty_ratio, dirty_background_ratio in percentage
> (1 ... 100)
> - dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille
> (1 ... 100,000)
>
> Both dirty_ratio and dirty_ratio_pcm refer to the same vm_dirty_ratio variable,
> only the interface to read/write this value is different. The same is valid for
> dirty_background_ratio and dirty_background_ratio_pcm.
>
> In this way it's possible to provide a fine grained interface to configure the
> writeback policy and at the same time preserve the compatibility with the old
> coarse grained dirty_ratio / dirty_background_ratio users.
>
> Examples:
> # echo 5 > /proc/sys/vm/dirty_ratio
> # cat /proc/sys/vm/dirty_ratio
> 5
> # cat /proc/sys/vm/dirty_ratio_pcm
> 5000
>
> # echo 500 > /proc/sys/vm/dirty_ratio_pcm
> # cat /proc/sys/vm/dirty_ratio
> 0
> # cat /proc/sys/vm/dirty_ratio_pcm
> 500
>
> # echo 5500 > /proc/sys/vm/dirty_ratio_pcm
> # cat /proc/sys/vm/dirty_ratio
> 5
> # cat /proc/sys/vm/dirty_ratio_pcm
> 5500
>
I like this. thanks.
<snip>
> -int dirty_background_ratio = 5;
> +int dirty_background_ratio = 5 * PERCENT_PCM;
>
> /*
> * free highmem will not be subtracted from the total free memory
> @@ -77,7 +77,7 @@ int vm_highmem_is_dirtyable;
> /*
> * The generator of dirty data starts writeback at this percentage
> */
> -int vm_dirty_ratio = 10;
> +int vm_dirty_ratio = 10 * PERCENT_PCM;
>
> /*
> * The interval between `kupdate'-style writebacks, in jiffies
> @@ -135,7 +135,8 @@ static int calc_period_shift(void)
> {
> unsigned long dirty_total;
>
> - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
> + dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
> + / ONE_HUNDRED_PCM;
> return 2 + ilog2(dirty_total - 1);
> }
>
I wonder...isn't this overflow in 32bit system ?
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio
2008-10-10 0:41 ` KAMEZAWA Hiroyuki
@ 2008-10-10 9:32 ` Andrea Righi
2008-10-10 13:13 ` Andrea Righi
0 siblings, 1 reply; 9+ messages in thread
From: Andrea Righi @ 2008-10-10 9:32 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: balbir, Michael Rubin, Andrew Morton, menage, dave, chlunde,
dpshah, eric.rannaud, fernando, agk, m.innocenti, s-uchida, ryov,
matt, dradford, KOSAKI Motohiro, linux-mm, LKML
KAMEZAWA Hiroyuki wrote:
> <snip>
>
>> -int dirty_background_ratio = 5;
>> +int dirty_background_ratio = 5 * PERCENT_PCM;
>>
>> /*
>> * free highmem will not be subtracted from the total free memory
>> @@ -77,7 +77,7 @@ int vm_highmem_is_dirtyable;
>> /*
>> * The generator of dirty data starts writeback at this percentage
>> */
>> -int vm_dirty_ratio = 10;
>> +int vm_dirty_ratio = 10 * PERCENT_PCM;
>>
>> /*
>> * The interval between `kupdate'-style writebacks, in jiffies
>> @@ -135,7 +135,8 @@ static int calc_period_shift(void)
>> {
>> unsigned long dirty_total;
>>
>> - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
>> + dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
>> + / ONE_HUNDRED_PCM;
>> return 2 + ilog2(dirty_total - 1);
>> }
>>
> I wonder...isn't this overflow in 32bit system ?
Correct! the worst case is (in pages):
4GB = 100,000 * determine_dirtyable_memory()
that means 42950 pages (~168MB) of dirtyable memory is enough to overflow :(.
Using an u64 for dirty_total should resolve.
Delta patch is below.
Unfortunately I have all 64-bit machines right now. Maybe tomorrow I'll
be able to get a 32-bit box, if someone doesn't test this before.
Thanks!
-Andrea
---
Subject: fix overflow in 32-bit systems using fine-grained dirty_ratio
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/page-writeback.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 6bc8c9b..29913e5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -133,7 +133,7 @@ static struct prop_descriptor vm_dirties;
*/
static int calc_period_shift(void)
{
- unsigned long dirty_total;
+ u64 dirty_total;
dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
/ ONE_HUNDRED_PCM;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio
2008-10-10 9:32 ` Andrea Righi
@ 2008-10-10 13:13 ` Andrea Righi
0 siblings, 0 replies; 9+ messages in thread
From: Andrea Righi @ 2008-10-10 13:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: balbir, Michael Rubin, Andrew Morton, menage, dave, chlunde,
dpshah, eric.rannaud, fernando, agk, m.innocenti, s-uchida, ryov,
matt, dradford, KOSAKI Motohiro, linux-mm, LKML
Andrea Righi wrote:
> KAMEZAWA Hiroyuki wrote:
>> <snip>
>>
>>> -int dirty_background_ratio = 5;
>>> +int dirty_background_ratio = 5 * PERCENT_PCM;
>>>
>>> /*
>>> * free highmem will not be subtracted from the total free memory
>>> @@ -77,7 +77,7 @@ int vm_highmem_is_dirtyable;
>>> /*
>>> * The generator of dirty data starts writeback at this percentage
>>> */
>>> -int vm_dirty_ratio = 10;
>>> +int vm_dirty_ratio = 10 * PERCENT_PCM;
>>>
>>> /*
>>> * The interval between `kupdate'-style writebacks, in jiffies
>>> @@ -135,7 +135,8 @@ static int calc_period_shift(void)
>>> {
>>> unsigned long dirty_total;
>>>
>>> - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
>>> + dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
>>> + / ONE_HUNDRED_PCM;
>>> return 2 + ilog2(dirty_total - 1);
>>> }
>>>
>> I wonder...isn't this overflow in 32bit system ?
>
> Correct! the worst case is (in pages):
>
> 4GB = 100,000 * determine_dirtyable_memory()
>
> that means 42950 pages (~168MB) of dirtyable memory is enough to overflow :(.
> Using an u64 for dirty_total should resolve.
>
> Delta patch is below.
>
> Unfortunately I have all 64-bit machines right now. Maybe tomorrow I'll
> be able to get a 32-bit box, if someone doesn't test this before.
>
> Thanks!
> -Andrea
I've been able to quickly resolve creating a 1GB mem i386 VM with kvm. :)
Everything seems to work fine and with the following fix it doesn't overflow.
-Andrea
>
> ---
> Subject: fix overflow in 32-bit systems using fine-grained dirty_ratio
>
> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/page-writeback.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 6bc8c9b..29913e5 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -133,7 +133,7 @@ static struct prop_descriptor vm_dirties;
> */
> static int calc_period_shift(void)
> {
> - unsigned long dirty_total;
> + u64 dirty_total;
>
> dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
> / ONE_HUNDRED_PCM;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH -mm] mm: fine-grained dirty_ratio_pcm and dirty_background_ratio_pcm (v2)
2008-10-09 15:29 ` [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio Andrea Righi
2008-10-10 0:41 ` KAMEZAWA Hiroyuki
@ 2008-11-10 20:58 ` Andrea Righi
2008-11-10 21:12 ` Andrew Morton
1 sibling, 1 reply; 9+ messages in thread
From: Andrea Righi @ 2008-11-10 20:58 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki, Andrew Morton, rientjes
Cc: balbir, Michael Rubin, menage, dave, chlunde, dpshah,
eric.rannaud, fernando, agk, m.innocenti, s-uchida, ryov, matt,
dradford, KOSAKI Motohiro, linux-mm, LKML, containers
The current granularity of 5% of dirtyable memory for dirty pages writeback is
too coarse for large memory machines and this will get worse as
memory-size/disk-speed ratio continues to increase.
These large writebacks can be unpleasant for desktop or latency-sensitive
environments, where the time to complete each writeback can be perceived as a
lack of responsiveness by the whole system.
Following there's a similar solution as discussed in [1], but a little
bit simplified in order to provide the same functionality (in particular
to avoid backward compatibility problems) and reduce the amount of code
needed to implement an in-kernel parser to handle percentages with
decimals digits.
The kernel provides the following parameters:
- dirty_ratio, dirty_background_ratio in percentage (1 ... 100)
- dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille (1 ... 100,000)
Both dirty_ratio and dirty_ratio_pcm refer to the same vm_dirty_ratio variable,
only the interface to read/write this value is different. The same is valid for
dirty_background_ratio.
In this way it's possible to provide a fine-grained interface to configure the
writeback policy and at the same time preserve the compatibility with the old
dirty_ratio / dirty_background_ratio users.
Examples:
# echo 5 > /proc/sys/vm/dirty_ratio
# cat /proc/sys/vm/dirty_ratio
5
# cat /proc/sys/vm/dirty_ratio_pcm
5000
# echo 500 > /proc/sys/vm/dirty_ratio_pcm
# cat /proc/sys/vm/dirty_ratio
0
# cat /proc/sys/vm/dirty_ratio_pcm
500
# echo 5500 > /proc/sys/vm/dirty_ratio_pcm
# cat /proc/sys/vm/dirty_ratio
5
# cat /proc/sys/vm/dirty_ratio_pcm
5500
Changelog: (v1 -> v2)
* fix overflow in 32bit systems (calc_period_shift needs a u64)
* rebase (and tested) to 2.6.28-rc2-mm1
[1] http://lkml.org/lkml/2008/10/7/230
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
Documentation/filesystems/proc.txt | 20 +++++++++
include/linux/sysctl.h | 7 +++
kernel/sysctl.c | 80 +++++++++++++++++++++++++++++++++--
kernel/sysctl_check.c | 3 +
mm/page-writeback.c | 31 +++++++++++---
5 files changed, 129 insertions(+), 12 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index bcceb99..38ed5bf 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1389,6 +1389,16 @@ pages + file cache, not including locked pages and HugePages), the number of
pages at which the pdflush background writeback daemon will start writing out
dirty data.
+dirty_background_ratio_pcm
+--------------------------
+
+A fine-grained interface to configure dirty_background_ratio.
+
+Contains, as a percentage in units of pcm (percent mille) of the dirtyable
+system memory (free pages + mapped pages + file cache, not including locked
+pages and HugePages), the number of pages at which the pdflush background
+writeback daemon will start writing out dirty data.
+
dirty_ratio
-----------------
@@ -1397,6 +1407,16 @@ pages + file cache, not including locked pages and HugePages), the number of
pages at which a process which is generating disk writes will itself start
writing out dirty data.
+dirty_ratio_pcm
+---------------
+
+A fine-grained interface to configure dirty_ratio.
+
+Contains, as a percentage in units of pcm (percent mille) of the dirtyable
+system memory (free pages + mapped pages + file cache, not including locked
+pages and HugePages), the number of pages at which a process which is
+generating disk writes will itself start writing out dirty data.
+
dirty_writeback_centisecs
-------------------------
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 39d471d..799594b 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -32,6 +32,9 @@
struct file;
struct completion;
+#define PERCENT_PCM 1000
+#define ONE_HUNDRED_PCM (100 * PERCENT_PCM)
+
#define CTL_MAXNAME 10 /* how many path components do we allow in a
call to sysctl? In other words, what is
the largest acceptable value for the nlen
@@ -205,6 +208,8 @@ enum
VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
+ VM_DIRTY_BACKGROUND_PCM = 36, /* fine-grained dirty_background_ratio */
+ VM_DIRTY_RATIO_PCM = 37, /* fine-grained dirty_ratio */
};
@@ -991,6 +996,8 @@ extern int proc_dointvec_userhz_jiffies(struct ctl_table *, int, struct file *,
void __user *, size_t *, loff_t *);
extern int proc_dointvec_ms_jiffies(struct ctl_table *, int, struct file *,
void __user *, size_t *, loff_t *);
+extern int proc_dointvec_pcm_minmax(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
extern int proc_doulongvec_minmax(struct ctl_table *, int, struct file *,
void __user *, size_t *, loff_t *);
extern int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d14953a..06ba902 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -88,9 +88,7 @@ extern int rcutorture_runnable;
#endif /* #ifdef CONFIG_RCU_TORTURE_TEST */
/* Constants used for minimum and maximum */
-#if defined(CONFIG_HIGHMEM) || defined(CONFIG_DETECT_SOFTLOCKUP)
static int one = 1;
-#endif
#ifdef CONFIG_DETECT_SOFTLOCKUP
static int sixty = 60;
@@ -103,6 +101,7 @@ static int two = 2;
static int zero;
static int one_hundred = 100;
+static int one_hundred_pcm = ONE_HUNDRED_PCM;
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
@@ -926,12 +925,23 @@ static struct ctl_table vm_table[] = {
.data = &dirty_background_ratio,
.maxlen = sizeof(dirty_background_ratio),
.mode = 0644,
- .proc_handler = &proc_dointvec_minmax,
+ .proc_handler = &proc_dointvec_pcm_minmax,
.strategy = &sysctl_intvec,
- .extra1 = &zero,
+ .extra1 = &one,
.extra2 = &one_hundred,
},
{
+ .ctl_name = VM_DIRTY_BACKGROUND_PCM,
+ .procname = "dirty_background_ratio_pcm",
+ .data = &dirty_background_ratio,
+ .maxlen = sizeof(dirty_background_ratio),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &one,
+ .extra2 = &one_hundred_pcm,
+ },
+ {
.ctl_name = VM_DIRTY_RATIO,
.procname = "dirty_ratio",
.data = &vm_dirty_ratio,
@@ -939,10 +949,21 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = &dirty_ratio_handler,
.strategy = &sysctl_intvec,
- .extra1 = &zero,
+ .extra1 = &one,
.extra2 = &one_hundred,
},
{
+ .ctl_name = VM_DIRTY_RATIO_PCM,
+ .procname = "dirty_ratio_pcm",
+ .data = &vm_dirty_ratio,
+ .maxlen = sizeof(vm_dirty_ratio),
+ .mode = 0644,
+ .proc_handler = &dirty_ratio_handler,
+ .strategy = &sysctl_intvec,
+ .extra1 = &one,
+ .extra2 = &one_hundred_pcm,
+ },
+ {
.procname = "dirty_writeback_centisecs",
.data = &dirty_writeback_interval,
.maxlen = sizeof(dirty_writeback_interval),
@@ -2525,6 +2546,35 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
lenp, ppos, HZ, 1000l);
}
+static int do_proc_dointvec_pcm_minmax_conv(int *negp, unsigned long *lvalp,
+ int *valp, int write, void *data)
+{
+ struct do_proc_dointvec_minmax_conv_param *param = data;
+ int val;
+
+ if (write) {
+ if (*lvalp > LONG_MAX / PERCENT_PCM)
+ return -EINVAL;
+ val = *negp ? -*lvalp : *lvalp;
+ if ((param->min && *param->min > val) ||
+ (param->max && *param->max < val))
+ return -EINVAL;
+ *valp = val * PERCENT_PCM;
+ } else {
+ unsigned long lval;
+
+ val = *valp;
+ if (val < 0) {
+ *negp = -1;
+ lval = (unsigned long)-val;
+ } else {
+ *negp = 0;
+ lval = (unsigned long)val;
+ }
+ *lvalp = lval / PERCENT_PCM;
+ }
+ return 0;
+}
static int do_proc_dointvec_jiffies_conv(int *negp, unsigned long *lvalp,
int *valp,
@@ -2663,6 +2713,19 @@ int proc_dointvec_ms_jiffies(struct ctl_table *table, int write, struct file *fi
do_proc_dointvec_ms_jiffies_conv, NULL);
}
+int proc_dointvec_pcm_minmax(struct ctl_table *table, int write,
+ struct file *filp, void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ struct do_proc_dointvec_minmax_conv_param param = {
+ .min = (int *)table->extra1,
+ .max = (int *)table->extra2,
+ };
+
+ return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
+ do_proc_dointvec_pcm_minmax_conv, ¶m);
+}
+
static int proc_do_cad_pid(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
@@ -2711,6 +2774,13 @@ int proc_dointvec_jiffies(struct ctl_table *table, int write, struct file *filp,
return -ENOSYS;
}
+int proc_dointvec_pcm_minmax(struct ctl_table *table, int write,
+ struct file *filp, void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ return -ENOSYS;
+}
+
int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index c35da23..83934a8 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -111,7 +111,9 @@ static const struct trans_ctl_table trans_vm_table[] = {
{ VM_OVERCOMMIT_MEMORY, "overcommit_memory" },
{ VM_PAGE_CLUSTER, "page-cluster" },
{ VM_DIRTY_BACKGROUND, "dirty_background_ratio" },
+ { VM_DIRTY_BACKGROUND_PCM, "dirty_background_ratio_pcm" },
{ VM_DIRTY_RATIO, "dirty_ratio" },
+ { VM_DIRTY_RATIO_PCM, "dirty_ratio_pcm" },
{ VM_DIRTY_WB_CS, "dirty_writeback_centisecs" },
{ VM_DIRTY_EXPIRE_CS, "dirty_expire_centisecs" },
{ VM_NR_PDFLUSH_THREADS, "nr_pdflush_threads" },
@@ -1494,6 +1496,7 @@ int sysctl_check_table(struct nsproxy *namespaces, struct ctl_table *table)
(table->proc_handler == proc_dostring) ||
(table->proc_handler == proc_dointvec) ||
(table->proc_handler == proc_dointvec_minmax) ||
+ (table->proc_handler == proc_dointvec_pcm_minmax) ||
(table->proc_handler == proc_dointvec_jiffies) ||
(table->proc_handler == proc_dointvec_userhz_jiffies) ||
(table->proc_handler == proc_dointvec_ms_jiffies) ||
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3584bf..e010a39 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -66,7 +66,7 @@ static inline long sync_writeback_pages(void)
/*
* Start background writeback (via pdflush) at this percentage
*/
-int dirty_background_ratio = 5;
+int dirty_background_ratio = 5 * PERCENT_PCM;
/*
* free highmem will not be subtracted from the total free memory
@@ -77,7 +77,7 @@ int vm_highmem_is_dirtyable;
/*
* The generator of dirty data starts writeback at this percentage
*/
-int vm_dirty_ratio = 10;
+int vm_dirty_ratio = 10 * PERCENT_PCM;
/*
* The interval between `kupdate'-style writebacks, in jiffies
@@ -133,9 +133,10 @@ static struct prop_descriptor vm_dirties;
*/
static int calc_period_shift(void)
{
- unsigned long dirty_total;
+ u64 dirty_total;
- dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
+ dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
+ / ONE_HUNDRED_PCM;
return 2 + ilog2(dirty_total - 1);
}
@@ -147,7 +148,23 @@ int dirty_ratio_handler(struct ctl_table *table, int write,
loff_t *ppos)
{
int old_ratio = vm_dirty_ratio;
- int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+ int ret;
+
+ switch (table->ctl_name) {
+ case VM_DIRTY_RATIO:
+ ret = proc_dointvec_pcm_minmax(table, write, filp, buffer,
+ lenp, ppos);
+ break;
+ case VM_DIRTY_RATIO_PCM:
+ ret = proc_dointvec_minmax(table, write, filp, buffer,
+ lenp, ppos);
+ break;
+ default:
+ ret = -EINVAL;
+ WARN_ON(1);
+ break;
+ }
+
if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
int shift = calc_period_shift();
prop_change_shift(&vm_completions, shift);
@@ -380,8 +397,8 @@ get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
if (background_ratio >= dirty_ratio)
background_ratio = dirty_ratio / 2;
- background = (background_ratio * available_memory) / 100;
- dirty = (dirty_ratio * available_memory) / 100;
+ background = (background_ratio * available_memory) / ONE_HUNDRED_PCM;
+ dirty = (dirty_ratio * available_memory) / ONE_HUNDRED_PCM;
tsk = current;
if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
background += background / 4;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH -mm] mm: fine-grained dirty_ratio_pcm and dirty_background_ratio_pcm (v2)
2008-11-10 20:58 ` [PATCH -mm] mm: fine-grained dirty_ratio_pcm and dirty_background_ratio_pcm (v2) Andrea Righi
@ 2008-11-10 21:12 ` Andrew Morton
2008-11-10 22:03 ` Andrea Righi
0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2008-11-10 21:12 UTC (permalink / raw)
To: righi.andrea
Cc: kamezawa.hiroyu, rientjes, balbir, mrubin, menage, dave, chlunde,
dpshah, eric.rannaud, fernando, agk, m.innocenti, s-uchida, ryov,
matt, dradford, kosaki.motohiro, linux-mm, linux-kernel,
containers
On Mon, 10 Nov 2008 21:58:28 +0100
Andrea Righi <righi.andrea@gmail.com> wrote:
> The current granularity of 5% of dirtyable memory for dirty pages writeback is
> too coarse for large memory machines and this will get worse as
> memory-size/disk-speed ratio continues to increase.
>
> These large writebacks can be unpleasant for desktop or latency-sensitive
> environments, where the time to complete each writeback can be perceived as a
> lack of responsiveness by the whole system.
>
> Following there's a similar solution as discussed in [1], but a little
> bit simplified in order to provide the same functionality (in particular
> to avoid backward compatibility problems) and reduce the amount of code
> needed to implement an in-kernel parser to handle percentages with
> decimals digits.
>
> The kernel provides the following parameters:
> - dirty_ratio, dirty_background_ratio in percentage (1 ... 100)
> - dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille (1 ... 100,000)
hm, so how long until dirty_ratio_pcm becomes too coarse...
What happened to the idea of specifying these in units of kilobytes?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH -mm] mm: fine-grained dirty_ratio_pcm and dirty_background_ratio_pcm (v2)
2008-11-10 21:12 ` Andrew Morton
@ 2008-11-10 22:03 ` Andrea Righi
2008-11-10 22:12 ` Andrew Morton
2008-11-10 22:15 ` David Rientjes
0 siblings, 2 replies; 9+ messages in thread
From: Andrea Righi @ 2008-11-10 22:03 UTC (permalink / raw)
To: Andrew Morton
Cc: kamezawa.hiroyu, rientjes, balbir, mrubin, menage, dave, chlunde,
dpshah, eric.rannaud, fernando, agk, m.innocenti, s-uchida, ryov,
matt, dradford, kosaki.motohiro, linux-mm, linux-kernel,
containers
On 2008-11-10 22:12, Andrew Morton wrote:
> On Mon, 10 Nov 2008 21:58:28 +0100
> Andrea Righi <righi.andrea@gmail.com> wrote:
>
>> The current granularity of 5% of dirtyable memory for dirty pages writeback is
>> too coarse for large memory machines and this will get worse as
>> memory-size/disk-speed ratio continues to increase.
>>
>> These large writebacks can be unpleasant for desktop or latency-sensitive
>> environments, where the time to complete each writeback can be perceived as a
>> lack of responsiveness by the whole system.
>>
>> Following there's a similar solution as discussed in [1], but a little
>> bit simplified in order to provide the same functionality (in particular
>> to avoid backward compatibility problems) and reduce the amount of code
>> needed to implement an in-kernel parser to handle percentages with
>> decimals digits.
>>
>> The kernel provides the following parameters:
>> - dirty_ratio, dirty_background_ratio in percentage (1 ... 100)
>> - dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille (1 ... 100,000)
>
> hm, so how long until dirty_ratio_pcm becomes too coarse...
>
> What happened to the idea of specifying these in units of kilobytes?
The conclusion was that with units in KB requires much more complexity
to keep in sync the old dirty_ratio (and dirty_background_ratio)
interface with the new one.
The KB limit is a static value, the other depends on the dirtyable
memory. If we want to preserve the same behaviour we should do the
following:
- when dirty_ratio changes to x:
dirty_amount_in_bytes = x * dirtyable_memory / 100.
- when dirty_amount_in_bytes changes to x:
dirty_ratio = x / dirtyable_memory * 100
But anytime the dirtyable memory changes (as well as the total memory in
the system) we should update both values accordingly to preserve the
coherency between them.
I wonder if setting also PERCENT_PCM (that is 1% expressed in
fine-grained units) as a parameter could be a better long-term solution.
And also use another name for it, because in this case this would be not
a milli-percent value anymore.
-Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH -mm] mm: fine-grained dirty_ratio_pcm and dirty_background_ratio_pcm (v2)
2008-11-10 22:03 ` Andrea Righi
@ 2008-11-10 22:12 ` Andrew Morton
2008-11-10 22:15 ` David Rientjes
1 sibling, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2008-11-10 22:12 UTC (permalink / raw)
To: righi.andrea
Cc: kamezawa.hiroyu, rientjes, balbir, mrubin, menage, dave, chlunde,
dpshah, eric.rannaud, fernando, agk, m.innocenti, s-uchida, ryov,
matt, dradford, kosaki.motohiro, linux-mm, linux-kernel,
containers
On Mon, 10 Nov 2008 23:03:13 +0100
Andrea Righi <righi.andrea@gmail.com> wrote:
> On 2008-11-10 22:12, Andrew Morton wrote:
> > On Mon, 10 Nov 2008 21:58:28 +0100
> > Andrea Righi <righi.andrea@gmail.com> wrote:
> >
> >> The current granularity of 5% of dirtyable memory for dirty pages writeback is
> >> too coarse for large memory machines and this will get worse as
> >> memory-size/disk-speed ratio continues to increase.
> >>
> >> These large writebacks can be unpleasant for desktop or latency-sensitive
> >> environments, where the time to complete each writeback can be perceived as a
> >> lack of responsiveness by the whole system.
> >>
> >> Following there's a similar solution as discussed in [1], but a little
> >> bit simplified in order to provide the same functionality (in particular
> >> to avoid backward compatibility problems) and reduce the amount of code
> >> needed to implement an in-kernel parser to handle percentages with
> >> decimals digits.
> >>
> >> The kernel provides the following parameters:
> >> - dirty_ratio, dirty_background_ratio in percentage (1 ... 100)
> >> - dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille (1 ... 100,000)
> >
> > hm, so how long until dirty_ratio_pcm becomes too coarse...
> >
> > What happened to the idea of specifying these in units of kilobytes?
>
> The conclusion was that with units in KB requires much more complexity
> to keep in sync the old dirty_ratio (and dirty_background_ratio)
> interface with the new one.
>
> The KB limit is a static value, the other depends on the dirtyable
> memory. If we want to preserve the same behaviour we should do the
> following:
>
> - when dirty_ratio changes to x:
> dirty_amount_in_bytes = x * dirtyable_memory / 100.
>
> - when dirty_amount_in_bytes changes to x:
> dirty_ratio = x / dirtyable_memory * 100
>
> But anytime the dirtyable memory changes (as well as the total memory in
> the system) we should update both values accordingly to preserve the
> coherency between them.
OK.
> I wonder if setting also PERCENT_PCM (that is 1% expressed in
> fine-grained units) as a parameter could be a better long-term solution.
> And also use another name for it, because in this case this would be not
> a milli-percent value anymore.
How about we forget the percentage thing and create
/proc/sys/vm/dirty_ratio_millionths? That will give us a few more years
of moores_law(memory size)/mores_law(disk speed) too..
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH -mm] mm: fine-grained dirty_ratio_pcm and dirty_background_ratio_pcm (v2)
2008-11-10 22:03 ` Andrea Righi
2008-11-10 22:12 ` Andrew Morton
@ 2008-11-10 22:15 ` David Rientjes
1 sibling, 0 replies; 9+ messages in thread
From: David Rientjes @ 2008-11-10 22:15 UTC (permalink / raw)
To: Andrea Righi
Cc: Andrew Morton, kamezawa.hiroyu, balbir, mrubin, menage, dave,
chlunde, dpshah, eric.rannaud, fernando, agk, m.innocenti,
s-uchida, ryov, matt, dradford, kosaki.motohiro, linux-mm,
linux-kernel, containers
On Mon, 10 Nov 2008, Andrea Righi wrote:
> The KB limit is a static value, the other depends on the dirtyable
> memory. If we want to preserve the same behaviour we should do the
> following:
>
> - when dirty_ratio changes to x:
> dirty_amount_in_bytes = x * dirtyable_memory / 100.
>
> - when dirty_amount_in_bytes changes to x:
> dirty_ratio = x / dirtyable_memory * 100
>
I think the idea is for a dynamic dirty_ratio based on a static value
dirty_amount_in_bytes:
dirtyable_memory = determine_dirtyable_memory() * PAGE_SIZE;
dirty_ratio = dirty_amount_in_bytes / dirtyable_memory;
> But anytime the dirtyable memory changes (as well as the total memory in
> the system) we should update both values accordingly to preserve the
> coherency between them.
>
Only dirty_ratio is actually updated if dirty_amount_in_bytes is static.
This allows you to control how many pages are NR_FILE_DIRTY or
NR_UNSTABLE_NFS and gives you the granularity that you want with
dirty_ratio_pcm, but on a byte scale instead of percent.
It's also a clean interface:
echo 200M > /proc/sys/vm/dirty_ratio_bytes
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2008-11-10 22:15 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1221232192-13553-1-git-send-email-righi.andrea@gmail.com>
[not found] ` <20080912131816.e0cfac7a.akpm@linux-foundation.org>
[not found] ` <532480950809221641y3471267esff82a14be8056586@mail.gmail.com>
[not found] ` <48EB4236.1060100@linux.vnet.ibm.com>
[not found] ` <48EB851D.2030300@gmail.com>
[not found] ` <20081008101642.fcfb9186.kamezawa.hiroyu@jp.fujitsu.com>
[not found] ` <48ECB215.4040409@linux.vnet.ibm.com>
2008-10-09 15:29 ` [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio Andrea Righi
2008-10-10 0:41 ` KAMEZAWA Hiroyuki
2008-10-10 9:32 ` Andrea Righi
2008-10-10 13:13 ` Andrea Righi
2008-11-10 20:58 ` [PATCH -mm] mm: fine-grained dirty_ratio_pcm and dirty_background_ratio_pcm (v2) Andrea Righi
2008-11-10 21:12 ` Andrew Morton
2008-11-10 22:03 ` Andrea Righi
2008-11-10 22:12 ` Andrew Morton
2008-11-10 22:15 ` David Rientjes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox