linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: fuse uring / wake_up on the same core
       [not found]     ` <3c0facd0-e3c7-0aa1-8b2e-961120d4f43d@ddn.com>
@ 2023-04-28  1:44       ` Hillf Danton
  2023-04-28 21:54         ` Bernd Schubert
  0 siblings, 1 reply; 7+ messages in thread
From: Hillf Danton @ 2023-04-28  1:44 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 27 Apr 2023 13:35:31 +0000 Bernd Schubert <bschubert@ddn.com>
> Btw, a very hackish way to 'solve' the issue is this
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index cd7aa679c3ee..dd32effb5010 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req)
>          int err;
>          int prev_cpu = task_cpu(current);
>   
> +       /* When running over uring and core affined userspace threads, we
> +        * do not want to let migrate away the request submitting process.
> +        * Issue is that even after waking up on the right core, processes
> +        * that have submitted requests might get migrated away, because
> +        * the ring thread is still doing a bit of work or is in the process
> +        * to go to sleep. Assumption here is that processes are started on
> +        * the right core (i.e. idle cores) and can then stay on that core
> +        * when they come and do file system requests.
> +        * Another alternative way is to set SCHED_IDLE for ring threads,
> +        * but that would have an issue if there are other processes keeping
> +        * the cpu busy.
> +        * SCHED_IDLE or this hack here result in about factor 3.5 for
> +        * max meta request performance.
> +        *
> +        * Ideal would to tell the scheduler that ring threads are not disturbing
> +        * that migration away from it should very very rarely happen.
> +        */
> +       if (fc->ring.ready)
> +               migrate_disable();
> +
>          if (!fc->no_interrupt) {
>                  /* Any signal may interrupt this */
>                  err = wait_event_interruptible(req->waitq,
> 
If I understand it correctly, the seesaw workload hint to scheduler looks
like the diff below, leaving scheduler free to pull the two players apart
across CPU and to migrate anyone.

--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -421,6 +421,7 @@ static void __fuse_request_send(struct f
 		/* acquire extra reference, since request is still needed
 		   after fuse_request_end() */
 		__fuse_get_request(req);
+		current->seesaw = 1;
 		queue_request_and_unlock(fiq, req);
 
 		request_wait_answer(req);
@@ -1229,6 +1230,7 @@ static ssize_t fuse_dev_do_read(struct f
 			   fc->max_write))
 		return -EINVAL;
 
+	current->seesaw = 1;
  restart:
 	for (;;) {
 		spin_lock(&fiq->lock);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,6 +953,7 @@ struct task_struct {
 	/* delay due to memory thrashing */
 	unsigned                        in_thrashing:1;
 #endif
+	unsigned 			seesaw:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
+		if (p->seesaw && current->seesaw)
+			return cpu;
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-04-28  1:44       ` fuse uring / wake_up on the same core Hillf Danton
@ 2023-04-28 21:54         ` Bernd Schubert
  2023-04-28 23:37           ` Hillf Danton
                             ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Bernd Schubert @ 2023-04-28 21:54 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 4/28/23 03:44, Hillf Danton wrote:
> On 27 Apr 2023 13:35:31 +0000 Bernd Schubert <bschubert@ddn.com>
>> Btw, a very hackish way to 'solve' the issue is this
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index cd7aa679c3ee..dd32effb5010 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req)
>>           int err;
>>           int prev_cpu = task_cpu(current);
>>    
>> +       /* When running over uring and core affined userspace threads, we
>> +        * do not want to let migrate away the request submitting process.
>> +        * Issue is that even after waking up on the right core, processes
>> +        * that have submitted requests might get migrated away, because
>> +        * the ring thread is still doing a bit of work or is in the process
>> +        * to go to sleep. Assumption here is that processes are started on
>> +        * the right core (i.e. idle cores) and can then stay on that core
>> +        * when they come and do file system requests.
>> +        * Another alternative way is to set SCHED_IDLE for ring threads,
>> +        * but that would have an issue if there are other processes keeping
>> +        * the cpu busy.
>> +        * SCHED_IDLE or this hack here result in about factor 3.5 for
>> +        * max meta request performance.
>> +        *
>> +        * Ideal would to tell the scheduler that ring threads are not disturbing
>> +        * that migration away from it should very very rarely happen.
>> +        */
>> +       if (fc->ring.ready)
>> +               migrate_disable();
>> +
>>           if (!fc->no_interrupt) {
>>                   /* Any signal may interrupt this */
>>                   err = wait_event_interruptible(req->waitq,
>>
> If I understand it correctly, the seesaw workload hint to scheduler looks
> like the diff below, leaving scheduler free to pull the two players apart
> across CPU and to migrate anyone.

Thank a lot Hillf! I had a day off / family day today, kernel is now 
eventually compiling.

> 
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -421,6 +421,7 @@ static void __fuse_request_send(struct f
>   		/* acquire extra reference, since request is still needed
>   		   after fuse_request_end() */
>   		__fuse_get_request(req);
> +		current->seesaw = 1;
>   		queue_request_and_unlock(fiq, req);
>   
>   		request_wait_answer(req);
> @@ -1229,6 +1230,7 @@ static ssize_t fuse_dev_do_read(struct f
>   			   fc->max_write))
>   		return -EINVAL;
>   
> +	current->seesaw = 1;

fuse_dev_do_read is plain /dev/fuse (with read/write) and we don't know 
on which core these IO threads are running and which of them to wake up 
when an application comes with a request.

There is a patch to use __wake_up_sync to wake the IO thread and reports 
that it helps in performance, but I don't see it and I think Miklos 
neither. For direct-io read I had also already tested disabling 
migration - it didn't show any effect - we better don't set 
current->seesaw = 1 in fuse_dev_do_read for now.

With my fuse-uring patches things are more clear
(https://lwn.net/Articles/926773/), there is one IO thread per core and 
libfuse side is binding these threads to a single core only.

nproc    /dev/fuse     /dev/fuse     fuse uring    fuse uring
           migrate on   migrate off  migrate on    migrate off
1         2023          1652          1151         3998
2         3375          2805          2221         7950
4         3823          4193          4540         15022
8         7796          8161          7846         22591
16        8520          8518          12235        27864
24        8361          8084          9415         27864
32        8361          8084          9124         12971

(in MiB/s)

So core affinity really matters and with core affinity it is always 
faster with fuse-uring over the existing code.

For single threaded metadata (file creates/stat/unlink) difference 
between migrate on/off is rather similar.  Going to run with multiple 
processes during the next days.

For paged (async) IO it behaves a bit different as here uring can show 
it strength and multiple requests can be combined on CQE processing - 
better to chose and idle ring thread on another core. I actually have a 
question for that as well - later.


>    restart:
>   	for (;;) {
>   		spin_lock(&fiq->lock);
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -953,6 +953,7 @@ struct task_struct {
>   	/* delay due to memory thrashing */
>   	unsigned                        in_thrashing:1;
>   #endif
> +	unsigned 			seesaw:1;
>   
>   	unsigned long			atomic_flags; /* Flags requiring atomic access. */
>   
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
>   	if (wake_flags & WF_TTWU) {
>   		record_wakee(p);
>   
> +		if (p->seesaw && current->seesaw)
> +			return cpu;
>   		if (sched_energy_enabled()) {
>   			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>   			if (new_cpu >= 0)


Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is 
in cpus_ptr?  The combination of both patches results in

		if (p->seesaw && current->seesaw)
			return cpu;

		if ((wake_flags & WF_CURRENT_CPU) &&
		    cpumask_test_cpu(cpu, p->cpus_ptr))
			return cpu;



While writing the mail kernel compilation is ready, but it got late, 
will test in the morning.


Thanks again,
Bernd

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-04-28 21:54         ` Bernd Schubert
@ 2023-04-28 23:37           ` Hillf Danton
  2023-05-01 21:44           ` Bernd Schubert
       [not found]           ` <20230502003335.3253-1-hdanton@sina.com>
  2 siblings, 0 replies; 7+ messages in thread
From: Hillf Danton @ 2023-04-28 23:37 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 28 Apr 2023 21:54:51 +0000 Bernd Schubert <bschubert@ddn.com>
> > @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
> >   	if (wake_flags & WF_TTWU) {
> >   		record_wakee(p);
> >   
> > +		if (p->seesaw && current->seesaw)
> > +			return cpu;
> >   		if (sched_energy_enabled()) {
> >   			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >   			if (new_cpu >= 0)
> 
> Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is 
> in cpus_ptr?  The combination of both patches results in

I missed checking cpu against p. Good catch.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-04-28 21:54         ` Bernd Schubert
  2023-04-28 23:37           ` Hillf Danton
@ 2023-05-01 21:44           ` Bernd Schubert
       [not found]           ` <20230502003335.3253-1-hdanton@sina.com>
  2 siblings, 0 replies; 7+ messages in thread
From: Bernd Schubert @ 2023-05-01 21:44 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 4/28/23 23:54, Bernd Schubert wrote:
> On 4/28/23 03:44, Hillf Danton wrote:
>>    restart:
>>       for (;;) {
>>           spin_lock(&fiq->lock);
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -953,6 +953,7 @@ struct task_struct {
>>       /* delay due to memory thrashing */
>>       unsigned                        in_thrashing:1;
>>   #endif
>> +    unsigned             seesaw:1;
>>       unsigned long            atomic_flags; /* Flags requiring atomic 
>> access. */
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
>>       if (wake_flags & WF_TTWU) {
>>           record_wakee(p);
>> +        if (p->seesaw && current->seesaw)
>> +            return cpu;
>>           if (sched_energy_enabled()) {
>>               new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>               if (new_cpu >= 0)
> 
> 
> Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is 
> in cpus_ptr?  The combination of both patches results in
> 
>          if (p->seesaw && current->seesaw)
>              return cpu;
> 
>          if ((wake_flags & WF_CURRENT_CPU) &&
>              cpumask_test_cpu(cpu, p->cpus_ptr))
>              return cpu;
> 
> 
> 
> While writing the mail kernel compilation is ready, but it got late, 
> will test in the morning.

This works wonders!  The fuse-uring part is this

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd7aa679c3ee..ec5853ca9646 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -373,6 +373,9 @@ static void request_wait_answer(struct fuse_req *req)
         int err;
         int prev_cpu = task_cpu(current);
  
+       if (fc->ring.per_core_queue)
+               current->seesaw = 1;
+
         if (!fc->no_interrupt) {
                 /* Any signal may interrupt this */
                 err = wait_event_interruptible(req->waitq,
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 7d327699b4c5..715741ed58bf 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
                         /* XXX error injection or test with malicious daemon */
                 }
  
+               /* In combination with requesting process (application) seesaw
+                * setting (see request_wait_answer), the application will
+                * stay on the same core.
+                */
+               if (fc->ring.per_core_queue)
+                       current->seesaw = 1;
+
                 ret = fuse_uring_fetch(ring_ent, cmd);
                 break;
         case FUSE_URING_REQ_COMMIT_AND_FETCH:




I'm not familiar at all with scheduler code,
given this works perfectly this suggests the same function is also
called without explicit waitq, when the scheduler preempts a task?

I think there might be side effects - what is if multiple
applications are on one core and another core would be available?
With this flag they would stay on the same core? Maybe better two flags?

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..07783ddaec5c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,6 +953,8 @@ struct task_struct {
         /* delay due to memory thrashing */
         unsigned                        in_thrashing:1;
  #endif
+       unsigned                        seesaw_req:1;
+       unsigned                        seesaw_io:1;
  
         unsigned long                   atomic_flags; /* Flags requiring atomic access. */
  
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9d6ed7585c6..474bf3657ef0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
         if (wake_flags & WF_TTWU) {
                 record_wakee(p);
  
+               /* current is handling requests on behalf of the waking process,
+                * both want to run on the same core in seeswaw manner.
+                */
+               if (p->seesaw_req && current->seesaw_io &&
+                   cpumask_test_cpu(cpu, p->cpus_ptr))
+                       return cpu;
+
                 if ((wake_flags & WF_CURRENT_CPU) &&
                     cpumask_test_cpu(cpu, p->cpus_ptr))
                         return cpu;

(not tested yet)


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse uring / wake_up on the same core
       [not found]           ` <20230502003335.3253-1-hdanton@sina.com>
@ 2023-05-03 17:04             ` Bernd Schubert
  2023-05-04  2:16               ` Hillf Danton
  0 siblings, 1 reply; 7+ messages in thread
From: Bernd Schubert @ 2023-05-03 17:04 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 5/2/23 02:33, Hillf Danton wrote:
> On 1 May 2023 21:44:48 +0000 Bernd Schubert <bschubert@ddn.com>
>> I'm not familiar at all with scheduler code,
>> given this works perfectly this suggests the same function is also
>> called without explicit waitq, when the scheduler preempts a task?
> 
> Please see comment below in the select_task function.
>>
>> I think there might be side effects - what is if multiple
>> applications are on one core and another core would be available?
>> With this flag they would stay on the same core? Maybe better two flags?
> 
> Gooddd question. Given multiple seesaws,
> 
> 	Player i1	Player j1
> 	|		|
> 	Seesaw i	Seesaw j
> 	|		|
> 	P i2		P j2
> 
> what is ideal is run Seesaws i and j on different CPU cores.
> 
> We can do that by replacing the seesaw flag in the task struct with
> seesaw id for instance. But I have no idea how complex it will become
> to set up multiple seesaw workloads on the fuse side, by grouping and
> binding players to different seesaws,
> 
> -	unsigned			seesaw:1;
> +	unsigned int			seesaw;> 
> while the corresponding change to scheduler looks like.
> 
> +               if (p->seesaw && p->seesaw == current->seesaw &&
> +                   cpumask_test_cpu(cpu, p->cpus_ptr))
> +                       return cpu;


Hmm, how is the seesaw id assigned and assuming two processes landed
on the same core but later on another core is available, how does it
dynamically change the ID?
My idea with two bits was that there is a fuse ring thread bound to a
specific core - it is the IO processor and gets the seesaw_proc bit.
Application is submitting requests and get the seesaw_req bit set.
Two applications running on the same core won't disturb each other that way.

As addition, if the application is not submitting requests anymore, but
let's say is busy doing computations, we want to have a timeout to let
it move away if another core is more suitable. What do you think about
the new patch version at the end of the mail? It uses two bits and jiffies.
Just tested it and it works fine. The exact timeout period is
certainly debatable. I also feel a bit bad
to take so many bits in struct task. If this approach is acceptable,
the jiffies parameter could be probably an u16.

> 
> Even after job done for fuse, the emerging question is, how to set up
> seesaw workloads for crypto for example, if no more than the seesaw id
> hint to scheduler is preferred.
> 
> And it makes me wonder why Prateek's work stalled for quite a while,
> as more is needed for userspace hint to scheduler to work for
> workloads other than seesaw.

Just quickly went over these a bit, assuming seesaw doesn't get accepted
and we need these, I think it would need a bit modification for fuse


> +		/*
> +		 * Handle the case where a hint is set and the current CPU
> +		 * and the previous CPU where task ran don't share caches.
> +		 */
> +		if (wakeup_hint && !cpus_share_cache(cpu, prev_cpu)) {

I'm testing on and older Xeon system (E5-2650) and tried different settings
with numa binding the application (fuse ring thread is bound anyway)

                                        
governor                                conservative    performance
                                           (default)
application cpu 0, ring cpu 0  creates/s   ~9200          ~9500
application cpu 0. ring cpu 16 creates/s   ~4200          ~8000
application cpu 0. ring cpu 1  creates/s   ~4200          ~8500


No idea why cpu 1 gives better results in performance mode than cpu 16, might be within
accuracy. CPU frequency definitely has the largest effect here - the cpus_share_cache()
condition is not ideal for fuse. And I guess asking users to use cpu performance mode
for fuse is also too much asked - other file systems don't have that requirement...
So far your seesaw idea works best (the modified version in combination with
wake-on-same-core).

>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 63d242164b1a..07783ddaec5c 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -953,6 +953,8 @@ struct task_struct {
>>           /* delay due to memory thrashing */
>>           unsigned                        in_thrashing:1;
>>    #endif
>> +       unsigned                        seesaw_req:1;
>> +       unsigned                        seesaw_io:1;
>>    
>>           unsigned long                   atomic_flags; /* Flags requiring atomic access. */
>>    
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b9d6ed7585c6..474bf3657ef0 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>           if (wake_flags & WF_TTWU) {
>>                   record_wakee(p);
> 
> Seesaw does not work without WF_TTWU as per define.

What does WF_TTWU actually mean? Something like work flow time to wake unit?

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd7aa679c3ee..6da0de4ae9ca 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -411,6 +411,20 @@ static void request_wait_answer(struct fuse_req *req)
  	 * Wait it out.
  	 */
  	wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
+
+	/*
+	 * __wake_up_on_current_cpu ensures we wake up on the right core,
+	 * after that we still want to stay on the same core, shared with
+	 * a ring thread to submit next request to it. Issue without seesaw
+	 * is that the while the ring thread is on its way to wait, it disturbs
+	 * the application and application might get migrated away
+	 */
+	if (fc->ring.per_core_queue) {
+		current->seesaw_req = 1;
+		current->seesaw_jiffies = jiffies;
+	}
+
+
  out:
  	if (prev_cpu != task_cpu(current))
  		pr_devel("%s cpu switch from=%d to=%d\n",
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 7d327699b4c5..73adc2b16778 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
  			/* XXX error injection or test with malicious daemon */
  		}
  
+		/* In combination with requesting process (application) seesaw
+		 * setting (see request_wait_answer), the application will
+		 * stay on the same core.
+		 */
+		if (fc->ring.per_core_queue)
+			current->seesaw_proc = 1;
+
  		ret = fuse_uring_fetch(ring_ent, cmd);
  		break;
  	case FUSE_URING_REQ_COMMIT_AND_FETCH:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..53d9c77672b7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,6 +953,13 @@ struct task_struct {
  	/* delay due to memory thrashing */
  	unsigned                        in_thrashing:1;
  #endif
+	/* requesting task */
+	unsigned 			seesaw_req:1;
+	/* request processing task */
+	unsigned			seesaw_proc:1;
+
+	/* limit seesaw time slot */
+	unsigned long			seesaw_jiffies;
  
  	unsigned long			atomic_flags; /* Flags requiring atomic access. */
  
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9d6ed7585c6..a14161e6e456 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7605,6 +7605,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
  	if (wake_flags & WF_TTWU) {
  		record_wakee(p);
  
+		/*
+		 * Current is handling requests on behalf of the waking process,
+		 * both want to run on the same core in seeswaw manner.
+		 * Typically current is bound to one core.'and only p is allowed
+		 * to freely move.
+		 */
+		if (p->seesaw_req && current->seesaw_proc &&
+		    time_after(jiffies, p->seesaw_jiffies + 10),
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
  		if ((wake_flags & WF_CURRENT_CPU) &&
  		    cpumask_test_cpu(cpu, p->cpus_ptr))
  			return cpu;


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-05-03 17:04             ` Bernd Schubert
@ 2023-05-04  2:16               ` Hillf Danton
  2023-05-05 13:10                 ` Bernd Schubert
  0 siblings, 1 reply; 7+ messages in thread
From: Hillf Danton @ 2023-05-04  2:16 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 3 May 2023 17:04:25 +0000 Bernd Schubert <bschubert@ddn.com>
> On 5/2/23 02:33, Hillf Danton wrote:
> > On 1 May 2023 21:44:48 +0000 Bernd Schubert <bschubert@ddn.com>
> >> I'm not familiar at all with scheduler code,
> >> given this works perfectly this suggests the same function is also
> >> called without explicit waitq, when the scheduler preempts a task?
> > 
> > Please see comment below in the select_task function.
> >>
> >> I think there might be side effects - what is if multiple
> >> applications are on one core and another core would be available?
> >> With this flag they would stay on the same core? Maybe better two flags?
> > 
> > Gooddd question. Given multiple seesaws,
> > 
> > 	Player i1	Player j1
> > 	|		|
> > 	Seesaw i	Seesaw j
> > 	|		|
> > 	P i2		P j2
> > 
> > what is ideal is run Seesaws i and j on different CPU cores.
> > 
> > We can do that by replacing the seesaw flag in the task struct with
> > seesaw id for instance. But I have no idea how complex it will become
> > to set up multiple seesaw workloads on the fuse side, by grouping and
> > binding players to different seesaws,
> > 
> > -	unsigned			seesaw:1;
> > +	unsigned int			seesaw;
> > while the corresponding change to scheduler looks like.
> > 
> > +               if (p->seesaw && p->seesaw == current->seesaw &&
> > +                   cpumask_test_cpu(cpu, p->cpus_ptr))
> > +                       return cpu;
> 
> 
> Hmm, how is the seesaw id assigned and assuming two processes landed
> on the same core but later on another core is available, how does it
> dynamically change the ID?

Gooood question again. The seesaw id is freely assigned in the space
unsigned int could offer, 1 <= seesaw id < UINT_MAx, independent of
anything else, and once assigned, it will not change throughout lifespan
because of no link between seesaw id and CPU core id.

The main point of seesaw id is help group/bind players on the fuse side,
and help select the current CPU on the scheduler side. It also prevents
player i2 from making a phone call to player j1.

Then no more it can do, so no change in how scheduler handles an idle core.

> My idea with two bits was that there is a fuse ring thread bound to a
> specific core - it is the IO processor and gets the seesaw_proc bit.

Setting CPU affinity is another usespace hint independent of seesaw.

> Application is submitting requests and get the seesaw_req bit set.
> Two applications running on the same core won't disturb each other that way.

Yeah, scheduler is free to migrate any seesaw_req away if needed, but
seesaw will pull it back at next wakeup.

	req I1 ... In
	|
	seesaw-I on core-K 
	|
	proc I

After binding proc-I to core-W for instance, if req-I2 is migrated to
core-X then it makes no sense to pull proc-I to core-X upon wakeup.
Actually this will not happen because of cpumask test for proc-I.

		    cpumask_test_cpu(cpu, p->cpus_ptr))
> 
> As addition, if the application is not submitting requests anymore, but
> let's say is busy doing computations, we want to have a timeout to let
> it move away if another core is more suitable.

If I understand fuse correctly, in request_wait_answer(), seesaw helps
only if the application sleeps on req->waitq (in other word, seesaw
does not work without WF_TTWU), and scheduler does migration for free.

Note migration is the job on the scheduler side, with nothing to do with
seesaw, otherwise PeterZ will lion roar.

> What do you think about
> the new patch version at the end of the mail? It uses two bits and jiffies.

See comment below in select_task_rq_fair().

What we are doing is just show PeterZ that userspace hint for the fuse
seesaw workload works, and feel free to do whatever you like because you
know much better about fuse than I do.

> Just tested it and it works fine. The exact timeout period is
> certainly debatable. I also feel a bit bad
> to take so many bits in struct task. If this approach is acceptable,
> the jiffies parameter could be probably an u16.
> 
> > 
> > Even after job done for fuse, the emerging question is, how to set up
> > seesaw workloads for crypto for example, if no more than the seesaw id
> > hint to scheduler is preferred.
> > 
> > And it makes me wonder why Prateek's work stalled for quite a while,
> > as more is needed for userspace hint to scheduler to work for
> > workloads other than seesaw.
> 
> Just quickly went over these a bit, assuming seesaw doesn't get accepted
> and we need these, I think it would need a bit modification for fuse
> 
> 
> > +		/*
> > +		 * Handle the case where a hint is set and the current CPU
> > +		 * and the previous CPU where task ran don't share caches.
> > +		 */
> > +		if (wakeup_hint && !cpus_share_cache(cpu, prev_cpu)) {
> 
> I'm testing on and older Xeon system (E5-2650) and tried different settings
> with numa binding the application (fuse ring thread is bound anyway)
> 
>                                         
> governor                                conservative    performance
>                                            (default)
> application cpu 0, ring cpu 0  creates/s   ~9200          ~9500
> application cpu 0. ring cpu 16 creates/s   ~4200          ~8000
> application cpu 0. ring cpu 1  creates/s   ~4200          ~8500
> 
> 
> No idea why cpu 1 gives better results in performance mode than cpu 16, might be within
> accuracy. CPU frequency definitely has the largest effect here - the cpus_share_cache()
> condition is not ideal for fuse. And I guess asking users to use cpu performance mode
> for fuse is also too much asked - other file systems don't have that requirement...
> So far your seesaw idea works best (the modified version in combination with
> wake-on-same-core).
> 
> >>
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 63d242164b1a..07783ddaec5c 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -953,6 +953,8 @@ struct task_struct {
> >>           /* delay due to memory thrashing */
> >>           unsigned                        in_thrashing:1;
> >>    #endif
> >> +       unsigned                        seesaw_req:1;
> >> +       unsigned                        seesaw_io:1;
> >>    
> >>           unsigned long                   atomic_flags; /* Flags requiring atomic access. */
> >>    
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b9d6ed7585c6..474bf3657ef0 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >>           if (wake_flags & WF_TTWU) {
> >>                   record_wakee(p);
> > 
> > Seesaw does not work without WF_TTWU as per define.
> 
> What does WF_TTWU actually mean? Something like work flow time to wake unit?
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index cd7aa679c3ee..6da0de4ae9ca 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -411,6 +411,20 @@ static void request_wait_answer(struct fuse_req *req)
>   	 * Wait it out.
>   	 */
>   	wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
> +
> +	/*
> +	 * __wake_up_on_current_cpu ensures we wake up on the right core,
> +	 * after that we still want to stay on the same core, shared with
> +	 * a ring thread to submit next request to it. Issue without seesaw
> +	 * is that the while the ring thread is on its way to wait, it disturbs
> +	 * the application and application might get migrated away
> +	 */
> +	if (fc->ring.per_core_queue) {
> +		current->seesaw_req = 1;
> +		current->seesaw_jiffies = jiffies;
> +	}
> +
> +
>   out:
>   	if (prev_cpu != task_cpu(current))
>   		pr_devel("%s cpu switch from=%d to=%d\n",
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 7d327699b4c5..73adc2b16778 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>   			/* XXX error injection or test with malicious daemon */
>   		}
>   
> +		/* In combination with requesting process (application) seesaw
> +		 * setting (see request_wait_answer), the application will
> +		 * stay on the same core.
> +		 */
> +		if (fc->ring.per_core_queue)
> +			current->seesaw_proc = 1;
> +
>   		ret = fuse_uring_fetch(ring_ent, cmd);
>   		break;
>   	case FUSE_URING_REQ_COMMIT_AND_FETCH:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 63d242164b1a..53d9c77672b7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -953,6 +953,13 @@ struct task_struct {
>   	/* delay due to memory thrashing */
>   	unsigned                        in_thrashing:1;
>   #endif
> +	/* requesting task */
> +	unsigned 			seesaw_req:1;
> +	/* request processing task */
> +	unsigned			seesaw_proc:1;
> +
> +	/* limit seesaw time slot */
> +	unsigned long			seesaw_jiffies;
>   
>   	unsigned long			atomic_flags; /* Flags requiring atomic access. */
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b9d6ed7585c6..a14161e6e456 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7605,6 +7605,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>   	if (wake_flags & WF_TTWU) {
>   		record_wakee(p);
>   
> +		/*
> +		 * Current is handling requests on behalf of the waking process,
> +		 * both want to run on the same core in seeswaw manner.
> +		 * Typically current is bound to one core.'and only p is allowed
> +		 * to freely move.
> +		 */
> +		if (p->seesaw_req && current->seesaw_proc &&

What is missed is, are p and current assigned the same seesaw id given
multiple cases of seesaw?

> +		    time_after(jiffies, p->seesaw_jiffies + 10),
> +		    cpumask_test_cpu(cpu, p->cpus_ptr))
> +			return cpu;
> +
>   		if ((wake_flags & WF_CURRENT_CPU) &&
>   		    cpumask_test_cpu(cpu, p->cpus_ptr))
>   			return cpu;


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-05-04  2:16               ` Hillf Danton
@ 2023-05-05 13:10                 ` Bernd Schubert
  0 siblings, 0 replies; 7+ messages in thread
From: Bernd Schubert @ 2023-05-05 13:10 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 5/4/23 04:16, Hillf Danton wrote:
> 
>> +		    time_after(jiffies, p->seesaw_jiffies + 10),
>> +		    cpumask_test_cpu(cpu, p->cpus_ptr))
>> +			return cpu;

Above is a big typo, I don't even see on the first glance how this 
compiled at all.

This was supposed to be

if (p->seesaw_req && current->seesaw_proc &&
     time_after(jiffies, p->seesaw_jiffies + 10) &&
     cpumask_test_cpu(cpu, p->cpus_ptr))


Anyway, I now understand that the WF_TTWU flag is related to waitq - we 
don't need the timeout at all. But then if the main issue is about waitq 
migration, I don't understand yet why Andrei's WF_CURRENT_CPU is not 
sufficient. I'm going to investigate that next. Probably much easier to 
get that accepted that a rather fuse specific seesaw.


Thanks,
Bernd



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-05 13:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <d0ed1dbd-1b7e-bf98-65c0-7f61dd1a3228@ddn.com>
     [not found] ` <20230327102845.GB7701@hirez.programming.kicks-ass.net>
     [not found]   ` <20230427122417.2452-1-hdanton@sina.com>
     [not found]     ` <3c0facd0-e3c7-0aa1-8b2e-961120d4f43d@ddn.com>
2023-04-28  1:44       ` fuse uring / wake_up on the same core Hillf Danton
2023-04-28 21:54         ` Bernd Schubert
2023-04-28 23:37           ` Hillf Danton
2023-05-01 21:44           ` Bernd Schubert
     [not found]           ` <20230502003335.3253-1-hdanton@sina.com>
2023-05-03 17:04             ` Bernd Schubert
2023-05-04  2:16               ` Hillf Danton
2023-05-05 13:10                 ` Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox