* Re: fuse uring / wake_up on the same core [not found] ` <3c0facd0-e3c7-0aa1-8b2e-961120d4f43d@ddn.com> @ 2023-04-28 1:44 ` Hillf Danton 2023-04-28 21:54 ` Bernd Schubert 0 siblings, 1 reply; 7+ messages in thread From: Hillf Danton @ 2023-04-28 1:44 UTC (permalink / raw) To: Bernd Schubert Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin, linux-mm, linux-kernel On 27 Apr 2023 13:35:31 +0000 Bernd Schubert <bschubert@ddn.com> > Btw, a very hackish way to 'solve' the issue is this > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c > index cd7aa679c3ee..dd32effb5010 100644 > --- a/fs/fuse/dev.c > +++ b/fs/fuse/dev.c > @@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req) > int err; > int prev_cpu = task_cpu(current); > > + /* When running over uring and core affined userspace threads, we > + * do not want to let migrate away the request submitting process. > + * Issue is that even after waking up on the right core, processes > + * that have submitted requests might get migrated away, because > + * the ring thread is still doing a bit of work or is in the process > + * to go to sleep. Assumption here is that processes are started on > + * the right core (i.e. idle cores) and can then stay on that core > + * when they come and do file system requests. > + * Another alternative way is to set SCHED_IDLE for ring threads, > + * but that would have an issue if there are other processes keeping > + * the cpu busy. > + * SCHED_IDLE or this hack here result in about factor 3.5 for > + * max meta request performance. > + * > + * Ideal would to tell the scheduler that ring threads are not disturbing > + * that migration away from it should very very rarely happen. > + */ > + if (fc->ring.ready) > + migrate_disable(); > + > if (!fc->no_interrupt) { > /* Any signal may interrupt this */ > err = wait_event_interruptible(req->waitq, > If I understand it correctly, the seesaw workload hint to scheduler looks like the diff below, leaving scheduler free to pull the two players apart across CPU and to migrate anyone. --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -421,6 +421,7 @@ static void __fuse_request_send(struct f /* acquire extra reference, since request is still needed after fuse_request_end() */ __fuse_get_request(req); + current->seesaw = 1; queue_request_and_unlock(fiq, req); request_wait_answer(req); @@ -1229,6 +1230,7 @@ static ssize_t fuse_dev_do_read(struct f fc->max_write)) return -EINVAL; + current->seesaw = 1; restart: for (;;) { spin_lock(&fiq->lock); --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -953,6 +953,7 @@ struct task_struct { /* delay due to memory thrashing */ unsigned in_thrashing:1; #endif + unsigned seesaw:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct * if (wake_flags & WF_TTWU) { record_wakee(p); + if (p->seesaw && current->seesaw) + return cpu; if (sched_energy_enabled()) { new_cpu = find_energy_efficient_cpu(p, prev_cpu); if (new_cpu >= 0) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fuse uring / wake_up on the same core 2023-04-28 1:44 ` fuse uring / wake_up on the same core Hillf Danton @ 2023-04-28 21:54 ` Bernd Schubert 2023-04-28 23:37 ` Hillf Danton ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Bernd Schubert @ 2023-04-28 21:54 UTC (permalink / raw) To: Hillf Danton Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin, linux-mm, linux-kernel On 4/28/23 03:44, Hillf Danton wrote: > On 27 Apr 2023 13:35:31 +0000 Bernd Schubert <bschubert@ddn.com> >> Btw, a very hackish way to 'solve' the issue is this >> >> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c >> index cd7aa679c3ee..dd32effb5010 100644 >> --- a/fs/fuse/dev.c >> +++ b/fs/fuse/dev.c >> @@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req) >> int err; >> int prev_cpu = task_cpu(current); >> >> + /* When running over uring and core affined userspace threads, we >> + * do not want to let migrate away the request submitting process. >> + * Issue is that even after waking up on the right core, processes >> + * that have submitted requests might get migrated away, because >> + * the ring thread is still doing a bit of work or is in the process >> + * to go to sleep. Assumption here is that processes are started on >> + * the right core (i.e. idle cores) and can then stay on that core >> + * when they come and do file system requests. >> + * Another alternative way is to set SCHED_IDLE for ring threads, >> + * but that would have an issue if there are other processes keeping >> + * the cpu busy. >> + * SCHED_IDLE or this hack here result in about factor 3.5 for >> + * max meta request performance. >> + * >> + * Ideal would to tell the scheduler that ring threads are not disturbing >> + * that migration away from it should very very rarely happen. >> + */ >> + if (fc->ring.ready) >> + migrate_disable(); >> + >> if (!fc->no_interrupt) { >> /* Any signal may interrupt this */ >> err = wait_event_interruptible(req->waitq, >> > If I understand it correctly, the seesaw workload hint to scheduler looks > like the diff below, leaving scheduler free to pull the two players apart > across CPU and to migrate anyone. Thank a lot Hillf! I had a day off / family day today, kernel is now eventually compiling. > > --- a/fs/fuse/dev.c > +++ b/fs/fuse/dev.c > @@ -421,6 +421,7 @@ static void __fuse_request_send(struct f > /* acquire extra reference, since request is still needed > after fuse_request_end() */ > __fuse_get_request(req); > + current->seesaw = 1; > queue_request_and_unlock(fiq, req); > > request_wait_answer(req); > @@ -1229,6 +1230,7 @@ static ssize_t fuse_dev_do_read(struct f > fc->max_write)) > return -EINVAL; > > + current->seesaw = 1; fuse_dev_do_read is plain /dev/fuse (with read/write) and we don't know on which core these IO threads are running and which of them to wake up when an application comes with a request. There is a patch to use __wake_up_sync to wake the IO thread and reports that it helps in performance, but I don't see it and I think Miklos neither. For direct-io read I had also already tested disabling migration - it didn't show any effect - we better don't set current->seesaw = 1 in fuse_dev_do_read for now. With my fuse-uring patches things are more clear (https://lwn.net/Articles/926773/), there is one IO thread per core and libfuse side is binding these threads to a single core only. nproc /dev/fuse /dev/fuse fuse uring fuse uring migrate on migrate off migrate on migrate off 1 2023 1652 1151 3998 2 3375 2805 2221 7950 4 3823 4193 4540 15022 8 7796 8161 7846 22591 16 8520 8518 12235 27864 24 8361 8084 9415 27864 32 8361 8084 9124 12971 (in MiB/s) So core affinity really matters and with core affinity it is always faster with fuse-uring over the existing code. For single threaded metadata (file creates/stat/unlink) difference between migrate on/off is rather similar. Going to run with multiple processes during the next days. For paged (async) IO it behaves a bit different as here uring can show it strength and multiple requests can be combined on CQE processing - better to chose and idle ring thread on another core. I actually have a question for that as well - later. > restart: > for (;;) { > spin_lock(&fiq->lock); > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -953,6 +953,7 @@ struct task_struct { > /* delay due to memory thrashing */ > unsigned in_thrashing:1; > #endif > + unsigned seesaw:1; > > unsigned long atomic_flags; /* Flags requiring atomic access. */ > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct * > if (wake_flags & WF_TTWU) { > record_wakee(p); > > + if (p->seesaw && current->seesaw) > + return cpu; > if (sched_energy_enabled()) { > new_cpu = find_energy_efficient_cpu(p, prev_cpu); > if (new_cpu >= 0) Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is in cpus_ptr? The combination of both patches results in if (p->seesaw && current->seesaw) return cpu; if ((wake_flags & WF_CURRENT_CPU) && cpumask_test_cpu(cpu, p->cpus_ptr)) return cpu; While writing the mail kernel compilation is ready, but it got late, will test in the morning. Thanks again, Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fuse uring / wake_up on the same core 2023-04-28 21:54 ` Bernd Schubert @ 2023-04-28 23:37 ` Hillf Danton 2023-05-01 21:44 ` Bernd Schubert [not found] ` <20230502003335.3253-1-hdanton@sina.com> 2 siblings, 0 replies; 7+ messages in thread From: Hillf Danton @ 2023-04-28 23:37 UTC (permalink / raw) To: Bernd Schubert Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin, linux-mm, linux-kernel On 28 Apr 2023 21:54:51 +0000 Bernd Schubert <bschubert@ddn.com> > > @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct * > > if (wake_flags & WF_TTWU) { > > record_wakee(p); > > > > + if (p->seesaw && current->seesaw) > > + return cpu; > > if (sched_energy_enabled()) { > > new_cpu = find_energy_efficient_cpu(p, prev_cpu); > > if (new_cpu >= 0) > > Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is > in cpus_ptr? The combination of both patches results in I missed checking cpu against p. Good catch. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fuse uring / wake_up on the same core 2023-04-28 21:54 ` Bernd Schubert 2023-04-28 23:37 ` Hillf Danton @ 2023-05-01 21:44 ` Bernd Schubert [not found] ` <20230502003335.3253-1-hdanton@sina.com> 2 siblings, 0 replies; 7+ messages in thread From: Bernd Schubert @ 2023-05-01 21:44 UTC (permalink / raw) To: Hillf Danton Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin, linux-mm, linux-kernel On 4/28/23 23:54, Bernd Schubert wrote: > On 4/28/23 03:44, Hillf Danton wrote: >> restart: >> for (;;) { >> spin_lock(&fiq->lock); >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -953,6 +953,7 @@ struct task_struct { >> /* delay due to memory thrashing */ >> unsigned in_thrashing:1; >> #endif >> + unsigned seesaw:1; >> unsigned long atomic_flags; /* Flags requiring atomic >> access. */ >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct * >> if (wake_flags & WF_TTWU) { >> record_wakee(p); >> + if (p->seesaw && current->seesaw) >> + return cpu; >> if (sched_energy_enabled()) { >> new_cpu = find_energy_efficient_cpu(p, prev_cpu); >> if (new_cpu >= 0) > > > Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is > in cpus_ptr? The combination of both patches results in > > if (p->seesaw && current->seesaw) > return cpu; > > if ((wake_flags & WF_CURRENT_CPU) && > cpumask_test_cpu(cpu, p->cpus_ptr)) > return cpu; > > > > While writing the mail kernel compilation is ready, but it got late, > will test in the morning. This works wonders! The fuse-uring part is this diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index cd7aa679c3ee..ec5853ca9646 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -373,6 +373,9 @@ static void request_wait_answer(struct fuse_req *req) int err; int prev_cpu = task_cpu(current); + if (fc->ring.per_core_queue) + current->seesaw = 1; + if (!fc->no_interrupt) { /* Any signal may interrupt this */ err = wait_event_interruptible(req->waitq, diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c index 7d327699b4c5..715741ed58bf 100644 --- a/fs/fuse/dev_uring.c +++ b/fs/fuse/dev_uring.c @@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) /* XXX error injection or test with malicious daemon */ } + /* In combination with requesting process (application) seesaw + * setting (see request_wait_answer), the application will + * stay on the same core. + */ + if (fc->ring.per_core_queue) + current->seesaw = 1; + ret = fuse_uring_fetch(ring_ent, cmd); break; case FUSE_URING_REQ_COMMIT_AND_FETCH: I'm not familiar at all with scheduler code, given this works perfectly this suggests the same function is also called without explicit waitq, when the scheduler preempts a task? I think there might be side effects - what is if multiple applications are on one core and another core would be available? With this flag they would stay on the same core? Maybe better two flags? diff --git a/include/linux/sched.h b/include/linux/sched.h index 63d242164b1a..07783ddaec5c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -953,6 +953,8 @@ struct task_struct { /* delay due to memory thrashing */ unsigned in_thrashing:1; #endif + unsigned seesaw_req:1; + unsigned seesaw_io:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b9d6ed7585c6..474bf3657ef0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) if (wake_flags & WF_TTWU) { record_wakee(p); + /* current is handling requests on behalf of the waking process, + * both want to run on the same core in seeswaw manner. + */ + if (p->seesaw_req && current->seesaw_io && + cpumask_test_cpu(cpu, p->cpus_ptr)) + return cpu; + if ((wake_flags & WF_CURRENT_CPU) && cpumask_test_cpu(cpu, p->cpus_ptr)) return cpu; (not tested yet) Thanks, Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <20230502003335.3253-1-hdanton@sina.com>]
* Re: fuse uring / wake_up on the same core [not found] ` <20230502003335.3253-1-hdanton@sina.com> @ 2023-05-03 17:04 ` Bernd Schubert 2023-05-04 2:16 ` Hillf Danton 0 siblings, 1 reply; 7+ messages in thread From: Bernd Schubert @ 2023-05-03 17:04 UTC (permalink / raw) To: Hillf Danton Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin, linux-mm, linux-kernel On 5/2/23 02:33, Hillf Danton wrote: > On 1 May 2023 21:44:48 +0000 Bernd Schubert <bschubert@ddn.com> >> I'm not familiar at all with scheduler code, >> given this works perfectly this suggests the same function is also >> called without explicit waitq, when the scheduler preempts a task? > > Please see comment below in the select_task function. >> >> I think there might be side effects - what is if multiple >> applications are on one core and another core would be available? >> With this flag they would stay on the same core? Maybe better two flags? > > Gooddd question. Given multiple seesaws, > > Player i1 Player j1 > | | > Seesaw i Seesaw j > | | > P i2 P j2 > > what is ideal is run Seesaws i and j on different CPU cores. > > We can do that by replacing the seesaw flag in the task struct with > seesaw id for instance. But I have no idea how complex it will become > to set up multiple seesaw workloads on the fuse side, by grouping and > binding players to different seesaws, > > - unsigned seesaw:1; > + unsigned int seesaw;> > while the corresponding change to scheduler looks like. > > + if (p->seesaw && p->seesaw == current->seesaw && > + cpumask_test_cpu(cpu, p->cpus_ptr)) > + return cpu; Hmm, how is the seesaw id assigned and assuming two processes landed on the same core but later on another core is available, how does it dynamically change the ID? My idea with two bits was that there is a fuse ring thread bound to a specific core - it is the IO processor and gets the seesaw_proc bit. Application is submitting requests and get the seesaw_req bit set. Two applications running on the same core won't disturb each other that way. As addition, if the application is not submitting requests anymore, but let's say is busy doing computations, we want to have a timeout to let it move away if another core is more suitable. What do you think about the new patch version at the end of the mail? It uses two bits and jiffies. Just tested it and it works fine. The exact timeout period is certainly debatable. I also feel a bit bad to take so many bits in struct task. If this approach is acceptable, the jiffies parameter could be probably an u16. > > Even after job done for fuse, the emerging question is, how to set up > seesaw workloads for crypto for example, if no more than the seesaw id > hint to scheduler is preferred. > > And it makes me wonder why Prateek's work stalled for quite a while, > as more is needed for userspace hint to scheduler to work for > workloads other than seesaw. Just quickly went over these a bit, assuming seesaw doesn't get accepted and we need these, I think it would need a bit modification for fuse > + /* > + * Handle the case where a hint is set and the current CPU > + * and the previous CPU where task ran don't share caches. > + */ > + if (wakeup_hint && !cpus_share_cache(cpu, prev_cpu)) { I'm testing on and older Xeon system (E5-2650) and tried different settings with numa binding the application (fuse ring thread is bound anyway) governor conservative performance (default) application cpu 0, ring cpu 0 creates/s ~9200 ~9500 application cpu 0. ring cpu 16 creates/s ~4200 ~8000 application cpu 0. ring cpu 1 creates/s ~4200 ~8500 No idea why cpu 1 gives better results in performance mode than cpu 16, might be within accuracy. CPU frequency definitely has the largest effect here - the cpus_share_cache() condition is not ideal for fuse. And I guess asking users to use cpu performance mode for fuse is also too much asked - other file systems don't have that requirement... So far your seesaw idea works best (the modified version in combination with wake-on-same-core). >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 63d242164b1a..07783ddaec5c 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -953,6 +953,8 @@ struct task_struct { >> /* delay due to memory thrashing */ >> unsigned in_thrashing:1; >> #endif >> + unsigned seesaw_req:1; >> + unsigned seesaw_io:1; >> >> unsigned long atomic_flags; /* Flags requiring atomic access. */ >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index b9d6ed7585c6..474bf3657ef0 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) >> if (wake_flags & WF_TTWU) { >> record_wakee(p); > > Seesaw does not work without WF_TTWU as per define. What does WF_TTWU actually mean? Something like work flow time to wake unit? diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index cd7aa679c3ee..6da0de4ae9ca 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -411,6 +411,20 @@ static void request_wait_answer(struct fuse_req *req) * Wait it out. */ wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags)); + + /* + * __wake_up_on_current_cpu ensures we wake up on the right core, + * after that we still want to stay on the same core, shared with + * a ring thread to submit next request to it. Issue without seesaw + * is that the while the ring thread is on its way to wait, it disturbs + * the application and application might get migrated away + */ + if (fc->ring.per_core_queue) { + current->seesaw_req = 1; + current->seesaw_jiffies = jiffies; + } + + out: if (prev_cpu != task_cpu(current)) pr_devel("%s cpu switch from=%d to=%d\n", diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c index 7d327699b4c5..73adc2b16778 100644 --- a/fs/fuse/dev_uring.c +++ b/fs/fuse/dev_uring.c @@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) /* XXX error injection or test with malicious daemon */ } + /* In combination with requesting process (application) seesaw + * setting (see request_wait_answer), the application will + * stay on the same core. + */ + if (fc->ring.per_core_queue) + current->seesaw_proc = 1; + ret = fuse_uring_fetch(ring_ent, cmd); break; case FUSE_URING_REQ_COMMIT_AND_FETCH: diff --git a/include/linux/sched.h b/include/linux/sched.h index 63d242164b1a..53d9c77672b7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -953,6 +953,13 @@ struct task_struct { /* delay due to memory thrashing */ unsigned in_thrashing:1; #endif + /* requesting task */ + unsigned seesaw_req:1; + /* request processing task */ + unsigned seesaw_proc:1; + + /* limit seesaw time slot */ + unsigned long seesaw_jiffies; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b9d6ed7585c6..a14161e6e456 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7605,6 +7605,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) if (wake_flags & WF_TTWU) { record_wakee(p); + /* + * Current is handling requests on behalf of the waking process, + * both want to run on the same core in seeswaw manner. + * Typically current is bound to one core.'and only p is allowed + * to freely move. + */ + if (p->seesaw_req && current->seesaw_proc && + time_after(jiffies, p->seesaw_jiffies + 10), + cpumask_test_cpu(cpu, p->cpus_ptr)) + return cpu; + if ((wake_flags & WF_CURRENT_CPU) && cpumask_test_cpu(cpu, p->cpus_ptr)) return cpu; Thanks, Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fuse uring / wake_up on the same core 2023-05-03 17:04 ` Bernd Schubert @ 2023-05-04 2:16 ` Hillf Danton 2023-05-05 13:10 ` Bernd Schubert 0 siblings, 1 reply; 7+ messages in thread From: Hillf Danton @ 2023-05-04 2:16 UTC (permalink / raw) To: Bernd Schubert Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin, linux-mm, linux-kernel On 3 May 2023 17:04:25 +0000 Bernd Schubert <bschubert@ddn.com> > On 5/2/23 02:33, Hillf Danton wrote: > > On 1 May 2023 21:44:48 +0000 Bernd Schubert <bschubert@ddn.com> > >> I'm not familiar at all with scheduler code, > >> given this works perfectly this suggests the same function is also > >> called without explicit waitq, when the scheduler preempts a task? > > > > Please see comment below in the select_task function. > >> > >> I think there might be side effects - what is if multiple > >> applications are on one core and another core would be available? > >> With this flag they would stay on the same core? Maybe better two flags? > > > > Gooddd question. Given multiple seesaws, > > > > Player i1 Player j1 > > | | > > Seesaw i Seesaw j > > | | > > P i2 P j2 > > > > what is ideal is run Seesaws i and j on different CPU cores. > > > > We can do that by replacing the seesaw flag in the task struct with > > seesaw id for instance. But I have no idea how complex it will become > > to set up multiple seesaw workloads on the fuse side, by grouping and > > binding players to different seesaws, > > > > - unsigned seesaw:1; > > + unsigned int seesaw; > > while the corresponding change to scheduler looks like. > > > > + if (p->seesaw && p->seesaw == current->seesaw && > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > + return cpu; > > > Hmm, how is the seesaw id assigned and assuming two processes landed > on the same core but later on another core is available, how does it > dynamically change the ID? Gooood question again. The seesaw id is freely assigned in the space unsigned int could offer, 1 <= seesaw id < UINT_MAx, independent of anything else, and once assigned, it will not change throughout lifespan because of no link between seesaw id and CPU core id. The main point of seesaw id is help group/bind players on the fuse side, and help select the current CPU on the scheduler side. It also prevents player i2 from making a phone call to player j1. Then no more it can do, so no change in how scheduler handles an idle core. > My idea with two bits was that there is a fuse ring thread bound to a > specific core - it is the IO processor and gets the seesaw_proc bit. Setting CPU affinity is another usespace hint independent of seesaw. > Application is submitting requests and get the seesaw_req bit set. > Two applications running on the same core won't disturb each other that way. Yeah, scheduler is free to migrate any seesaw_req away if needed, but seesaw will pull it back at next wakeup. req I1 ... In | seesaw-I on core-K | proc I After binding proc-I to core-W for instance, if req-I2 is migrated to core-X then it makes no sense to pull proc-I to core-X upon wakeup. Actually this will not happen because of cpumask test for proc-I. cpumask_test_cpu(cpu, p->cpus_ptr)) > > As addition, if the application is not submitting requests anymore, but > let's say is busy doing computations, we want to have a timeout to let > it move away if another core is more suitable. If I understand fuse correctly, in request_wait_answer(), seesaw helps only if the application sleeps on req->waitq (in other word, seesaw does not work without WF_TTWU), and scheduler does migration for free. Note migration is the job on the scheduler side, with nothing to do with seesaw, otherwise PeterZ will lion roar. > What do you think about > the new patch version at the end of the mail? It uses two bits and jiffies. See comment below in select_task_rq_fair(). What we are doing is just show PeterZ that userspace hint for the fuse seesaw workload works, and feel free to do whatever you like because you know much better about fuse than I do. > Just tested it and it works fine. The exact timeout period is > certainly debatable. I also feel a bit bad > to take so many bits in struct task. If this approach is acceptable, > the jiffies parameter could be probably an u16. > > > > > Even after job done for fuse, the emerging question is, how to set up > > seesaw workloads for crypto for example, if no more than the seesaw id > > hint to scheduler is preferred. > > > > And it makes me wonder why Prateek's work stalled for quite a while, > > as more is needed for userspace hint to scheduler to work for > > workloads other than seesaw. > > Just quickly went over these a bit, assuming seesaw doesn't get accepted > and we need these, I think it would need a bit modification for fuse > > > > + /* > > + * Handle the case where a hint is set and the current CPU > > + * and the previous CPU where task ran don't share caches. > > + */ > > + if (wakeup_hint && !cpus_share_cache(cpu, prev_cpu)) { > > I'm testing on and older Xeon system (E5-2650) and tried different settings > with numa binding the application (fuse ring thread is bound anyway) > > > governor conservative performance > (default) > application cpu 0, ring cpu 0 creates/s ~9200 ~9500 > application cpu 0. ring cpu 16 creates/s ~4200 ~8000 > application cpu 0. ring cpu 1 creates/s ~4200 ~8500 > > > No idea why cpu 1 gives better results in performance mode than cpu 16, might be within > accuracy. CPU frequency definitely has the largest effect here - the cpus_share_cache() > condition is not ideal for fuse. And I guess asking users to use cpu performance mode > for fuse is also too much asked - other file systems don't have that requirement... > So far your seesaw idea works best (the modified version in combination with > wake-on-same-core). > > >> > >> diff --git a/include/linux/sched.h b/include/linux/sched.h > >> index 63d242164b1a..07783ddaec5c 100644 > >> --- a/include/linux/sched.h > >> +++ b/include/linux/sched.h > >> @@ -953,6 +953,8 @@ struct task_struct { > >> /* delay due to memory thrashing */ > >> unsigned in_thrashing:1; > >> #endif > >> + unsigned seesaw_req:1; > >> + unsigned seesaw_io:1; > >> > >> unsigned long atomic_flags; /* Flags requiring atomic access. */ > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index b9d6ed7585c6..474bf3657ef0 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > >> if (wake_flags & WF_TTWU) { > >> record_wakee(p); > > > > Seesaw does not work without WF_TTWU as per define. > > What does WF_TTWU actually mean? Something like work flow time to wake unit? > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c > index cd7aa679c3ee..6da0de4ae9ca 100644 > --- a/fs/fuse/dev.c > +++ b/fs/fuse/dev.c > @@ -411,6 +411,20 @@ static void request_wait_answer(struct fuse_req *req) > * Wait it out. > */ > wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags)); > + > + /* > + * __wake_up_on_current_cpu ensures we wake up on the right core, > + * after that we still want to stay on the same core, shared with > + * a ring thread to submit next request to it. Issue without seesaw > + * is that the while the ring thread is on its way to wait, it disturbs > + * the application and application might get migrated away > + */ > + if (fc->ring.per_core_queue) { > + current->seesaw_req = 1; > + current->seesaw_jiffies = jiffies; > + } > + > + > out: > if (prev_cpu != task_cpu(current)) > pr_devel("%s cpu switch from=%d to=%d\n", > diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c > index 7d327699b4c5..73adc2b16778 100644 > --- a/fs/fuse/dev_uring.c > +++ b/fs/fuse/dev_uring.c > @@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) > /* XXX error injection or test with malicious daemon */ > } > > + /* In combination with requesting process (application) seesaw > + * setting (see request_wait_answer), the application will > + * stay on the same core. > + */ > + if (fc->ring.per_core_queue) > + current->seesaw_proc = 1; > + > ret = fuse_uring_fetch(ring_ent, cmd); > break; > case FUSE_URING_REQ_COMMIT_AND_FETCH: > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 63d242164b1a..53d9c77672b7 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -953,6 +953,13 @@ struct task_struct { > /* delay due to memory thrashing */ > unsigned in_thrashing:1; > #endif > + /* requesting task */ > + unsigned seesaw_req:1; > + /* request processing task */ > + unsigned seesaw_proc:1; > + > + /* limit seesaw time slot */ > + unsigned long seesaw_jiffies; > > unsigned long atomic_flags; /* Flags requiring atomic access. */ > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b9d6ed7585c6..a14161e6e456 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7605,6 +7605,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > if (wake_flags & WF_TTWU) { > record_wakee(p); > > + /* > + * Current is handling requests on behalf of the waking process, > + * both want to run on the same core in seeswaw manner. > + * Typically current is bound to one core.'and only p is allowed > + * to freely move. > + */ > + if (p->seesaw_req && current->seesaw_proc && What is missed is, are p and current assigned the same seesaw id given multiple cases of seesaw? > + time_after(jiffies, p->seesaw_jiffies + 10), > + cpumask_test_cpu(cpu, p->cpus_ptr)) > + return cpu; > + > if ((wake_flags & WF_CURRENT_CPU) && > cpumask_test_cpu(cpu, p->cpus_ptr)) > return cpu; ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fuse uring / wake_up on the same core 2023-05-04 2:16 ` Hillf Danton @ 2023-05-05 13:10 ` Bernd Schubert 0 siblings, 0 replies; 7+ messages in thread From: Bernd Schubert @ 2023-05-05 13:10 UTC (permalink / raw) To: Hillf Danton Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin, linux-mm, linux-kernel On 5/4/23 04:16, Hillf Danton wrote: > >> + time_after(jiffies, p->seesaw_jiffies + 10), >> + cpumask_test_cpu(cpu, p->cpus_ptr)) >> + return cpu; Above is a big typo, I don't even see on the first glance how this compiled at all. This was supposed to be if (p->seesaw_req && current->seesaw_proc && time_after(jiffies, p->seesaw_jiffies + 10) && cpumask_test_cpu(cpu, p->cpus_ptr)) Anyway, I now understand that the WF_TTWU flag is related to waitq - we don't need the timeout at all. But then if the main issue is about waitq migration, I don't understand yet why Andrei's WF_CURRENT_CPU is not sufficient. I'm going to investigate that next. Probably much easier to get that accepted that a rather fuse specific seesaw. Thanks, Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-05-05 13:10 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <d0ed1dbd-1b7e-bf98-65c0-7f61dd1a3228@ddn.com>
[not found] ` <20230327102845.GB7701@hirez.programming.kicks-ass.net>
[not found] ` <20230427122417.2452-1-hdanton@sina.com>
[not found] ` <3c0facd0-e3c7-0aa1-8b2e-961120d4f43d@ddn.com>
2023-04-28 1:44 ` fuse uring / wake_up on the same core Hillf Danton
2023-04-28 21:54 ` Bernd Schubert
2023-04-28 23:37 ` Hillf Danton
2023-05-01 21:44 ` Bernd Schubert
[not found] ` <20230502003335.3253-1-hdanton@sina.com>
2023-05-03 17:04 ` Bernd Schubert
2023-05-04 2:16 ` Hillf Danton
2023-05-05 13:10 ` Bernd Schubert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox