From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E19A5C56201 for ; Thu, 19 Nov 2020 18:26:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DB8E3246A7 for ; Thu, 19 Nov 2020 18:26:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="0zPt8CaD" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DB8E3246A7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CA1AB6B005C; Thu, 19 Nov 2020 13:26:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C52AC6B005D; Thu, 19 Nov 2020 13:26:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B1A2A6B0068; Thu, 19 Nov 2020 13:26:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0093.hostedemail.com [216.40.44.93]) by kanga.kvack.org (Postfix) with ESMTP id 856E66B005C for ; Thu, 19 Nov 2020 13:26:45 -0500 (EST) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 21BE48249980 for ; Thu, 19 Nov 2020 18:26:45 +0000 (UTC) X-FDA: 77501998770.18.key31_2e0de1127345 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin18.hostedemail.com (Postfix) with ESMTP id D264C100ED0FE for ; Thu, 19 Nov 2020 18:26:44 +0000 (UTC) X-HE-Tag: key31_2e0de1127345 X-Filterd-Recvd-Size: 9867 Received: from mail-qv1-f68.google.com (mail-qv1-f68.google.com [209.85.219.68]) by imf32.hostedemail.com (Postfix) with ESMTP for ; Thu, 19 Nov 2020 18:26:44 +0000 (UTC) Received: by mail-qv1-f68.google.com with SMTP id g19so3363430qvy.2 for ; Thu, 19 Nov 2020 10:26:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=U1HEdLIcc4UdvMhCka09HnXOfZEKW6F4PWCyGqabJpU=; b=0zPt8CaD6GPDed3k4ERiK3VFr1NNO/aTupHaeVB5cdIZHgiB4pkoQimOv2CQMPYjra R7doXXRXv0QtrOKjLKhHKC954x01fO5rEbkKNYUgf59ctjK0Wo55jFOxvTl+ifWAwKTE uNff9bqgh4Ihb77fZFRQZEAs9n+Y8UaKvsiCpEOmIyHergScIAG8P78Wsau5tLadEQUa Cu61prE0ltJyIwCCIBVeTBl3xXEVSbpQ9qO/4blRsSMEYm1R3XxfTjxx5STAkhpVmCCJ D9ogNxsJxAKp7evnA3RTgJfD0LTAfo6MVj3Zf9CESGcJ2gw2/zOcOb5Wptedw7UWxaER F9tQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=U1HEdLIcc4UdvMhCka09HnXOfZEKW6F4PWCyGqabJpU=; b=E+HD0qIMCK/tk1T2ced0NzOKgNDv8ANnIOXGTVnIjpINmfGIS+j2BmDEFkh3OLSum0 6RA4Jx4Y1NOvIjzPTsFvz7zkWiYHWxqdClvjtRMcDdAHJMmSg9kUP8m3WRo+OSzwwyKJ OCBn9eEVXVoiaxZuFdONj+OMUjCFLWGOD0yax/VkQqoij6UJRz1PxsYb5BnTw3gZfSbV ORGLhHCgVazZUkQWu2zmI4WmB4TdbHcS24wRVaM6P3kphQKR5gFrMUQH1k6D9wDemuf2 2GR1geqwi9YBMU5AegXJrkncwEntzLgNmOyWA+C8DCTbLoRmshAy+rEg0a4q0nBxXU7Y Po/Q== X-Gm-Message-State: AOAM531/HvRd2C/+1qa5zhTBCo1wb6zVDTpomnXJpoMxsBAfq8Mi8Ydg fgVROCYOi0jWXSlPrrdkWE5BoQ== X-Google-Smtp-Source: ABdhPJy7UAFqTEEAw8KVcE5o5p+HCRgQpoqoiJUVhTRRdCbwV0jHj7KHYAvaN5vz7YcomTSrwIYYcg== X-Received: by 2002:a0c:c78e:: with SMTP id k14mr12087939qvj.5.1605810403356; Thu, 19 Nov 2020 10:26:43 -0800 (PST) Received: from localhost ([2620:10d:c091:480::1:2976]) by smtp.gmail.com with ESMTPSA id x21sm416135qkx.31.2020.11.19.10.26.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Nov 2020 10:26:42 -0800 (PST) Date: Thu, 19 Nov 2020 13:24:49 -0500 From: Johannes Weiner To: Peter Zijlstra Cc: Zhaoyang Huang , Zhaoyang Huang , Ingo Molnar , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: bail out from psi memstall when cond_resched Message-ID: <20201119182449.GA1144379@cmpxchg.org> References: <1605669776-24242-1-git-send-email-zhaoyang.huang@unisoc.com> <20201119125557.GM3121392@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201119125557.GM3121392@hirez.programming.kicks-ass.net> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Nov 19, 2020 at 01:55:57PM +0100, Peter Zijlstra wrote: > On Wed, Nov 18, 2020 at 11:22:56AM +0800, Zhaoyang Huang wrote: > > Memory reclaiming will run as several seconds in memory constraint system, which > > will be deemed as heavy memstall. Have the memory reclaim be more presiced by > > bailing out when cond_resched > > How is this supposed to work on PREEMPT=y where cond_resched() is a NOP > and you can get preempted at any random point? > > (leaving the rest for Johannes) > > > Signed-off-by: Zhaoyang Huang > > --- > > mm/vmscan.c | 23 ++++++++++++++++------- > > 1 file changed, 16 insertions(+), 7 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index a815f73..a083c85 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -316,6 +316,15 @@ static inline bool memcg_congested(struct pglist_data *pgdat, > > } > > #endif > > > > +static inline void psi_cond_resched(void) > > +{ > > + unsigned long *flags; > > + > > + if (current->flags & PF_MEMSTALL) > > + psi_memstall_leave(&flags); > > + cond_resched(); > > + psi_memstall_enter(&flags); > > +} > > /* > > * This misses isolated pages which are not accounted for to save counters. > > * As the data only determines if reclaim or compaction continues, it is > > @@ -557,7 +566,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > total_scan -= shrinkctl->nr_scanned; > > scanned += shrinkctl->nr_scanned; > > > > - cond_resched(); > > + psi_cond_resched(); > > } I guess there is some merit to discounting that time: if you're out of time slice, you couldn't run the workload either at this time, so the scheduling delay is not exactly productivity lost due to memory. But as Peter says, you'd have to handle random preemption at any time, not just cond_resched(). I think it should be doable. The below stops the clock on any memstalls when the task is queued but not scheduled. This includes preemption periods, but also the schedule delay after a wakeup. Lightly tested. It would rely on p->psi_flags for control flow, which were previously only used for debugging. But that simplifies cgroup moving a bit. Thoughts? diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 967732c0766c..f53823b6852b 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -811,26 +811,38 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep) { struct psi_group *group, *common = NULL; + int set = TSK_ONCPU, clear = TSK_ONCPU; int cpu = task_cpu(prev); void *iter; + /* + * Preemption periods and scheduling delays after wakeup are + * due to a lack of CPU, not memory. We stop the clock on the + * memory stall in such situations (see also psi_enqueue()); + * resume the stall when the task is actually scheduled again. + */ + if (!sleep && prev->in_memstall) + clear |= TSK_MEMSTALL; + if (next->in_memstall && !(next->psi_flags & TSK_MEMSTALL)) + set |= TSK_MEMSTALL; + if (next->pid) { - psi_flags_change(next, 0, TSK_ONCPU); - /* - * When moving state between tasks, the group that - * contains them both does not change: we can stop - * updating the tree once we reach the first common - * ancestor. Iterate @next's ancestors until we - * encounter @prev's state. - */ + psi_flags_change(next, 0, set); iter = NULL; while ((group = iterate_groups(next, &iter))) { - if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + /* + * When moving ONCPU between tasks, the group + * that contains them both does not change: we + * can stop updating the tree once we reach + * the first common ancestor. Iterate @next's + * ancestors until we encounter @prev's state. + */ + if ((set|clear) == TSK_ONCPU && + per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common = group; break; } - - psi_group_change(group, cpu, 0, TSK_ONCPU, true); + psi_group_change(group, cpu, 0, set, true); } } @@ -843,11 +855,10 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, return; if (prev->pid) { - psi_flags_change(prev, TSK_ONCPU, 0); - + psi_flags_change(prev, clear, 0); iter = NULL; while ((group = iterate_groups(prev, &iter)) && group != common) - psi_group_change(group, cpu, TSK_ONCPU, 0, true); + psi_group_change(group, cpu, clear, 0, true); } } @@ -964,7 +975,7 @@ void psi_cgroup_free(struct cgroup *cgroup) */ void cgroup_move_task(struct task_struct *task, struct css_set *to) { - unsigned int task_flags = 0; + unsigned int task_flags; struct rq_flags rf; struct rq *rq; @@ -978,23 +989,10 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) } rq = task_rq_lock(task, &rf); - - if (task_on_rq_queued(task)) { - task_flags = TSK_RUNNING; - if (task_current(rq, task)) - task_flags |= TSK_ONCPU; - } else if (task->in_iowait) - task_flags = TSK_IOWAIT; - - if (task->in_memstall) - task_flags |= TSK_MEMSTALL; - + task_flags = task->psi_flags; if (task_flags) psi_task_change(task, task_flags, 0); - - /* See comment above */ rcu_assign_pointer(task->cgroups, to); - if (task_flags) psi_task_change(task, 0, task_flags); diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 33d0daf83842..cb36e2e2b6c7 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -69,14 +69,20 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup) if (static_branch_likely(&psi_disabled)) return; - if (!wakeup || p->sched_psi_wake_requeue) { - if (p->in_memstall) - set |= TSK_MEMSTALL; - if (p->sched_psi_wake_requeue) - p->sched_psi_wake_requeue = 0; - } else { + if (p->sched_psi_wake_requeue) { + /* + * A migration wakeup has nothing to do here but set + * TSK_RUNNING: TSK_IOWAIT has already been cleared on + * the old queue in ttwu(); TSK_MEMSTALL will be + * restored when the task resumes in schedule(). + */ + p->sched_psi_wake_requeue = 0; + } else if (wakeup) { if (p->in_iowait) clear |= TSK_IOWAIT; + /* Don't count queue time as memory pressure */ + if (p->in_memstall) + clear |= TSK_MEMSTALL; } psi_task_change(p, clear, set); @@ -89,10 +95,7 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep) if (static_branch_likely(&psi_disabled)) return; - if (!sleep) { - if (p->in_memstall) - clear |= TSK_MEMSTALL; - } else { + if (sleep) { /* * When a task sleeps, schedule() dequeues it before * switching to the next one. Merge the clearing of