From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_PASS,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77C4CC169C4 for ; Tue, 29 Jan 2019 18:35:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0AC6720856 for ; Tue, 29 Jan 2019 18:35:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="BbU4HeLE" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0AC6720856 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id ABB9A8E0002; Tue, 29 Jan 2019 13:35:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A1CB78E0001; Tue, 29 Jan 2019 13:35:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 898EB8E0002; Tue, 29 Jan 2019 13:35:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from mail-yb1-f198.google.com (mail-yb1-f198.google.com [209.85.219.198]) by kanga.kvack.org (Postfix) with ESMTP id 52D648E0001 for ; Tue, 29 Jan 2019 13:35:40 -0500 (EST) Received: by mail-yb1-f198.google.com with SMTP id o200so10678373ybc.1 for ; Tue, 29 Jan 2019 10:35:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:date:from:to:cc:subject :message-id:references:mime-version:content-disposition:in-reply-to :user-agent; bh=FTv/ayI6eNRu9V59TgMNsy0y0Pdy4VkeVoplB/wraOE=; b=jF2r6XhtF+xEul6nefdOU5Rv4L48daiZW50QL/0CMk+8y89tNHjMUGa74QVQS7hcat oU8ZJUqL3eQ7lnrN95Rg+a/ihJgzjKuv6/HlUyxlW6gHprWAGLfFjJqkxQjK2wxBLPsX yUyS+GeWUBvtSOWRn8MWNt32qTsWdKZ9EmgVTnyRUQs0lCKHKwLD2HGCsnlJ2iB4nIeg 3qD9Agc92VoH0gcZPV2rFAVoF87BL3rqtQUUfWajreQx8dPHrJVjNlG/Co8jSCT15kpA GjyA8VBlCajMneuTdEY0JXomO9WlFN1tCdwwl+HmQPKQxBcoX/s9EzDhVjpTKoJ5OY8v PSDw== X-Gm-Message-State: AHQUAua1kh1v9NHp5g6ssWByfSx4K8NJPS3rrXvrOne2ZAfA3goSJWH8 0WfNJir0sXSIwiMFy/FKuQfJQ8TDvQvdxjJOKEtb+oujER4to/06J7ZmdPRYVgMgmrjMZhnZGC7 R5y+fmXP2XpGBScchUahR5LS1AmO4JalDPryEsB7Vf+8chpVj91S5RrhOB5OYVJJqKiiA10EuFY wU7KMIaN2wKTpIRJlnAb9hr4z9N7G4x/j9TI2LcvnV+MVtWJgC2ulA1m8LZsQqH3RUZyUAvUdQ2 ve8zx/gQ+MyzY5SboYtrcSkqMmVtB3EN+DfCXsEBbxCRyaF5FcN1+pxnLCJAxWKh/78Tru3yayW 1PbcQPw1JMsnX58agyl7XE+JwThtsZBsjeSm/llPMg+1QVlts7hI0TyXc/N92GCTp5G6z8PQSno 3 X-Received: by 2002:a25:9944:: with SMTP id n4mr5162506ybo.84.1548786939862; Tue, 29 Jan 2019 10:35:39 -0800 (PST) X-Received: by 2002:a25:9944:: with SMTP id n4mr5162429ybo.84.1548786938374; Tue, 29 Jan 2019 10:35:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548786938; cv=none; d=google.com; s=arc-20160816; b=juLSCMDJJZ0W4VWZl1b0FcFuDS2SNfZPFr79LPBLFMS9RDc1a39D1V1YrG1WpT7caH zUsAASwUAAu61RX/uXloUBb93GPLrUILFXvkuCrrhqTFpeCNBw+X/Rlrv59YeK6rMq2X R5gytbLTIYEcDxNEkvCqP1L8HkWEnnYpsMmno3n/MYEZw4ai4IOaESjDOV8mNH9oFmdG iSukgsP+vIKo2lNkb0uOUjK7VpIjVPcsW7Q8atWOEzWOQ8P3x6EgdpYx7ClQdPkTPedq QbCmDHFI6O52qEZKpkp2SUPhsjGdaU4i3k0gPrp4nRNAhCnyvyNWRp//CA2MELh4mtfD gq2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=user-agent:in-reply-to:content-disposition:mime-version:references :message-id:subject:cc:to:from:date:dkim-signature; bh=FTv/ayI6eNRu9V59TgMNsy0y0Pdy4VkeVoplB/wraOE=; b=owLl1CVrWXP56S/xciSK9Wc821X6BJe1VAdFJEerzWD3AP4XL5w5cgbjzAKRr+iK28 PsJ6DjLoRZbL5F3ElYx5zcIFZ9J3ZCRkA8Z4WeRQ7VT/ohJ6MkJo9SKRIJPtSSui0SHw t/yPBEIBfOaqekZKwuzoneDWKlAg+nykM+/Amnl7vkTuZwxsMHct3W9hJwEDkmpQjzcE j7WqTlDX/DI9XaN/q256YAp6R0RfvZu+Hg5gXYMh7HBhrTbo0R3csHgdxA9NknOmNz6R CsZLNVHKy3xaaAKOSvYQoZySohQ6KZ1jxRL5TC6rH3Bf3Vw9rXLbfpSzaoMGYMQa/9w1 7U6g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=BbU4HeLE; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id z1sor16072594ybg.187.2019.01.29.10.35.37 for (Google Transport Security); Tue, 29 Jan 2019 10:35:38 -0800 (PST) Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=BbU4HeLE; spf=pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=FTv/ayI6eNRu9V59TgMNsy0y0Pdy4VkeVoplB/wraOE=; b=BbU4HeLEGzf5EgjgZ3+Iqx4Yxa8riC1V+8GhlnLnbuEjbEBSmUHXArX0d5/+nSF3we 8J+kefO1e3dc4M64/cXl90OL9BjQzjAtd0bgLVnxtwIq/mFZ0SimTJsykuIKKlsc8Cyu 1rGT9omA2Wl3AgBiS2mWLZnkxL1bSCZm/oW+v6iLx1biir6x7NPhFG1DLFRo5DI3bH72 4wav23ri+61Ie7zxGHWhxSCJyeuNSyY6SzShzEGFWnerkgrToUI3JL7ZQWBHXBoPyyuA JLx8Hv7LFNSJkxCKgqDdNPd4rrGown3ozJCI0WdLNLAnhYM4tsVeFXb2nO5Tops8jiwJ Rw4A== X-Google-Smtp-Source: ALg8bN4wF13z1kBa+gPhI4fAKj4abbx6h9bme5Z9tivunJ4hJTnfr16r4q0FPv8qyNxhjhw+J8X9qw== X-Received: by 2002:a25:3292:: with SMTP id y140mr16143456yby.45.1548786937499; Tue, 29 Jan 2019 10:35:37 -0800 (PST) Received: from localhost ([2620:10d:c091:200::4:1d25]) by smtp.gmail.com with ESMTPSA id x133sm18913944ywg.57.2019.01.29.10.35.36 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 29 Jan 2019 10:35:36 -0800 (PST) Date: Tue, 29 Jan 2019 13:35:35 -0500 From: Johannes Weiner To: Minchan Kim Cc: Suren Baghdasaryan , gregkh@linuxfoundation.org, tj@kernel.org, lizefan@huawei.com, axboe@kernel.dk, dennis@kernel.org, dennisszhou@gmail.com, mingo@redhat.com, peterz@infradead.org, akpm@linux-foundation.org, corbet@lwn.net, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor Message-ID: <20190129183535.GB7871@cmpxchg.org> References: <20190124211518.244221-1-surenb@google.com> <20190124211518.244221-6-surenb@google.com> <20190128235358.GA211479@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190128235358.GA211479@google.com> User-Agent: Mutt/1.11.2 (2019-01-07) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Minchan, good to see your name on the lists again :) On Tue, Jan 29, 2019 at 08:53:58AM +0900, Minchan Kim wrote: > On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote: > > @@ -68,6 +69,50 @@ struct psi_group_cpu { > > u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp; > > }; > > > > +/* PSI growth tracking window */ > > +struct psi_window { > > + /* Window size in ns */ > > + u64 size; > > As rest of field are time, how about "interval" instead of size? If it were "size" on its own, I would agree, but "window size" is an existing term that works pretty well here. "window interval" wouldn't. > > + /* Start time of the current window in ns */ > > + u64 start_time; > > + > > + /* Value at the start of the window */ > > + u64 start_value; > > "value" is rather vague. starting_stall? I'm not a fan of using stall here, because it reads like an event, when it's really a time interval we're interested in. For an abstraction that samples time intervals, value seems like a pretty good, straight-forward name... > > + > > + /* Value growth per previous window(s) */ > > + u64 per_win_growth; > > Rather than per, prev would be more meaninful, I think. > How about prev_win_stall? Agreed on the "per", but still not loving the stall. How about prev_delta? prev_growth? > > +struct psi_trigger { > > + /* PSI state being monitored by the trigger */ > > + enum psi_states state; > > + > > + /* User-spacified threshold in ns */ > > + u64 threshold; > > + > > + /* List node inside triggers list */ > > + struct list_head node; > > + > > + /* Backpointer needed during trigger destruction */ > > + struct psi_group *group; > > + > > + /* Wait queue for polling */ > > + wait_queue_head_t event_wait; > > + > > + /* Pending event flag */ > > + int event; > > + > > + /* Tracking window */ > > + struct psi_window win; > > + > > + /* > > + * Time last event was generated. Used for rate-limiting > > + * events to one per window > > + */ > > + u64 last_event_time; > > +}; > > + > > struct psi_group { > > /* Protects data used by the aggregator */ > > struct mutex update_lock; > > @@ -75,6 +120,8 @@ struct psi_group { > > /* Per-cpu task state & time tracking */ > > struct psi_group_cpu __percpu *pcpu; > > > > + /* Periodic work control */ > > + atomic_t polling; > > struct delayed_work clock_work; > > > > /* Total stall times observed */ > > @@ -85,6 +132,18 @@ struct psi_group { > > u64 avg_last_update; > > u64 avg_next_update; > > unsigned long avg[NR_PSI_STATES - 1][3]; > > + > > + /* Configured polling triggers */ > > + struct list_head triggers; > > + u32 nr_triggers[NR_PSI_STATES - 1]; > > + u32 trigger_mask; > > This is a state we have an interest. > How about trigger_states? Sounds good to me, I'd also switch change_mask below to changed_states: if (changed_states & trigger_states) /* engage! */ [ After reading the rest, I see Minchan proposed the same. ] > > + u64 trigger_min_period; > > + > > + /* Polling state */ > > + /* Total stall times at the start of monitor activation */ > > + u64 polling_total[NR_PSI_STATES - 1]; > > + u64 polling_next_update; > > + u64 polling_until; > > }; > > > > #else /* CONFIG_PSI */ > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > index e8cd12c6a553..de3ac22a5e23 100644 > > --- a/kernel/cgroup/cgroup.c > > +++ b/kernel/cgroup/cgroup.c > > @@ -3464,7 +3464,101 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) > > { > > return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU); > > } > > -#endif > > + > > +static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf, > > + size_t nbytes, enum psi_res res) > > +{ > > + enum psi_states state; > > + struct psi_trigger *old; > > + struct psi_trigger *new; > > + struct cgroup *cgrp; > > + u32 threshold_us; > > + u32 win_sz_us; > > window_us? We don't really encode units in variables in the rest of the code, maybe we can drop it here as well. Btw, it looks like the original reason for splitting up trigger_parse and trigger_create seems gone from the code. Can we merge them again and keep all those details out of the filesystem ->write methods? new = psi_trigger_create(group, buf, nbytes, res); > > + ssize_t ret; > > + > > + cgrp = cgroup_kn_lock_live(of->kn, false); > > + if (!cgrp) > > + return -ENODEV; > > + > > + cgroup_get(cgrp); > > + cgroup_kn_unlock(of->kn); > > + > > + ret = psi_trigger_parse(buf, nbytes, res, > > + &state, &threshold_us, &win_sz_us); > > + if (ret) { > > + cgroup_put(cgrp); > > + return ret; > > + } > > + > > + new = psi_trigger_create(&cgrp->psi, > > + state, threshold_us, win_sz_us); > > + if (IS_ERR(new)) { > > + cgroup_put(cgrp); > > + return PTR_ERR(new); > > + } > > + > > + old = of->priv; > > + rcu_assign_pointer(of->priv, new); > > + if (old) { > > + synchronize_rcu(); > > + psi_trigger_destroy(old); > > + } > > + > > + cgroup_put(cgrp); > > + > > + return nbytes; > > +} > > + > > +static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of, > > + char *buf, size_t nbytes, > > + loff_t off) > > +{ > > + return cgroup_pressure_write(of, buf, nbytes, PSI_IO); > > +} > > + > > +static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of, > > + char *buf, size_t nbytes, > > + loff_t off) > > +{ > > + return cgroup_pressure_write(of, buf, nbytes, PSI_MEM); > > +} > > + > > +static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of, > > + char *buf, size_t nbytes, > > + loff_t off) > > +{ > > + return cgroup_pressure_write(of, buf, nbytes, PSI_CPU); > > +} > > + > > +static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, > > + poll_table *pt) > > +{ > > + struct psi_trigger *t; > > + __poll_t ret; > > + > > + rcu_read_lock(); > > + t = rcu_dereference(of->priv); > > + if (t) > > + ret = psi_trigger_poll(t, of->file, pt); > > + else > > + ret = DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; > > + rcu_read_unlock(); > > + > > + return ret; > > +} > > + > > +static void cgroup_pressure_release(struct kernfs_open_file *of) > > +{ > > + struct psi_trigger *t = of->priv; > > + > > + if (!t) > > + return; > > + > > + rcu_assign_pointer(of->priv, NULL); > > + synchronize_rcu(); > > + psi_trigger_destroy(t); > > +} > > +#endif /* CONFIG_PSI */ > > > > static int cgroup_file_open(struct kernfs_open_file *of) > > { > > @@ -4619,18 +4713,27 @@ static struct cftype cgroup_base_files[] = { > > .name = "io.pressure", > > .flags = CFTYPE_NOT_ON_ROOT, > > .seq_show = cgroup_io_pressure_show, > > + .write = cgroup_io_pressure_write, > > + .poll = cgroup_pressure_poll, > > + .release = cgroup_pressure_release, > > }, > > { > > .name = "memory.pressure", > > .flags = CFTYPE_NOT_ON_ROOT, > > .seq_show = cgroup_memory_pressure_show, > > + .write = cgroup_memory_pressure_write, > > + .poll = cgroup_pressure_poll, > > + .release = cgroup_pressure_release, > > }, > > { > > .name = "cpu.pressure", > > .flags = CFTYPE_NOT_ON_ROOT, > > .seq_show = cgroup_cpu_pressure_show, > > + .write = cgroup_cpu_pressure_write, > > + .poll = cgroup_pressure_poll, > > + .release = cgroup_pressure_release, > > }, > > -#endif > > +#endif /* CONFIG_PSI */ > > { } /* terminate */ > > }; > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > > index c366503ba135..fefb98f19a80 100644 > > --- a/kernel/sched/psi.c > > +++ b/kernel/sched/psi.c > > @@ -4,6 +4,9 @@ > > * Copyright (c) 2018 Facebook, Inc. > > * Author: Johannes Weiner > > * > > + * Polling support by Suren Baghdasaryan > > + * Copyright (c) 2018 Google, Inc. > > + * > > * When CPU, memory and IO are contended, tasks experience delays that > > * reduce throughput and introduce latencies into the workload. Memory > > * and IO contention, in addition, can cause a full loss of forward > > @@ -126,11 +129,16 @@ > > > > #include > > #include > > +#include > > #include > > #include > > +#include > > #include > > #include > > #include > > +#include > > +#include > > +#include > > #include > > #include "sched.h" > > > > @@ -150,11 +158,16 @@ static int __init setup_psi(char *str) > > __setup("psi=", setup_psi); > > > > /* Running averages - we need to be higher-res than loadavg */ > > -#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ > > +#define PSI_FREQ (2*HZ+1UL) /* 2 sec intervals */ > > #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */ > > #define EXP_60s 1981 /* 1/exp(2s/60s) */ > > #define EXP_300s 2034 /* 1/exp(2s/300s) */ > > > > +/* PSI trigger definitions */ > > +#define PSI_TRIG_MIN_WIN_US 500000 /* Min window size is 500ms */ > > +#define PSI_TRIG_MAX_WIN_US 10000000 /* Max window size is 10s */ > > +#define PSI_TRIG_UPDATES_PER_WIN 10 /* 10 updates per window */ > > To me, it's rather long. > How about WINDOW_MIN_US, WINDOW_MAX_US, UPDATES_PER_WINDOW? Sounds good to me too. I'm just ambivalent on the _US suffix. Dealer's choice, though. > > + > > /* Sampling frequency in nanoseconds */ > > static u64 psi_period __read_mostly; > > > > @@ -173,8 +186,17 @@ static void group_init(struct psi_group *group) > > for_each_possible_cpu(cpu) > > seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); > > group->avg_next_update = sched_clock() + psi_period; > > + atomic_set(&group->polling, 0); > > INIT_DELAYED_WORK(&group->clock_work, psi_update_work); > > mutex_init(&group->update_lock); > > + /* Init trigger-related members */ > > + INIT_LIST_HEAD(&group->triggers); > > + memset(group->nr_triggers, 0, sizeof(group->nr_triggers)); > > + group->trigger_mask = 0; > > + group->trigger_min_period = U32_MAX; > > + memset(group->polling_total, 0, sizeof(group->polling_total)); > > + group->polling_next_update = ULLONG_MAX; > > + group->polling_until = 0; > > } > > > > void __init psi_init(void) > > @@ -209,10 +231,11 @@ static bool test_state(unsigned int *tasks, enum psi_states state) > > } > > } > > > > -static void get_recent_times(struct psi_group *group, int cpu, u32 *times) > > +static u32 get_recent_times(struct psi_group *group, int cpu, u32 *times) > > Rather awkward to return change_mask when we consider function name as > get_recent_times It would be better to add additional parameter > instead of return value. Good suggestion, I have to agree this would be nicer. > > { > > struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); > > u64 now, state_start; > > + u32 change_mask = 0; > > enum psi_states s; > > unsigned int seq; > > u32 state_mask; > > @@ -245,7 +268,11 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times) > > groupc->times_prev[s] = times[s]; > > > > times[s] = delta; > > + if (delta) > > + change_mask |= (1 << s); > > } > > + > > + return change_mask; > > } > > > > static void calc_avgs(unsigned long avg[3], int missed_periods, > > @@ -268,17 +295,14 @@ static void calc_avgs(unsigned long avg[3], int missed_periods, > > avg[2] = calc_load(avg[2], EXP_300s, pct); > > } > > > > -static bool update_stats(struct psi_group *group) > > +static u32 collect_percpu_times(struct psi_group *group) > > Not sure it's a good idea to add "implementation facility" in here. > How about update_stall_time with additional parameter of > "[changed|updated]_states? > > IOW, > static void update_stall_times(struct psi_group *group, u32 *changed_states) I disagree on this one. collect_percpu_times() isn't too detailed of a name, but it does reflect the complexity/cost of the function and the structure that is being aggregated, which is a good thing. But the return-by-parameter is a good idea. > > u64 deltas[NR_PSI_STATES - 1] = { 0, }; > > - unsigned long missed_periods = 0; > > unsigned long nonidle_total = 0; > > - u64 now, expires, period; > > + u32 change_mask = 0; > > int cpu; > > int s; > > > > - mutex_lock(&group->update_lock); > > - > > /* > > * Collect the per-cpu time buckets and average them into a > > * single time sample that is normalized to wallclock time. > > @@ -291,7 +315,7 @@ static bool update_stats(struct psi_group *group) > > u32 times[NR_PSI_STATES]; > > u32 nonidle; > > > > - get_recent_times(group, cpu, times); > > + change_mask |= get_recent_times(group, cpu, times); > > > > nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]); > > nonidle_total += nonidle; > > @@ -316,11 +340,18 @@ static bool update_stats(struct psi_group *group) > > for (s = 0; s < NR_PSI_STATES - 1; s++) > > group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL)); > > > > + return change_mask; > > +} > > + > > +static u64 calculate_averages(struct psi_group *group, u64 now) > > time? > > The function name is still awkward to me. > If someone see this function, he will expect return value as "average", not next_update. > If we want to have next_update as return value, it would better to rename *update_avgs*. update_averages() would be nice, agreed. But I disagree on the now -> time. time is really vague - could be a random timestamp or a period. We use "now" everywhere in this code to mean the current time (cpu clock in cpu-local paths, sched clock for global stuff), so let's keep it here as well. > > +{ > > + unsigned long missed_periods = 0; > > + u64 expires, period; > > + u64 avg_next_update; > > + int s; > > + > > /* avgX= */ > > - now = sched_clock(); > > expires = group->avg_next_update; > > - if (now < expires) > > - goto out; > > if (now - expires > psi_period) > > missed_periods = div_u64(now - expires, psi_period); > > > > @@ -331,7 +362,7 @@ static bool update_stats(struct psi_group *group) > > * But the deltas we sample out of the per-cpu buckets above > > * are based on the actual time elapsing between clock ticks. > > */ > > - group->avg_next_update = expires + ((1 + missed_periods) * psi_period); > > + avg_next_update = expires + ((1 + missed_periods) * psi_period); > > period = now - (group->avg_last_update + (missed_periods * psi_period)); > > group->avg_last_update = now; > > > > @@ -361,20 +392,237 @@ static bool update_stats(struct psi_group *group) > > group->avg_total[s] += sample; > > calc_avgs(group->avg[s], missed_periods, sample, period); > > } > > -out: > > - mutex_unlock(&group->update_lock); > > - return nonidle_total; > > + > > + return avg_next_update; > > +} > > + > > +/* Trigger tracking window manupulations */ > > +static void window_init(struct psi_window *win, u64 now, u64 value) > > +{ > > + win->start_value = value; > > + win->start_time = now; > > + win->per_win_growth = 0; > > +} > > + > > +/* > > + * PSI growth tracking window update and growth calculation routine. > > Let's add empty line here. Agreed. > > + * This approximates a sliding tracking window by interpolating > > + * partially elapsed windows using historical growth data from the > > + * previous intervals. This minimizes memory requirements (by not storing > > + * all the intermediate values in the previous window) and simplifies > > + * the calculations. It works well because PSI signal changes only in > > + * positive direction and over relatively small window sizes the growth > > + * is close to linear. > > + */ > > +static u64 window_update(struct psi_window *win, u64 now, u64 value) > > Hope to change now as just time for function. > > Insetad of value, couldn't we use more concrete naming? > Maybe stall_time or just stall? Disagreed on both :) > > +{ > > + u64 interval; > > elapsed? Hm, elapsed is a bit better, but how about period? We use that in the averages code for the same functionality. > > + u64 growth; > > + > > + interval = now - win->start_time; > > + growth = value - win->start_value; > > + /* > > + * After each tracking window passes win->start_value and > > + * win->start_time get reset and win->per_win_growth stores > > + * the average per-window growth of the previous window. > > + * win->per_win_growth is then used to interpolate additional > > + * growth from the previous window assuming it was linear. > > + */ > > + if (interval > win->size) { > > + win->per_win_growth = growth; > > + win->start_value = value; > > + win->start_time = now; > > We can use window_init via adding per_win_growth in the function > parameter. Maybe, window_reset would be better function name. > > > + } else { > > + u32 unelapsed; > > remaining? remained? Yup, much better. > > + > > + unelapsed = win->size - interval; > > + growth += div_u64(win->per_win_growth * unelapsed, win->size); > > + } > > + > > + return growth; > > +} > > + > > +static void init_triggers(struct psi_group *group, u64 now) > > +{ > > + struct psi_trigger *t; > > + > > + list_for_each_entry(t, &group->triggers, node) > > + window_init(&t->win, now, group->total[t->state]); > > + memcpy(group->polling_total, group->total, > > + sizeof(group->polling_total)); > > + group->polling_next_update = now + group->trigger_min_period; > > +} > > + > > +static u64 poll_triggers(struct psi_group *group, u64 now) > > How about update_[poll|trigger]_stat? update_triggers()? The signature already matches the update_averages() one, so we might as well do the same thing there I guess. > > +{ > > + struct psi_trigger *t; > > + bool new_stall = false; > > + > > + /* > > + * On subsequent updates, calculate growth deltas and let > > + * watchers know when their specified thresholds are exceeded. > > + */ > > + list_for_each_entry(t, &group->triggers, node) { > > + u64 growth; > > + > > + /* Check for stall activity */ > > + if (group->polling_total[t->state] == group->total[t->state]) > > + continue; > > + > > + /* > > + * Multiple triggers might be looking at the same state, > > + * remember to update group->polling_total[] once we've > > + * been through all of them. Also remember to extend the > > + * polling time if we see new stall activity. > > + */ > > + new_stall = true; > > + > > + /* Calculate growth since last update */ > > + growth = window_update(&t->win, now, group->total[t->state]); > > + if (growth < t->threshold) > > + continue; > > + > > + /* Limit event signaling to once per window */ > > + if (now < t->last_event_time + t->win.size) > > + continue; > > + > > + /* Generate an event */ > > + if (cmpxchg(&t->event, 0, 1) == 0) > > + wake_up_interruptible(&t->event_wait); > > + t->last_event_time = now; > > + } > > + > > + if (new_stall) { > > + memcpy(group->polling_total, group->total, > > + sizeof(group->polling_total)); > > + } > > + > > + return now + group->trigger_min_period; > > } > > > > +/* > > + * psi_update_work represents slowpath accounting part while psi_group_change > > + * represents hotpath part. There are two potential races between them: > > + * 1. Changes to group->polling when slowpath checks for new stall, then hotpath > > + * records new stall and then slowpath resets group->polling flag. This leads > > + * to the exit from the polling mode while monitored state is still changing. > > + * 2. Slowpath overwriting an immediate update scheduled from the hotpath with > > + * a regular update further in the future and missing the immediate update. > > + * Both races are handled with a retry cycle in the slowpath: > > + * > > + * HOTPATH: | SLOWPATH: > > + * | > > + * A) times[cpu] += delta | E) delta = times[*] > > + * B) start_poll = (delta[poll_mask] &&| if delta[poll_mask]: > > + * cmpxchg(g->polling, 0, 1) == 0)| F) polling_until = now + grace_period > > + * if start_poll: | if now > polling_until: > > + * C) mod_delayed_work(1) | if g->polling: > > + * else if !delayed_work_pending():| G) g->polling = polling = 0 > > + * D) schedule_delayed_work(PSI_FREQ)| smp_mb > > + * | H) goto SLOWPATH > > + * | else: > > + * | if !g->polling: > > + * | I) g->polling = polling = 1 > > + * | J) if delta && first_pass: > > + * | next_avg = calculate_averages() > > + * | if polling: > > + * | next_poll = poll_triggers() > > + * | if (delta && first_pass) || polling: > > + * | K) mod_delayed_work( > > + * | min(next_avg, next_poll)) > > + * | if !polling: > > + * | first_pass = false > > + * | L) goto SLOWPATH > > + * > > + * Race #1 is represented by (EABGD) sequence in which case slowpath deactivates > > + * polling mode because it misses new monitored stall and hotpath doesn't > > + * activate it because at (B) g->polling is not yet reset by slowpath in (G). > > + * This race is handled by the (H) retry, which in the race described above > > + * results in the new sequence of (EABGDHEIK) that reactivates polling mode. > > + * > > + * Race #2 is represented by polling==false && (JABCK) sequence which overwrites > > + * immediate update scheduled at (C) with a later (next_avg) update scheduled at > > + * (K). This race is handled by the (L) retry which results in the new sequence > > + * of polling==false && (JABCKLEIK) that reactivates polling mode and > > + * reschedules the next polling update (next_poll). > > + * > > + * Note that retries can't result in an infinite loop because retry #1 happens > > + * only during polling reactivation and retry #2 happens only on the first pass. > > + * Constant reactivations are impossible because polling will stay active for at > > + * least grace_period. Worst case scenario involves two retries (HEJKLE) > > + */ > > static void psi_update_work(struct work_struct *work) > > { > > struct delayed_work *dwork; > > struct psi_group *group; > > + bool first_pass = true; > > + u64 next_update; > > + u32 change_mask; > > How about [changed|updated]_states? changed_states sounds good to me.