From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D929C76196 for ; Thu, 23 Mar 2023 16:45:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3948B6B0072; Thu, 23 Mar 2023 12:45:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 345A26B0074; Thu, 23 Mar 2023 12:45:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 20D1E6B0075; Thu, 23 Mar 2023 12:45:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 11B7C6B0072 for ; Thu, 23 Mar 2023 12:45:39 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A77671603FD for ; Thu, 23 Mar 2023 16:45:38 +0000 (UTC) X-FDA: 80600739156.16.E65F7DF Received: from mail-wr1-f53.google.com (mail-wr1-f53.google.com [209.85.221.53]) by imf11.hostedemail.com (Postfix) with ESMTP id D2E7640014 for ; Thu, 23 Mar 2023 16:45:36 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=qDusDQcI; spf=pass (imf11.hostedemail.com: domain of shakeelb@google.com designates 209.85.221.53 as permitted sender) smtp.mailfrom=shakeelb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679589936; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GglKicPIARqjK50ePWIEou5wLn8UHWFeaeY7+rjXEwI=; b=VhUeUP+D2EEXy4UdUH7CBRGTZUPnHOSoXHqnymQQijYHyxVHRsIuSEFtc6ainhJlC/WYbR n1BKMcHrnhw4cNn3nSgC+KepH1NdWiJuhzrZ0xdvZEIBxjQvSfx55fSEs6SLXtoTluUBox e3zcxqBbbfrVjUyRDeGCLGkOI/cARCY= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=qDusDQcI; spf=pass (imf11.hostedemail.com: domain of shakeelb@google.com designates 209.85.221.53 as permitted sender) smtp.mailfrom=shakeelb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679589936; a=rsa-sha256; cv=none; b=UW2ym2jpIrK8BTnQ4gi1vIPbH8MI1QKEJAqjPOD8TX8Z2bO5sAcS5SXl2JDogJsOge9Lni 4AOo7QGVISl8aXCibT8pYveIWfxbZ+uF4EBuvtSN/37D5f5Kn31OcF1qaF4QG1ndK1Fr4D LsUPkmQZoXf0D489FST0EgYN5ofa25w= Received: by mail-wr1-f53.google.com with SMTP id v1so15215905wrv.1 for ; Thu, 23 Mar 2023 09:45:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1679589935; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GglKicPIARqjK50ePWIEou5wLn8UHWFeaeY7+rjXEwI=; b=qDusDQcIUB4xyGrh0Cy0pH4GJXpbd3dBvtD3xReDL20t/GutJptx7dOvT4vlZdqKI3 f+fWp4deIzDbuOXho4jdEyhGJ09Xa27VcoxrPtkG5R0wKq0G7Q0d7rZYqyTfujICDJOz iuWKWXvAzM22e22natXMUpus3et4jQPBuPeOT3gbm0Mlyxjo/kxGQFoR4SGwQ3jKsTnT BTidcAWwLqvWaic+X/v0ffejNUKAS5lmND0yzPTIIy2pIZpPFUQvlsCES8MUG15SNwqZ fB6xcKWpDZezHntFyeWpTZwVIMFwjI9Nzh3eHR5yL9Q2SXI2A/gWdF3BMTOj6udQZDKZ wyQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679589935; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GglKicPIARqjK50ePWIEou5wLn8UHWFeaeY7+rjXEwI=; b=BxAW15zBT7De/41leJVbNyqlZeQzSnFLp+XqIoCazqy7ZXM/frnf2pyK0HqerPMarz 2uube1h3Jxd5xFw9IQ6Jq+BmcLsZ5mFj9y3Ljj9qPgc5+NKY3WBWg0EMJd1GGKP50z2q U9dZs6ZwGZON6sgd2Ox2kpGrngw8QfpGnOgADiLjm0BOVjBcJyPd5ssXoYYFysZajkBk lvoP0QtUK70K/rozov36qBTA7/2PFSmBmNrhqImbesugwjzL4u52UOA+rupW9rgHGeUK IAbsi0Naz3MyhcUMmpM/UPo2VqL9QKFvC93IOSQVa4Uok2JxVEx4pcbGua0XJx1lMugm 9pLA== X-Gm-Message-State: AAQBX9cmjejdHIyfzeXwadhC3qTbZe62NXC4u7ULOjds1CIESJqdT8ck receESx007jKXE9dKzwoSn00EmNf+Vn6sVyzpEQP3g== X-Google-Smtp-Source: AKy350ZlSuPs5ToB6LO/yJNmvDf9+my4w0sgXcPgDet+2XYNHi3vh+JlZRxgO7wkE7KZJ50aT5NNg6E0z4j8xc76wLc= X-Received: by 2002:adf:fd81:0:b0:2c7:1483:9479 with SMTP id d1-20020adffd81000000b002c714839479mr897523wrr.11.1679589935242; Thu, 23 Mar 2023 09:45:35 -0700 (PDT) MIME-Version: 1.0 References: <20230323040037.2389095-1-yosryahmed@google.com> <20230323040037.2389095-2-yosryahmed@google.com> In-Reply-To: From: Shakeel Butt Date: Thu, 23 Mar 2023 09:45:20 -0700 Message-ID: Subject: Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock To: Yosry Ahmed Cc: Tejun Heo , Josef Bacik , Jens Axboe , Zefan Li , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Andrew Morton , Vasily Averin , cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: wypqnmq9r3bdzp5tpzc8fmt17acr6otp X-Rspamd-Queue-Id: D2E7640014 X-HE-Tag: 1679589936-789385 X-HE-Meta: U2FsdGVkX193ZZ6/4VrOzBcD8i/H3wY2AO9WUEiCVzqfsLkzWSl7ZW7jnX6jDiwQRZzffV/YRy4xCOJnFBqAkrb/9fh+oMRRMjizD9354T0sBdGR4ZSSP7xAfixjgimFh82qIKO5+Rj2WDrta2Lb54nSPzhB2+nBTjsgW/xSjlJotn0k/v86796RDhM8O9SWj+DmpgpYOklGdLH7At0DcWeTIqzxH050F+GxzJ0xnebMpK4uHV9vFlt2Tbic2VpH7Mw0Irgt2f8vHHlms7JjhQ7lwlXKxhCH8T4Ir7cLQXKvp0oE5c6ZVQlc9twOv0gL+qwG5FAoVSRU/9T0iH+WSVAKzi9pevAYVFHh+QK1KNFrFZSdWroVUktujpinIXgDz2S44zdVDP5BhNg56zTWWsY8xYgvplDPp20/D2sKBXx9NRt591U3fnzhGoblY1oJg6U3AY8ag/7rHM0+8aEfvNKMgl+dOjhO9bTUYC1UkqFCYEuQxChNQlJC7eXwm+UhnMH7jFsT7QKbQT08v1qs9e+sEznkj2cHXDlDhF+Il3Fsk9iAQwBipXuuBA3hHv0q+LJHRDec4VckWkDsMk2BjSCVsVeKINHgqJK/lYYtD/EpczGg+1vMVPN/mED1FJiF8F21Vtglr5CtlJUAdL7I8mhdRG+8WKELNIMPV6nQl2QcH+tDMV1NvTamG4uxUjRgSphqFc/cs7YRotYSEjh9WpBjzoOUXhz7oGpskCuYaRE1Ja9+qHN5wiDpi72f/WSmD+XHDYX6SgV9Mdaxg2mWd6PdVMrWy7kFn2evtwekLg4vYEwcxCpYkmGlJOe6/B/1JwhghN8SpqNKm6t8Myl2UqrxOBxDdWUeNDcdpLA2JHJ5ulAv7jntRu77JSfjrq+4W+d1lVcnFyGwdL17z2eUUs8gr5N4hd1eVbezppCLzLUXQTP25bd4S+OKl3t2A5lNFZbJpnHPGaVZ/Ig78dO G0MO/sQP XSD2ww9jCoig/zffLe6V07qINqX8lcGCmsXyWiUxcqxzyir0chJpNT2kzPvodyIiOPRjs5x/8te1tFrJmu8CCIPPVpMY9gPIVskSimUs1x8WWKIj8iJEoDhBRdnxV1nsIoe1KbIDiN/M3/BnaoxS6aIK5m65KiGJ7zbBEanCnxBtqOpMCxw9LdAtiAICIaGacEWg9//DL6b+PYHkVhp+dpnybZdgzTH9GJGFqh6f5g5dMeN8Em8eXJ3H/NxilIkNZ43CHCRlpMvSwMBs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Mar 23, 2023 at 9:37=E2=80=AFAM Yosry Ahmed = wrote: > > On Thu, Mar 23, 2023 at 9:29=E2=80=AFAM Shakeel Butt wrote: > > > > On Thu, Mar 23, 2023 at 9:18=E2=80=AFAM Yosry Ahmed wrote: > > > > > > On Thu, Mar 23, 2023 at 9:10=E2=80=AFAM Shakeel Butt wrote: > > > > > > > > On Thu, Mar 23, 2023 at 8:46=E2=80=AFAM Shakeel Butt wrote: > > > > > > > > > > On Thu, Mar 23, 2023 at 8:43=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > > > > > On Thu, Mar 23, 2023 at 8:40=E2=80=AFAM Shakeel Butt wrote: > > > > > > > > > > > > > > On Thu, Mar 23, 2023 at 6:36=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > > > > > > > > > > > I think it is possible through the charge/uncharge path= : > > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgrou= p_usage(). I > > > > > > > > > > added the protection against flushing in an interrupt c= ontext for > > > > > > > > > > future callers as well, as it may cause a deadlock if w= e don't disable > > > > > > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_us= age() is only > > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() in= terested in root > > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_= threshold() ? > > > > > > > > > > > > > > > > > > > > I am not sure, but the code looks like event notificati= ons may be set > > > > > > > > > > up on root memcg, which is why we need to check thresho= lds. > > > > > > > > > > > > > > > > > > This is something we should deprecate as root memcg's usa= ge is ill defined. > > > > > > > > > > > > > > > > Right, but I think this would be orthogonal to this patch s= eries. > > > > > > > > > > > > > > > > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabli= ng lock > > > > > > > without either breaking a link between mem_cgroup_threshold a= nd > > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without d= isabling > > > > > > > irqs. > > > > > > > > > > > > > > So, this patch can not be applied before either of those two = tasks are > > > > > > > done (and we may find more such scenarios). > > > > > > > > > > > > > > > > > > Could you elaborate why? > > > > > > > > > > > > My understanding is that with an in_task() check to make sure w= e only > > > > > > acquire cgroup_rstat_lock from non-irq context it should be fin= e to > > > > > > acquire cgroup_rstat_lock without disabling interrupts. > > > > > > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be = taken > > > > > with irq disabled while other code paths will take cgroup_rstat_l= ock > > > > > with irq enabled. This is a potential deadlock hazard unless > > > > > cgroup_rstat_lock is always taken with irq disabled. > > > > > > > > Oh you are making sure it is not taken in the irq context through > > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommen= ded > > > > to actually remove all such users instead of silently > > > > ignoring/bypassing the functionality. > > > > > > It is a workaround, we simply accept to read stale stats in irq > > > context instead of the expensive flush operation. > > > > > > > > > > > So, how about removing mem_cgroup_flush_stats() from > > > > mem_cgroup_usage(). It will break the known chain which is taking > > > > cgroup_rstat_lock with irq disabled and you can add > > > > WARN_ON_ONCE(!in_task()). > > > > > > This changes the behavior in a more obvious way because: > > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage= () > > > path is also exercised in a lot of paths outside irq context, this > > > will change the behavior for any event thresholds on the root memcg. > > > With proposed skipped flushing in irq context we only change the > > > behavior in a small subset of cases. > > > > > > I think we can skip flushing in irq context for now, and separately > > > deprecate threshold events for the root memcg. When that is done we > > > can come back and remove should_skip_flush() and add a VM_BUG_ON or > > > WARN_ON_ONCE instead. WDYT? > > > > > > 2. mem_cgroup_usage() is also used when reading usage from userspace. > > > This should be an easy workaround though. > > > > This is a cgroup v1 behavior and to me it is totally reasonable to get > > the 2 second stale root's usage. Even if you want to skip flushing in > > irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the > > rstat core code. This way we will know if other subsystems are doing > > the same or not. > > We can do that. Basically in mem_cgroup_usage() have: > > /* Some useful comment */ > if (in_task()) > mem_cgroup_flush_stats(); > > and in cgroup_rstat_flush() have: > WARN_ON_ONCE(!in_task()); > > I am assuming VM_BUG_ON is not used outside mm code. > > The only thing that worries me is that if there is another unlikely > path somewhere that flushes stats in irq context we may run into a > deadlock. I am a little bit nervous about not skipping flushing if > !in_task() in cgroup_rstat_flush(). I think it is a good thing. We will find such scenarios and fix those instead of hiding them forever or keeping the door open for new such scenarios.