From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A243C10F0E for ; Thu, 18 Apr 2019 14:05:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4B2ED217D7 for ; Thu, 18 Apr 2019 14:05:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="PKLIIu/2" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4B2ED217D7 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B05636B0008; Thu, 18 Apr 2019 10:05:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AB6A66B000C; Thu, 18 Apr 2019 10:05:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9A3C66B000D; Thu, 18 Apr 2019 10:05:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-yb1-f200.google.com (mail-yb1-f200.google.com [209.85.219.200]) by kanga.kvack.org (Postfix) with ESMTP id 7B56D6B0008 for ; Thu, 18 Apr 2019 10:05:37 -0400 (EDT) Received: by mail-yb1-f200.google.com with SMTP id 62so1595455ybg.11 for ; Thu, 18 Apr 2019 07:05:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:mime-version:references :in-reply-to:from:date:message-id:subject:to:cc; bh=lxZOPtOiRntngjfC1zUz7S6vThNcfbNwqp8rwY4BU+o=; b=Tx/tkcEOjZ/E/L2+qEfq2SGPYOllH8O09oKLaCfpx7jJQlmdYDgNeUM+2v2MWA6CSQ twG8yEuTK4imFrThlYfW1gDM28Q5in7wpB3odcvMEP4N2aY4rJN0rErcjmNWXQFB06xN cI1LI0hK3XB/1FOs43DOcbHdaJCyY472O6ktx+T68ndJeFFsHynxTnZMlwHgNxFaPa05 PvOc6c5GWwn5Wqk5dKkch7TMvalLkpQcKbdGm0ztFaafoOBkjJyeqLgVnX1EZMcpb1+e GI5Bx3F3BwyPp5GkoeYhevmmGXjwV8mAETUOGv7YGVavbNrMwIOD/xRoTSAXo1TjSg1B KnlA== X-Gm-Message-State: APjAAAVPOo31s2emEeH7N5fxf6K42bf9WFIcPYzMOvCLTprljGtePIFT ZAgRH4AK3C64YdZS6987hGv6YQvs0PkbUAUn0rM+9cKFUp2S0N0hnMCt5/2HTxoazwTI4mAQSQ6 8+eqNsRvSrwdWVRWNGaGiakgbETQrnsKHaVx0UBrxua5XnV0TmSRbOuyRu6rt38Td/A== X-Received: by 2002:a25:9743:: with SMTP id h3mr50680703ybo.218.1555596337133; Thu, 18 Apr 2019 07:05:37 -0700 (PDT) X-Received: by 2002:a25:9743:: with SMTP id h3mr50680629ybo.218.1555596336436; Thu, 18 Apr 2019 07:05:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555596336; cv=none; d=google.com; s=arc-20160816; b=ClxEEoXP0ZpbNXhqg3BQVx6bSpiD2DlNaBpr2yKjRQ26cNGKaaM9+vlkabjKqHhQlE zzjzyMdoCyv3p+7jMDbGFNBPoE/5bXDxczMaTXsoxqIV/QiiWOSOPmX5641ZQ+vSQNHL mj33SCTZC8zq+ifDgrjD1ekIo/gi2Fpqo0I3sjG7MUnLOaKPCB5Rq5VaUri5zYRpXO0T B7t/55e/Gmz+pBLDsoaALuAd5Y4h5lQQNxWt93Gli44gNQXyo9kczoTmXvBeqgrTCj+J 3zjwOvim0d42bBlKxdYGp+aGcONY/gkEs/am/jRy8nnowwoZ+3POzaiaL5SV/5o+Vfs6 +27A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=lxZOPtOiRntngjfC1zUz7S6vThNcfbNwqp8rwY4BU+o=; b=kjdR4MMlUYIJm3H6esyf2Iu7bSr9Q5WSyRu2K7lflKSRM9scUx82Trsx4EmzBYA23o ihGZ5naZPKpLFVirc5dGHSsUk0SI2/AxoYlVFETdp3BKECwu7yRzuDRZYQyFKGObaTkZ 5ZZyKLHfopNdDzh/x89LuHNtUL1efeb32tlJSsWylIh5YmRH7nkJmWuHT3qHhPt1CKwS Eosqr5KtOnQ/vyRY1t8LQl34Fn9a7m28+vaEpoIrJNspIgYDysAJFK2nTrs+j8CoLOUL c2RpUFDXtuflm41QRfRZ4bGPZwscsdYiDKpRWKVo8gh3CLfkRrQwCC4HHvwKXYMJz0PH ZZpg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="PKLIIu/2"; spf=pass (google.com: domain of shakeelb@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=shakeelb@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id r17sor963166ybb.118.2019.04.18.07.05.36 for (Google Transport Security); Thu, 18 Apr 2019 07:05:36 -0700 (PDT) Received-SPF: pass (google.com: domain of shakeelb@google.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="PKLIIu/2"; spf=pass (google.com: domain of shakeelb@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=shakeelb@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=lxZOPtOiRntngjfC1zUz7S6vThNcfbNwqp8rwY4BU+o=; b=PKLIIu/2p1mf1PwPOq/Dq4fgAcfc8HRGD9pYpQXnNgCf9RbbH0YRWi21j0rUDYr5wQ Eob7OjAetNA6zLyyXYPOL0nUEo98qJwiD3yTkPG/k5nGP6SimBhHkWL+LlSa6bnTTLo4 GMxZZ/n2prtENzG93LCA1AE7rm0kn1+PyjMbZ5FAPSjbKY9CemkZ6yBzoJtz9ke26xjy iKnQP2356YiuOQxUNHDA5OeSj/ioRpaHWQCYsuEWVpL9SiId0bpGAb/Te1mqcI9aIPrj fu7IlcI8zKJ0kSR2a3dvrD3Mn9XC32xKmal7fY0PNo/p/vqCij5V1Mgnpw+aX+AnLUzk yhtg== X-Google-Smtp-Source: APXvYqxVayXWw+tHK7hWSvZynRaOADWkL/7dyCgTmVc6P5Equy80JD+hVLds59IhUiVRwpl/bfjKkWQKxUgeKU9O4w0= X-Received: by 2002:a25:ac41:: with SMTP id r1mr10138415ybd.377.1555596335704; Thu, 18 Apr 2019 07:05:35 -0700 (PDT) MIME-Version: 1.0 References: <20190417215434.25897-1-guro@fb.com> <20190417215434.25897-5-guro@fb.com> <20190418003850.GA13977@tower.DHCP.thefacebook.com> <20190418030729.GA5038@castle> In-Reply-To: <20190418030729.GA5038@castle> From: Shakeel Butt Date: Thu, 18 Apr 2019 07:05:24 -0700 Message-ID: Subject: Re: [PATCH 4/5] mm: rework non-root kmem_cache lifecycle management To: Roman Gushchin Cc: Roman Gushchin , Andrew Morton , Linux MM , LKML , Kernel Team , Johannes Weiner , Michal Hocko , Rik van Riel , "david@fromorbit.com" , Christoph Lameter , Pekka Enberg , Vladimir Davydov , Cgroups Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Apr 17, 2019 at 8:07 PM Roman Gushchin wrote: > > On Wed, Apr 17, 2019 at 06:55:12PM -0700, Shakeel Butt wrote: > > On Wed, Apr 17, 2019 at 5:39 PM Roman Gushchin wrote: > > > > > > On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote: > > > > On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin wrote: > > > > > > > > > > This commit makes several important changes in the lifecycle > > > > > of a non-root kmem_cache, which also affect the lifecycle > > > > > of a memory cgroup. > > > > > > > > > > Currently each charged slab page has a page->mem_cgroup pointer > > > > > to the memory cgroup and holds a reference to it. > > > > > Kmem_caches are held by the cgroup. On offlining empty kmem_caches > > > > > are freed, all other are freed on cgroup release. > > > > > > > > No, they are not freed (i.e. destroyed) on offlining, only > > > > deactivated. All memcg kmem_caches are freed/destroyed on memcg's > > > > css_free. > > > > > > You're right, my bad. I was thinking about the corresponding sysfs entry > > > when was writing it. We try to free it from the deactivation path too. > > > > > > > > > > > > > > > > > So the current scheme can be illustrated as: > > > > > page->mem_cgroup->kmem_cache. > > > > > > > > > > To implement the slab memory reparenting we need to invert the scheme > > > > > into: page->kmem_cache->mem_cgroup. > > > > > > > > > > Let's make every page to hold a reference to the kmem_cache (we > > > > > already have a stable pointer), and make kmem_caches to hold a single > > > > > reference to the memory cgroup. > > > > > > > > What about memcg_kmem_get_cache()? That function assumes that by > > > > taking reference on memcg, it's kmem_caches will stay. I think you > > > > need to get reference on the kmem_cache in memcg_kmem_get_cache() > > > > within the rcu lock where you get the memcg through css_tryget_online. > > > > > > Yeah, a very good question. > > > > > > I believe it's safe because css_tryget_online() guarantees that > > > the cgroup is online and won't go offline before css_free() in > > > slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1 > > > and drop it on offlining, so it protects the online kmem_cache. > > > > > > > Let's suppose a thread doing a remote charging calls > > memcg_kmem_get_cache() and gets an empty kmem_cache of the remote > > memcg having refcnt equal to 1. That thread got a reference on the > > remote memcg but no reference on the kmem_cache. Let's suppose that > > thread got stuck in the reclaim and scheduled away. In the meantime > > that remote memcg got offlined and decremented the refcnt of all of > > its kmem_caches. The empty kmem_cache which the thread stuck in > > reclaim have pointer to can get deleted and may be using an already > > destroyed kmem_cache after coming back from reclaim. > > > > I think the above situation is possible unless the thread gets the > > reference on the kmem_cache in memcg_kmem_get_cache(). > > Yes, you're right and I'm writing a nonsense: css_tryget_online() > can't prevent the cgroup from being offlined. > The reason I knew about that race is because I tried something similar but for different use-case: https://lkml.org/lkml/2018/3/26/472 > So, the problem with getting a reference in memcg_kmem_get_cache() > is that it's an atomic operation on the hot path, something I'd like > to avoid. > > I can make the refcounter percpu, but it'll add some complexity and size > to the kmem_cache object. Still an option, of course. > I kind of prefer this option. > I wonder if we can use rcu_read_lock() instead, and bump the refcounter > only if we're going into reclaim. > > What do you think? Should it be just reclaim or anything that can reschedule the current thread? I can tell how we resolve the similar issue for our eager-kmem_cache-deletion use-case. Our solution (hack) works only for CONFIG_SLAB (we only use SLAB) and non-preemptible kernel. The underlying motivation was to reduce the overhead of slab reaper of traversing thousands of empty offlined kmem caches. CONFIG_SLAB disables interrupts before accessing the per-cpu caches and reenables the interrupts if it has to fallback to the page allocation. We use this window to call memcg_kmem_get_cache() and only increment the refcnt of kmem_cache if going to the fallback. Thus no need to do atomic operation on the hot path. Anyways, I think having percpu refcounter for each memcg kmem_cache is not that costy for CONFIG_MEMCG_KMEM users and to me that seems like the most simple solution. Shakeel