From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F199DC6FD1C
	for <linux-mm@archiver.kernel.org>; Fri, 24 Mar 2023 05:28:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8F9216B0072; Fri, 24 Mar 2023 01:28:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8A9E36B0081; Fri, 24 Mar 2023 01:28:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7714F6B0089; Fri, 24 Mar 2023 01:28:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 695506B0072
	for <linux-mm@kvack.org>; Fri, 24 Mar 2023 01:28:20 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 3DACB40742
	for <linux-mm@kvack.org>; Fri, 24 Mar 2023 05:28:20 +0000 (UTC)
X-FDA: 80602661160.16.00D6D82
Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182])
	by imf29.hostedemail.com (Postfix) with ESMTP id 53B9F120013
	for <linux-mm@kvack.org>; Fri, 24 Mar 2023 05:28:18 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=etOCIB+9;
	spf=pass (imf29.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=david@fromorbit.com;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1679635698;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ZOSe/koJcjkHktX0rr7crEDi66LRf3WhlSwmQLHAVzA=;
	b=rtGbnmBdXIgS1uu7/PKFtuxeznfBaKHYawWCv46t7Whdv0KfCji47/olh7WvC7D+JO70h6
	OfNIzO09K7imvpus4HqDcil8THAfY2cIFlvAZUwV1zKZV2ufbbtqd0DllU+k2b49nnY6Zu
	RP5AIo/GNgvx40kcfrXCMC5b2Fn/GJk=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=etOCIB+9;
	spf=pass (imf29.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=david@fromorbit.com;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679635698; a=rsa-sha256;
	cv=none;
	b=Im0F1pdZjqeiDS8kxaq7ZI3RJPREcIWPIgsVm0kbxLkJYg9hRrvgbO03T3OR0PRpQxVgy5
	R+menuW4JWeRl+s+piw/2DsYeyb05ApnCAM1rw0nDRAr2UjtoCtOaAH54Yc7Da/ZZU8GXy
	BLqO5EK5HOE0tEeopKs8FBVg19cxEY4=
Received: by mail-pg1-f182.google.com with SMTP id d8so473692pgm.3
        for <linux-mm@kvack.org>; Thu, 23 Mar 2023 22:28:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20210112.gappssmtp.com; s=20210112; t=1679635697;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=ZOSe/koJcjkHktX0rr7crEDi66LRf3WhlSwmQLHAVzA=;
        b=etOCIB+94IuJacytS+sTvPStURg9/zxwnFiGg96OgUMzljibCb52x9yq2VKAbzQcLt
         DLZ38wQpAbTKpBEpq9+bD5XY1ZMoPOQOEFpkx3HgeUhbeKD3IIvJyreYQcZMKCBlr/JI
         GLXFSvKYaDf6ni3YaLiiDM8+kppXG+UrGNx1zdmaTVcWem3B3scfxSELlI6YW0f3fddv
         UM4KS5t8V7YxPicmSb2LfCyiI3cPc+1aOjR2Bb+4GlDJEEvWT0cU+xyji+6NgQoFmdz5
         s0zRFyXtVzNRk9ytRHgaE3liNMNKFR+a9dTZd0aRcu+pbfvonx4hUvFFwHT2IbSEaaHE
         eFKg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679635697;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ZOSe/koJcjkHktX0rr7crEDi66LRf3WhlSwmQLHAVzA=;
        b=g5ZhCtURbYH7RUomDLlqwc6nABXYrjJyCKkvtCPnN/k8UYGQiyRFi8Fck0JvY0+L1Y
         55+nQhI+jyxK+w0MsgLCv5p52umu37bi8ftnUIVeSZQztDUzsWdig0IvYLU/JW1VVCOV
         Q86exNI8/dwd/DJIMzSpvRy9jdgZCeUxrAuI8EBJ9ZVngVL4qMTuxXoVyiKx20kid7Ga
         YrF2it94XwTfqkAxTlWqjWSCPvESdOUDKRVMjjaMrPotL7Ga7DUa3M2fWCcUdO+zjZIp
         btQisoqb3e6kDLCc+/2DUK2tzJsyufc9ryzVa1sabSzP+RWAXSSa7cthc9s2J2ZBdcAV
         Ctvg==
X-Gm-Message-State: AO0yUKUAiIUexVWuDOTEsXFZ4YuFpJ4dXjRRBPyErvaC2+qeEm8FlAX5
	5i1RwFcIpzoNwNK89d10tt8AmufnJH17l275a3o=
X-Google-Smtp-Source: AK7set/EzPupF051ue7ba+/mCiZHbX7cEqy9daiacsrdlsBRoJU4dgwkAgTZaKKN0KtuMnBhj1CeyA==
X-Received: by 2002:a05:6a00:3117:b0:625:524b:9712 with SMTP id bi23-20020a056a00311700b00625524b9712mr6597056pfb.2.1679635696842;
        Thu, 23 Mar 2023 22:28:16 -0700 (PDT)
Received: from destitution (pa49-181-91-157.pa.nsw.optusnet.com.au. [49.181.91.157])
        by smtp.gmail.com with ESMTPSA id s17-20020a62e711000000b00575d1ba0ecfsm12920483pfh.133.2023.03.23.22.28.16
        for <linux-mm@kvack.org>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 23 Mar 2023 22:28:16 -0700 (PDT)
Received: from dave by destitution with local (Exim 4.96)
	(envelope-from <david@fromorbit.com>)
	id 1pfZvn-001s69-0S;
	Fri, 24 Mar 2023 16:25:39 +1100
Date: Fri, 24 Mar 2023 16:25:39 +1100
From: Dave Chinner <david@fromorbit.com>
To: Uladzislau Rezki <urezki@gmail.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Baoquan He <bhe@redhat.com>, Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	Liu Shixin <liushixin2@huawei.com>, Jiri Olsa <jolsa@kernel.org>
Subject: Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock
 and vmap_block->lock
Message-ID: <ZB00U2S4g+VqzDPL@destitution>
References: <cover.1679209395.git.lstoakes@gmail.com>
 <6c7f1ac0aeb55faaa46a09108d3999e4595870d9.1679209395.git.lstoakes@gmail.com>
 <ZBkDuLKLhsOHNUeG@destitution>
 <ZBsAG5cpOFhFZZG6@pc636>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ZBsAG5cpOFhFZZG6@pc636>
X-Rspamd-Queue-Id: 53B9F120013
X-Stat-Signature: cryht4q7zq74i55bns9f1daydekfrqe6
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1679635698-971246
X-HE-Meta: U2FsdGVkX1+2sfjI4B/Y+9ctadXoHpFQhNGihWLHwurJnmpJtnukEJGt09S0nWIsDRc0qsn8OnrJ3XjZbv+s4s8TPcwYihyRoz7PwHJvX1uXAkSJ1HB4G6t+CL0Ks5vTHAc2WBY65OM7D9Zd9GV6ON5f2ezwTB79e4FK4FKTzTv3T5K68b3wEhTBJu7flxe2ymUxcjq/fkTQy/MDvZ053P+lZYiJtU6hXn3pIsY1NeHRao7opjbwFCiZhiCMhBUnhy60StYR9HoKx/9HdFfyhrMegKVseQH2s4/NZgjfGi22GlE4nC5kM87Ip2IdZ3l1fh5XHpA4FOamcltphHDI3+ZPraoS8n2rCrmgLLe5P9qGrskqaFUiAS35QHQyEgQ9TWHMTkwXMkHyjCw62e3d/qJsZ+JVFDsB0lm9JwBUTn7i4dGABogTqduSY8le8P5Pq5iDdhcUYVYictqMcIfYUQ2EoxEtINX+MiZUWZGo+iM8Tt4vo5X/M8QhPG56unhuo5IS/fNw3R48C6KYy9cq1MxPvqv6SpxzC+sXaXfffWpzMCVTxYNQtEgbaB6qw1T04Msfg1lASLf8LXCjYkHQj2zrWO/V0EF0DLYE4ppWTElWJZikJyAp2Hm2cwgllhJHczmyY+CQLWhyB/b/6+xdv01vv2IpXkX9RgAtJoWXu6kXuXFD/XrYAtjAMVm7MMBOpcVmLGDvihhnUbaDzlZVUacqM5ZxX1uylqV6UglQOZDLaHELexDCRLktvo25t6GyEpAsMv03G5TgwzVbmSp8kbEOvxNPeLwkTik/IPYc8FbyMgdH8wgIXlu53PlRSSESF1eTEqL624eSjD1rYQKwYXaUECKv4RqqdhIxx0glzrXXrsaCjYYKSQbkaqqivnUGCNuJ09TGU9FEHm7hmGYDV2Gxh25KJZOpsfUM7lq4JN57NOYTGtPyapVH6KXKwTotFzR2wlEWONi2Fr6f9WW
 WAE9NPAq
 vYDM8W54MoLnlNSxS4n6HoaZLN9h+z1k/6r96ehsez/dikKk0rF34dFVyz7/KrDsIj6RHl1+hMXiZj1NvU52qJhyqvfajcah3z5QeQ5Or+HihfAz0x8Lxlm13wHAoIPjXIhmvntqHd1dD5qvHBrcepIiv9uRtpNLEB1kK8AkqZbUdPhXBNdeEE+lXchidYAbl+sj8rFRkqIp7ztoFH7lZ3klMmdq/AfeAnt7dfWPNqn8PZXEXqMtPDVFVCnRomguAQ/K492lso+dwxLRN9yv9/hVOyxr37xsYikgdyf68N32EiLAXRVXA1Ra8IzCrGExzZR9WDFGVH2AhG4CXTTIqB/d7rvA5kUC2I0kuaaH/cO/CiuMe0EJZq1O4H9cO00uYVzsNBUs5OlOHAggG/BVXTcawlLCD/Udwkl05hHiRsU4tikDUBQEj8w0yOwm51MYhGN65BoC3EQXyrZW9IHqpifVfDKnBF5Oh4PlBYm0phHwWhaZH+p9DXKUOBibCR7CXxe2kmL0xxjjvjo/tL1r0TXBtzw0cFLeH0en9p0GNxWD0uYsgGhV2nJXI67NKvHfkCpWnBJrAI1U45cG0SXekZARYVS2nzSlFvAksc9pwhQRvu+Va6x8VLRdAowU+wjTBZ43LV00MazQ5NvZ1mEd1QqRnsg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> Hello, Dave.
> 
> > 
> > I'm travelling right now, but give me a few days and I'll test this
> > against the XFS workloads that hammer the global vmalloc spin lock
> > really, really badly. XFS can use vm_map_ram and vmalloc really
> > heavily for metadata buffers and hit the global spin lock from every
> > CPU in the system at the same time (i.e. highly concurrent
> > workloads). vmalloc is also heavily used in the hottest path
> > throught the journal where we process and calculate delta changes to
> > several million items every second, again spread across every CPU in
> > the system at the same time.
> > 
> > We really need the global spinlock to go away completely, but in the
> > mean time a shared read lock should help a little bit....
> > 
> Could you please share some steps how to run your workloads in order to
> touch vmalloc() code. I would like to have a look at it in more detail
> just for understanding the workloads.

Go search lore for the fsmark scalability benchmarks I've been
running for the past 12-13 years on XFS. Essentially they are high
concurency create/walk/modify/remove workloads on cold cache and
limited memory configurations. Essentially it's hammering caches
turning over inodes as fast as the system can possibly stream them
in and out of memory....

> <snip>
> urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
> ./xfs_log_priv.h:675: * Log vector and shadow buffers can be large, so we need to use kvmalloc() here
> ./xfs_log_priv.h:676: * to ensure success. Unfortunately, kvmalloc() only allows GFP_KERNEL contexts
> ./xfs_log_priv.h:677: * to fall back to vmalloc, so we can't actually do anything useful with gfp
> ./xfs_log_priv.h:678: * flags to control the kmalloc() behaviour within kvmalloc(). Hence kmalloc()
> ./xfs_log_priv.h:681: * vmalloc if it can't get somethign straight away from the free lists or
> ./xfs_log_priv.h:682: * buddy allocator. Hence we have to open code kvmalloc outselves here.
> ./xfs_log_priv.h:686: * allocations. This is actually the only way to make vmalloc() do GFP_NOFS
> ./xfs_log_priv.h:691:xlog_kvmalloc(

Did you read the comment above this function? I mean, it's all about
how poorly kvmalloc() works for the highly concurrent, fail-fast
context that occurs in the journal commit fast path, and how we open
code it with kmalloc and vmalloc to work "ok" in this path.

Then if you go look at the commits related to it, you might find
that XFS developers tend to write properly useful changelogs to
document things like "it's better, but vmalloc will soon have lock
contention problems if we hit it any harder"....

commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jan 4 17:22:18 2022 -0800

    xfs: reduce kvmalloc overhead for CIL shadow buffers
    
    Oh, let me count the ways that the kvmalloc API sucks dog eggs.
    
    The problem is when we are logging lots of large objects, we hit
    kvmalloc really damn hard with costly order allocations, and
    behaviour utterly sucks:
    
         - 49.73% xlog_cil_commit
             - 31.62% kvmalloc_node
                - 29.96% __kmalloc_node
                   - 29.38% kmalloc_large_node
                      - 29.33% __alloc_pages
                         - 24.33% __alloc_pages_slowpath.constprop.0
                            - 18.35% __alloc_pages_direct_compact
                               - 17.39% try_to_compact_pages
                                  - compact_zone_order
                                     - 15.26% compact_zone
                                          5.29% __pageblock_pfn_to_page
                                          3.71% PageHuge
                                        - 1.44% isolate_migratepages_block
                                             0.71% set_pfnblock_flags_mask
                                       1.11% get_pfnblock_flags_mask
                               - 0.81% get_page_from_freelist
                                  - 0.59% _raw_spin_lock_irqsave
                                     - do_raw_spin_lock
                                          __pv_queued_spin_lock_slowpath
                            - 3.24% try_to_free_pages
                               - 3.14% shrink_node
                                  - 2.94% shrink_slab.constprop.0
                                     - 0.89% super_cache_count
                                        - 0.66% xfs_fs_nr_cached_objects
                                           - 0.65% xfs_reclaim_inodes_count
                                                0.55% xfs_perag_get_tag
                                       0.58% kfree_rcu_shrink_count
                            - 2.09% get_page_from_freelist
                               - 1.03% _raw_spin_lock_irqsave
                                  - do_raw_spin_lock
                                       __pv_queued_spin_lock_slowpath
                         - 4.88% get_page_from_freelist
                            - 3.66% _raw_spin_lock_irqsave
                               - do_raw_spin_lock
                                    __pv_queued_spin_lock_slowpath
                - 1.63% __vmalloc_node
                   - __vmalloc_node_range
                      - 1.10% __alloc_pages_bulk
                         - 0.93% __alloc_pages
                            - 0.92% get_page_from_freelist
                               - 0.89% rmqueue_bulk
                                  - 0.69% _raw_spin_lock
                                     - do_raw_spin_lock
                                          __pv_queued_spin_lock_slowpath
               13.73% memcpy_erms
             - 2.22% kvfree
    
    On this workload, that's almost a dozen CPUs all trying to compact
    and reclaim memory inside kvmalloc_node at the same time. Yet it is
    regularly falling back to vmalloc despite all that compaction, page
    and shrinker reclaim that direct reclaim is doing. Copying all the
    metadata is taking far less CPU time than allocating the storage!
    
    Direct reclaim should be considered extremely harmful.
    
    This is a high frequency, high throughput, CPU usage and latency
    sensitive allocation. We've got memory there, and we're using
    kvmalloc to allow memory allocation to avoid doing lots of work to
    try to do contiguous allocations.
    
    Except it still does *lots of costly work* that is unnecessary.
    
    Worse: the only way to avoid the slowpath page allocation trying to
    do compaction on costly allocations is to turn off direct reclaim
    (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
    
    Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
    GFP_KERNEL allocation context, so you only get kmalloc!". This
    cuts off the vmalloc fallback, and this leads to almost instant OOM
    problems which ends up in filesystems deadlocks, shutdowns and/or
    kernel crashes.
    
    I want some basic kvmalloc behaviour:
    
    - kmalloc for a contiguous range with fail fast semantics - no
      compaction direct reclaim if the allocation enters the slow path.
    - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
    
    The really, really stupid part about this is these kvmalloc() calls
    are run under memalloc_nofs task context, so all the allocations are
    always reduced to GFP_NOFS regardless of the fact that kvmalloc
    requires GFP_KERNEL to be passed in. IOWs, we're already telling
    kvmalloc to behave differently to the gfp flags we pass in, but it
    still won't allow vmalloc to be run with anything other than
    GFP_KERNEL.
    
    So, this patch open codes the kvmalloc() in the commit path to have
    the above described behaviour. The result is we more than halve the
    CPU time spend doing kvmalloc() in this path and transaction commits
    with 64kB objects in them more than doubles. i.e. we get ~5x
    reduction in CPU usage per costly-sized kvmalloc() invocation and
    the profile looks like this:
    
      - 37.60% xlog_cil_commit
            16.01% memcpy_erms
          - 8.45% __kmalloc
             - 8.04% kmalloc_order_trace
                - 8.03% kmalloc_order
                   - 7.93% alloc_pages
                      - 7.90% __alloc_pages
                         - 4.05% __alloc_pages_slowpath.constprop.0
                            - 2.18% get_page_from_freelist
                            - 1.77% wake_all_kswapds
    ....
                                        - __wake_up_common_lock
                                           - 0.94% _raw_spin_lock_irqsave
                         - 3.72% get_page_from_freelist
                            - 2.43% _raw_spin_lock_irqsave
          - 5.72% vmalloc
             - 5.72% __vmalloc_node_range
                - 4.81% __get_vm_area_node.constprop.0
                   - 3.26% alloc_vmap_area
                      - 2.52% _raw_spin_lock
                   - 1.46% _raw_spin_lock
                  0.56% __alloc_pages_bulk
          - 4.66% kvfree
             - 3.25% vfree
                - __vfree
                   - 3.23% __vunmap
                      - 1.95% remove_vm_area
                         - 1.06% free_vmap_area_noflush
                            - 0.82% _raw_spin_lock
                         - 0.68% _raw_spin_lock
                      - 0.92% _raw_spin_lock
             - 1.40% kfree
                - 1.36% __free_pages
                   - 1.35% __free_pages_ok
                      - 1.02% _raw_spin_lock_irqsave
    
    It's worth noting that over 50% of the CPU time spent allocating
    these shadow buffers is now spent on spinlocks. So the shadow buffer
    allocation overhead is greatly reduced by getting rid of direct
    reclaim from kmalloc, and could probably be made even less costly if
    vmalloc() didn't use global spinlocks to protect it's structures.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>


-- 
Dave Chinner
david@fromorbit.com