From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DF7B4F46452 for ; Mon, 16 Mar 2026 11:22:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E38406B01A2; Mon, 16 Mar 2026 07:22:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DBB886B01B1; Mon, 16 Mar 2026 07:22:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C901D6B01B3; Mon, 16 Mar 2026 07:22:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id AF9346B01A2 for ; Mon, 16 Mar 2026 07:22:54 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 57BEA138C02 for ; Mon, 16 Mar 2026 11:22:54 +0000 (UTC) X-FDA: 84551689068.17.06AC379 Received: from va-2-112.ptr.blmpb.com (va-2-112.ptr.blmpb.com [209.127.231.112]) by imf22.hostedemail.com (Postfix) with ESMTP id 19629C000B for ; Mon, 16 Mar 2026 11:22:50 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=bytedance.com header.s=2212171451 header.b=n25MEtkI; spf=pass (imf22.hostedemail.com: domain of changfengnan@bytedance.com designates 209.127.231.112 as permitted sender) smtp.mailfrom=changfengnan@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1773660172; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=teee8vzlr5I52VALSpkXiCKR/hFCeiuAThsacU/pkrw=; b=68JtNAJC6b9e+KJkSKrA/9Nd23x+TBBphYHCo8jYrpn/5uair4nak1UuvXSGKDNLXsyZLa rEC4oLXQPn9VH0nIEPriVuNzAH+6eOrrAkqxdvR+49SD959/GUOFSY9ptSk7V/Symw83EM Ox73Okb6bbV734XsY04tVvPLduqDbvI= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=bytedance.com header.s=2212171451 header.b=n25MEtkI; spf=pass (imf22.hostedemail.com: domain of changfengnan@bytedance.com designates 209.127.231.112 as permitted sender) smtp.mailfrom=changfengnan@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1773660172; a=rsa-sha256; cv=none; b=y88T4xSBqQbNd+EOt8F7q8Q9YuQGzWm3htkt9WwvHfijC/oZZtk6yR6ckU5/yJxSeJvy3Y avlxATGcNYJbafrrq5c7b5m1TmoMyhreMXVlKqH29Uc5wjD/JeYE61KpLizO18gsx+/c21 uMcZ7WKjyZCQyV3+V382U4mbkgkEsIM= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; s=2212171451; d=bytedance.com; t=1773660163; h=from:subject: mime-version:from:date:message-id:subject:to:cc:reply-to:content-type: mime-version:in-reply-to:message-id; bh=teee8vzlr5I52VALSpkXiCKR/hFCeiuAThsacU/pkrw=; b=n25MEtkIyiUOdLK74INwAJRrVBbC3bLyS/1r1JZUuP7vPrN/6rAvKLLqF/3hwo/Pa3zl7x MmW6FoXSEczfphsLKTZwoOtdTAlXxM7DuvpbumGbuJJBVUV/0KROlTzwOsld+af1k+C6o4 B9ZHoI2eKDsVF2yAmR6u8O41M+xBwSc7owK6PwsagTrp2DBOJZFXBA08gFxYlLUMDNpPfO Z7X+WFouELqmvkoCM+bdJapOlsqI2LGHEsx/kg1LE94ykTsTEVotIIm3QjNRFMjs7pUvgT x2nxqwn2arRV25UD12To55f0+MJ/nLvrSluS9KeXZPIFvRidV1bgTqDIb6MsEQ== Mime-Version: 1.0 In-Reply-To: X-Lms-Return-Path: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-Id: To: "Dave Chinner" Cc: "guzebing" , , , , , , , , , , "Vlastimil Babka" References: <20260115021108.1913695-1-guzebing1612@gmail.com> Date: Mon, 16 Mar 2026 19:22:39 +0800 Subject: Re: [PATCH v3] iomap: add allocation cache for iomap_dio From: "changfengnan" X-Rspamd-Queue-Id: 19629C000B X-Rspamd-Server: rspam07 X-Stat-Signature: 6hqyz1bwd1ykwwe38xta7gw8ak1to4hj X-Rspam-User: X-HE-Tag: 1773660170-924812 X-HE-Meta: U2FsdGVkX1/q5wCwmr0zjzPCdeNM2FT60wAgwIrPqQt1kK6osAM3o6M89oiWEsMUqK+71qu4JSTp7dt9iumlYxbBYXz8xFyX2F/aMdNvM8Vg6TmzXno75m2NmtBecBwGX+0heWCAmOT+iYxbX3Z8rsYl5dRBiMfzg2rQcnzwB7BdRqhF4HvZcngciIYedahpmQd2uQ5AaOGQudPDjT+a59rY798s6ISFNK0wSeZVnqGYfXe1ghrUxhpBsQRdAIVC9yDMpJgbGFJL34HdG222YGji6jjRQ2ZLx9QzMnSRmFsC1Nb8Tki+RgN70s6B83EwJSf05Y4G20VQrw1nVRS7A7iu+Bsw47N/h+T+J38q2rMR6qYngjVVb1zhMyP2CGpymu2XLbVkTn0SuScWcFMT4MfP3juLivP8xOgpTzWdPf/ybG0VJXwbBD2qX/rSaM2k4gOGUNR4kuhe7axqVEXKq4Un83AtJpSy9imyPsdb3AF/fpxyLXt/S0Fu8cy5rzclyJtVfI6kDnDj8cU2c9Agf713OfZ6ibCOcAxmgojreqeL/ek5koGeoTcYmXoqSmC9NcpcOwSoieOSiqmfPdmpQz9U5rRqWe4cNEPJkgQ1XmOCFpez/lJ2+p/iKj70KzEnbee+H95OVaha9S9lGUEdjhUxrtnxeExGfgqDNnWypqTcF4j8dkLKyPyarrN2muvk0FmXihj9nkgfQcEeR3D8Nbj9lYKX4BgxrX6B29QHLTwB84FrXKqCPCum9B2zkvHLphVRd+b6yObT5R6g6GJJqnLkouADc9OUEFL0tZ3nOhX3eV9j0ioIbc2JGyoxadc+pBWyoX+y0Fo/AOi+EILuO+x3fvF49cLDq1KkOnFgmFFSZ17I/pPxEY8VHKrsuGBuOcYvwyCh/0XdAtGCU0hbM7vx8YDftKpIVFgvOmHOcZK7t+ypgvXlJMcvigix28r3QwkyX6Ti3ejcjq7Cgqz H5Dqf8+u IrKep6tHuxeBm7p2t8ZePIYOAcEIJUD4791DBLT/GBrMhJI1N06cL4Bx/7I1RUVTK+Mvz5C6yaSZGBEaiXLwB2S29+hhrRZRFhbJtnMyDahsUyz90wCCMYSnILf4pb+3/mWgfavJZAyM8neneuE3uJ+8CSAeESQy61u3C8IZPJlT1barYqWWHVaQYW9npJ/Grqy14S5TLLL32VGfxLb1JQjP8JFdqNvFqnm6zWlmf8RMIIQj7iPN1l6MEKVkixBQa2ENekvcP+VeNOrDqiPkjAdHpvT1xbf+WzyZ+Ett1WqsrflI1TCqa124p2GmFZrOmp7o042ZB5PtHAbm7DL8MMz81zZ2SwGrueWY81YGmqLxmp3e9rRGqv3Q1mqEQgtPBwqU2ws8JSsXk5/kBjGI6uVCYLLJXtLMNRUzg8vhstKU0POq3koJmxP6jFqLwEmc0dM4Coa3XOejOd7gAbx+A7zcE0d4j88Jq7ToJw1UQhPl7+o2cQtj1imV2Ft1jJxk7CEZKza3f2fev+RwuXg3ztWgAl0Rkp6mUGtpeVIOJB7FpuX5DzTyT4YZT+His0sAq49CxcR/NAwqZHSdY4kBB4ZxcRFZremgxAd0P5ZbWDvwjAKqMhwXI5GmPeuwmcoV3zCeIiRXdbLyTda5FcyLERalzCMZwHz9Oq/8uF0RApLKscoIEA8GtMxaYeISDKMNaMaXc Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > From: "Dave Chinner" > Date:=C2=A0 Thu, Jan 15, 2026, 13:02 > Subject:=C2=A0 Re: [PATCH v3] iomap: add allocation cache for iomap_dio > To: "guzebing" > Cc: , , , , , , , ,= "Fengnan Chang", , "Vlasti= mil Babka" > [cc linux-mm] >=C2=A0 > On Thu, Jan 15, 2026 at 10:11:08AM +0800, guzebing wrote: > > As implemented by the bio structure, we do the same thing on the > > iomap-dio structure. Add a per-cpu cache for iomap_dio allocations, > > enabling us to quickly recycle them instead of going through the slab > > allocator. > >=C2=A0 > > By making such changes, we can reduce memory allocation on the direct > > IO path, so that direct IO will not block due to insufficient system > > memory. In addition, for direct IO, the read performance of io_uring > > is improved by about 2.6%. >=C2=A0 > Honestly, this just feels wrong. >=C2=A0 > If heap memory allocation has performance issues, then the right > solution is to fix the memory allocator. >=C2=A0 > Oh, wait, you're copy-pasting the hacky per-cpu bio allocator cache > lists into the iomap DIO code. >=C2=A0 > IMO, this really should be part of the generic memory allocation > APIs, not repeatedly tacked on the outside of specific individual > object allocations. >=C2=A0 > >=C2=A0 > Huh. per-cpu free lists is the traditional SLAB allocator > architecture. That was removed a while back because SLUB performs > better in most cases.... >=C2=A0 > >=C2=A0 > ISTR somebody was already working to optimise the SLUB allocator to > address these corner case shortcomings w.r.t. traditional SLABs. >=C2=A0 > Yup: >=C2=A0 >=C2=A0 > commit 2d517aa09bbc4203f10cdee7e1d42f3bbdc1b1cd > Author: Vlastimil Babka > Date: =C2=A0 Wed Sep 3 14:59:45 2025 +0200 >=C2=A0 > =C2=A0 =C2=A0 slab: add opt-in caching layer of percpu sheaves >=C2=A0 > =C2=A0 =C2=A0 Specifying a non-zero value for a new struct kmem_cache_arg= s field > =C2=A0 =C2=A0 sheaf_capacity will setup a caching layer of percpu arrays = called > =C2=A0 =C2=A0 sheaves of given capacity for the created cache. >=C2=A0 > =C2=A0 =C2=A0 Allocations from the cache will allocate via the percpu she= aves (main or > =C2=A0 =C2=A0 spare) as long as they have no NUMA node preference. Frees = will also > =C2=A0 =C2=A0 put the object back into one of the sheaves. >=C2=A0 > =C2=A0 =C2=A0 When both percpu sheaves are found empty during an allocati= on, an empty > =C2=A0 =C2=A0 sheaf may be replaced with a full one from the per-node bar= n. If none > =C2=A0 =C2=A0 are available and the allocation is allowed to block, an em= pty sheaf is > =C2=A0 =C2=A0 refilled from slab(s) by an internal bulk alloc operation. = When both > =C2=A0 =C2=A0 percpu sheaves are full during freeing, the barn can replac= e a full one > =C2=A0 =C2=A0 with an empty one, unless over a full sheaves limit. In tha= t case a > =C2=A0 =C2=A0 sheaf is flushed to slab(s) by an internal bulk free operat= ion. Flushing > =C2=A0 =C2=A0 sheaves and barns is also wired to the existing cpu flushin= g and cache > =C2=A0 =C2=A0 shrinking operations. >=C2=A0 > =C2=A0 =C2=A0 The sheaves do not distinguish NUMA locality of the cached = objects. If > =C2=A0 =C2=A0 an allocation is requested with kmem_cache_alloc_node() (or= a mempolicy > =C2=A0 =C2=A0 with strict_numa mode enabled) with a specific node (not NU= MA_NO_NODE), > =C2=A0 =C2=A0 the sheaves are bypassed. >=C2=A0 > =C2=A0 =C2=A0 The bulk operations exposed to slab users also try to utili= ze the > =C2=A0 =C2=A0 sheaves as long as the necessary (full or empty) sheaves ar= e available > =C2=A0 =C2=A0 on the cpu or in the barn. Once depleted, they will fallbac= k to bulk > =C2=A0 =C2=A0 alloc/free to slabs directly to avoid double copying. >=C2=A0 > =C2=A0 =C2=A0 The sheaf_capacity value is exported in sysfs for observabi= lity. >=C2=A0 > =C2=A0 =C2=A0 Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_c= pu_sheaf > =C2=A0 =C2=A0 count objects allocated or freed using the sheaves (and thu= s not > =C2=A0 =C2=A0 counting towards the other alloc/free path counters). Count= ers > =C2=A0 =C2=A0 sheaf_refill and sheaf_flush count objects filled or flushe= d from or to > =C2=A0 =C2=A0 slab pages, and can be used to assess how effective the cac= hing is. The > =C2=A0 =C2=A0 refill and flush operations will also count towards the usu= al > =C2=A0 =C2=A0 alloc_fastpath/slowpath, free_fastpath/slowpath and other c= ounters for > =C2=A0 =C2=A0 the backing slabs. =C2=A0For barn operations, barn_get and = barn_put count how > =C2=A0 =C2=A0 many full sheaves were get from or put to the barn, the _fa= il variants > =C2=A0 =C2=A0 count how many such requests could not be satisfied mainly = =C2=A0because the > =C2=A0 =C2=A0 barn was either empty or full. While the barn also holds em= pty sheaves > =C2=A0 =C2=A0 to make some operations easier, these are not as critical t= o mandate own > =C2=A0 =C2=A0 counters. =C2=A0Finally, there are sheaf_alloc/sheaf_free c= ounters. >=C2=A0 > =C2=A0 =C2=A0 Access to the percpu sheaves is protected by local_trylock(= ) when > =C2=A0 =C2=A0 potential callers include irq context, and local_lock() oth= erwise (such > =C2=A0 =C2=A0 as when we already know the gfp flags allow blocking). The = trylock > =C2=A0 =C2=A0 failures should be rare and we can easily fallback. Each pe= r-NUMA-node > =C2=A0 =C2=A0 barn has a spin_lock. >=C2=A0 > =C2=A0 =C2=A0 When slub_debug is enabled for a cache with sheaf_capacity = also > =C2=A0 =C2=A0 specified, the latter is ignored so that allocations and fr= ees reach the > =C2=A0 =C2=A0 slow path where debugging hooks are processed. Similarly, w= e ignore it > =C2=A0 =C2=A0 with CONFIG_SLUB_TINY which prefers low memory usage to per= formance. >=C2=A0 > =C2=A0 =C2=A0 [boot failure: https://lore.kernel.org/all/583eacf5-c971-45= 1a-9f76-fed0e341b815@linux.ibm.com/ ] >=C2=A0 > =C2=A0 =C2=A0 Reported-and-tested-by: Venkat Rao Bagalkote > =C2=A0 =C2=A0 Reviewed-by: Harry Yoo > =C2=A0 =C2=A0 Reviewed-by: Suren Baghdasaryan > =C2=A0 =C2=A0 Signed-off-by: Vlastimil Babka >=C2=A0 > Yeah, recent code, functionality is not enabled by default yet. So, > kmem_cache_alloc() with: >=C2=A0 > struct kmem_cache_args { > ..... > =C2=A0 =C2=A0 =C2=A0 =C2=A0 /** > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* @sheaf_capacity: Enable sheaves of gi= ven capacity for the cache. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* With a non-zero value, allocations fr= om the cache go through caching > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* arrays called sheaves. Each cpu has a= main sheaf that's always > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* present, and a spare sheaf that may b= e not present. When both become > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* empty, there's an attempt to replace = an empty sheaf with a full sheaf > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* from the per-node barn. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* When no full sheaf is available, and = gfp flags allow blocking, a > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* sheaf is allocated and filled from sl= ab(s) using bulk allocation. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Otherwise the allocation falls back t= o the normal operation > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* allocating a single object from a sla= b. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Analogically when freeing and both pe= rcpu sheaves are full, the barn > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* may replace it with an empty sheaf, u= nless it's over capacity. In > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* that case a sheaf is bulk freed to sl= ab pages. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* The sheaves do not enforce NUMA place= ment of objects, so allocations > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* via kmem_cache_alloc_node() with a no= de specified other than > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* NUMA_NO_NODE will bypass them. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Bulk allocation and free operations a= lso try to use the cpu sheaves > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* and barn, but fallback to using slab = pages directly. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* When slub_debug is enabled for the ca= che, the sheaf_capacity argument > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* is ignored. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* %0 means no sheaves will be created. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/ > =C2=A0 =C2=A0 =C2=A0 =C2=A0 unsigned int sheaf_capacity; > } >=C2=A0 > set to the value required is all we need. i.e. something like this > in iomap_dio_init(): >=C2=A0 >=C2=A0 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 struct kmem_cache_args kmem_args =3D { > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .sheaf_capacity = =3D 256, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 }; >=C2=A0 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 dio_kmem_cache =3D kmem_cache_create("iomap_d= io", sizeof(struct iomap_dio), > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 &kmem_args, SLAB_PANIC | SLAB_ACCOUNT >=C2=A0 > And changing the allocation to kmem_cache_alloc(dio_kmem_cache, > GFP_KERNEL) should provide the same sort of performance improvement > as this patch does. >=C2=A0 > Can you test this, please? Hi Dave: Sorry it took so long to respond. Guzebing was busy with something else, I = did this test. I test sheaf_capacity on 7.0-rc3, it doesn't show any performance improvmen= t. Besides, I wrote a simple kernel modules to test the performance difference= by creating a normal memcache and one with sheaf_capacity and testing the time consuming to request 32 objects and then free 32 objects. which resulted in= a roughly 10% improvement in time spent.=C2=A0 I'm thinking that maybe these improvements may not be significant enough to see the effect in the io flow. Using a simple list seems to be the most efficient approach. Thanks. Fengnan. >=C2=A0 > If it doesn't provide any performance improvment, then I suspect > that Vlastimil will be interested to find out why.... >=C2=A0 > Also, if it does work, it is likely the bioset mempools (which are > slab based) can be initialised similarly, removing the need for > custom per-cpu free lists in the block layer, too. >=C2=A0 > -Dave. >=C2=A0 > >=C2=A0 > > v3: > > kmalloc now is called outside the get_cpu/put_cpu code section. > >=C2=A0 > > v2: > > Factor percpu cache into common code and the iomap module uses it. > >=C2=A0 > > v1: > > https://lore.kernel.org/all/20251121090052.384823-1-guzebing1612@gmail.= com/ > >=C2=A0 > > Tested-by: syzbot@syzkaller.appspotmail.com > >=C2=A0 > > Suggested-by: Fengnan Chang > > Signed-off-by: guzebing > > --- > > =C2=A0fs/iomap/direct-io.c | 133 ++++++++++++++++++++++++++++++++++++++= ++++- > > =C2=A01 file changed, 130 insertions(+), 3 deletions(-) > >=C2=A0 > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > index 5d5d63efbd57..4421e4ad3a8f 100644 > > --- a/fs/iomap/direct-io.c > > +++ b/fs/iomap/direct-io.c > > @@ -56,6 +56,130 @@ struct iomap_dio { > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}; > > =C2=A0}; > > =C2=A0 > > +#define PCPU_CACHE_IRQ_THRESHOLD =C2=A0 =C2=A0 =C2=A0 =C2=A016 > > +#define PCPU_CACHE_ELEMENT_SIZE(pcpu_cache_list) \ > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0(sizeof(struct pcpu_cache_element) + pcpu_= cache_list->element_size) > > +#define PCPU_CACHE_ELEMENT_GET_HEAD_FROM_PAYLOAD(payload) \ > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0((struct pcpu_cache_element *)((unsigned l= ong)(payload) - \ > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sizeof(stru= ct pcpu_cache_element))) > > +#define PCPU_CACHE_ELEMENT_GET_PAYLOAD_FROM_HEAD(head) \ > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0((void *)((unsigned long)(head) + sizeof(s= truct pcpu_cache_element))) > > + > > +struct pcpu_cache_element { > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0struct pcpu_cache_element =C2=A0 =C2=A0 = =C2=A0 =C2=A0*next; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0char =C2=A0 =C2=A0 =C2=A0 =C2=A0payload[]; > > +}; > > +struct pcpu_cache { > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0struct pcpu_cache_element =C2=A0 =C2=A0 = =C2=A0 =C2=A0*free_list; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0struct pcpu_cache_element =C2=A0 =C2=A0 = =C2=A0 =C2=A0*free_list_irq; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0int =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0nr; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0int =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0nr_irq; > > +}; > > +struct pcpu_cache_list { > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0struct pcpu_cache __percpu *cache; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0size_t element_size; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0int max_nr; > > +}; > > + > > +static struct pcpu_cache_list *pcpu_cache_list_create(int max_nr, size= _t size) > > +{ > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0struct pcpu_cache_list *pcpu_cache_list; > > + > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0pcpu_cache_list =3D kmalloc(sizeof(struct = pcpu_cache_list), GFP_KERNEL); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!pcpu_cache_list) > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return NULL; > > + > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0pcpu_cache_list->element_size =3D size; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0pcpu_cache_list->max_nr =3D max_nr; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0pcpu_cache_list->cache =3D alloc_percpu(st= ruct pcpu_cache); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!pcpu_cache_list->cache) { > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0kfree(pcpu_cac= he_list); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return NULL; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0} > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0return pcpu_cache_list; > > +} > > + > > +static void pcpu_cache_list_destroy(struct pcpu_cache_list *pcpu_cache= _list) > > +{ > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0free_percpu(pcpu_cache_list->cache); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0kfree(pcpu_cache_list); > > +} > > + > > +static void irq_cache_splice(struct pcpu_cache *cache) > > +{ > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long flags; > > + > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0/* cache->free_list must be empty */ > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (WARN_ON_ONCE(cache->free_list)) > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return; > > + > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0local_irq_save(flags); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0cache->free_list =3D cache->free_list_irq; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0cache->free_list_irq =3D NULL; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0cache->nr +=3D cache->nr_irq; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0cache->nr_irq =3D 0; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0local_irq_restore(flags); > > +} > > + > > +static void *pcpu_cache_list_alloc(struct pcpu_cache_list *pcpu_cache_= list) > > +{ > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0struct pcpu_cache *cache; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0struct pcpu_cache_element *cache_element; > > + > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0cache =3D per_cpu_ptr(pcpu_cache_list->cac= he, get_cpu()); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!cache->free_list) { > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (READ_ONCE(= cache->nr_irq) >=3D PCPU_CACHE_IRQ_THRESHOLD) > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0irq_cache_splice(cache); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!cache->fr= ee_list) { > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0put_cpu(); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0cache_element =3D kmalloc(PCPU_CACHE_ELEMENT_SIZE(pcpu_cache_= list), > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0GFP_KERNEL); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0if (!cache_element) > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return NULL; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0return PCPU_CACHE_ELEMENT_GET_PAYLOAD_FROM_HEAD(cache_element= ); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0} > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0} > > + > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0cache_element =3D cache->free_list; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0cache->free_list =3D cache_element->next; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0cache->nr--; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0put_cpu(); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0return PCPU_CACHE_ELEMENT_GET_PAYLOAD_FROM= _HEAD(cache_element); > > +} > > + > > +static void pcpu_cache_list_free(void *payload, struct pcpu_cache_list= *pcpu_cache_list) > > +{ > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0struct pcpu_cache *cache; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0struct pcpu_cache_element *cache_element; > > + > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0cache_element =3D PCPU_CACHE_ELEMENT_GET_H= EAD_FROM_PAYLOAD(payload); > > + > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0cache =3D per_cpu_ptr(pcpu_cache_list->cac= he, get_cpu()); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (READ_ONCE(cache->nr_irq) + cache->nr >= =3D pcpu_cache_list->max_nr) > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto out_free; > > + > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (in_task()) { > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cache_element-= >next =3D cache->free_list; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cache->free_li= st =3D cache_element; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cache->nr++; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0} else if (in_hardirq()) { > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0lockdep_assert= _irqs_disabled(); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cache_element-= >next =3D cache->free_list_irq; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cache->free_li= st_irq =3D cache_element; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cache->nr_irq+= +; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0} else { > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto out_free; > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0} > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0put_cpu(); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0return; > > +out_free: > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0put_cpu(); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0kfree(cache_element); > > +} > > + > > +#define DIO_ALLOC_CACHE_MAX =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0256 > > +static struct pcpu_cache_list *dio_pcpu_cache_list; > > + > > =C2=A0static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *i= ter, > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct io= map_dio *dio, unsigned short nr_vecs, blk_opf_t opf) > > =C2=A0{ > > @@ -135,7 +259,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0ret +=3D dio->done_before; > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0} > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0trace_iomap_dio_complete(iocb, dio->e= rror, ret); > > - =C2=A0 =C2=A0 =C2=A0 =C2=A0kfree(dio); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0pcpu_cache_list_free(dio, dio_pcpu_cache_l= ist); > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ret; > > =C2=A0} > > =C2=A0EXPORT_SYMBOL_GPL(iomap_dio_complete); > > @@ -620,7 +744,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter = *iter, > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!iomi.len) > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return NU= LL; > > =C2=A0 > > - =C2=A0 =C2=A0 =C2=A0 =C2=A0dio =3D kmalloc(sizeof(*dio), GFP_KERNEL); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0dio =3D pcpu_cache_list_alloc(dio_pcpu_cac= he_list); > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!dio) > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ER= R_PTR(-ENOMEM); > > =C2=A0 > > @@ -804,7 +928,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter = *iter, > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return dio; > > =C2=A0 > > =C2=A0out_free_dio: > > - =C2=A0 =C2=A0 =C2=A0 =C2=A0kfree(dio); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0pcpu_cache_list_free(dio, dio_pcpu_cache_l= ist); > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (ret) > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ER= R_PTR(ret); > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return NULL; > > @@ -834,6 +958,9 @@ static int __init iomap_dio_init(void) > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!zero_page) > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -E= NOMEM; > > =C2=A0 > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0dio_pcpu_cache_list =3D pcpu_cache_list_cr= eate(DIO_ALLOC_CACHE_MAX, sizeof(struct iomap_dio)); > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!dio_pcpu_cache_list) > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return -ENOMEM= ; > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return 0; > > =C2=A0} > > =C2=A0fs_initcall(iomap_dio_init); > > --=C2=A0 > > 2.20.1 > >=C2=A0 > >=C2=A0 > >=C2=A0 >=C2=A0 > --=C2=A0 > Dave Chinner > david@fromorbit.com >=C2=A0