linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Pasha Tatashin <pasha.tatashin@soleen.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev,
	 linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	rientjes@google.com
Subject: Re: [PATCH] iommu/iova: use named kmem_cache for iova magazines
Date: Thu, 1 Feb 2024 17:10:59 -0500	[thread overview]
Message-ID: <CA+CK2bBRK=cDe5a-TuLMUs6DWRO06FzL6JDmoMmnwFfdt+Z-kg@mail.gmail.com> (raw)
In-Reply-To: <84c7e816-f749-48d8-a429-8b0ef799cdbb@arm.com>

On Thu, Feb 1, 2024 at 4:23 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2024-02-01 9:06 pm, Pasha Tatashin wrote:
> > On Thu, Feb 1, 2024 at 3:56 PM Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 2024-02-01 7:30 pm, Pasha Tatashin wrote:
> >>> From: Pasha Tatashin <pasha.tatashin@soleen.com>
> >>>
> >>> The magazine buffers can take gigabytes of kmem memory, dominating all
> >>> other allocations. For observability prurpose create named slab cache so
> >>> the iova magazine memory overhead can be clearly observed.
> >>>
> >>> With this change:
> >>>
> >>>> slabtop -o | head
> >>>    Active / Total Objects (% used)    : 869731 / 952904 (91.3%)
> >>>    Active / Total Slabs (% used)      : 103411 / 103974 (99.5%)
> >>>    Active / Total Caches (% used)     : 135 / 211 (64.0%)
> >>>    Active / Total Size (% used)       : 395389.68K / 411430.20K (96.1%)
> >>>    Minimum / Average / Maximum Object : 0.02K / 0.43K / 8.00K
> >>>
> >>> OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> >>> 244412 244239 99%    1.00K  61103       4    244412K iommu_iova_magazine
> >>>    91636  88343 96%    0.03K    739     124      2956K kmalloc-32
> >>>    75744  74844 98%    0.12K   2367      32      9468K kernfs_node_cache
> >>>
> >>> On this machine it is now clear that magazine use 242M of kmem memory.
> >>
> >> Hmm, something smells there...
> >>
> >> In the "worst" case there should be a maximum of 6 * 2 *
> >> num_online_cpus() empty magazines in the iova_cpu_rcache structures,
> >> i.e., 12KB per CPU. Under normal use those will contain at least some
> >> PFNs, but mainly every additional magazine stored in a depot is full
> >> with 127 PFNs, and each one of those PFNs is backed by a 40-byte struct
> >> iova, i.e. ~5KB per 1KB magazine. Unless that machine has many thousands
> >> of CPUs, if iova_magazine allocations are the top consumer of memory
> >> then something's gone wrong.
> >
> > This is an upstream kernel  + few drivers that is booted on AMD EPYC,
> > with 128 CPUs.
> >
> > It has allocations stacks like these:
> > init_iova_domain+0x1ed/0x230 iommu_setup_dma_ops+0xf8/0x4b0
> > amd_iommu_probe_finalize.
> > And also init_iova_domain() calls for Google's TPU drivers 242M is
> > actually not that much, compared to the size of the system.
>
> Hmm, I did misspeak slightly (it's late and I really should have left
> this for tomorrow...) - that's 12KB per CPU *per domain*, but still that
> would seem to imply well over 100 domains if you have 242MB of magazine
> allocations while the iommu_iova cache isn't even on the charts... what
> the heck is that driver doing?

I am not sure what the driver is doing. However, I can check the
actual allocation sizes for each init_iova_domain() and report on that
later.

>
> (I don't necessarily disagree with the spirit of the patch BTW, I just
> really want to understand the situation that prompted it, and make sure
> we don't actually have a subtle leak somewhere.)

Yes, the observability is needed here, because there were several
optimizations that reduced the size of these magazines, and they still
can be large. For example, for a while we had 1032-bytes per-magazine
instead of 1024, this caused wasting almost half of magazine memroy
with 2K slabs. This was fixed with: b4c9bf178ace iommu/iova: change
IOVA_MAG_SIZE to 127 to save memory

 Also, earlier there was another optimization "32e92d9f6f87
iommu/iova: Separate out rcache init" that reduced cases when
magazines need to be allocated. That  also reduced overhead on our
systems by a factor of 10.

Yet, the magazines are still large, and I think it is time to improve
observability for the future optimizations, and avoiding future
regressions.

Pasha


  reply	other threads:[~2024-02-01 22:11 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-01 19:30 Pasha Tatashin
2024-02-01 20:56 ` Robin Murphy
2024-02-01 21:06   ` Pasha Tatashin
2024-02-01 21:23     ` Robin Murphy
2024-02-01 22:10       ` Pasha Tatashin [this message]
2024-02-02 18:04       ` Pasha Tatashin
2024-02-02 18:27         ` Robin Murphy
2024-02-02 19:14           ` Pasha Tatashin
2024-02-01 22:28 ` Yosry Ahmed
2024-02-02 17:52   ` Pasha Tatashin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+CK2bBRK=cDe5a-TuLMUs6DWRO06FzL6JDmoMmnwFfdt+Z-kg@mail.gmail.com' \
    --to=pasha.tatashin@soleen.com \
    --cc=iommu@lists.linux.dev \
    --cc=joro@8bytes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=robin.murphy@arm.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox