linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [BUG] ZSwap leaks memory upon being disabled
@ 2024-10-24 13:02 Konstantin Kharlamov
  2024-10-24 20:47 ` Yosry Ahmed
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-24 13:02 UTC (permalink / raw)
  To: linux-mm

When ZSWAP is disabled, the `Zswap` and `Zswapped` in meminfo are still non-zero.
IOW, ZSWAP doesn't free memory upon being disabled.

Stumbled upon this while trying to figure out where did ≈4G of my SWAP memory
disappear. Been seeing some unknown memory in SWAP for years, now I suspect ZSWAP
might be the culprit. But no way to know for sure because of this bug.

# Steps to reproduce

1. Enable ZSWAP
2. Wait for `grep Zswap /proc/meminfo` to become non-zero
3. Disable ZSWAP via `sudo sh -c "echo 0 > /sys/module/zswap/parameters/enabled"`
4. Look at `grep Zswap /proc/meminfo`

## Expected

The rows are zero because ZSWAP is disabled.

## Actual

The rows doesn't change.

# Additional information

Kernel: 6.11.3
OS: Archlinux


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-24 13:02 [BUG] ZSwap leaks memory upon being disabled Konstantin Kharlamov
@ 2024-10-24 20:47 ` Yosry Ahmed
  2024-10-25  6:41   ` Konstantin Kharlamov
  0 siblings, 1 reply; 19+ messages in thread
From: Yosry Ahmed @ 2024-10-24 20:47 UTC (permalink / raw)
  To: Konstantin Kharlamov; +Cc: linux-mm, Johannes Weiner, Nhat Pham, Chengming Zhou

On Thu, Oct 24, 2024 at 6:02 AM Konstantin Kharlamov <Hi-Angel@yandex.ru> wrote:
>
> When ZSWAP is disabled, the `Zswap` and `Zswapped` in meminfo are still non-zero.
> IOW, ZSWAP doesn't free memory upon being disabled.
>
> Stumbled upon this while trying to figure out where did ≈4G of my SWAP memory
> disappear. Been seeing some unknown memory in SWAP for years, now I suspect ZSWAP
> might be the culprit. But no way to know for sure because of this bug.
>
> # Steps to reproduce
>
> 1. Enable ZSWAP
> 2. Wait for `grep Zswap /proc/meminfo` to become non-zero
> 3. Disable ZSWAP via `sudo sh -c "echo 0 > /sys/module/zswap/parameters/enabled"`
> 4. Look at `grep Zswap /proc/meminfo`
>
> ## Expected
>
> The rows are zero because ZSWAP is disabled.

Not really, the expected behavior is that further swapouts will not go
to zswap, but pages that are already compressed in zswap will not be
written out to the backing swapfile or swapped back to memory. A
swapoff would be required for the latter.

This is documented in:
https://docs.kernel.org/admin-guide/mm/zswap.html#overview.

>
> ## Actual
>
> The rows doesn't change.
>
> # Additional information
>
> Kernel: 6.11.3
> OS: Archlinux
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-24 20:47 ` Yosry Ahmed
@ 2024-10-25  6:41   ` Konstantin Kharlamov
  2024-10-25  7:50     ` Yosry Ahmed
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-25  6:41 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: linux-mm, Johannes Weiner, Nhat Pham, Chengming Zhou

On Thu, 2024-10-24 at 13:47 -0700, Yosry Ahmed wrote:
> On Thu, Oct 24, 2024 at 6:02 AM Konstantin Kharlamov
> <Hi-Angel@yandex.ru> wrote:
> > 
> > When ZSWAP is disabled, the `Zswap` and `Zswapped` in meminfo are
> > still non-zero.
> > IOW, ZSWAP doesn't free memory upon being disabled.
> > 
> > Stumbled upon this while trying to figure out where did ≈4G of my
> > SWAP memory
> > disappear. Been seeing some unknown memory in SWAP for years, now I
> > suspect ZSWAP
> > might be the culprit. But no way to know for sure because of this
> > bug.
> > 
> > # Steps to reproduce
> > 
> > 1. Enable ZSWAP
> > 2. Wait for `grep Zswap /proc/meminfo` to become non-zero
> > 3. Disable ZSWAP via `sudo sh -c "echo 0 >
> > /sys/module/zswap/parameters/enabled"`
> > 4. Look at `grep Zswap /proc/meminfo`
> > 
> > ## Expected
> > 
> > The rows are zero because ZSWAP is disabled.
> 
> Not really, the expected behavior is that further swapouts will not
> go
> to zswap, but pages that are already compressed in zswap will not be
> written out to the backing swapfile or swapped back to memory. A
> swapoff would be required for the latter.
> 
> This is documented in:
> https://docs.kernel.org/admin-guide/mm/zswap.html#overview.

Oh, I see, thank you, sorry for the noise.

Then, I'm curious, is it correct to assume that this `Zswap`-prefixed
memory mentioned in meminfo is never the one that is in SWAP? I mean,
Zswap being a buffer before data goes to swap kind of implies that yes,
the data *either* in zswap or in swap. But just wanted to hear that
explicitly.

The background to my question is that I'm trying to find the culprit
some "phantom memory" eventually filling up my SWAP. This memory is not
one accounted to apps (as calculated via `smem`), nor to tmpfs. So my
next suspect was something related to ZSwap.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-25  6:41   ` Konstantin Kharlamov
@ 2024-10-25  7:50     ` Yosry Ahmed
  2024-10-26 11:33       ` Konstantin Kharlamov
  0 siblings, 1 reply; 19+ messages in thread
From: Yosry Ahmed @ 2024-10-25  7:50 UTC (permalink / raw)
  To: Konstantin Kharlamov; +Cc: linux-mm, Johannes Weiner, Nhat Pham, Chengming Zhou

On Thu, Oct 24, 2024 at 11:41 PM Konstantin Kharlamov
<Hi-Angel@yandex.ru> wrote:
>
> On Thu, 2024-10-24 at 13:47 -0700, Yosry Ahmed wrote:
> > On Thu, Oct 24, 2024 at 6:02 AM Konstantin Kharlamov
> > <Hi-Angel@yandex.ru> wrote:
> > >
> > > When ZSWAP is disabled, the `Zswap` and `Zswapped` in meminfo are
> > > still non-zero.
> > > IOW, ZSWAP doesn't free memory upon being disabled.
> > >
> > > Stumbled upon this while trying to figure out where did ≈4G of my
> > > SWAP memory
> > > disappear. Been seeing some unknown memory in SWAP for years, now I
> > > suspect ZSWAP
> > > might be the culprit. But no way to know for sure because of this
> > > bug.
> > >
> > > # Steps to reproduce
> > >
> > > 1. Enable ZSWAP
> > > 2. Wait for `grep Zswap /proc/meminfo` to become non-zero
> > > 3. Disable ZSWAP via `sudo sh -c "echo 0 >
> > > /sys/module/zswap/parameters/enabled"`
> > > 4. Look at `grep Zswap /proc/meminfo`
> > >
> > > ## Expected
> > >
> > > The rows are zero because ZSWAP is disabled.
> >
> > Not really, the expected behavior is that further swapouts will not
> > go
> > to zswap, but pages that are already compressed in zswap will not be
> > written out to the backing swapfile or swapped back to memory. A
> > swapoff would be required for the latter.
> >
> > This is documented in:
> > https://docs.kernel.org/admin-guide/mm/zswap.html#overview.
>
> Oh, I see, thank you, sorry for the noise.
>
> Then, I'm curious, is it correct to assume that this `Zswap`-prefixed
> memory mentioned in meminfo is never the one that is in SWAP? I mean,
> Zswap being a buffer before data goes to swap kind of implies that yes,
> the data *either* in zswap or in swap. But just wanted to hear that
> explicitly.

I know this makes sense, but unfortunately no. Zswap is currently
transparent to the rest of the system. For all intents and purposes,
pages in zswap are considered in swap. You cannot even use zswap with
an actual swapfile. So the zswap stats should be a subset of the swap
stats.

FWIW, Nhat is working on restructuring this to have zswap be its own
entity, separate from any swapfiles.

>
> The background to my question is that I'm trying to find the culprit
> some "phantom memory" eventually filling up my SWAP. This memory is not
> one accounted to apps (as calculated via `smem`), nor to tmpfs. So my
> next suspect was something related to ZSwap.
> >

As I mentioned, zswap should be transparent to the rest of the system,
so it shouldn't make a difference in this case whether the pages are
in zswap or in the swapfile.

You can use the memory.swap.current counter to find out which memory
cgroup currently has swapped out pages (in zswap or in the swapfile).
This should help find the application that has memory in swap. If you
want to find the exact type of memory (e.g. anon vs tmpfs), that would
be more tricky. Perhaps you can swapoff and see what counters increase
in memory.stat of the relevant memory cgroup?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-25  7:50     ` Yosry Ahmed
@ 2024-10-26 11:33       ` Konstantin Kharlamov
  2024-10-26 17:47         ` Yosry Ahmed
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-26 11:33 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: linux-mm, Johannes Weiner, Nhat Pham, Chengming Zhou

On Fri, 2024-10-25 at 00:50 -0700, Yosry Ahmed wrote:
> On Thu, Oct 24, 2024 at 11:41 PM Konstantin Kharlamov
> <Hi-Angel@yandex.ru> wrote:
> > 
> > On Thu, 2024-10-24 at 13:47 -0700, Yosry Ahmed wrote:
> > > On Thu, Oct 24, 2024 at 6:02 AM Konstantin Kharlamov
> > > <Hi-Angel@yandex.ru> wrote:
> > > > 
> > > > When ZSWAP is disabled, the `Zswap` and `Zswapped` in meminfo
> > > > are
> > > > still non-zero.
> > > > IOW, ZSWAP doesn't free memory upon being disabled.
> > > > 
> > > > Stumbled upon this while trying to figure out where did ≈4G of
> > > > my
> > > > SWAP memory
> > > > disappear. Been seeing some unknown memory in SWAP for years,
> > > > now I
> > > > suspect ZSWAP
> > > > might be the culprit. But no way to know for sure because of
> > > > this
> > > > bug.
> > > > 
> > > > # Steps to reproduce
> > > > 
> > > > 1. Enable ZSWAP
> > > > 2. Wait for `grep Zswap /proc/meminfo` to become non-zero
> > > > 3. Disable ZSWAP via `sudo sh -c "echo 0 >
> > > > /sys/module/zswap/parameters/enabled"`
> > > > 4. Look at `grep Zswap /proc/meminfo`
> > > > 
> > > > ## Expected
> > > > 
> > > > The rows are zero because ZSWAP is disabled.
> > > 
> > > Not really, the expected behavior is that further swapouts will
> > > not
> > > go
> > > to zswap, but pages that are already compressed in zswap will not
> > > be
> > > written out to the backing swapfile or swapped back to memory. A
> > > swapoff would be required for the latter.
> > > 
> > > This is documented in:
> > > https://docs.kernel.org/admin-guide/mm/zswap.html#overview.
> > 
> > Oh, I see, thank you, sorry for the noise.
> > 
> > Then, I'm curious, is it correct to assume that this `Zswap`-
> > prefixed
> > memory mentioned in meminfo is never the one that is in SWAP? I
> > mean,
> > Zswap being a buffer before data goes to swap kind of implies that
> > yes,
> > the data *either* in zswap or in swap. But just wanted to hear that
> > explicitly.
> 
> I know this makes sense, but unfortunately no. Zswap is currently
> transparent to the rest of the system. For all intents and purposes,
> pages in zswap are considered in swap. You cannot even use zswap with
> an actual swapfile. So the zswap stats should be a subset of the swap
> stats.
> 
> FWIW, Nhat is working on restructuring this to have zswap be its own
> entity, separate from any swapfiles.
> 
> > 
> > The background to my question is that I'm trying to find the
> > culprit
> > some "phantom memory" eventually filling up my SWAP. This memory is
> > not
> > one accounted to apps (as calculated via `smem`), nor to tmpfs. So
> > my
> > next suspect was something related to ZSwap.
> > > 
> 
> As I mentioned, zswap should be transparent to the rest of the
> system,
> so it shouldn't make a difference in this case whether the pages are
> in zswap or in the swapfile.
> 
> You can use the memory.swap.current counter to find out which memory
> cgroup currently has swapped out pages (in zswap or in the swapfile).
> This should help find the application that has memory in swap. If you
> want to find the exact type of memory (e.g. anon vs tmpfs), that
> would
> be more tricky. Perhaps you can swapoff and see what counters
> increase
> in memory.stat of the relevant memory cgroup?

Thank you, so, I've waited till my SWAP gets almost full again
(apparently my new workflow triggers that a lot). It is 7.5G out of 8
in total. 437M is taken by tmpfs'es, let's subtract for simplicity, so
I have 7G taken by something else.

Now I'm looking at `/sys/fs/cgroup/user.slice/memory.swap.current` and
it's 4422422528 = 4.1G. That's a lot less than 7G. I'm certain this
"phantom swap memory" is hidden in `user.slice`, because if I wait till
OOM-killer gets triggered and kills some app, my user-systemd gets
crashed for some reason, taking down the entire user session, and
afterwards SWAP is almost free.

I think this memory.swap.current isn't much different compared to just
asking `smem` for SWAP taken by individual apps. As of writing the
words that's 4.6G for the entire system, as calculated by:

	sudo smem -c "name user pid vss pss rss swap" | awk
'{total+=$7} END {print "Swap memory: " total "K"}'

So 7 - 4.6 = 2.4G of some "phantom" memory.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-26 11:33       ` Konstantin Kharlamov
@ 2024-10-26 17:47         ` Yosry Ahmed
  2024-10-27  0:29           ` Konstantin Kharlamov
  0 siblings, 1 reply; 19+ messages in thread
From: Yosry Ahmed @ 2024-10-26 17:47 UTC (permalink / raw)
  To: Konstantin Kharlamov; +Cc: linux-mm, Johannes Weiner, Nhat Pham, Chengming Zhou

On Sat, Oct 26, 2024 at 4:33 AM Konstantin Kharlamov <Hi-Angel@yandex.ru> wrote:
>
> On Fri, 2024-10-25 at 00:50 -0700, Yosry Ahmed wrote:
> > On Thu, Oct 24, 2024 at 11:41 PM Konstantin Kharlamov
> > <Hi-Angel@yandex.ru> wrote:
> > >
> > > On Thu, 2024-10-24 at 13:47 -0700, Yosry Ahmed wrote:
> > > > On Thu, Oct 24, 2024 at 6:02 AM Konstantin Kharlamov
> > > > <Hi-Angel@yandex.ru> wrote:
> > > > >
> > > > > When ZSWAP is disabled, the `Zswap` and `Zswapped` in meminfo
> > > > > are
> > > > > still non-zero.
> > > > > IOW, ZSWAP doesn't free memory upon being disabled.
> > > > >
> > > > > Stumbled upon this while trying to figure out where did ≈4G of
> > > > > my
> > > > > SWAP memory
> > > > > disappear. Been seeing some unknown memory in SWAP for years,
> > > > > now I
> > > > > suspect ZSWAP
> > > > > might be the culprit. But no way to know for sure because of
> > > > > this
> > > > > bug.
> > > > >
> > > > > # Steps to reproduce
> > > > >
> > > > > 1. Enable ZSWAP
> > > > > 2. Wait for `grep Zswap /proc/meminfo` to become non-zero
> > > > > 3. Disable ZSWAP via `sudo sh -c "echo 0 >
> > > > > /sys/module/zswap/parameters/enabled"`
> > > > > 4. Look at `grep Zswap /proc/meminfo`
> > > > >
> > > > > ## Expected
> > > > >
> > > > > The rows are zero because ZSWAP is disabled.
> > > >
> > > > Not really, the expected behavior is that further swapouts will
> > > > not
> > > > go
> > > > to zswap, but pages that are already compressed in zswap will not
> > > > be
> > > > written out to the backing swapfile or swapped back to memory. A
> > > > swapoff would be required for the latter.
> > > >
> > > > This is documented in:
> > > > https://docs.kernel.org/admin-guide/mm/zswap.html#overview.
> > >
> > > Oh, I see, thank you, sorry for the noise.
> > >
> > > Then, I'm curious, is it correct to assume that this `Zswap`-
> > > prefixed
> > > memory mentioned in meminfo is never the one that is in SWAP? I
> > > mean,
> > > Zswap being a buffer before data goes to swap kind of implies that
> > > yes,
> > > the data *either* in zswap or in swap. But just wanted to hear that
> > > explicitly.
> >
> > I know this makes sense, but unfortunately no. Zswap is currently
> > transparent to the rest of the system. For all intents and purposes,
> > pages in zswap are considered in swap. You cannot even use zswap with
> > an actual swapfile. So the zswap stats should be a subset of the swap
> > stats.
> >
> > FWIW, Nhat is working on restructuring this to have zswap be its own
> > entity, separate from any swapfiles.
> >
> > >
> > > The background to my question is that I'm trying to find the
> > > culprit
> > > some "phantom memory" eventually filling up my SWAP. This memory is
> > > not
> > > one accounted to apps (as calculated via `smem`), nor to tmpfs. So
> > > my
> > > next suspect was something related to ZSwap.
> > > >
> >
> > As I mentioned, zswap should be transparent to the rest of the
> > system,
> > so it shouldn't make a difference in this case whether the pages are
> > in zswap or in the swapfile.
> >
> > You can use the memory.swap.current counter to find out which memory
> > cgroup currently has swapped out pages (in zswap or in the swapfile).
> > This should help find the application that has memory in swap. If you
> > want to find the exact type of memory (e.g. anon vs tmpfs), that
> > would
> > be more tricky. Perhaps you can swapoff and see what counters
> > increase
> > in memory.stat of the relevant memory cgroup?
>
> Thank you, so, I've waited till my SWAP gets almost full again
> (apparently my new workflow triggers that a lot). It is 7.5G out of 8
> in total. 437M is taken by tmpfs'es, let's subtract for simplicity, so
> I have 7G taken by something else.

If the tmpfs's are created and written to by processes in the user
slice, they should show up memory.swap.current as well.

>
> Now I'm looking at `/sys/fs/cgroup/user.slice/memory.swap.current` and
> it's 4422422528 = 4.1G. That's a lot less than 7G. I'm certain this

Can you check the memory.swap.current value of other slices?

The other possibility is that the pages are swapped out from the root
cgroup, in which case they won't show up in memory.swap.current as
they are basically unaccounted. Although typically user processes
should not be running in the root cgroup.

> "phantom swap memory" is hidden in `user.slice`, because if I wait till
> OOM-killer gets triggered and kills some app, my user-systemd gets
> crashed for some reason, taking down the entire user session, and
> afterwards SWAP is almost free.

Did you check the OOM logs? It is possible that the OOM killer kills
some system process that has some memory in swap as well.

>
> I think this memory.swap.current isn't much different compared to just
> asking `smem` for SWAP taken by individual apps. As of writing the
> words that's 4.6G for the entire system, as calculated by:
>
>         sudo smem -c "name user pid vss pss rss swap" | awk
> '{total+=$7} END {print "Swap memory: " total "K"}'
>
> So 7 - 4.6 = 2.4G of some "phantom" memory.

I am not sure about smem, but memory.swap.current should be accounting
pages swapped out from all memory cgroups except the root.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-26 17:47         ` Yosry Ahmed
@ 2024-10-27  0:29           ` Konstantin Kharlamov
  2024-10-27  3:14             ` Nhat Pham
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-27  0:29 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: linux-mm, Johannes Weiner, Nhat Pham, Chengming Zhou

On Sat, 2024-10-26 at 10:47 -0700, Yosry Ahmed wrote:
> On Sat, Oct 26, 2024 at 4:33 AM Konstantin Kharlamov
> <Hi-Angel@yandex.ru> wrote:
> > 
> > On Fri, 2024-10-25 at 00:50 -0700, Yosry Ahmed wrote:
> > > On Thu, Oct 24, 2024 at 11:41 PM Konstantin Kharlamov
> > > <Hi-Angel@yandex.ru> wrote:
> > > > 
> > > > On Thu, 2024-10-24 at 13:47 -0700, Yosry Ahmed wrote:
> > > > > On Thu, Oct 24, 2024 at 6:02 AM Konstantin Kharlamov
> > > > > <Hi-Angel@yandex.ru> wrote:
> > > > > > 
> > > > > > When ZSWAP is disabled, the `Zswap` and `Zswapped` in
> > > > > > meminfo
> > > > > > are
> > > > > > still non-zero.
> > > > > > IOW, ZSWAP doesn't free memory upon being disabled.
> > > > > > 
> > > > > > Stumbled upon this while trying to figure out where did ≈4G
> > > > > > of
> > > > > > my
> > > > > > SWAP memory
> > > > > > disappear. Been seeing some unknown memory in SWAP for
> > > > > > years,
> > > > > > now I
> > > > > > suspect ZSWAP
> > > > > > might be the culprit. But no way to know for sure because
> > > > > > of
> > > > > > this
> > > > > > bug.
> > > > > > 
> > > > > > # Steps to reproduce
> > > > > > 
> > > > > > 1. Enable ZSWAP
> > > > > > 2. Wait for `grep Zswap /proc/meminfo` to become non-zero
> > > > > > 3. Disable ZSWAP via `sudo sh -c "echo 0 >
> > > > > > /sys/module/zswap/parameters/enabled"`
> > > > > > 4. Look at `grep Zswap /proc/meminfo`
> > > > > > 
> > > > > > ## Expected
> > > > > > 
> > > > > > The rows are zero because ZSWAP is disabled.
> > > > > 
> > > > > Not really, the expected behavior is that further swapouts
> > > > > will
> > > > > not
> > > > > go
> > > > > to zswap, but pages that are already compressed in zswap will
> > > > > not
> > > > > be
> > > > > written out to the backing swapfile or swapped back to
> > > > > memory. A
> > > > > swapoff would be required for the latter.
> > > > > 
> > > > > This is documented in:
> > > > > https://docs.kernel.org/admin-guide/mm/zswap.html#overview.
> > > > 
> > > > Oh, I see, thank you, sorry for the noise.
> > > > 
> > > > Then, I'm curious, is it correct to assume that this `Zswap`-
> > > > prefixed
> > > > memory mentioned in meminfo is never the one that is in SWAP? I
> > > > mean,
> > > > Zswap being a buffer before data goes to swap kind of implies
> > > > that
> > > > yes,
> > > > the data *either* in zswap or in swap. But just wanted to hear
> > > > that
> > > > explicitly.
> > > 
> > > I know this makes sense, but unfortunately no. Zswap is currently
> > > transparent to the rest of the system. For all intents and
> > > purposes,
> > > pages in zswap are considered in swap. You cannot even use zswap
> > > with
> > > an actual swapfile. So the zswap stats should be a subset of the
> > > swap
> > > stats.
> > > 
> > > FWIW, Nhat is working on restructuring this to have zswap be its
> > > own
> > > entity, separate from any swapfiles.
> > > 
> > > > 
> > > > The background to my question is that I'm trying to find the
> > > > culprit
> > > > some "phantom memory" eventually filling up my SWAP. This
> > > > memory is
> > > > not
> > > > one accounted to apps (as calculated via `smem`), nor to tmpfs.
> > > > So
> > > > my
> > > > next suspect was something related to ZSwap.
> > > > > 
> > > 
> > > As I mentioned, zswap should be transparent to the rest of the
> > > system,
> > > so it shouldn't make a difference in this case whether the pages
> > > are
> > > in zswap or in the swapfile.
> > > 
> > > You can use the memory.swap.current counter to find out which
> > > memory
> > > cgroup currently has swapped out pages (in zswap or in the
> > > swapfile).
> > > This should help find the application that has memory in swap. If
> > > you
> > > want to find the exact type of memory (e.g. anon vs tmpfs), that
> > > would
> > > be more tricky. Perhaps you can swapoff and see what counters
> > > increase
> > > in memory.stat of the relevant memory cgroup?
> > 
> > Thank you, so, I've waited till my SWAP gets almost full again
> > (apparently my new workflow triggers that a lot). It is 7.5G out of
> > 8
> > in total. 437M is taken by tmpfs'es, let's subtract for simplicity,
> > so
> > I have 7G taken by something else.
> 
> If the tmpfs's are created and written to by processes in the user
> slice, they should show up memory.swap.current as well.
> 
> > 
> > Now I'm looking at `/sys/fs/cgroup/user.slice/memory.swap.current`
> > and
> > it's 4422422528 = 4.1G. That's a lot less than 7G. I'm certain this
> 
> Can you check the memory.swap.current value of other slices?

That was a good idea! The
`/sys/fs/cgroup/system.slice/memory.swap.current` seems to have the
missing half of the SWAP memory. From my understanding of the
`systemctl status` graph `sytem.slice` and `user.slice` groups do not
intersect, and by adding up `system.slice/…` + `user.slice/…` I get
around 8G.

However, I'm still unclear what does this memory belong to.
`system.slice/memory.swap.current` is 4.4G currently, that's a lot and
I'm not seeing anything that could take so much memory.

An even larger related mystery is why does this memory not show up in
`smem` numbers for individual applications (which calculates it by
going over `/proc/$pid/smaps` for every pid).

> The other possibility is that the pages are swapped out from the root
> cgroup, in which case they won't show up in memory.swap.current as
> they are basically unaccounted. Although typically user processes
> should not be running in the root cgroup.
> 
> > "phantom swap memory" is hidden in `user.slice`, because if I wait
> > till
> > OOM-killer gets triggered and kills some app, my user-systemd gets
> > crashed for some reason, taking down the entire user session, and
> > afterwards SWAP is almost free.
> 
> Did you check the OOM logs? It is possible that the OOM killer kills
> some system process that has some memory in swap as well.

I did, logs are pretty uninteresting. OOM kills `electron` (of element-
desktop), but I tried closing it before the OOM, that didn't have much
influence. Just an arbitrary victim. Then a few lines later a `Process
560296 (systemd) of user 1000 terminated abnormally with signal
11/SEGV`. Wasn't able to get stacktrace for systemd with Archlinux's
debuginfo servers. And then everything gets down with systemd.

I just tried closing every application I have open and I still got 5.5
in SWAP. Well, obviously there are services still running, Plasma,
i3wm… Not many suspects left though.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-27  0:29           ` Konstantin Kharlamov
@ 2024-10-27  3:14             ` Nhat Pham
  2024-10-27  6:46               ` Yosry Ahmed
  2024-10-27 10:25               ` [BUG] ZSwap leaks memory upon being disabled Konstantin Kharlamov
  0 siblings, 2 replies; 19+ messages in thread
From: Nhat Pham @ 2024-10-27  3:14 UTC (permalink / raw)
  To: Konstantin Kharlamov
  Cc: Yosry Ahmed, linux-mm, Johannes Weiner, Chengming Zhou

On Sat, Oct 26, 2024 at 5:29 PM Konstantin Kharlamov <Hi-Angel@yandex.ru> wrote:
>
> That was a good idea! The
> `/sys/fs/cgroup/system.slice/memory.swap.current` seems to have the
> missing half of the SWAP memory. From my understanding of the
> `systemctl status` graph `sytem.slice` and `user.slice` groups do not
> intersect, and by adding up `system.slice/…` + `user.slice/…` I get
> around 8G.
>
> However, I'm still unclear what does this memory belong to.
> `system.slice/memory.swap.current` is 4.4G currently, that's a lot and
> I'm not seeing anything that could take so much memory.

I assume you do not have any proactive memory reclaimer? :) I believe
the top utility can display swap usage by process. Have you tried
that?

There are a couple of edge cases - for instance, if you disable zswap
writeback and zswap at the same time. We will allocate slots on
swapfile, and store it at the page table entry, but we cannot store
the page's content in zswap or the swapfile, so the page remains in
memory. You're occupying swap space, but are not really saving any
memory usage.

IIRC, there is also an edge case where a page is faulted back into
memory from swap, but the associated swap space cannot be immediately
released. This should be temporary though - memory reclaimer will
attempt to release these pages later on, or they can be released when
we scan the swapfile for slots during swap out.

>
> An even larger related mystery is why does this memory not show up in
> `smem` numbers for individual applications (which calculates it by
> going over `/proc/$pid/smaps` for every pid).
>
> > The other possibility is that the pages are swapped out from the root
> > cgroup, in which case they won't show up in memory.swap.current as
> > they are basically unaccounted. Although typically user processes
> > should not be running in the root cgroup.
> >
> > > "phantom swap memory" is hidden in `user.slice`, because if I wait
> > > till
> > > OOM-killer gets triggered and kills some app, my user-systemd gets
> > > crashed for some reason, taking down the entire user session, and
> > > afterwards SWAP is almost free.
> >
> > Did you check the OOM logs? It is possible that the OOM killer kills
> > some system process that has some memory in swap as well.
>
> I did, logs are pretty uninteresting. OOM kills `electron` (of element-
> desktop), but I tried closing it before the OOM, that didn't have much
> influence. Just an arbitrary victim. Then a few lines later a `Process
> 560296 (systemd) of user 1000 terminated abnormally with signal
> 11/SEGV`. Wasn't able to get stacktrace for systemd with Archlinux's
> debuginfo servers. And then everything gets down with systemd.
>
> I just tried closing every application I have open and I still got 5.5
> in SWAP. Well, obviously there are services still running, Plasma,
> i3wm… Not many suspects left though.

This beats me. I don't know the process situation in your laptop. Sorry :)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-27  3:14             ` Nhat Pham
@ 2024-10-27  6:46               ` Yosry Ahmed
  2024-10-27 10:11                 ` Konstantin Kharlamov
  2024-10-27 10:25               ` [BUG] ZSwap leaks memory upon being disabled Konstantin Kharlamov
  1 sibling, 1 reply; 19+ messages in thread
From: Yosry Ahmed @ 2024-10-27  6:46 UTC (permalink / raw)
  To: Nhat Pham; +Cc: Konstantin Kharlamov, linux-mm, Johannes Weiner, Chengming Zhou

On Sat, Oct 26, 2024 at 8:14 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Sat, Oct 26, 2024 at 5:29 PM Konstantin Kharlamov <Hi-Angel@yandex.ru> wrote:
> >
> > That was a good idea! The
> > `/sys/fs/cgroup/system.slice/memory.swap.current` seems to have the
> > missing half of the SWAP memory. From my understanding of the
> > `systemctl status` graph `sytem.slice` and `user.slice` groups do not
> > intersect, and by adding up `system.slice/…` + `user.slice/…` I get
> > around 8G.
> >
> > However, I'm still unclear what does this memory belong to.
> > `system.slice/memory.swap.current` is 4.4G currently, that's a lot and
> > I'm not seeing anything that could take so much memory.

I am not very familiar with what usually runs in system.slice.

>
> I assume you do not have any proactive memory reclaimer? :) I believe
> the top utility can display swap usage by process. Have you tried
> that?
>
> There are a couple of edge cases - for instance, if you disable zswap
> writeback and zswap at the same time. We will allocate slots on
> swapfile, and store it at the page table entry, but we cannot store
> the page's content in zswap or the swapfile, so the page remains in
> memory. You're occupying swap space, but are not really saving any
> memory usage.
>
> IIRC, there is also an edge case where a page is faulted back into
> memory from swap, but the associated swap space cannot be immediately
> released. This should be temporary though - memory reclaimer will
> attempt to release these pages later on, or they can be released when
> we scan the swapfile for slots during swap out.

I don't think this is an edge case. I think when we swapin a page we
generally leave it in the swapcache if there is no pressure on swap
space. In that case the memory is not really swapped out, but because
it remains in the swapcache it is still reserving a swap slot, so it
shows up as swap usage.

Konstantin, could you check the amount of swapcache you have, whether
through /proc/vmstat or memory.stat on both user and system slices?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-27  6:46               ` Yosry Ahmed
@ 2024-10-27 10:11                 ` Konstantin Kharlamov
  2024-10-27 10:32                   ` Konstantin Kharlamov
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-27 10:11 UTC (permalink / raw)
  To: Yosry Ahmed, Nhat Pham; +Cc: linux-mm, Johannes Weiner, Chengming Zhou

On Sat, 2024-10-26 at 23:46 -0700, Yosry Ahmed wrote:
> On Sat, Oct 26, 2024 at 8:14 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > 
> > On Sat, Oct 26, 2024 at 5:29 PM Konstantin Kharlamov
> > <Hi-Angel@yandex.ru> wrote:
> > > 
> > > That was a good idea! The
> > > `/sys/fs/cgroup/system.slice/memory.swap.current` seems to have
> > > the
> > > missing half of the SWAP memory. From my understanding of the
> > > `systemctl status` graph `sytem.slice` and `user.slice` groups do
> > > not
> > > intersect, and by adding up `system.slice/…` + `user.slice/…` I
> > > get
> > > around 8G.
> > > 
> > > However, I'm still unclear what does this memory belong to.
> > > `system.slice/memory.swap.current` is 4.4G currently, that's a
> > > lot and
> > > I'm not seeing anything that could take so much memory.
> 
> I am not very familiar with what usually runs in system.slice.
> 
> > 
> > I assume you do not have any proactive memory reclaimer? :) I
> > believe
> > the top utility can display swap usage by process. Have you tried
> > that?
> > 
> > There are a couple of edge cases - for instance, if you disable
> > zswap
> > writeback and zswap at the same time. We will allocate slots on
> > swapfile, and store it at the page table entry, but we cannot store
> > the page's content in zswap or the swapfile, so the page remains in
> > memory. You're occupying swap space, but are not really saving any
> > memory usage.
> > 
> > IIRC, there is also an edge case where a page is faulted back into
> > memory from swap, but the associated swap space cannot be
> > immediately
> > released. This should be temporary though - memory reclaimer will
> > attempt to release these pages later on, or they can be released
> > when
> > we scan the swapfile for slots during swap out.
> 
> I don't think this is an edge case. I think when we swapin a page we
> generally leave it in the swapcache if there is no pressure on swap
> space. In that case the memory is not really swapped out, but because
> it remains in the swapcache it is still reserving a swap slot, so it
> shows up as swap usage.
> 
> Konstantin, could you check the amount of swapcache you have, whether
> through /proc/vmstat or memory.stat on both user and system slices?

Sure

	λ grep cache /sys/fs/cgroup/*/memory.stat
	…
	/sys/fs/cgroup/system.slice/memory.stat:swapcached 434917376
	/sys/fs/cgroup/user.slice/memory.stat:swapcached 15478784

`434917376` is a 0.4G, not much. In comparison,
`system.slice/memory.swap.current` is currently `4764139520 = 4.4G`.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-27  3:14             ` Nhat Pham
  2024-10-27  6:46               ` Yosry Ahmed
@ 2024-10-27 10:25               ` Konstantin Kharlamov
  1 sibling, 0 replies; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-27 10:25 UTC (permalink / raw)
  To: Nhat Pham; +Cc: Yosry Ahmed, linux-mm, Johannes Weiner, Chengming Zhou

On Sat, 2024-10-26 at 20:14 -0700, Nhat Pham wrote:
> On Sat, Oct 26, 2024 at 5:29 PM Konstantin Kharlamov
> <Hi-Angel@yandex.ru> wrote:
> > 
> > That was a good idea! The
> > `/sys/fs/cgroup/system.slice/memory.swap.current` seems to have the
> > missing half of the SWAP memory. From my understanding of the
> > `systemctl status` graph `sytem.slice` and `user.slice` groups do
> > not
> > intersect, and by adding up `system.slice/…` + `user.slice/…` I get
> > around 8G.
> > 
> > However, I'm still unclear what does this memory belong to.
> > `system.slice/memory.swap.current` is 4.4G currently, that's a lot
> > and
> > I'm not seeing anything that could take so much memory.
> 
> I assume you do not have any proactive memory reclaimer? :) 

No, just the kernel with `vm.swappiness = 100` and with ZSWAP (ZSWAP is
on on Archlinux nowadays via CONFIG_ZSWAP_DEFAULT_ON).

> I believe
> the top utility can display swap usage by process. Have you tried
> that?

I just tried. Well, the data seems the same as what `smem` shows,
except I can't add up the column numbers because top is interactive 😊
I noticed plasmashell was too bloated, so restarted it. Didn't solve
the problem with some unknown memory taking gigabytes in SWAP though.

> There are a couple of edge cases - for instance, if you disable zswap
> writeback and zswap at the same time. We will allocate slots on
> swapfile, and store it at the page table entry, but we cannot store
> the page's content in zswap or the swapfile, so the page remains in
> memory. You're occupying swap space, but are not really saving any
> memory usage.

I never disabled zswap writeback and as of writing the words zswap is
on, so this certainly not it.

> IIRC, there is also an edge case where a page is faulted back into
> memory from swap, but the associated swap space cannot be immediately
> released. This should be temporary though - memory reclaimer will
> attempt to release these pages later on, or they can be released when
> we scan the swapfile for slots during swap out.

Replied in a separate email.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-27 10:11                 ` Konstantin Kharlamov
@ 2024-10-27 10:32                   ` Konstantin Kharlamov
  2024-10-27 11:28                     ` Konstantin Kharlamov
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-27 10:32 UTC (permalink / raw)
  To: Yosry Ahmed, Nhat Pham; +Cc: linux-mm, Johannes Weiner, Chengming Zhou

On Sun, 2024-10-27 at 13:11 +0300, Konstantin Kharlamov wrote:
> On Sat, 2024-10-26 at 23:46 -0700, Yosry Ahmed wrote:
> > On Sat, Oct 26, 2024 at 8:14 PM Nhat Pham <nphamcs@gmail.com>
> > wrote:
> > >
> > > On Sat, Oct 26, 2024 at 5:29 PM Konstantin Kharlamov
> > > <Hi-Angel@yandex.ru> wrote:
> > > >
> > > > That was a good idea! The
> > > > `/sys/fs/cgroup/system.slice/memory.swap.current` seems to have
> > > > the
> > > > missing half of the SWAP memory. From my understanding of the
> > > > `systemctl status` graph `sytem.slice` and `user.slice` groups
> > > > do
> > > > not
> > > > intersect, and by adding up `system.slice/…` + `user.slice/…` I
> > > > get
> > > > around 8G.
> > > >
> > > > However, I'm still unclear what does this memory belong to.
> > > > `system.slice/memory.swap.current` is 4.4G currently, that's a
> > > > lot and
> > > > I'm not seeing anything that could take so much memory.
> >
> > I am not very familiar with what usually runs in system.slice.
> >
> > >
> > > I assume you do not have any proactive memory reclaimer? :) I
> > > believe
> > > the top utility can display swap usage by process. Have you tried
> > > that?
> > >
> > > There are a couple of edge cases - for instance, if you disable
> > > zswap
> > > writeback and zswap at the same time. We will allocate slots on
> > > swapfile, and store it at the page table entry, but we cannot
> > > store
> > > the page's content in zswap or the swapfile, so the page remains
> > > in
> > > memory. You're occupying swap space, but are not really saving
> > > any
> > > memory usage.
> > >
> > > IIRC, there is also an edge case where a page is faulted back
> > > into
> > > memory from swap, but the associated swap space cannot be
> > > immediately
> > > released. This should be temporary though - memory reclaimer will
> > > attempt to release these pages later on, or they can be released
> > > when
> > > we scan the swapfile for slots during swap out.
> >
> > I don't think this is an edge case. I think when we swapin a page
> > we
> > generally leave it in the swapcache if there is no pressure on swap
> > space. In that case the memory is not really swapped out, but
> > because
> > it remains in the swapcache it is still reserving a swap slot, so
> > it
> > shows up as swap usage.
> >
> > Konstantin, could you check the amount of swapcache you have,
> > whether
> > through /proc/vmstat or memory.stat on both user and system slices?
>
> Sure
>
> 	λ grep cache /sys/fs/cgroup/*/memory.stat
> 	…
> 	/sys/fs/cgroup/system.slice/memory.stat:swapcached 434917376
> 	/sys/fs/cgroup/user.slice/memory.stat:swapcached 15478784
>
> `434917376` is a 0.4G, not much. In comparison,
> `system.slice/memory.swap.current` is currently `4764139520 = 4.4G`.

I figured since 434917376 is 10 numbers, I'd grep everything in
memory.stat that has ten digits:

    λ grep -P "\d{10}$" /sys/fs/cgroup/system.slice/memory.stat
    file 2671874048
    shmem 2592768000
    zswapped 2997760000
    active_anon 1491247104
    unevictable 1269555200

well, to me personally this isn't helpful, but perhaps am I missing
something…


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-27 10:32                   ` Konstantin Kharlamov
@ 2024-10-27 11:28                     ` Konstantin Kharlamov
  2024-10-27 19:31                       ` Yosry Ahmed
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-27 11:28 UTC (permalink / raw)
  To: Yosry Ahmed, Nhat Pham; +Cc: linux-mm, Johannes Weiner, Chengming Zhou

On Sun, 2024-10-27 at 13:32 +0300, Konstantin Kharlamov wrote:
> On Sun, 2024-10-27 at 13:11 +0300, Konstantin Kharlamov wrote:
> > On Sat, 2024-10-26 at 23:46 -0700, Yosry Ahmed wrote:
> > > I don't think this is an edge case. I think when we swapin a page
> > > we
> > > generally leave it in the swapcache if there is no pressure on
> > > swap
> > > space. In that case the memory is not really swapped out, but
> > > because
> > > it remains in the swapcache it is still reserving a swap slot, so
> > > it
> > > shows up as swap usage.
> > >
> > > Konstantin, could you check the amount of swapcache you have,
> > > whether
> > > through /proc/vmstat or memory.stat on both user and system
> > > slices?
> >
> > Sure
> >
> > 	λ grep cache /sys/fs/cgroup/*/memory.stat
> > 	…
> > 	/sys/fs/cgroup/system.slice/memory.stat:swapcached
> > 434917376
> > 	/sys/fs/cgroup/user.slice/memory.stat:swapcached 15478784
> >
> > `434917376` is a 0.4G, not much. In comparison,
> > `system.slice/memory.swap.current` is currently `4764139520 =
> > 4.4G`.
>
> I figured since 434917376 is 10 numbers, I'd grep everything in
> memory.stat that has ten digits:
>
>     λ grep -P "\d{10}$" /sys/fs/cgroup/system.slice/memory.stat
>     file 2671874048
>     shmem 2592768000
>     zswapped 2997760000
>     active_anon 1491247104
>     unevictable 1269555200
>
> well, to me personally this isn't helpful, but perhaps am I missing
> something…

I found the process the "phantom memory" belongs to! I just realized
that I can see `memory.swap.current` for individual processes in a
cgroup too, and it turns out currently 4.3G belong to sddm:

  /sys/fs/cgroup/system.slice/sddm.service/memory.swap.current:4723781632

systemctl confirms this:

  λ systemctl status sddm
  ● sddm.service - Simple Desktop Display Manager
       Loaded: loaded (/usr/lib/systemd/system/sddm.service; enabled; preset: disabled)
       Active: active (running) since Wed 2024-10-16 15:59:10 MSK; 1 week 3 days ago
   Invocation: daadb3ed391b421b90b216122339be83
         Docs: man:sddm(1)
               man:sddm.conf(5)
     Main PID: 720 (sddm)
        Tasks: 10 (limit: 18621)
       Memory: 3.3G (peak: 4.1G swap: 4.3G swap peak: 5.8G zswap: 67.6M)
          CPU: 21h 30min 56.309s
       CGroup: /system.slice/sddm.service
               ├─720 /usr/bin/sddm
               └─724 /usr/lib/Xorg -nolisten tcp -background none -seat seat0 vt2 -auth /run/sddm/xauth_IKXVXT -noreset -displayfd 16

Note the `swap: 4.3G` sentence.

So, this is good news, but still doesn't answer the question where did this memory
go. Out of the 2 processes in the group, `smem` shows 2.1M for sddm and 88M for Xorg.

I even tried manually calculating:

  λ sudo grep Swap /proc/72{0,4}/smaps | awk '{total+=$2} END {print "Swap memory: " total "K"}'
  Swap memory: 184656K

That's 180M, for some reason very different, but whatever, still very far from 4.3G.

----------

Just to make it clear, the reason why I'm digging is that something's clearly very
wrong. And I can't blame Xorg nor sddm currently, because by all means they don't
take 4.3G of memory. The cgroup for some reason does, but the processes don't.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] ZSwap leaks memory upon being disabled
  2024-10-27 11:28                     ` Konstantin Kharlamov
@ 2024-10-27 19:31                       ` Yosry Ahmed
  2024-10-27 22:13                         ` phantom memory in a cgroup (was [BUG] ZSwap leaks memory upon being disabled) Konstantin Kharlamov
  0 siblings, 1 reply; 19+ messages in thread
From: Yosry Ahmed @ 2024-10-27 19:31 UTC (permalink / raw)
  To: Konstantin Kharlamov; +Cc: Nhat Pham, linux-mm, Johannes Weiner, Chengming Zhou

On Sun, Oct 27, 2024 at 4:28 AM Konstantin Kharlamov <Hi-Angel@yandex.ru> wrote:
>
> On Sun, 2024-10-27 at 13:32 +0300, Konstantin Kharlamov wrote:
> > On Sun, 2024-10-27 at 13:11 +0300, Konstantin Kharlamov wrote:
> > > On Sat, 2024-10-26 at 23:46 -0700, Yosry Ahmed wrote:
> > > > I don't think this is an edge case. I think when we swapin a page
> > > > we
> > > > generally leave it in the swapcache if there is no pressure on
> > > > swap
> > > > space. In that case the memory is not really swapped out, but
> > > > because
> > > > it remains in the swapcache it is still reserving a swap slot, so
> > > > it
> > > > shows up as swap usage.
> > > >
> > > > Konstantin, could you check the amount of swapcache you have,
> > > > whether
> > > > through /proc/vmstat or memory.stat on both user and system
> > > > slices?
> > >
> > > Sure
> > >
> > >     λ grep cache /sys/fs/cgroup/*/memory.stat
> > >     …
> > >     /sys/fs/cgroup/system.slice/memory.stat:swapcached
> > > 434917376
> > >     /sys/fs/cgroup/user.slice/memory.stat:swapcached 15478784
> > >
> > > `434917376` is a 0.4G, not much. In comparison,
> > > `system.slice/memory.swap.current` is currently `4764139520 =
> > > 4.4G`.
> >
> > I figured since 434917376 is 10 numbers, I'd grep everything in
> > memory.stat that has ten digits:
> >
> >     λ grep -P "\d{10}$" /sys/fs/cgroup/system.slice/memory.stat
> >     file 2671874048
> >     shmem 2592768000
> >     zswapped 2997760000
> >     active_anon 1491247104
> >     unevictable 1269555200
> >
> > well, to me personally this isn't helpful, but perhaps am I missing
> > something…
>
> I found the process the "phantom memory" belongs to! I just realized
> that I can see `memory.swap.current` for individual processes in a
> cgroup too, and it turns out currently 4.3G belong to sddm:
>
>   /sys/fs/cgroup/system.slice/sddm.service/memory.swap.current:4723781632
>
> systemctl confirms this:
>
>   λ systemctl status sddm
>   ● sddm.service - Simple Desktop Display Manager
>        Loaded: loaded (/usr/lib/systemd/system/sddm.service; enabled; preset: disabled)
>        Active: active (running) since Wed 2024-10-16 15:59:10 MSK; 1 week 3 days ago
>    Invocation: daadb3ed391b421b90b216122339be83
>          Docs: man:sddm(1)
>                man:sddm.conf(5)
>      Main PID: 720 (sddm)
>         Tasks: 10 (limit: 18621)
>        Memory: 3.3G (peak: 4.1G swap: 4.3G swap peak: 5.8G zswap: 67.6M)
>           CPU: 21h 30min 56.309s
>        CGroup: /system.slice/sddm.service
>                ├─720 /usr/bin/sddm
>                └─724 /usr/lib/Xorg -nolisten tcp -background none -seat seat0 vt2 -auth /run/sddm/xauth_IKXVXT -noreset -displayfd 16
>
> Note the `swap: 4.3G` sentence.
>
> So, this is good news, but still doesn't answer the question where did this memory
> go. Out of the 2 processes in the group, `smem` shows 2.1M for sddm and 88M for Xorg.
>
> I even tried manually calculating:
>
>   λ sudo grep Swap /proc/72{0,4}/smaps | awk '{total+=$2} END {print "Swap memory: " total "K"}'
>   Swap memory: 184656K
>
> That's 180M, for some reason very different, but whatever, still very far from 4.3G.

I think smaps will only show you swapped out mapped memory. It could be tmpfs.

One thing you can do is take a snapshot of memory.stat when
memory.swap.current is at a high value (for sddm), then swapoff, then
take another snapshot of memory.stat.

We should see an increase in either anon or shmem, which will tell us
which type of memory was swapped out.

>
> ----------
>
> Just to make it clear, the reason why I'm digging is that something's clearly very
> wrong. And I can't blame Xorg nor sddm currently, because by all means they don't
> take 4.3G of memory. The cgroup for some reason does, but the processes don't.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: phantom memory in a cgroup (was [BUG] ZSwap leaks memory upon being disabled)
  2024-10-27 19:31                       ` Yosry Ahmed
@ 2024-10-27 22:13                         ` Konstantin Kharlamov
  2024-10-30 14:41                           ` Konstantin Kharlamov
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-27 22:13 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Nhat Pham, linux-mm, Johannes Weiner, Chengming Zhou

On Sun, 2024-10-27 at 12:31 -0700, Yosry Ahmed wrote:
> On Sun, Oct 27, 2024 at 4:28 AM Konstantin Kharlamov
> <Hi-Angel@yandex.ru> wrote:
> > 
> > On Sun, 2024-10-27 at 13:32 +0300, Konstantin Kharlamov wrote:
> > > On Sun, 2024-10-27 at 13:11 +0300, Konstantin Kharlamov wrote:
> > > > On Sat, 2024-10-26 at 23:46 -0700, Yosry Ahmed wrote:
> > > > > I don't think this is an edge case. I think when we swapin a
> > > > > page
> > > > > we
> > > > > generally leave it in the swapcache if there is no pressure
> > > > > on
> > > > > swap
> > > > > space. In that case the memory is not really swapped out, but
> > > > > because
> > > > > it remains in the swapcache it is still reserving a swap
> > > > > slot, so
> > > > > it
> > > > > shows up as swap usage.
> > > > > 
> > > > > Konstantin, could you check the amount of swapcache you have,
> > > > > whether
> > > > > through /proc/vmstat or memory.stat on both user and system
> > > > > slices?
> > > > 
> > > > Sure
> > > > 
> > > >     λ grep cache /sys/fs/cgroup/*/memory.stat
> > > >     …
> > > >     /sys/fs/cgroup/system.slice/memory.stat:swapcached
> > > > 434917376
> > > >     /sys/fs/cgroup/user.slice/memory.stat:swapcached 15478784
> > > > 
> > > > `434917376` is a 0.4G, not much. In comparison,
> > > > `system.slice/memory.swap.current` is currently `4764139520 =
> > > > 4.4G`.
> > > 
> > > I figured since 434917376 is 10 numbers, I'd grep everything in
> > > memory.stat that has ten digits:
> > > 
> > >     λ grep -P "\d{10}$" /sys/fs/cgroup/system.slice/memory.stat
> > >     file 2671874048
> > >     shmem 2592768000
> > >     zswapped 2997760000
> > >     active_anon 1491247104
> > >     unevictable 1269555200
> > > 
> > > well, to me personally this isn't helpful, but perhaps am I
> > > missing
> > > something…
> > 
> > I found the process the "phantom memory" belongs to! I just
> > realized
> > that I can see `memory.swap.current` for individual processes in a
> > cgroup too, and it turns out currently 4.3G belong to sddm:
> > 
> >  
> > /sys/fs/cgroup/system.slice/sddm.service/memory.swap.current:472378
> > 1632
> > 
> > systemctl confirms this:
> > 
> >   λ systemctl status sddm
> >   ● sddm.service - Simple Desktop Display Manager
> >        Loaded: loaded (/usr/lib/systemd/system/sddm.service;
> > enabled; preset: disabled)
> >        Active: active (running) since Wed 2024-10-16 15:59:10 MSK;
> > 1 week 3 days ago
> >    Invocation: daadb3ed391b421b90b216122339be83
> >          Docs: man:sddm(1)
> >                man:sddm.conf(5)
> >      Main PID: 720 (sddm)
> >         Tasks: 10 (limit: 18621)
> >        Memory: 3.3G (peak: 4.1G swap: 4.3G swap peak: 5.8G zswap:
> > 67.6M)
> >           CPU: 21h 30min 56.309s
> >        CGroup: /system.slice/sddm.service
> >                ├─720 /usr/bin/sddm
> >                └─724 /usr/lib/Xorg -nolisten tcp -background none -
> > seat seat0 vt2 -auth /run/sddm/xauth_IKXVXT -noreset -displayfd 16
> > 
> > Note the `swap: 4.3G` sentence.
> > 
> > So, this is good news, but still doesn't answer the question where
> > did this memory
> > go. Out of the 2 processes in the group, `smem` shows 2.1M for sddm
> > and 88M for Xorg.
> > 
> > I even tried manually calculating:
> > 
> >   λ sudo grep Swap /proc/72{0,4}/smaps | awk '{total+=$2} END
> > {print "Swap memory: " total "K"}'
> >   Swap memory: 184656K
> > 
> > That's 180M, for some reason very different, but whatever, still
> > very far from 4.3G.

FTR, the reason I got "very different 180M" is I by mistake added up
SwapPSS as well.

> I think smaps will only show you swapped out mapped memory. It could
> be tmpfs.
> 
> One thing you can do is take a snapshot of memory.stat when
> memory.swap.current is at a high value (for sddm), then swapoff, then
> take another snapshot of memory.stat.
> 
> We should see an increase in either anon or shmem, which will tell us
> which type of memory was swapped out.

Okay. I will have to wait, because the session got killed by OOM. But I
think it's gonna reproduce in just a few days, my new workflow seems to
be triggering that a lot.

I took this chance to rename the thread as well, otherwise I'm gonna
forget it upon writing the next email.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: phantom memory in a cgroup (was [BUG] ZSwap leaks memory upon being disabled)
  2024-10-27 22:13                         ` phantom memory in a cgroup (was [BUG] ZSwap leaks memory upon being disabled) Konstantin Kharlamov
@ 2024-10-30 14:41                           ` Konstantin Kharlamov
  2024-10-30 19:44                             ` Yosry Ahmed
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-30 14:41 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Nhat Pham, linux-mm, Johannes Weiner, Chengming Zhou

On Mon, 2024-10-28 at 01:13 +0300, Konstantin Kharlamov wrote:
> On Sun, 2024-10-27 at 12:31 -0700, Yosry Ahmed wrote:
> > One thing you can do is take a snapshot of memory.stat when
> > memory.swap.current is at a high value (for sddm), then swapoff,
> > then
> > take another snapshot of memory.stat.
> >
> > We should see an increase in either anon or shmem, which will tell
> > us
> > which type of memory was swapped out.
>
> Okay. I will have to wait, because the session got killed by OOM. But
> I
> think it's gonna reproduce in just a few days, my new workflow seems
> to
> be triggering that a lot.

Done. I missed one cycle, which again got my session killed by OOM 😅
Now I caught this in time. The information was retrieved by:

    (systemctl status sddm && cat /sys/fs/cgroup/system.slice/sddm.service/memory.stat) > ~/Projects/cgroups-mem-leak/"$(date -R)".log

I wasn't sure how to represent it in email, and decided to post a diff
of "before `swapoff -a`" and "after …", to be viewed with `diffr` or
with `perl /path/to/diff-highlight` of git or similar.

Diff follows:

--- "Wed, 30 Oct 2024 17:27:38 +0300.log"       2024-10-30 17:27:38.401290017 +0300
+++ "Wed, 30 Oct 2024 17:28:12 +0300.log"       2024-10-30 17:28:12.397695798 +0300
@@ -6,8 +6,8 @@
              man:sddm.conf(5)
    Main PID: 710 (sddm)
       Tasks: 9 (limit: 18621)
-     Memory: 1.2G (peak: 2.8G swap: 1.7G swap peak: 3.2G zswap: 58.3M)
-        CPU: 6h 10min 7.847s
+     Memory: 2.8G (peak: 2.8G swap: 0B swap peak: 3.2G)
+        CPU: 6h 10min 14.748s
      CGroup: /system.slice/sddm.service
              ├─710 /usr/bin/sddm
              └─746 /usr/lib/Xorg -nolisten tcp -background none -seat seat0 vt2 -auth /run/sddm/xauth_GXHGRA -noreset -displayfd 20
@@ -22,36 +22,36 @@
 окт 28 09:22:36 dell-g15 sddm-helper[925]: Writing cookie to "/tmp/xauth_RKKlcB"
 окт 28 09:22:36 dell-g15 sddm-helper[925]: Starting X11 session: "" "/usr/share/sddm/scripts/Xsession \"env KDEWM=/usr/bin/i3 /usr/bin/startplasma-x11\""
 окт 28 09:22:36 dell-g15 sddm[710]: Session started true
-anon 42807296
-file 1150423040
-kernel 79376384
+anon 93822976
+file 2957750272
+kernel 18210816
 kernel_stack 147456
 pagetables 7204864
 sec_pagetables 0
 percpu 2184
 sock 0
 vmalloc 12288
-shmem 1150795776
-zswap 61173438
-zswapped 1751408640
+shmem 2958123008
+zswap 0
+zswapped 0
 file_mapped 4108288
 file_dirty 0
 file_writeback 0
-swapcached 17666048
+swapcached 0
 anon_thp 2097152
 file_thp 0
 shmem_thp 0
-inactive_anon 445014016
-active_anon 625201152
+inactive_anon 589209600
+active_anon 2321489920
 inactive_file 2895872
 active_file 2244608
-unevictable 140836864
-slab_reclaimable 8166656
-slab_unreclaimable 2618128
-slab 10784784
-workingset_refault_anon 69854
+unevictable 141144064
+slab_reclaimable 8169032
+slab_unreclaimable 2625208
+slab 10794240
+workingset_refault_anon 177253
 workingset_refault_file 12496
-workingset_activate_anon 33476
+workingset_activate_anon 41579
 workingset_activate_file 2372
 workingset_restore_anon 12558
 workingset_restore_file 2132
@@ -64,14 +64,14 @@
 pgsteal_kswapd 1243374
 pgsteal_direct 348876
 pgsteal_khugepaged 9149
-pgfault 626853
+pgfault 626941
 pgmajfault 11521
 pgrefill 560417
 pgactivate 85087
 pgdeactivate 0
 pglazyfree 0
 pglazyfreed 0
-zswpin 87568
+zswpin 515158
 zswpout 1395410
 zswpwb 211559
 thp_fault_alloc 8


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: phantom memory in a cgroup (was [BUG] ZSwap leaks memory upon being disabled)
  2024-10-30 14:41                           ` Konstantin Kharlamov
@ 2024-10-30 19:44                             ` Yosry Ahmed
  2024-10-31 21:59                               ` Konstantin Kharlamov
  0 siblings, 1 reply; 19+ messages in thread
From: Yosry Ahmed @ 2024-10-30 19:44 UTC (permalink / raw)
  To: Konstantin Kharlamov; +Cc: Nhat Pham, linux-mm, Johannes Weiner, Chengming Zhou

On Wed, Oct 30, 2024 at 7:41 AM Konstantin Kharlamov <Hi-Angel@yandex.ru> wrote:
>
> On Mon, 2024-10-28 at 01:13 +0300, Konstantin Kharlamov wrote:
> > On Sun, 2024-10-27 at 12:31 -0700, Yosry Ahmed wrote:
> > > One thing you can do is take a snapshot of memory.stat when
> > > memory.swap.current is at a high value (for sddm), then swapoff,
> > > then
> > > take another snapshot of memory.stat.
> > >
> > > We should see an increase in either anon or shmem, which will tell
> > > us
> > > which type of memory was swapped out.
> >
> > Okay. I will have to wait, because the session got killed by OOM. But
> > I
> > think it's gonna reproduce in just a few days, my new workflow seems
> > to
> > be triggering that a lot.
>
> Done. I missed one cycle, which again got my session killed by OOM 😅
> Now I caught this in time. The information was retrieved by:
>
>     (systemctl status sddm && cat /sys/fs/cgroup/system.slice/sddm.service/memory.stat) > ~/Projects/cgroups-mem-leak/"$(date -R)".log
>
> I wasn't sure how to represent it in email, and decided to post a diff
> of "before `swapoff -a`" and "after …", to be viewed with `diffr` or
> with `perl /path/to/diff-highlight` of git or similar.
>
> Diff follows:
>
> --- "Wed, 30 Oct 2024 17:27:38 +0300.log"       2024-10-30 17:27:38.401290017 +0300
> +++ "Wed, 30 Oct 2024 17:28:12 +0300.log"       2024-10-30 17:28:12.397695798 +0300
> @@ -6,8 +6,8 @@
>               man:sddm.conf(5)
>     Main PID: 710 (sddm)
>        Tasks: 9 (limit: 18621)
> -     Memory: 1.2G (peak: 2.8G swap: 1.7G swap peak: 3.2G zswap: 58.3M)
> -        CPU: 6h 10min 7.847s
> +     Memory: 2.8G (peak: 2.8G swap: 0B swap peak: 3.2G)
> +        CPU: 6h 10min 14.748s
>       CGroup: /system.slice/sddm.service
>               ├─710 /usr/bin/sddm
>               └─746 /usr/lib/Xorg -nolisten tcp -background none -seat seat0 vt2 -auth /run/sddm/xauth_GXHGRA -noreset -displayfd 20
> @@ -22,36 +22,36 @@
>  окт 28 09:22:36 dell-g15 sddm-helper[925]: Writing cookie to "/tmp/xauth_RKKlcB"
>  окт 28 09:22:36 dell-g15 sddm-helper[925]: Starting X11 session: "" "/usr/share/sddm/scripts/Xsession \"env KDEWM=/usr/bin/i3 /usr/bin/startplasma-x11\""
>  окт 28 09:22:36 dell-g15 sddm[710]: Session started true
> -anon 42807296
> -file 1150423040
> -kernel 79376384
> +anon 93822976

Anonymous memory increased, but not by too much.

> +file 2957750272
> +kernel 18210816
>  kernel_stack 147456
>  pagetables 7204864
>  sec_pagetables 0
>  percpu 2184
>  sock 0
>  vmalloc 12288
> -shmem 1150795776
> -zswap 61173438
> -zswapped 1751408640
> +shmem 2958123008

shmem increased by a lot (~1.8G).

So this looks like it could be the answer to your question about where
the swap usage is coming from. I would try to find what tmpfs files
are used by this application.

> +zswap 0
> +zswapped 0
>  file_mapped 4108288
>  file_dirty 0
>  file_writeback 0
> -swapcached 17666048
> +swapcached 0
>  anon_thp 2097152
>  file_thp 0
>  shmem_thp 0
> -inactive_anon 445014016
> -active_anon 625201152
> +inactive_anon 589209600
> +active_anon 2321489920
>  inactive_file 2895872
>  active_file 2244608
> -unevictable 140836864
> -slab_reclaimable 8166656
> -slab_unreclaimable 2618128
> -slab 10784784
> -workingset_refault_anon 69854
> +unevictable 141144064
> +slab_reclaimable 8169032
> +slab_unreclaimable 2625208
> +slab 10794240
> +workingset_refault_anon 177253
>  workingset_refault_file 12496
> -workingset_activate_anon 33476
> +workingset_activate_anon 41579
>  workingset_activate_file 2372
>  workingset_restore_anon 12558
>  workingset_restore_file 2132
> @@ -64,14 +64,14 @@
>  pgsteal_kswapd 1243374
>  pgsteal_direct 348876
>  pgsteal_khugepaged 9149
> -pgfault 626853
> +pgfault 626941
>  pgmajfault 11521
>  pgrefill 560417
>  pgactivate 85087
>  pgdeactivate 0
>  pglazyfree 0
>  pglazyfreed 0
> -zswpin 87568
> +zswpin 515158
>  zswpout 1395410
>  zswpwb 211559
>  thp_fault_alloc 8


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: phantom memory in a cgroup (was [BUG] ZSwap leaks memory upon being disabled)
  2024-10-30 19:44                             ` Yosry Ahmed
@ 2024-10-31 21:59                               ` Konstantin Kharlamov
  2024-10-31 22:04                                 ` Yosry Ahmed
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Kharlamov @ 2024-10-31 21:59 UTC (permalink / raw)
  To: Yosry Ahmed; +Cc: Nhat Pham, linux-mm, Johannes Weiner, Chengming Zhou

On Wed, 2024-10-30 at 12:44 -0700, Yosry Ahmed wrote:
> shmem increased by a lot (~1.8G).
> 
> So this looks like it could be the answer to your question about
> where
> the swap usage is coming from. I would try to find what tmpfs files
> are used by this application.

Thank you! After doing more digging I reduced it to `Xorg` having a
hunderds of `anon_inode:i915.gem`, and afterwards pinned down this to
be Picom not freeing resources. Reported on Github¹.

That said, isn't there a kernel bug too? If this `shmem` ends up in
Swap, then it should be accounted in `Swap` fields of
`proc/<pid>/smaps` accordingly, right? In the end, that's what the
field is for: amount of SWAP taken by a process. Otherwise it is a
"phantom memory": something being in SWAP, but who owns this
"something" — there's no way to know, it just kind of "exists" amidst
kernel and processes realms.

1: https://github.com/yshui/picom/issues/1378


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: phantom memory in a cgroup (was [BUG] ZSwap leaks memory upon being disabled)
  2024-10-31 21:59                               ` Konstantin Kharlamov
@ 2024-10-31 22:04                                 ` Yosry Ahmed
  0 siblings, 0 replies; 19+ messages in thread
From: Yosry Ahmed @ 2024-10-31 22:04 UTC (permalink / raw)
  To: Konstantin Kharlamov; +Cc: Nhat Pham, linux-mm, Johannes Weiner, Chengming Zhou

On Thu, Oct 31, 2024 at 2:59 PM Konstantin Kharlamov <Hi-Angel@yandex.ru> wrote:
>
> On Wed, 2024-10-30 at 12:44 -0700, Yosry Ahmed wrote:
> > shmem increased by a lot (~1.8G).
> >
> > So this looks like it could be the answer to your question about
> > where
> > the swap usage is coming from. I would try to find what tmpfs files
> > are used by this application.
>
> Thank you! After doing more digging I reduced it to `Xorg` having a
> hunderds of `anon_inode:i915.gem`, and afterwards pinned down this to
> be Picom not freeing resources. Reported on Github¹.
>
> That said, isn't there a kernel bug too? If this `shmem` ends up in
> Swap, then it should be accounted in `Swap` fields of
> `proc/<pid>/smaps` accordingly, right? In the end, that's what the
> field is for: amount of SWAP taken by a process. Otherwise it is a
> "phantom memory": something being in SWAP, but who owns this
> "something" — there's no way to know, it just kind of "exists" amidst
> kernel and processes realms.

I don't think so. shmem doesn't really belong to a single process. If
you kill the process but leave the tmpfs files behind, the memory will
not go away.

>
> 1: https://github.com/yshui/picom/issues/1378


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-10-31 22:05 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-24 13:02 [BUG] ZSwap leaks memory upon being disabled Konstantin Kharlamov
2024-10-24 20:47 ` Yosry Ahmed
2024-10-25  6:41   ` Konstantin Kharlamov
2024-10-25  7:50     ` Yosry Ahmed
2024-10-26 11:33       ` Konstantin Kharlamov
2024-10-26 17:47         ` Yosry Ahmed
2024-10-27  0:29           ` Konstantin Kharlamov
2024-10-27  3:14             ` Nhat Pham
2024-10-27  6:46               ` Yosry Ahmed
2024-10-27 10:11                 ` Konstantin Kharlamov
2024-10-27 10:32                   ` Konstantin Kharlamov
2024-10-27 11:28                     ` Konstantin Kharlamov
2024-10-27 19:31                       ` Yosry Ahmed
2024-10-27 22:13                         ` phantom memory in a cgroup (was [BUG] ZSwap leaks memory upon being disabled) Konstantin Kharlamov
2024-10-30 14:41                           ` Konstantin Kharlamov
2024-10-30 19:44                             ` Yosry Ahmed
2024-10-31 21:59                               ` Konstantin Kharlamov
2024-10-31 22:04                                 ` Yosry Ahmed
2024-10-27 10:25               ` [BUG] ZSwap leaks memory upon being disabled Konstantin Kharlamov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox