Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosryahmed@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Nhat Pham <nphamcs@gmail.com>,  Chris Li <chrisl@kernel.org>,
	Chengming Zhou <zhouchengming@bytedance.com>,
	 Huang Ying <ying.huang@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()
Date: Tue, 23 Jan 2024 22:57:30 -0800	[thread overview]
Message-ID: <CAJD7tkaVdJ9B_UDQs+o1nLdbs62CeKgbCyEXbMdezaBgOruEWw@mail.gmail.com> (raw)
In-Reply-To: <CAJD7tkZC6w2EaE=j2NEVWn1s7Lo2A7YZh8LiZ+w72jQzFFWLUQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5613 bytes --]

> > > > > However, while I have your attention on the swapoff path, there's a
> > > > > slightly irrelevant problem that I think might be there, but I am not
> > > > > sure.
> > > > >
> > > > > It looks to me like swapoff can race with writeback, and there may be
> > > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff()
> > > > > races with shrink_memcg_cb(), I feel like we may free the tree as it
> > > > > is being used. For example if zswap_swapoff()->kfree(tree) happen
> > > > > right before shrink_memcg_cb()->list_lru_isolate(l, item).
> > > > >
> > > > > Please tell me that I am being paranoid and that there is some
> > > > > protection against zswap writeback racing with swapoff. It feels like
> > > > > we are very careful with zswap entries refcounting, but not with the
> > > > > zswap tree itself.
> > > >
> > > > Hm, I don't see how.
> > > >
> > > > Writeback operates on entries from the LRU. By the time
> > > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should
> > > > will have emptied out the LRU and tree.
> > > >
> > > > Writeback could have gotten a refcount to the entry and dropped the
> > > > tree->lock. But then it does __read_swap_cache_async(), and while
> > > > holding the page lock checks the tree under lock once more; if that
> > > > finds the entry valid, it means try_to_unuse() hasn't started on this
> > > > page yet, and would be held up by the page lock/writeback state.
> > >
> > > Consider the following race:
> > >
> > > CPU 1                                 CPU 2
> > > # In shrink_memcg_cb()     # In swap_off
> > > list_lru_isolate()
> > >                                             zswap_invalidate()
> > >                                             ..
> > >                                             zswap_swapoff() -> kfree(tree)
> > > spin_lock(&tree->lock);
> > >
> > > Isn't this a UAF or am I missing something here?
> >
> > Oof. You're right, it looks like there is a bug. Digging through the
> > history, I think this is actually quite old: the original backend
> > shrinkers would pluck something off their LRU, drop all locks, then
> > try to acquire tree->lock. There is no protection against swapoff.
> >
> > The lock that is supposed to protect this is the LRU lock. That's
> > where reclaim and invalidation should synchronize. But it's not right:
> >
> > 1. We drop the LRU lock before acquiring the tree lock. We should
> >    instead trylock the tree while still holding the LRU lock to make
> >    sure the tree is safe against swapoff.
> >
> > 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But
> >    it must always cycle the LRU lock before freeing the tree for that
> >    synchronization to work.
> >
> > Once we're holding a refcount to the entry, it's safe to drop all
> > locks for the next step because we'll then work against the swapcache
> > and entry: __read_swap_cache_async() will try to pin and lock whatever
> > swap entry is at that type+offset. This also pins the type's current
> > tree. HOWEVER, if swapoff + swapon raced, this could be a different
> > tree than what we had in @tree, so
> >
> > 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to
> >    look up zswap_trees[] again after __read_swap_cache_async()
> >    succeeded to validate the entry.
> >
> > Once it succeeded, we can validate the entry. The entry is valid due
> > to our refcount. The zswap_trees[type] is valid due to the cache pin.
> >
> > However, if validation failed and we have a non-zero writeback_result,
> > there is one last bug:
> >
> > 4. the original entry's tree is no longer valid for the entry put.
> >
> > The current scheme handles invalidation fine (which is good because
> > that's quite common). But it's fundamentally unsynchronized against
> > swapoff (which has probably gone undetected because that's rare).
> >
> > I can't think of an immediate solution to this, but I wanted to put my
> > analysis out for comments.
>
>
> Thanks for the great analysis, I missed the swapoff/swapon race myself :)
>
> The first solution that came to mind for me was refcounting the zswap
> tree with RCU with percpu-refcount, similar to how cgroup refs are
> handled (init in zswap_swapon() and kill in zswap_swapoff()). I think
> the percpu-refcount may be an overkill in terms of memory usage
> though. I think we can still do our own refcounting with RCU, but it
> may be more complicated.

FWIW, I was able to reproduce the problem in a vm with the following
kernel diff:
diff --git a/mm/zswap.c b/mm/zswap.c
index 78df16d307aa8..6580a4be52a18 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -880,6 +880,9 @@ static enum lru_status shrink_memcg_cb(struct
list_head *item, struct list_lru_o
         */
        spin_unlock(lock);

+       pr_info("Sleeping in shrink_memcg_cb() before
spin_lock(&tree->lock)\n");
+       schedule_timeout_uninterruptible(msecs_to_jiffies(10 * 1000));
+
        /* Check for invalidate() race */
        spin_lock(&tree->lock);
        if (entry != zswap_rb_search(&tree->rbroot, swpoffset))

This basically expands the race window to 10 seconds. I have a
reproducer script attached that utilizes the zswap shrinker (which
makes this much easier to reproduce btw). The steps are as follows:
- Compile the kernel with the above diff, and both ZRAM & KASAN enabled.
- In one terminal, run zswap_wb_swapoff_race.sh.
- In a different terminal, once the "Sleeping in shrink_memcg_cb()"
message is logged, run "swapoff /dev/zram0".
- In the first terminal, once the 10 seconds elapse, I get a UAF BUG
from KASAN (log attached as well).

[-- Attachment #2: zswap_wb_swapoff_race.sh --]
[-- Type: text/x-sh, Size: 988 bytes --]

#!/bin/bash

TOTAL_MB=100
TOTAL_BYTES=$((TOTAL_MB << 20))
ROOT_CG=/sys/fs/cgroup
ZRAM_DEV=/dev/zram0
TMPFS=./tmpfs
A_ASCII=$(printf '%d' "'A")

function cleanup() {
  rm -rf $TMPFS/*
  umount $TMPFS
  rmdir $TMPFS

  if swapon -s | grep -q $ZRAM_DEV; then
    swapoff $ZRAM_DEV
  fi
  echo 1 > /sys/block/zram0/reset

  rmdir $ROOT_CG/test
}
trap cleanup EXIT

echo 1 > /sys/module/zswap/parameters/shrinker_enabled

# Initialize tmpfs
mkdir $TMPFS
mount -t tmpfs none $TMPFS

# Initialize zram
echo $((TOTAL_MB*2))m > /sys/block/zram0/disksize
mkswap -L zram0 $ZRAM_DEV
swapon $ZRAM_DEV

# Initialize cgroup
echo "+memory" > $ROOT_CG/cgroup.subtree_control
mkdir $ROOT_CG/test

# Create a test tmpfs file containing the alphabet on repeat
(
  echo 0 > $ROOT_CG/test/cgroup.procs
  s=$(printf '%s' {A..Z})
  yes $s | dd iflag=fullblock of=$TMPFS/test bs=1M count=$TOTAL_MB status=none
)

# Reclaim everything (invoking the zswap shrinker)
echo $TOTAL_BYTES > $ROOT_CG/test/memory.reclaim

[-- Attachment #3: uaf_log --]
[-- Type: application/octet-stream, Size: 5550 bytes --]

[   30.251114] zswap: Sleeping in shrink_memcg_cb() before spin_lock(&tree->lock)
[   40.408431] ==================================================================
[   40.413062] BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x59/0x110
[   40.417247] Write of size 4 at addr ffff88810252dc98 by task zswap_wb_swapof/383
[   40.421609]
[   40.422648] CPU: 0 PID: 383 Comm: zswap_wb_swapof Not tainted 6.7.0+ #19
[   40.426663] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   40.432151] Call Trace:
[   40.433717]  <TASK>
[   40.435125]  dump_stack_lvl+0x98/0xd0
[   40.437389]  print_address_description+0x7f/0x3a0
[   40.440272]  print_report+0x119/0x220
[   40.442540]  ? _raw_spin_lock+0x59/0x110
[   40.444971]  ? kasan_complete_mode_report_info+0x66/0x80
[   40.448170]  ? _raw_spin_lock+0x59/0x110
[   40.450571]  kasan_report+0xd7/0x110
[   40.452762]  ? _raw_spin_lock+0x59/0x110
[   40.455281]  kasan_check_range+0x250/0x2f0
[   40.457782]  ? _raw_spin_lock+0x59/0x110
[   40.460161]  __kasan_check_write+0x18/0x20
[   40.462638]  _raw_spin_lock+0x59/0x110
[   40.464933]  shrink_memcg_cb+0x11c/0x700
[   40.467360]  ? xa_load+0xcd/0x140
[   40.469429]  __list_lru_walk_one+0x236/0x5c0
[   40.472080]  ? zswap_shrinker_count+0x330/0x330
[   40.474841]  ? zswap_shrinker_count+0x330/0x330
[   40.477586]  list_lru_walk_one+0xa3/0x100
[   40.480023]  zswap_shrinker_scan+0x2c7/0x450
[   40.482649]  do_shrink_slab+0x3a6/0x8f0
[   40.485005]  shrink_slab_memcg+0x3fe/0x7c0
[   40.487499]  shrink_slab+0x7c/0x3a0
[   40.489658]  ? mem_cgroup_iter+0x2ae/0x780
[   40.492137]  shrink_node_memcgs+0x397/0x670
[   40.494671]  shrink_node+0x25f/0xce0
[   40.496877]  shrink_zones+0x531/0x9e0
[   40.499160]  do_try_to_free_pages+0x1f4/0xb10
[   40.501838]  try_to_free_mem_cgroup_pages+0x37d/0x7c0
[   40.504893]  memory_reclaim+0x301/0x370
[   40.507249]  ? _copy_from_iter+0x162/0x11e0
[   40.509851]  ? check_heap_object+0x160/0x410
[   40.512463]  cgroup_file_write+0x1e3/0x3d0
[   40.514960]  ? cgroup_seqfile_stop+0xb0/0xb0
[   40.517601]  kernfs_fop_write_iter+0x28e/0x390
[   40.520349]  vfs_write+0x7d2/0xb10
[   40.522478]  ksys_write+0xde/0x1a0
[   40.524574]  __x64_sys_write+0x7f/0x90
[   40.526891]  do_syscall_64+0x6c/0x150
[   40.529140]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
[   40.532189] RIP: 0033:0x7f718a982b00
[   40.534411] Code: 40 00 48 8b 15 19 b3 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d e1 3a 0e 00 00 74 17 b8 01 00 00 00 09
[   40.545338] RSP: 002b:00007ffff5fb5c68 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[   40.549846] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f718a982b00
[   40.554106] RDX: 000000000000000a RSI: 00005564fb1d5880 RDI: 0000000000000001
[   40.558342] RBP: 00005564fb1d5880 R08: 0000000000000073 R09: 0000000000000001
[   40.562586] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000a
[   40.566859] R13: 00007f718aa5f780 R14: 000000000000000a R15: 00007f718aa5ad00
[   40.571103]  </TASK>
[   40.572489]
[   40.573512] Allocated by task 388:
[   40.575650]  kasan_save_track+0x30/0x60
[   40.577986]  kasan_save_alloc_info+0x69/0x70
[   40.580572]  __kasan_kmalloc+0xa2/0xb0
[   40.582879]  __kmalloc_node+0x1ee/0x420
[   40.585235]  kvmalloc_node+0x54/0x1a0
[   40.587451]  zswap_swapon+0x4a/0x260
[   40.589650]  __do_sys_swapon+0x12ad/0x1fc0
[   40.592109]  __x64_sys_swapon+0x5e/0x70
[   40.594434]  do_syscall_64+0x6c/0x150
[   40.596652]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
[   40.599685]
[   40.600688] Freed by task 400:
[   40.602610]  kasan_save_track+0x30/0x60
[   40.604943]  kasan_save_free_info+0x52/0x60
[   40.607476]  poison_slab_object+0x11a/0x190
[   40.610006]  __kasan_slab_free+0x39/0x70
[   40.612375]  kfree+0xf3/0x280
[   40.614294]  kvfree+0x29/0x30
[   40.616136]  zswap_swapoff+0x103/0x1a0
[   40.618454]  __se_sys_swapoff+0x19d8/0x21d0
[   40.620990]  __x64_sys_swapoff+0x3c/0x40
[   40.623361]  do_syscall_64+0x6c/0x150
[   40.625605]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
[   40.628620]
[   40.629644] The buggy address belongs to the object at ffff88810252dc80
[   40.629644]  which belongs to the cache kmalloc-64 of size 64
[   40.636813] The buggy address is located 24 bytes inside of
[   40.636813]  freed 64-byte region [ffff88810252dc80, ffff88810252dcc0)
[   40.643854]
[   40.644871] The buggy address belongs to the physical page:
[   40.648187] page:00000000985903b6 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10252d
[   40.653713] flags: 0x400000000000800(slab|node=0|zone=1)
[   40.656880] page_type: 0xffffffff()
[   40.659031] raw: 0400000000000800 ffff888100042640 dead000000000100 dead000000000122
[   40.663635] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
[   40.668235] page dumped because: kasan: bad access detected
[   40.671564]
[   40.672569] Memory state around the buggy address:
[   40.675463]  ffff88810252db80: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
[   40.679766]  ffff88810252dc00: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
[   40.684120] >ffff88810252dc80: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[   40.688423]                             ^
[   40.690860]  ffff88810252dd00: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
[   40.695136]  ffff88810252dd80: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc
[   40.699455] ==================================================================
[   40.704258] Disabling lock debugging due to kernel taint

next prev parent reply	other threads:[~2024-01-24  6:58 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-20  2:40 [PATCH 0/2] mm: zswap: simplify zswap_swapoff() Yosry Ahmed
2024-01-20  2:40 ` [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done Yosry Ahmed
2024-01-22 13:17   ` Chengming Zhou
2024-01-23  8:59   ` Huang, Ying
2024-01-23  9:40     ` Yosry Ahmed
2024-01-23  9:54       ` Yosry Ahmed
2024-01-24  3:13       ` Huang, Ying
2024-01-24  3:20         ` Yosry Ahmed
2024-01-24  3:27           ` Huang, Ying
2024-01-24  4:15             ` Yosry Ahmed
2024-01-20  2:40 ` [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() Yosry Ahmed
2024-01-22 13:13   ` Chengming Zhou
2024-01-22 20:19   ` Johannes Weiner
2024-01-22 20:39     ` Yosry Ahmed
2024-01-23 15:38       ` Johannes Weiner
2024-01-23 15:54         ` Yosry Ahmed
2024-01-23 20:12           ` Johannes Weiner
2024-01-23 21:02             ` Yosry Ahmed
2024-01-24  6:57               ` Yosry Ahmed [this message]
2024-01-25  5:28                 ` Chris Li
2024-01-25  7:59                   ` Yosry Ahmed
2024-01-25 18:55                     ` Chris Li
2024-01-25 20:57                       ` Yosry Ahmed
2024-01-25 22:31                         ` Chris Li
2024-01-25 22:33                           ` Yosry Ahmed
2024-01-26  1:09                             ` Chris Li
2024-01-24  7:20               ` Chengming Zhou
2024-01-25  5:44                 ` Chris Li
2024-01-25  8:01                   ` Yosry Ahmed
2024-01-25 19:03                     ` Chris Li
2024-01-25 21:01                       ` Yosry Ahmed
2024-01-25  7:53                 ` Yosry Ahmed
2024-01-25  8:03                   ` Yosry Ahmed
2024-01-25  8:30                   ` Chengming Zhou
2024-01-25  8:42                     ` Yosry Ahmed
2024-01-25  8:52                       ` Chengming Zhou
2024-01-25  9:03                         ` Yosry Ahmed
2024-01-25  9:22                           ` Chengming Zhou
2024-01-25  9:26                             ` Yosry Ahmed
2024-01-25  9:38                               ` Chengming Zhou
2024-01-26  0:03                   ` Chengming Zhou
2024-01-26  0:05                     ` Yosry Ahmed
2024-01-26  0:10                       ` Chengming Zhou
2024-01-23 20:30           ` Nhat Pham
2024-01-23 21:04             ` Yosry Ahmed
2024-01-22 21:21   ` Nhat Pham
2024-01-22 22:31   ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJD7tkaVdJ9B_UDQs+o1nLdbs62CeKgbCyEXbMdezaBgOruEWw@mail.gmail.com \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ying.huang@intel.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox