From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F431C4829E for ; Thu, 15 Feb 2024 19:08:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CF7668D0010; Thu, 15 Feb 2024 14:08:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CA6FE8D0001; Thu, 15 Feb 2024 14:08:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B21628D0010; Thu, 15 Feb 2024 14:08:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9BB678D0001 for ; Thu, 15 Feb 2024 14:08:08 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 62FEF140DFE for ; Thu, 15 Feb 2024 19:08:08 +0000 (UTC) X-FDA: 81794973456.28.8C98D66 Received: from mail-lj1-f171.google.com (mail-lj1-f171.google.com [209.85.208.171]) by imf06.hostedemail.com (Postfix) with ESMTP id 79D8A180005 for ; Thu, 15 Feb 2024 19:08:06 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NM0ThNpu; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.171 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708024086; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FjTbENydrbuRWSXbS+/c8JRxZTMf9Ate8Q6rJufWxjU=; b=4XHop4jYBxDfRFAfqOBtKu8aD5/BMQO3KVySRtM+TCmXSZNTk8CT2N7fRkRkFTrqTLTOxQ PmpJLP6dXYfqhiuVu/d7s64fOwOOgm0LcYsX8VOerNik4g+jyE/8ow97isvcFa1Q5mln5P 8K/sSWPsGDr05lRTPYVzSYApzsTjWJY= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NM0ThNpu; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.171 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708024086; a=rsa-sha256; cv=none; b=PuJ5J41yILm+L3OOdZqM/3lYsjTUjO0oLfWMakNoMwzw23vo/v/hz6QhHk2lKJ/dxPfuzt jYYNC5sIukDxiccChTZ5tcZ4SIxXJMs3j9YX4DiS7Wnk9S2elR3IIHv1cRM5xzhltS14sn /g1LmmDmS9HsIexmd3aZtvsh/WeMY1c= Received: by mail-lj1-f171.google.com with SMTP id 38308e7fff4ca-2d0c7e6b240so16741201fa.0 for ; Thu, 15 Feb 2024 11:08:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708024084; x=1708628884; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FjTbENydrbuRWSXbS+/c8JRxZTMf9Ate8Q6rJufWxjU=; b=NM0ThNpuEQXXdnI0WQi4L6cBrsXW36sgY6Q+lsgRc8TImK5GdDG+M7Esvj8+B3DkKW t7MTdfG1py2ms7C8B+nq4Wye+QFcJfdKinRXddUALBrzzEL/ObCBRdm5Op+q395XoStz /j3OI7dOb0g+UvyflAH3IAF8Dwsxg6LZEhuxLnpsFJsohObsssU/VEahZ1hrNkWqm/Na bImwC/XMXgVNON1/P7iUeAhR9yp52An5xwtJRH0Fd29YDQ84HGiCMN+aJ4D/zynzjWBe 1U2iTIRW46FwsRsZOGGAyF+XV2RszFleVAWDtpZGtJEfbygySophGQ4drRysOP3QvyKl vS7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708024084; x=1708628884; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FjTbENydrbuRWSXbS+/c8JRxZTMf9Ate8Q6rJufWxjU=; b=guzx+vo4HZl1r6vcpKNxN2Rvm89bxveBRcZDFhmksGgaqxWtt3VpCk3w9qGIOR1CNR DmOjATQo81HpvDKkDyy/ZDcocIxepHKya3XTTfV0XsOVfW5hwELyT8dVki4KlCl2S9SP 4MGm5r5E7ceBJHcmlrEBQTnMNlW/f9jY1mO9DKjHeAm3nCRP4IzG6NY073xa1eereHyM sJHRF/KnRhJoriNy2M3OExpbTLxhPVxIZfssr2tvgaNDTDWh0MxpuTbb9VlwO2jlqEIp fMPHlGG+aKNV598lZQbcgwzmbLYVOthmkMq9LJpznbvEWz5VpzfVXxQp2nqc+PT1KXZP 6nLg== X-Forwarded-Encrypted: i=1; AJvYcCXWsbpY5xm2Ovtke8LBJTsdKYqZ4tAYXMo35aHdZBIpCHzR+uObEb9rXgQlqkxcI6Py2GwxBsrpa2XJx5BRz1fI2HQ= X-Gm-Message-State: AOJu0Ywaxc7RpE6USDQf0DFffefKczUK6Vhj9P52weSv4Ofnmkn61+a+ tasJJfYzerRZES1oY6hMiKaPZI8JHu/wRZxvR5sPmDyPGygyKYSW6lrCU0VKMp1umF9osB+Y2JF 0bjpdPzSnzkBi3DZDNYWYZ9aUcO0= X-Google-Smtp-Source: AGHT+IGvdwDuWE82135UuIcL3SI7YfwCsW9Gc4KSorhfuvqkkIcc6b9W6zcdrL/DBV38q26ho1PtyoR055v77S/alrE= X-Received: by 2002:a2e:994d:0:b0:2d1:749:c85f with SMTP id r13-20020a2e994d000000b002d10749c85fmr1698386ljj.47.1708024083404; Thu, 15 Feb 2024 11:08:03 -0800 (PST) MIME-Version: 1.0 References: <87eddnxy47.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: From: Kairui Song Date: Fri, 16 Feb 2024 03:07:44 +0800 Message-ID: Subject: Re: [PATCH v2] mm/swap: fix race when skipping swapcache To: Minchan Kim , Barry Song <21cnbao@gmail.com> Cc: Chris Li , "Huang, Ying" , linux-mm@kvack.org, Andrew Morton , Yu Zhao , Barry Song , SeongJae Park , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 79D8A180005 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: hr7yb69jpi18yp36mobybyrstecfacda X-HE-Tag: 1708024086-440952 X-HE-Meta: U2FsdGVkX18JUzSZTku74rmZ+XDgpNomWgtEmE4yH5hsXMW0CVgSsb+an00R8QCMMmZI9d0cISiFUQtP+9W4TyIrdn6SuhzAQK+Po/iR+HdmTCht05zICFRdgQ1HatBEFWur5c23Xs6KXbaB6TQORgVFwoDdTRdWrwTa+761Q4Esl337zADtiy7I84HZ7y49pfqPVxtXC4bAgLnQ8+izgNX919wEPN6Rj/DSiu+w0jJNuzucSh9Aor8HWH9sQi1thfizpbyz6PB/bWflCph385o+UZIfj2fuaD3Hepbw0l96pGn+PrRcLnjyOb/uwWfoGM4wLKPwenv7nrLed1flBl9pfjQj1S7JYuFEmQk5M4kHPMA5EFLyB7+69E93hEef2t4Kh8/WL7e5U80mIiRHaPrcNpBUQBzU6sisyUKxom3LJLOoWe8UPNG1cOJK+7LvJUF1KedTewWNlYhJ+MDq6/V1nX5ns5tQMZqioqcoKofHqslDgq522rJjs/yt0O5HcaP9xLLxLEeMxQr3luI7xiRHQLq4p55GCoYVaVTI/XwRWlXX4gR9MTHAset1olCDR6SwlFFqnB8Ur2UVUiWfDMaRt9Z4bncufLss/d225VfecAYLC66nYwqgbsh9DKvLRFA3/EFo3y75CeHj9qR1/0f7RBHGIX2vEMYB4GdQmQ4GkssI8Clq80kOJ5jX0dqEHQ2T+1YnWpC2bHjSQEeRVx+ob4v0Qq319DNiCzZsBCGbNVnsSowAhAOhhJAz/bb0jyXbTttDGkk0Gu0p5iNIcZsr4eiMx0yvoVUiBS29tke2K4XkI9EQjMLQR7+n26PHTtDQwzCfGWJ1W7XimMzJaSlZPPSr0NkO+5RECcdlnTgwndMrAAGTEU4IKP87ix174PKFT6gsPP3CBy1BuNHrTCCISw3ScW6dLxJq51Qi43PiCMlXr2cjhsZcWTswGW4q2sp8NRaOQhbnrcvbd0x SbDAFZji Nx29Y2hZ+73g4virdOZOjRaQUQXmMsds3nNJfwJZ2F90ZOQYNLtwHtZyl7QOM6ME0iH86jrnBXxeQhilyJYCXkjon/Nuvp9wTehFert9L8OoRjq09cz02yU9HvS1joOUkb4Pwv2WMV9f3tVuJFmD8sn/K+jCPLepficweB49bYD+3RY9MUUBKM3YUpqV8HTjR64vxggKn1jT4mlBY7glyWu5jTSYgRXGbOMyKg/28NouaSIdtUCxaqXnoK9B2gdk9/t7O6OhkWMMHovlfrpIsJmz6/Tb+DYxyxy2wVvfcpLBzZuyQDWhmv4D3X8bFZnHUNtSOSltT+XDbUfSEhmytFYnc70/EacB5milt2VaSX+z+55/GEM08M+GEJcthQJyVLy5klVubO5xmej6ex2XkjpIJphK5BmqOb++C X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 15, 2024 at 8:44=E2=80=AFAM Minchan Kim wr= ote: > > Hi Kairui, > Hi, Minchan, > > I also played the swap refcount stuff ideas and encoutered a lot > problems(e.g., kernel crash and/or data missing). > > Now I realize your original solution would be best to fix under such a > small change. > > Folks, please chime in if you have another idea. > > > > > Instead if we add a schedule or schedule_timeout_uninterruptible(1), > > How much your workload is going If we use schedule instead of > schedule_timeout_uninterruptible(1)? If that doesn't increase the > statistics a lot, I prefer the schedule. > (TBH, I didn't care much about the stat since it would be > very rare). I've tested both schedule() and schedule_timeout_uninterruptible(1) locally using the reproducer, and some other cases, and looked all good. The reproducer I provided demonstrates an extreme case, so I think schedule() is good enough already. I currently don't have any more benchmarks or tests, as the change is very small for most workloads. I'll use schedule() here in V3 if no one else has other suggestions. I remember Barry's series about large folio swapin may change the ZRAM read time depending on folio size, and cause strange races that's different from the reproducer. Not sure if I can test using some known race cases. That could be helpful to verify this fix and if schedule() or schedule_timeout_uninterruptible(1) is better here. > > it seems quite enough to avoid repeated page faults. I did following > > test with the reproducer I provided in the commit message: > > > > Using ZRAM instead of brd for more realistic workload: > > $ perf stat -e minor-faults,major-faults -I 30000 ./a.out > > > > Unpatched kernel: > > # time counts unit events > > 30.000096504 33,089 minor-faults:u > > 30.000096504 958,745 major-faults:u > > 60.000375308 22,150 minor-faults:u > > 60.000375308 1,221,665 major-faults:u > > 90.001062176 24,840 minor-faults:u > > 90.001062176 1,005,427 major-faults:u > > 120.004233268 23,634 minor-faults:u > > 120.004233268 1,045,596 major-faults:u > > 150.004530091 22,358 minor-faults:u > > 150.004530091 1,097,871 major-faults:u > > > > Patched kernel: > > # time counts unit events > > 30.000069753 280,094 minor-faults:u > > 30.000069753 1,198,871 major-faults:u > > 60.000415915 436,862 minor-faults:u > > 60.000415915 919,860 major-faults:u > > 90.000651670 359,176 minor-faults:u > > 90.000651670 955,340 major-faults:u > > 120.001088135 289,137 minor-faults:u > > 120.001088135 992,317 major-faults:u > > > > Indeed there is a huge increase of minor page faults, which indicate > > the raced path returned without handling the fault. This reproducer > > tries to maximize the race chance so the reading is more terrifying > > than usual. > > > > But after adding a schedule/schedule_timeout_uninterruptible(1) after > > swapcache_prepare failed, still using the same test: > > > > Patched kernel (adding a schedule() before goto out): > > # time counts unit events > > 30.000335901 43,066 minor-faults:u > > 30.000335901 1,163,232 major-faults:u > > 60.034629687 35,652 minor-faults:u > > 60.034629687 844,188 major-faults:u > > 90.034941472 34,957 minor-faults:u > > 90.034941472 1,218,174 major-faults:u > > 120.035255386 36,241 minor-faults:u > > 120.035255386 1,073,395 major-faults:u > > 150.035535786 33,057 minor-faults:u > > 150.035535786 1,171,684 major-faults:u > > > > Patched kernel (adding a schedule_timeout_uninterruptible(1) before got= o out): > > # time counts unit events > > 30.000044452 26,748 minor-faults:u > > 30.000044452 1,202,064 major-faults:u > > 60.000317379 19,883 minor-faults:u > > 60.000317379 1,186,152 major-faults:u > > 90.000568779 18,333 minor-faults:u > > 90.000568779 1,151,015 major-faults:u > > 120.000787253 22,844 minor-faults:u > > 120.000787253 887,531 major-faults:u > > 150.001309065 18,900 minor-faults:u > > 150.001309065 1,211,894 major-faults:u > > > > The minor faults are basically the same as before, or even lower since > > other races are also reduced, with no observable performance > > regression so far. > > If the vague margin of this schedule call is still concerning, I think > > another approach is just a new swap map value for parallel cache > > bypassed swapping to loop on. > > Long term, the swap map vaule would be good but for now, I prefer > your original solution with some delay with schedule. Yes, that's also what I have in mind. With a swap map value, actually I think the cache bypassed path may even go faster as it can simplify some swap map value change process, but that requires quite some extra work and change, could be introduced later as an optimization. > Thanks, Kairui! Thanks for the comments!