From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C8E8C48260 for ; Mon, 19 Feb 2024 09:01:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 77A636B0082; Mon, 19 Feb 2024 04:01:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 72A1A6B0083; Mon, 19 Feb 2024 04:01:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F1FC6B0085; Mon, 19 Feb 2024 04:01:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4F4676B0082 for ; Mon, 19 Feb 2024 04:01:26 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 166A2C02A7 for ; Mon, 19 Feb 2024 09:01:26 +0000 (UTC) X-FDA: 81807959772.01.2B78BEB Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf13.hostedemail.com (Postfix) with ESMTP id A132D20019 for ; Mon, 19 Feb 2024 09:01:23 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MP1NZNlj; spf=pass (imf13.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708333283; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jluwWOTlkHle+7tEQx830n7N9r9EKidLwTohIxeHs78=; b=EVStAoKU4y9071XGD7VAXGkGqW0gW9kcO9tQteZ3qClmBAqMcgg4uDc38nu9U6rLO5gC7m 5SGvMDhjxun/zkTvWBYsAHYSPyAPx4kM9l14GiFeMKWqm3yVj8qF53EZrLZOA0hbgaaDOE yKwBqAyu/EM5OQ8gxohtCTxhJXLZm5o= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708333283; a=rsa-sha256; cv=none; b=vnJr4SzWal6dYm77huE0U1SEiUkDbP0vXshsj8mb31fBGSBNN8gRH5TycXVtZM/UPJS0Y5 WHZo5Qb0CcY1jQrR93Iq5XfcO7kpU3vDhh2Lt9r16ai86a0sH1sQAcRmQ9hhXj1U6KFyme ld6/iFxwnRMIg6u3vV4yHEBloRgy6GQ= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MP1NZNlj; spf=pass (imf13.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1708333282; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=jluwWOTlkHle+7tEQx830n7N9r9EKidLwTohIxeHs78=; b=MP1NZNljDvAQA3+UIsPJKanZOV57Wcan/rK5WOdZjNwI8Sgj2bwh+arPAYSUn6S1Y+vKEd TW7JY1Xlqb6nBMKbFDwLxqvQYm44Ovdcv/sHDZe3zdi0zHooQoGmuCpU3OIikvndCZ3c4C 8V35yKDbbvefhcPwOWRxeNdA4y2AxeA= Received: from mail-lf1-f71.google.com (mail-lf1-f71.google.com [209.85.167.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-224-W8eHNGksPTejNyhpg2JShw-1; Mon, 19 Feb 2024 04:01:21 -0500 X-MC-Unique: W8eHNGksPTejNyhpg2JShw-1 Received: by mail-lf1-f71.google.com with SMTP id 2adb3069b0e04-512b0765e6bso690816e87.0 for ; Mon, 19 Feb 2024 01:01:21 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708333280; x=1708938080; h=content-transfer-encoding:in-reply-to:organization:autocrypt:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=jluwWOTlkHle+7tEQx830n7N9r9EKidLwTohIxeHs78=; b=ZI320sIMpwRFTX2gGGM3bolyzv07gmpLNjtJlVCL3GY9kcuVYwR86shxLZHjsJP4Uq naQNRjXwsdqEK/OZjvH43vWeEJ/xWNfMlPRFqdVbRFkJeTyvXsLwo6CNKKOFL2Rqbv5G eAJAsZ3KJW9IoVS5JIOw/w3P+8BG/JA7YgcT8pvlEHTR1r7MSfQ1onhGBk0SGwr5U9FA ypRvOK1cK7OjatVGrz2NCy9U6FonuShMhqX7VtaaF+NIk+1GpyypKyDydx52zgWzQ8pJ I7zKP/1cMTUtAITcIt2/bhEBwPkZLwdPWbT2x+olEOohsfNGbxQW9s9Je3198kAmIHEN L69g== X-Forwarded-Encrypted: i=1; AJvYcCV4KXRh+CCIos0INMDz1o2txUlZbSBAWUOuKA/Y0U0sOGdQybkc4DGnmt1HXBDlISztaKe+hqZhe8HpL8P6gL3jHCs= X-Gm-Message-State: AOJu0Yy8TfADktPY82JhxPGFZSXPjlWo+dTPTozAVj0DBb9O0/Yxolo5 +N0C6vd2psZMvhrazNdJeIO5x/Us/VboWgzK4hopLLtfFqMkZNC4CpftEJsGtcj+0rx041srFGS o/gx7nGspPZRjAyPHNgo8do1QFxMLNDsginOsb78Qyt7Q1s5m X-Received: by 2002:a05:6512:31d1:b0:512:8d6d:4804 with SMTP id j17-20020a05651231d100b005128d6d4804mr9848449lfe.15.1708333279689; Mon, 19 Feb 2024 01:01:19 -0800 (PST) X-Google-Smtp-Source: AGHT+IF3ye5iBZ5wV4dHS7rteYk1dfb2UaMLYWRJwYCKSUCu4BByxAzn6tMYJnoKm8Evf4a0rfUlLQ== X-Received: by 2002:a05:6512:31d1:b0:512:8d6d:4804 with SMTP id j17-20020a05651231d100b005128d6d4804mr9848419lfe.15.1708333279218; Mon, 19 Feb 2024 01:01:19 -0800 (PST) Received: from ?IPV6:2a09:80c0:192:0:5dac:bf3d:c41:c3e7? ([2a09:80c0:192:0:5dac:bf3d:c41:c3e7]) by smtp.gmail.com with ESMTPSA id o20-20020a05600c4fd400b00412590eee7csm7505288wmq.10.2024.02.19.01.01.18 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 19 Feb 2024 01:01:18 -0800 (PST) Message-ID: <726f3658-a5d2-4c14-a584-312883846303@redhat.com> Date: Mon, 19 Feb 2024 10:01:17 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] mm/swap: fix race when skipping swapcache To: "Huang, Ying" Cc: Kairui Song , linux-mm@kvack.org, Andrew Morton , Chris Li , Minchan Kim , Yu Zhao , Barry Song , SeongJae Park , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , stable@vger.kernel.org, linux-kernel@vger.kernel.org References: <20240216095105.14502-1-ryncsn@gmail.com> <871q9atd6o.fsf@yhuang6-desk2.ccr.corp.intel.com> From: David Hildenbrand Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAl8Ox4kFCRKpKXgACgkQTd4Q 9wD/g1oHcA//a6Tj7SBNjFNM1iNhWUo1lxAja0lpSodSnB2g4FCZ4R61SBR4l/psBL73xktp rDHrx4aSpwkRP6Epu6mLvhlfjmkRG4OynJ5HG1gfv7RJJfnUdUM1z5kdS8JBrOhMJS2c/gPf wv1TGRq2XdMPnfY2o0CxRqpcLkx4vBODvJGl2mQyJF/gPepdDfcT8/PY9BJ7FL6Hrq1gnAo4 3Iv9qV0JiT2wmZciNyYQhmA1V6dyTRiQ4YAc31zOo2IM+xisPzeSHgw3ONY/XhYvfZ9r7W1l pNQdc2G+o4Di9NPFHQQhDw3YTRR1opJaTlRDzxYxzU6ZnUUBghxt9cwUWTpfCktkMZiPSDGd KgQBjnweV2jw9UOTxjb4LXqDjmSNkjDdQUOU69jGMUXgihvo4zhYcMX8F5gWdRtMR7DzW/YE BgVcyxNkMIXoY1aYj6npHYiNQesQlqjU6azjbH70/SXKM5tNRplgW8TNprMDuntdvV9wNkFs 9TyM02V5aWxFfI42+aivc4KEw69SE9KXwC7FSf5wXzuTot97N9Phj/Z3+jx443jo2NR34XgF 89cct7wJMjOF7bBefo0fPPZQuIma0Zym71cP61OP/i11ahNye6HGKfxGCOcs5wW9kRQEk8P9 M/k2wt3mt/fCQnuP/mWutNPt95w9wSsUyATLmtNrwccz63XOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCXw7HsgUJEqkpoQAKCRBN3hD3AP+DWrrpD/4qS3dyVRxDcDHIlmguXjC1Q5tZTwNB boaBTPHSy/Nksu0eY7x6HfQJ3xajVH32Ms6t1trDQmPx2iP5+7iDsb7OKAb5eOS8h+BEBDeq 3ecsQDv0fFJOA9ag5O3LLNk+3x3q7e0uo06XMaY7UHS341ozXUUI7wC7iKfoUTv03iO9El5f XpNMx/YrIMduZ2+nd9Di7o5+KIwlb2mAB9sTNHdMrXesX8eBL6T9b+MZJk+mZuPxKNVfEQMQ a5SxUEADIPQTPNvBewdeI80yeOCrN+Zzwy/Mrx9EPeu59Y5vSJOx/z6OUImD/GhX7Xvkt3kq Er5KTrJz3++B6SH9pum9PuoE/k+nntJkNMmQpR4MCBaV/J9gIOPGodDKnjdng+mXliF3Ptu6 3oxc2RCyGzTlxyMwuc2U5Q7KtUNTdDe8T0uE+9b8BLMVQDDfJjqY0VVqSUwImzTDLX9S4g/8 kC4HRcclk8hpyhY2jKGluZO0awwTIMgVEzmTyBphDg/Gx7dZU1Xf8HFuE+UZ5UDHDTnwgv7E th6RC9+WrhDNspZ9fJjKWRbveQgUFCpe1sa77LAw+XFrKmBHXp9ZVIe90RMe2tRL06BGiRZr jPrnvUsUUsjRoRNJjKKA/REq+sAnhkNPPZ/NNMjaZ5b8Tovi8C0tmxiCHaQYqj7G2rgnT0kt WNyWQQ== Organization: Red Hat In-Reply-To: <871q9atd6o.fsf@yhuang6-desk2.ccr.corp.intel.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: A132D20019 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 78jpdgxctiiu9m3c6ujnthpcoqgu1xty X-HE-Tag: 1708333283-57732 X-HE-Meta: U2FsdGVkX18m7vNPFO1PlEqp2WodL8O/VGdRzi6OwNtHE2ET5UFm/z+/CCLhv0Zn73b8jE+OWVYTpAytHiQP8Jl+0l8Un3iVaHyEmkXK6XRJavCpAprEkhVjjqoLUKSbkNO1E1mm4ElGcCA6RS+Ixg0j8EZSYZYZJEQn4wZmFwyUjZZVDudTWak1iCJekCKAL1pdYg2xwM57a/Tw1ejJijIip46EH7vc9S+p51kIb5gsNTj+bNbnWvrI0aJOcIaY37EmMnDgFduCS0uLIQNdwT44256shUMS21oRSwO7z6d0NdGEDO3l9HPTT+9+jVefIaQnS8/rkd7QY0fc3s8PpOYzXYExsUjmC9BbWG+0xmYDvVLO1SCz1CKY1HxJL1aKo7kgGRTbFIwTrA5iqGAJV3lNWcHPVDPIVj+Uwz1gn+KlJYcfvw33or1avKzmM5n46d2pe4IQTr1fYvV1XSZr4dkjICUN2ExTZIyDZUewXwaKVZ7rFlC3mf+Jy2XM9xMacxCg3yaw/4eqRIeIyzJWslrRhlDfAFkc6Kflsea797kSkqRxVNg1yGjJJTtOulHkxxPywL87WiDzVgWnNgXEQEfiUJmY88PiBL7G5UCLsXQyx5njQxvZ0KugmEEJdOPK7qOHRLJucGlm+HadtYNingT6NSGfj/7OkrHNTm6KHUONxzYANq9j1ETYPufyWeVtp5vBBVfii7e/Y1jlcU/+A8AgSayXkIwmJJepgdhmFg9WX+fk8y6EG1o/f7oDAOcMD7aWSIAqVIvTef+f0IMdB3WsdOAwimxq9w3IofE9RMVxtk7aUA9mINJak298llggWk/ct2E5rbIcNtow+z7oas43fFyH9VtLt2hb9JKnO7jZXAXlyeUeHXn8eNeT5wOr/IubtTdo0croxLgOyBiwfTeLMoQMExla5QgQVo6sDznrXpFgyUjO+NYLSMiPvt/0epwT7NWRVlVQZdnQTrv n0IpYqhc BxUBY4h+QKFZjnObTGnw45FBXugs+5rqNxWz44z16/gcyCASgACiGglOOsVabojWCeFJViuE0Hay4W4z1Sg9VsHMfWfj3zhGVpLG9phdxX7ta0JoifpVq7Ybqxo9ER8PKr9bwkt8+tr9L3Ty41NrMwuSjNCWdNwTiPNwsq31FTdhO8OIligN7Ek13GXvarOwugL8+u6nPBgjr8GZGV+WQr4+OZ+MogbjiHdtCjgRqvy2wE7ekdHbyXGm5Qqrb6kF61g0KmPIjmv8m2WjAqNGal4x8+4skEhqBtgAPBOvhfpV2KBdI01SARm9NEoEHP9QMaj42PQwQMHmswLx9WJQOT7JXgBUuhriUhIaBokM6ftmvdO/gyd4hXLn5d2Rq13ErOC310iTrL7BDsneLFgtwvNurIsWD8fE2p9JhU/HwtkLwhaf+ntcWrGurpbC1ZlQuscdULRt9zTzIL74pd5Of1SKCsY0gQTelfLT7UpDcv+UOOFE8s2QJVYuPUVAq6yHjwK/FXMiD9cd497iD6mGUHwZxbbXhsFWNfJeV05Q0sAYgccD7P5pTHGp/ZyQk31CJm7FRn3ahQy2TJWpnh7M0X15vLoCfT1E2Egn33kgM548pEO/N3kUyMT43r1EAGl7r8Wt8LOY0nXvBpJce2jZN1dfRlg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 18.02.24 08:59, Huang, Ying wrote: > David Hildenbrand writes: > >> On 16.02.24 10:51, Kairui Song wrote: >>> From: Kairui Song >>> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more >>> threads >>> swapin the same entry at the same time, they get different pages (A, B). >>> Before one thread (T0) finishes the swapin and installs page (A) >>> to the PTE, another thread (T1) could finish swapin of page (B), >>> swap_free the entry, then swap out the possibly modified page >>> reusing the same entry. It breaks the pte_same check in (T0) because >>> PTE value is unchanged, causing ABA problem. Thread (T0) will >>> install a stalled page (A) into the PTE and cause data corruption. >>> One possible callstack is like this: >>> CPU0 CPU1 >>> ---- ---- >>> do_swap_page() do_swap_page() with same entry >>> >>> >>> swap_read_folio() <- read to page A swap_read_folio() <- read to page B >>> >>> ... set_pte_at() >>> swap_free() <- entry is free >>> >>> >>> pte_same() <- Check pass, PTE seems >>> unchanged, but page A >>> is stalled! >>> swap_free() <- page B content lost! >>> set_pte_at() <- staled page A installed! >>> And besides, for ZRAM, swap_free() allows the swap device to discard >>> the entry content, so even if page (B) is not modified, if >>> swap_read_folio() on CPU0 happens later than swap_free() on CPU1, >>> it may also cause data loss. >>> To fix this, reuse swapcache_prepare which will pin the swap entry >>> using >>> the cache flag, and allow only one thread to pin it. Release the pin >>> after PT unlocked. Racers will simply wait since it's a rare and very >>> short event. A schedule() call is added to avoid wasting too much CPU >>> or adding too much noise to perf statistics >>> Other methods like increasing the swap count don't seem to be a good >>> idea after some tests, that will cause racers to fall back to use the >>> swap cache again. Parallel swapin using different methods leads to >>> a much more complex scenario. >>> Reproducer: >>> This race issue can be triggered easily using a well constructed >>> reproducer and patched brd (with a delay in read path) [1]: >>> With latest 6.8 mainline, race caused data loss can be observed >>> easily: >>> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out >>> Polulating 32MB of memory region... >>> Keep swapping out... >>> Starting round 0... >>> Spawning 65536 workers... >>> 32746 workers spawned, wait for done... >>> Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! >>> Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! >>> Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! >>> Round 0 Failed, 15 data loss! >>> This reproducer spawns multiple threads sharing the same memory >>> region >>> using a small swap device. Every two threads updates mapped pages one by >>> one in opposite direction trying to create a race, with one dedicated >>> thread keep swapping out the data out using madvise. >>> The reproducer created a reproduce rate of about once every 5 >>> minutes, >>> so the race should be totally possible in production. >>> After this patch, I ran the reproducer for over a few hundred rounds >>> and no data loss observed. >>> Performance overhead is minimal, microbenchmark swapin 10G from 32G >>> zram: >>> Before: 10934698 us >>> After: 11157121 us >>> Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) >>> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of >>> synchronous device") >>> Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] >>> Reported-by: "Huang, Ying" >>> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ >>> Signed-off-by: Kairui Song >>> Cc: stable@vger.kernel.org >>> --- >>> Update from V2: >>> - Add a schedule() if raced to prevent repeated page faults wasting CPU >>> and add noise to perf statistics. >>> - Use a bool to state the special case instead of reusing existing >>> variables fixing error handling [Minchan Kim]. >>> V2: >>> https://lore.kernel.org/all/20240206182559.32264-1-ryncsn@gmail.com/ >>> Update from V1: >>> - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] >>> - Update comments make it cleaner [Huang, Ying] >>> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] >>> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] >>> - Update commit message. >>> - Collect Review and Acks. >>> V1: >>> https://lore.kernel.org/all/20240205110959.4021-1-ryncsn@gmail.com/ >>> include/linux/swap.h | 5 +++++ >>> mm/memory.c | 20 ++++++++++++++++++++ >>> mm/swap.h | 5 +++++ >>> mm/swapfile.c | 13 +++++++++++++ >>> 4 files changed, 43 insertions(+) >>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>> index 4db00ddad261..8d28f6091a32 100644 >>> --- a/include/linux/swap.h >>> +++ b/include/linux/swap.h >>> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) >>> return 0; >>> } >>> +static inline int swapcache_prepare(swp_entry_t swp) >>> +{ >>> + return 0; >>> +} >>> + >>> static inline void swap_free(swp_entry_t swp) >>> { >>> } >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 7e1f4849463a..7059230d0a54 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -3799,6 +3799,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> struct page *page; >>> struct swap_info_struct *si = NULL; >>> rmap_t rmap_flags = RMAP_NONE; >>> + bool need_clear_cache = false; >>> bool exclusive = false; >>> swp_entry_t entry; >>> pte_t pte; >>> @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> if (!folio) { >>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >>> __swap_count(entry) == 1) { >>> + /* >>> + * Prevent parallel swapin from proceeding with >>> + * the cache flag. Otherwise, another thread may >>> + * finish swapin first, free the entry, and swapout >>> + * reusing the same entry. It's undetectable as >>> + * pte_same() returns true due to entry reuse. >>> + */ >>> + if (swapcache_prepare(entry)) { >>> + /* Relax a bit to prevent rapid repeated page faults */ >>> + schedule(); >>> + goto out; >>> + } >>> + need_clear_cache = true; >>> + >> >> I took a closer look at __read_swap_cache_async() and it essentially >> does something similar. >> >> Instead of returning, it keeps retrying until it finds that >> swapcache_prepare() fails for another reason than -EEXISTS (e.g., >> freed concurrently) or it finds the entry in the swapcache. >> >> So if you would succeed here on a freed+reused swap entry, >> __read_swap_cache_async() would simply retry. >> >> It spells that out: >> >> /* >> * We might race against __delete_from_swap_cache(), and >> * stumble across a swap_map entry whose SWAP_HAS_CACHE >> * has not yet been cleared. Or race against another >> * __read_swap_cache_async(), which has set SWAP_HAS_CACHE >> * in swap_map, but not yet added its folio to swap cache. >> */ >> >> Whereby we could not race against this code here as well where we >> speculatively set SWAP_HAS_CACHE and might never add something to the swap >> cache. >> >> >> I'd probably avoid the wrong returns and do something even closer to >> __read_swap_cache_async(). >> >> while (true) { >> /* >> * Fake that we are trying to insert a page into the swapcache, to >> * serialize against concurrent threads wanting to do the same. >> * [more from your description] >> */ >> ret = swapcache_prepare(entry); >> if (likely(!ret) >> /* >> * Move forward with swapin, we'll recheck if the PTE hasn't >> * changed later. >> */ >> break; >> else if (ret != -EEXIST) >> goto out; > > The swap entry may be kept in swap cache for long time. For example, it > may be read into swap cache via MADV_WILLNEED. Right, we'd have to check for the swapcache. I briefly thought about just factoring out what we have in __read_swap_cache_async() and reusing here. Similar problem to solve, and quite a lot of duplicate code. But not worth the churn in a simple fix. We could explore that option as a cleanup on top. -- Cheers, David / dhildenb