From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A833EB64DA for ; Wed, 5 Jul 2023 15:41:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C12DC8D0002; Wed, 5 Jul 2023 11:41:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BE9FB8D0001; Wed, 5 Jul 2023 11:41:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B00278D0002; Wed, 5 Jul 2023 11:41:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id A075F8D0001 for ; Wed, 5 Jul 2023 11:41:18 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 341051C8E94 for ; Wed, 5 Jul 2023 15:41:18 +0000 (UTC) X-FDA: 80977972236.02.F28CA8A Received: from smtp6-g21.free.fr (smtp6-g21.free.fr [212.27.42.6]) by imf04.hostedemail.com (Postfix) with ESMTP id 645F940022 for ; Wed, 5 Jul 2023 15:41:16 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=free.fr header.s=smtp-20201208 header.b=eINM56G+; dmarc=pass (policy=none) header.from=free.fr; spf=pass (imf04.hostedemail.com: domain of marc.w.gonzalez@free.fr designates 212.27.42.6 as permitted sender) smtp.mailfrom=marc.w.gonzalez@free.fr ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688571676; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=CA3gCAHvSvyVtQtdw9a5nZjjhlRWF6CDKaN5Y6rWae8=; b=Jo8o7GnkeDqCmhILdBy/R3jHn/a53fl/WVBo/V4s/MXxHOfHklEfmy2bGizwoHvGwH+KLm eJimAD1wbvYUO2PVTEcoqyPqrIEZEai+t54WkNqxv2mfHTX6fhECV6tzkItDTvqVfIvJYZ yPf4iOsWOTkuGClmD9+CQg4E2IviPoQ= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=free.fr header.s=smtp-20201208 header.b=eINM56G+; dmarc=pass (policy=none) header.from=free.fr; spf=pass (imf04.hostedemail.com: domain of marc.w.gonzalez@free.fr designates 212.27.42.6 as permitted sender) smtp.mailfrom=marc.w.gonzalez@free.fr ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688571676; a=rsa-sha256; cv=none; b=v+MioFnMzncn6HFQk2sGxI0wFXwaO7/TKZJLBdIPeqsNrTUi6SIXWuOoh0wQTY6TMsCtCN FNGzi+25m4u9s8JjhewGSFBtophiiLQdo3dfwSQflrdOhsnWJ7jG+mgqQ/RnDdv0ONjSUa YrVVdqSMEipiebrGX7cNV2pFFPghYNw= Received: from [IPV6:2a02:8428:2a4:1a01:e930:cb1d:5289:8ae5] (unknown [IPv6:2a02:8428:2a4:1a01:e930:cb1d:5289:8ae5]) (Authenticated sender: marc.w.gonzalez@free.fr) by smtp6-g21.free.fr (Postfix) with ESMTPSA id C9A987803B1; Wed, 5 Jul 2023 17:41:02 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=free.fr; s=smtp-20201208; t=1688571674; bh=CA3gCAHvSvyVtQtdw9a5nZjjhlRWF6CDKaN5Y6rWae8=; h=Date:To:From:Cc:Subject:From; b=eINM56G+GjgTYwJloBp/bafwfUfwEV/jZ3O7+kMQkG6+e/F0WAXnVJHmjMJ5oPcoB L6BNVSWDPvCujdKq4xUNBQSyvGrrQBP9pOaEszY32JNLe7pV2XtSpMsZdFt8DChLRo bI3AXGLQUP0HW1Ls1q3LN3OlzNZcIOGNwR2JVjF5aB2xixR/RABb2awjB27nhBkGer CMqyqU9Bzt/EhcCYZVfKZLa8/8Fc4GFZQfg8vTuNYkwou79LMEokqAsobsGpWwTQEi OHtCPlbvZN8BYm3KG5zpvCxD1YIDwxZ8Zs8thHBJQOz9AsO5dFb5AxhDjuiVGlHato +UmaPCaO8GA2g== Message-ID: Date: Wed, 5 Jul 2023 17:41:01 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Content-Language: fr, en-US To: LKML , linux-mm@kvack.org From: Marc Gonzalez Cc: Vladimir Murzin , Will Deacon , Mark Rutland , Thomas Gleixner , Ingo Molnar , Tomas Mudrunka , HPeter Anvin , Arnd Bergmann , Ard Biesheuvel Subject: RFC: Faster memtest (possibly bypassing data cache) Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 645F940022 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: f9r98icss81pgom5kjgd9gughy6gzfez X-HE-Tag: 1688571676-887864 X-HE-Meta: U2FsdGVkX1+J5DwgnCXlqL3eOcotcAdFG1IT605i60T28WLOQHaWXkHlTr+yyUdc1Ej+SLez9flTitS4FtzomNvcQx8TZOylXBHuSFVkBi+vLKUeYxWebnLGw9MoNhwppWwHeohludgFyO3hd2cOTQqOamB6kGVQl+MqBEIxtoKkLT50hY6ZDc3VBdUo5CTv5D73Nk3naJigoiMmH8rKucj8grc8gxWLq/ht2nzBEu+ghgrRzUNqL0vIwiROsE08p1uGvhDevmVbX/J9R2NtIKPsGmDnwkABJcVKwjhhzCFtyC9ECmReiQDhK0gLz8QSkW2hqnyxkIOKVuCKsE+LNKLiZ/puFD7cU+k0hWuI8EMgQ4UJk0poPdzPfZ+hhRA6eHuqOKp/075A5FOhBP65FUbpihV+8m2bWM22vdjbenEpGMPLT3JGQrMTjW28xJr/xdGneahrswo18MpyKk5qvX6+tQchTbwT9KAqzId+++X6p/rQzCepUw/gEQmw+9I3fBzDPdX5DBeLkuLhIhy/ZjW+qzjSJuHhGtWPafJfUztSd1fA8yN+2Gva1IW6tBs0cJsjJDvaEgQXDJA/WAigUwgKGay6XDq7ktCiEM/CGOe6MkAvFc2OlO4PM1tz6aj7qsteDo5OyfMAPa+kEHWt6G51I0/YymImAlxN8pLMudum3S1iPv/cVYKxvujVVcZdAkKlm/yL/C1szB7Ef6gBvMnGLvGdWuBVNn1nR2h7DaXNHBf484sO7NRHd6PKzaGTliSnR6TH1CVybcqrZye0qsrRIU4HTrlARAGbUS3JmG9DAM8/go6o6WwjM9IEnRf0SYFSxxC2UPhapqFv4CGqMzXVfzSDMFsbtlhBcvZQxuW0m7pKFCELkKWwWwHfew9djoRmE7rt2Ai1vy0MsV7lqjP6h6vMK4flZXqlbNhNp3P85uk9Jw2V6HPpmvU5I4HHgdveiwPUKBF0VME63ea J0+5NyZ1 bS6UfVlKpWRsJRaYCi1qiEpcEzoFXSYwSjk7ECHnj2MI4WE2t7vRY6+kRZZKx7Jx7bVQ0WxsdEhDQN7FM0IJ0zLJjmE41WOQ7Evv4SiDe0kjdsqupba9wCNEXER0vKGi02jvNQcw/vFPL1jF4WW8KlPU8F9wIgV+hPSOef4a1m++ZxFeO21nIwr13vgeJn31oeW8CZFPezzHI+BFVlTd+g3Zqf5yYCINO03tBD4FvKf8wSgHm3ORMSM1xbKzDcGAXw2LrZnhJ7vIYbB4a2COSNLSdqzCR7LX5gsuIk092EXO9+mfm3ZZcZpx9qxZoL8EN3Ayn X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, When dealing with a few million devices (x86 and arm64), it is statistically expected to have "a few" devices with at least one bad RAM cell. (How many?) For one particular model, we've determined that ~0.1% have at least one bad RAM cell (ergo, a few thousand devices). I've been wondering if someone more experienced knows: Are these RAM cells bad from the start, or do they become bad with time? (I assume both failure modes exist.) Once the first bad cell is detected, is it more likely to detect other bad cells as time goes by? In other words, what are the failure modes of ageing RAM? Closing the HW tangent, focusing on the SW side of things: Since these bad RAM cells wreak havoc for the device's user, especially with ASLR (different stuff crashes across reboots), I've been experimenting with mm/memtest.c as a first line of defense against bad RAM cells. However, I have a run into a few issues. Even though early_memtest is called, well... early, memory has already been mapped as regular *cached* memory. This means that when we test an area smaller than L3 cache, we're not even hitting RAM, we're just testing the cache hierarchy. I suppose it /might/ make sense to test the cache hierarchy, as it could(?) have errors as well? However, I suspect defects in cache are much more rare (and thus detection might not be worth the added run-time). On x86, I ran a few tests using SIMD non-temporal stores (to bypass the cache on stores), and got 30% reduction in run-time. (Minimal run-time is critical for being able to deploy the code to millions of devices for the benefit of a few thousand users.) AFAIK, there are no non-temporal loads, the normal loads probably thrashed the data cache. I was hoping to be able to test a different implementation: When we enter early_memtest(), we remap [start, end] as UC (or maybe WC?) so as to entirely bypass the cache. We read/write using the largest size available for stores/loads, e.g. entire cache lines on recent x86 HW. Then when we leave, we remap as was done originally. Is that possible? Hopefully, the other cores are not started at this point? (Otherwise this whole charade would be pointless.) To summarize: is it possible to tweak memtest to make it run faster while testing RAM in all cases? Regards, Marc Gonzalez