* [PATCH] ARM: i.MX8M: enable MMU in PBL around fw-external BL32 verify
@ 2026-06-16 14:49 Johannes Schneider
2026-06-16 14:49 ` [PATCH] crypto: sha256: PBL multi-block transform via ARMv8 Crypto Extensions Crypto Extensions Johannes Schneider
0 siblings, 1 reply; 2+ messages in thread
From: Johannes Schneider @ 2026-06-16 14:49 UTC (permalink / raw)
To: barebox; +Cc: Johannes Schneider
The BL32 fw-external blob is loaded into DRAM by the PBL and then
SHA-256-verified inside get_builtin_firmware_ext(). The verify runs
in PBL phase 1 with the MMU off and D-cache cold, walking ~720 KiB
through uncached DRAM accesses; on a Cortex-A53 this costs around
2 s of pre-BL31 wall-clock on every boot.
The verify is the only thing anchoring the BL32 content to the
signed PBL: HABv4 on i.MX8M only signs and loads what fits in
on-chip SRAM (= the PBL), and BL31/BL32 reach DRAM via PBL-driven
copies, so skipping the SHA-256 would be a security regression.
Turn on MMU + D-cache once the DRAM is populated and right before
the SHA-256 verify + BL31/BL32 memcpy run, and drop the MMU again
right before the BL31 entry (BL31 expects MMU off). Mirrors the
Rockchip handling in commits f2ae1a4a85 ("ARM: rockchip: atf:
enable MMU in PBL") and a0ef3a1b5c ("ARM: rockchip: atf: pass
correct memsize to mmu_early_enable()").
Measured on i.MX8MM and i.MX8MP (Cortex-A53, ~720 KiB BL32 blob):
the BL32 verify drops from ~2 s to ~300 ms (generic-C SHA-256 in
both cases, the difference is the D-cache state) and the BL31
early-init also benefits from the warm cache (~200 ms saved).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Johannes Schneider <johannes.schneider@leica-geosystems.com>
---
--- a/arch/arm/mach-imx/atf.c
+++ b/arch/arm/mach-imx/atf.c
@@ -20,6 +20,7 @@
#include <mach/imx/xload.h>
#include <mach/imx/snvs.h>
#include <pbl.h>
+#include <asm/mmu.h>
static void imx_adjust_optee_memory(void **bl32, void **bl32_image, size_t *bl32_size)
{
@@ -187,6 +188,9 @@
"r" (tfa_dest - 16) :
"cc");
+ /* BL31 expects MMU off. */
+ mmu_disable();
+
/*
* If enabled the bl_params are passed via x0 to the TF-A, except for
* the i.MX8MQ which doesn't support bl_params yet.
@@ -284,6 +288,12 @@
imx8m_setup_snvs();
imx8mm_load_bl33(bl33);
+ /* Cache DRAM for the BL32 verify + BL31/BL32 memcpy that follow. */
+ mmu_early_enable(MX8M_DDR_CSD1_BASE_ADDR,
+ imx8m_barebox_earlymem_size(32),
+ MX8M_DDR_CSD1_BASE_ADDR +
+ imx8m_barebox_earlymem_size(32) - OPTEE_SIZE);
+
if (IS_ENABLED(CONFIG_FIRMWARE_IMX8MM_OPTEE)) {
get_builtin_firmware_ext(imx8mm_bl32_bin, bl33, &bl32, &bl32_size);
get_builtin_firmware(imx8mm_bl31_bin_optee, &bl31, &bl31_size);
@@ -349,6 +359,12 @@
imx8m_setup_snvs();
imx8mp_load_bl33(bl33);
+ /* Cache DRAM for the BL32 verify + BL31/BL32 memcpy that follow. */
+ mmu_early_enable(MX8M_DDR_CSD1_BASE_ADDR,
+ imx8m_barebox_earlymem_size(32),
+ MX8M_DDR_CSD1_BASE_ADDR +
+ imx8m_barebox_earlymem_size(32) - OPTEE_SIZE);
+
if (IS_ENABLED(CONFIG_FIRMWARE_IMX8MP_OPTEE)) {
get_builtin_firmware_ext(imx8mp_bl32_bin, bl33, &bl32, &bl32_size);
get_builtin_firmware(imx8mp_bl31_bin_optee, &bl31, &bl31_size);
@@ -414,6 +430,12 @@
imx8m_setup_snvs();
imx8mn_load_bl33(bl33);
+ /* Cache DRAM for the BL32 verify + BL31/BL32 memcpy that follow. */
+ mmu_early_enable(MX8M_DDR_CSD1_BASE_ADDR,
+ imx8m_barebox_earlymem_size(16),
+ MX8M_DDR_CSD1_BASE_ADDR +
+ imx8m_barebox_earlymem_size(16) - OPTEE_SIZE);
+
if (IS_ENABLED(CONFIG_FIRMWARE_IMX8MN_OPTEE)) {
get_builtin_firmware_ext(imx8mn_bl32_bin, bl33, &bl32, &bl32_size);
get_builtin_firmware(imx8mn_bl31_bin_optee, &bl31, &bl31_size);
@@ -473,6 +495,12 @@
imx8m_setup_snvs();
imx8mq_load_bl33(bl33);
+ /* Cache DRAM for the BL32 verify + BL31/BL32 memcpy that follow. */
+ mmu_early_enable(MX8M_DDR_CSD1_BASE_ADDR,
+ imx8m_barebox_earlymem_size(32),
+ MX8M_DDR_CSD1_BASE_ADDR +
+ imx8m_barebox_earlymem_size(32) - OPTEE_SIZE);
+
if (IS_ENABLED(CONFIG_FIRMWARE_IMX8MQ_OPTEE)) {
get_builtin_firmware_ext(imx8mq_bl32_bin, bl33, &bl32, &bl32_size);
get_builtin_firmware(imx8mq_bl31_bin_optee, &bl31, &bl31_size);
^ permalink raw reply [flat|nested] 2+ messages in thread
* [PATCH] crypto: sha256: PBL multi-block transform via ARMv8 Crypto Extensions Crypto Extensions
2026-06-16 14:49 [PATCH] ARM: i.MX8M: enable MMU in PBL around fw-external BL32 verify Johannes Schneider
@ 2026-06-16 14:49 ` Johannes Schneider
0 siblings, 0 replies; 2+ messages in thread
From: Johannes Schneider @ 2026-06-16 14:49 UTC (permalink / raw)
To: barebox; +Cc: Johannes Schneider
barebox's PBL ships a generic-C sha256_transform() that runs roughly
1.6 MB/s on a Cortex-A53. Callers that hash MB-scale blobs in the PBL
-- e.g. the fw-external SHA-256 verify on i.MX8M, ~720 KiB of BL32 --
spend hundreds of ms in the transform even with the D-cache warm.
Wire the asm core in arch/arm/crypto/sha2-ce-core.S into the PBL link
and expose it through a new sha256_transform_blocks() entry point.
The asm has an internal multi-block loop; a single call amortises the
prologue (round-constant load, state load) over the whole input, which
makes the difference between ~200 ms (per-block calls) and ~5 ms
(batched) on the BL32 verify.
Rewire sha256_update()'s bulk path to call sha256_transform_blocks()
with the remaining block count rather than looping over a single-block
transform. The generic-C path gets a trivial blocks-wrapping shim so
both code paths share the same caller-side API.
The asm needs two link-time constants (sha256_ce_offsetof_count and
sha256_ce_offsetof_finalize) which we provide locally rather than
pulling in sha2-ce-glue.c -- the glue drags crypto-API and
kernel_neon_begin shims that the PBL has no use for.
Measured on i.MX8MM and i.MX8MP, ~720 KiB SHA-256 verify with MMU on:
~300 ms (generic-C)
-> 17 ms (crypto-ext, single block per call)
-> 3-5 ms (crypto-ext, batched).
Both crypto-ext savings carry over with MMU off too, just shifted up
by the uncached-DRAM read cost.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Johannes Schneider <johannes.schneider@leica-geosystems.com>
---
arch/arm/crypto/Makefile | 3 ++
crypto/Kconfig | 12 ++++++++
crypto/sha2.c | 66 ++++++++++++++++++++++++++++++++++++----
3 files changed, 75 insertions(+), 6 deletions(-)
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 55b3ac0538..72d4bd77c0 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -15,6 +15,9 @@ sha1-ce-y := sha1-ce-glue.o sha1-ce-core.o
obj-$(CONFIG_DIGEST_SHA256_ARM64_CE) += sha2-ce.o
sha2-ce-y := sha2-ce-glue.o sha2-ce-core.o
+# Reuse the asm core (glue is provided inline in crypto/sha2.c).
+pbl-$(CONFIG_PBL_DIGEST_SHA256_ARM64_CE) += sha2-ce-core.o
+
quiet_cmd_perl = PERL $@
cmd_perl = $(PERL) $(<) > $(@)
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 528e9a0d22..3dfb316b32 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -107,6 +107,18 @@ config DIGEST_SHA256_ARM64_CE
Architecture: arm64 using:
- ARMv8 Crypto Extensions
+config PBL_DIGEST_SHA256_ARM64_CE
+ bool "SHA-256 in PBL via ARMv8 Crypto Extensions"
+ depends on CPU_V8 && PBL_IMAGE
+ help
+ Use ARMv8 Crypto Extensions (sha256h/sha256h2/sha256su0/sha256su1)
+ for the SHA-256 transform inside the PBL. Roughly 100x faster than
+ the generic-C transform; for callers that hash large blobs (e.g.
+ fw-external SHA-256 verifies) this is the difference between tens
+ of ms and hundreds. Requires Cortex-A53 or later with the optional
+ Crypto Extensions feature.
+
+
endif
config CRYPTO_PBKDF2
diff --git a/crypto/sha2.c b/crypto/sha2.c
index cac5095648..06af886867 100644
--- a/crypto/sha2.c
+++ b/crypto/sha2.c
@@ -29,6 +29,44 @@
#include <crypto/internal.h>
#include <crypto/pbl-sha.h>
+#if defined(__PBL__) && IS_ENABLED(CONFIG_PBL_DIGEST_SHA256_ARM64_CE)
+/*
+ * PBL multi-block sha256 dispatch through the asm core in
+ * arch/arm/crypto/sha2-ce-core.S. The asm expects a sha256_ce_state-
+ * compatible struct and reads its `count` / `finalize` fields at the
+ * offsets advertised by the two link-time constants below. With
+ * finalize == 0 the asm runs just the block transform and writes the
+ * new midstate back into state[]; count/buf are untouched.
+ *
+ * Avoiding sha2-ce-glue.c here keeps the PBL out of the crypto-API and
+ * kernel_neon_begin shims, which add bytes and unrelated dependencies.
+ */
+struct pbl_sha256_ce_state {
+ u32 state[8];
+ u64 count;
+ u8 buf[64];
+ u32 finalize;
+};
+
+const u32 sha256_ce_offsetof_count = offsetof(struct pbl_sha256_ce_state, count);
+const u32 sha256_ce_offsetof_finalize = offsetof(struct pbl_sha256_ce_state, finalize);
+
+extern int sha2_ce_transform(struct pbl_sha256_ce_state *sst,
+ const u8 *src, int blocks);
+
+static void sha256_transform_blocks(u32 *state, const u8 *input,
+ unsigned int blocks)
+{
+ struct pbl_sha256_ce_state sst;
+
+ memcpy(sst.state, state, sizeof(sst.state));
+ sst.finalize = 0;
+ sha2_ce_transform(&sst, input, blocks);
+ memcpy(state, sst.state, sizeof(sst.state));
+}
+
+#else /* generic C transform */
+
static inline u32 Ch(u32 x, u32 y, u32 z)
{
return z ^ (x & (y ^ z));
@@ -213,6 +251,18 @@ static void sha256_transform(u32 *state, const u8 *input)
state[4] += e; state[5] += f; state[6] += g; state[7] += h;
}
+static void sha256_transform_blocks(u32 *state, const u8 *input,
+ unsigned int blocks)
+{
+ while (blocks--) {
+ sha256_transform(state, input);
+ input += 64;
+ }
+}
+
+#endif /* PBL crypto-ext vs generic */
+
+
static int sha224_init(struct digest *desc)
{
struct sha256_state *sctx = digest_ctx(desc);
@@ -258,18 +308,22 @@ int sha256_update(struct digest *desc, const void *data,
src = data;
if ((partial + len) > 63) {
+ unsigned int blocks;
+
if (partial) {
done = -partial;
memcpy(sctx->buf + partial, data, done + 64);
- src = sctx->buf;
+ sha256_transform_blocks(sctx->state, sctx->buf, 1);
+ done += 64;
}
- do {
- sha256_transform(sctx->state, src);
- done += 64;
- src = data + done;
- } while (done + 63 < len);
+ blocks = (len - done) / 64;
+ if (blocks) {
+ sha256_transform_blocks(sctx->state, data + done, blocks);
+ done += blocks * 64;
+ }
+ src = data + done;
partial = 0;
}
memcpy(sctx->buf + partial, src, len - done);
--
2.43.0
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-06-16 15:01 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-06-16 14:49 [PATCH] ARM: i.MX8M: enable MMU in PBL around fw-external BL32 verify Johannes Schneider
2026-06-16 14:49 ` [PATCH] crypto: sha256: PBL multi-block transform via ARMv8 Crypto Extensions Crypto Extensions Johannes Schneider
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox