mail archive of the barebox mailing list
 help / color / mirror / Atom feed
From: Johannes Schneider <johannes.schneider@leica-geosystems.com>
To: barebox@lists.infradead.org
Cc: Johannes Schneider <johannes.schneider@leica-geosystems.com>
Subject: [PATCH] crypto: sha256: PBL multi-block transform via ARMv8 Crypto Extensions Crypto Extensions
Date: Tue, 16 Jun 2026 14:49:24 +0000	[thread overview]
Message-ID: <20260616144924.1614561-2-johannes.schneider@leica-geosystems.com> (raw)
In-Reply-To: <20260616144924.1614561-1-johannes.schneider@leica-geosystems.com>

barebox's PBL ships a generic-C sha256_transform() that runs roughly
1.6 MB/s on a Cortex-A53. Callers that hash MB-scale blobs in the PBL
-- e.g. the fw-external SHA-256 verify on i.MX8M, ~720 KiB of BL32 --
spend hundreds of ms in the transform even with the D-cache warm.

Wire the asm core in arch/arm/crypto/sha2-ce-core.S into the PBL link
and expose it through a new sha256_transform_blocks() entry point.
The asm has an internal multi-block loop; a single call amortises the
prologue (round-constant load, state load) over the whole input, which
makes the difference between ~200 ms (per-block calls) and ~5 ms
(batched) on the BL32 verify.

Rewire sha256_update()'s bulk path to call sha256_transform_blocks()
with the remaining block count rather than looping over a single-block
transform. The generic-C path gets a trivial blocks-wrapping shim so
both code paths share the same caller-side API.

The asm needs two link-time constants (sha256_ce_offsetof_count and
sha256_ce_offsetof_finalize) which we provide locally rather than
pulling in sha2-ce-glue.c -- the glue drags crypto-API and
kernel_neon_begin shims that the PBL has no use for.

Measured on i.MX8MM and i.MX8MP, ~720 KiB SHA-256 verify with MMU on:
  ~300 ms (generic-C)
  -> 17 ms (crypto-ext, single block per call)
  -> 3-5 ms (crypto-ext, batched).
Both crypto-ext savings carry over with MMU off too, just shifted up
by the uncached-DRAM read cost.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Johannes Schneider <johannes.schneider@leica-geosystems.com>
---
 arch/arm/crypto/Makefile |  3 ++
 crypto/Kconfig           | 12 ++++++++
 crypto/sha2.c            | 66 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 55b3ac0538..72d4bd77c0 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -15,6 +15,9 @@ sha1-ce-y := sha1-ce-glue.o sha1-ce-core.o
 obj-$(CONFIG_DIGEST_SHA256_ARM64_CE) += sha2-ce.o
 sha2-ce-y := sha2-ce-glue.o sha2-ce-core.o
 
+# Reuse the asm core (glue is provided inline in crypto/sha2.c).
+pbl-$(CONFIG_PBL_DIGEST_SHA256_ARM64_CE) += sha2-ce-core.o
+
 quiet_cmd_perl = PERL    $@
       cmd_perl = $(PERL) $(<) > $(@)
 
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 528e9a0d22..3dfb316b32 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -107,6 +107,18 @@ config DIGEST_SHA256_ARM64_CE
 	  Architecture: arm64 using:
 	  - ARMv8 Crypto Extensions
 
+config PBL_DIGEST_SHA256_ARM64_CE
+	bool "SHA-256 in PBL via ARMv8 Crypto Extensions"
+	depends on CPU_V8 && PBL_IMAGE
+	help
+	  Use ARMv8 Crypto Extensions (sha256h/sha256h2/sha256su0/sha256su1)
+	  for the SHA-256 transform inside the PBL. Roughly 100x faster than
+	  the generic-C transform; for callers that hash large blobs (e.g.
+	  fw-external SHA-256 verifies) this is the difference between tens
+	  of ms and hundreds. Requires Cortex-A53 or later with the optional
+	  Crypto Extensions feature.
+
+
 endif
 
 config CRYPTO_PBKDF2
diff --git a/crypto/sha2.c b/crypto/sha2.c
index cac5095648..06af886867 100644
--- a/crypto/sha2.c
+++ b/crypto/sha2.c
@@ -29,6 +29,44 @@
 #include <crypto/internal.h>
 #include <crypto/pbl-sha.h>
 
+#if defined(__PBL__) && IS_ENABLED(CONFIG_PBL_DIGEST_SHA256_ARM64_CE)
+/*
+ * PBL multi-block sha256 dispatch through the asm core in
+ * arch/arm/crypto/sha2-ce-core.S. The asm expects a sha256_ce_state-
+ * compatible struct and reads its `count` / `finalize` fields at the
+ * offsets advertised by the two link-time constants below. With
+ * finalize == 0 the asm runs just the block transform and writes the
+ * new midstate back into state[]; count/buf are untouched.
+ *
+ * Avoiding sha2-ce-glue.c here keeps the PBL out of the crypto-API and
+ * kernel_neon_begin shims, which add bytes and unrelated dependencies.
+ */
+struct pbl_sha256_ce_state {
+	u32	state[8];
+	u64	count;
+	u8	buf[64];
+	u32	finalize;
+};
+
+const u32 sha256_ce_offsetof_count    = offsetof(struct pbl_sha256_ce_state, count);
+const u32 sha256_ce_offsetof_finalize = offsetof(struct pbl_sha256_ce_state, finalize);
+
+extern int sha2_ce_transform(struct pbl_sha256_ce_state *sst,
+			     const u8 *src, int blocks);
+
+static void sha256_transform_blocks(u32 *state, const u8 *input,
+				    unsigned int blocks)
+{
+	struct pbl_sha256_ce_state sst;
+
+	memcpy(sst.state, state, sizeof(sst.state));
+	sst.finalize = 0;
+	sha2_ce_transform(&sst, input, blocks);
+	memcpy(state, sst.state, sizeof(sst.state));
+}
+
+#else /* generic C transform */
+
 static inline u32 Ch(u32 x, u32 y, u32 z)
 {
 	return z ^ (x & (y ^ z));
@@ -213,6 +251,18 @@ static void sha256_transform(u32 *state, const u8 *input)
 	state[4] += e; state[5] += f; state[6] += g; state[7] += h;
 }
 
+static void sha256_transform_blocks(u32 *state, const u8 *input,
+				    unsigned int blocks)
+{
+	while (blocks--) {
+		sha256_transform(state, input);
+		input += 64;
+	}
+}
+
+#endif /* PBL crypto-ext vs generic */
+
+
 static int sha224_init(struct digest *desc)
 {
 	struct sha256_state *sctx = digest_ctx(desc);
@@ -258,18 +308,22 @@ int sha256_update(struct digest *desc, const void *data,
 	src = data;
 
 	if ((partial + len) > 63) {
+		unsigned int blocks;
+
 		if (partial) {
 			done = -partial;
 			memcpy(sctx->buf + partial, data, done + 64);
-			src = sctx->buf;
+			sha256_transform_blocks(sctx->state, sctx->buf, 1);
+			done += 64;
 		}
 
-		do {
-			sha256_transform(sctx->state, src);
-			done += 64;
-			src = data + done;
-		} while (done + 63 < len);
+		blocks = (len - done) / 64;
+		if (blocks) {
+			sha256_transform_blocks(sctx->state, data + done, blocks);
+			done += blocks * 64;
+		}
 
+		src = data + done;
 		partial = 0;
 	}
 	memcpy(sctx->buf + partial, src, len - done);
-- 
2.43.0




      reply	other threads:[~2026-06-16 15:01 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-16 14:49 [PATCH] ARM: i.MX8M: enable MMU in PBL around fw-external BL32 verify Johannes Schneider
2026-06-16 14:49 ` Johannes Schneider [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260616144924.1614561-2-johannes.schneider@leica-geosystems.com \
    --to=johannes.schneider@leica-geosystems.com \
    --cc=barebox@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox