mail archive of the barebox mailing list
 help / color / mirror / Atom feed
From: "Ulrich Ölmann" <u.oelmann@pengutronix.de>
To: Ahmad Fatoum <a.fatoum@pengutronix.de>
Cc: barebox@lists.infradead.org,
	 David Picard <david.picard@clermont.in2p3.fr>
Subject: Re: [PATCH 3/3] Documentation: devel: troubleshooting: add new chapter
Date: Mon, 07 Jul 2025 11:05:38 +0200	[thread overview]
Message-ID: <6r8ql0cqvx.fsf@pengutronix.de> (raw)
In-Reply-To: <20250704143803.2740813-4-a.fatoum@pengutronix.de> (Ahmad Fatoum's message of "Fri, 4 Jul 2025 16:38:03 +0200")

Hi Ahmad,

just some typos that I stumbled over.
Thanks for writing all those things down!

On Fri, Jul 04 2025 at 16:38 +0200, Ahmad Fatoum <a.fatoum@pengutronix.de> wrote:
> A consequence of running bare metal is that early failures are difficult
> to diagnose. Let's add a troubleshooting section to help users take
> the first step in diagnosing issues.
>
> Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
> ---
>  Documentation/devel/devel.rst           |   2 +
>  Documentation/devel/troubleshooting.rst | 377 ++++++++++++++++++++++++
>  Documentation/devicetree/index.rst      |   2 +
>  3 files changed, 381 insertions(+)
>  create mode 100644 Documentation/devel/troubleshooting.rst
>
> diff --git a/Documentation/devel/devel.rst b/Documentation/devel/devel.rst
> index d985bff40d42..b90805263bbd 100644
> --- a/Documentation/devel/devel.rst
> +++ b/Documentation/devel/devel.rst
> @@ -8,7 +8,9 @@ Contents:
>  .. toctree::
>     :maxdepth: 2
>  
> +   architecture
>     porting
> +   troubleshooting
>     filesystems
>     background-execution
>     project-ideas
> diff --git a/Documentation/devel/troubleshooting.rst b/Documentation/devel/troubleshooting.rst
> new file mode 100644
> index 000000000000..67c4e3102be2
> --- /dev/null
> +++ b/Documentation/devel/troubleshooting.rst
> @@ -0,0 +1,377 @@
> +.. _troubleshooting:
> +
> +##########################
> +Boot Troubleshooting Guide
> +##########################
> +
> +Especially during development or bring-up, very early failure situations can leave
> +the system hanging before recovery is even possible.
> +
> +This guide helps diagnose and debug such issues across barebox' different boot stages.
> +
> +Boot Flow Overview
> +==================
> +
> +A barebox binary consists of two main stages:
> +
> +1. **PBL (Pre-Bootloader)**: This is a smaller barebones loader that does
> +   what's necessary to download the full barebox binary.
> +   At the very least, this is decompressing barebox proper and jumping
> +   to it while passing it a device tree.
> +   Depending on platform, it may also need to setup DRAM, install a secure
> +   monitory like TF-A or a secure operating system like OP-TEE and chainload
> +   barebox from a boot medium.
> +2. **barebox proper**: The main bootloader logic. This is always loaded
> +   by a prebootloader passing a device tree and including drivers for
> +   device initialization, environment setup, and booting the OS.
> +
> +If barebox hangs, it's essential to identify *where* in this process the
> +failure occurs. Here's how to debug different stages.
> +
> +Refer to the :ref:`barebox architecture <architecture>` for more background
> +information on the different stages and the images.
> +
> +Completely silent console

s/Completely silent console/Completely Silent Console/

> +=========================
> +
> +Even the barebox prebootloader is most often loaded by another
> +bootloader. This is commonly a mask BootROM hardwired into the
> +System-on-chip.

s/System-on-chip/System-on-Chip/

> +
> +**Common problems**:
> +
> +- Wrong bootloader image or format
> +- Bootloader installed to wrong location
> +- System hang before serial driver probe
> +- enabled, but misconfigured CONFIG_DEBUG_LL

s/enabled/Enabled/

> +
> +**What to try**:
> +
> +- Check for BootROM boot indicators:
> +
> +  Some BootROMs (e.g. AT91) write to a serial port when they start up
> +  or blink a GPIO (e.g. STM32MP) if they fail to boot the next stage
> +  bootloader.
> +
> +- Check that barebox is in the format and at the location that the
> +  previous stage bootloader expects. Compare with a previously working
> +  bootloader image, refer to the barebox documentation and/or the
> +  vendor documentation or ask around.
> +
> +- Enable ``CONFIG_DEBUG_LL``
> +
> +  This enables very early low-level UART debugging.
> +  It bypasses console frameworks and writes directly to UART registers.
> +  Many boards in barebox, print a ``>`` character, when ``CONFIG_DEBUG_LL``
> +  is enabled. If you see such a character after enabling ``DEBUG_LL``, it
> +  indicates that the barebox prebootloader has been found and control was
> +  successfully handed over to it. Note that on some SoCs, ``DEBUG_LL``
> +  requires co-operation from the board entry point, e.g., the pin muxing for
> +  the serial console needs to be done in software in some situations before
> +  the UART is accessible from the outside.
> +
> +  .. note::
> +     Make sure the correct UART index or address is selected under
> +     **Kernel low-level debugging por** in ``menuconfig``.
> +     Configuring the wrong UART might hang your system, because barebox would
> +     be tricked into accessing hardware that's not there or is powered off.
> +     The numbering/addresses of ports are described in the System-on-Chip
> +     datasheet or reference manual and may differ from labels on the hardware.
> +     Refer to the config symbol help text and ``/chosen/stdout-path`` in the
> +     device tree if unsure.
> +
> +- Enable ``CONFIG_PBL_CONSOLE`` and ``CONFIG_DEBUG_PBL``
> +
> +  For boards that don't have an early ``putc_ll('>');``, the first output
> +  being printed is often the debugging output from the uncompress entry
> +  point (``barebox_pbl_start()``). Enable these options to see if the
> +  CPU gets that far.
> +
> +  .. warning::
> +     CONFIG_DEBUG_PBL increases the size of the PBL, which can make it
> +     exceed a hard limit imposed by a previous stage bootloader.
> +     Best case, this will be caught by the build system, but might not
> +     if you are adding a new board and haven't told it yet.
> +
> +- Toggle a GPIO from the board entry point
> +
> +  A number of platforms (e.g. i.MX or STM32MP) have header-only GPIO helper
> +  functions that can be used to toggle a GPIO. These can be used for
> +  debugging early hangs by toggling an LED for example.
> +
> +- Trace BootROM activity
> +
> +  If you have no indication that the barebox prebootloader is being started,
> +  consider tracing what the BootROM is doing, e.g. via JTAG or a logic analyzer
> +  for the SD-Card.

s/SD-Card/SD card/  <-- this spelling seems to dominate within barebox.

> +
> +If you managed to get some serial output, move along to the next step.
> +
> +Hang after first stage PBL console output

s/Hang after first stage PBL console output/Hang after First Stage PBL Console Output/

> +=========================================
> +
> +The first stage prebootloader handles:
> +- Basic initialization (e.g., clocks, SDRAM)
> +- installation of secure firmware if applicable

s/installation/Installation/

> +- invocation of the second stage

s/invocation/Invocation/

> +
> +**Common problems**:
> +
> +- issues in board entry point

s/issues/Issues/

> +- Hang in firmware
> +
> +**What to try**:
> +
> +- Check where hang occurs
> +
> +  If you get just some early output, you'll need to pinpoint, where the issue
> +  occurs. if enabling ``CONFIG_PBL_CONSOLE`` along with a correctly configured

s/if enabling/If enabling/

> +  ``CONFIG_DEBUG_PBL`` doesn't help, try adding ``putc_ll('@')`` (or any other
> +  character) to find out, where the startup is stuck. ``putc_ll`` has the
> +  benefit of being usable everywhere, even before ``setup_c()`` is or
> +  ``relocate_to_current_adr()`` is called. Once these are called, you may
> +  also use ``puts_ll()`` or just normal ``printf`` if ``CONFIG_PBL_CONSOLE=y``.
> +
> +- Check if hang occurs in other loaded firmware
> +
> +  On platforms like i.MX8/9 and RK35xx, barebox will install ARM trusted
> +  firmware as secure monitor and possibly OP-TEE as secure OS.
> +  Hangs can happen if TF-A or OP-TEE is configured to access the wrong
> +  console (hang/abort on accessing peripheral with gated clock).
> +  If output ends with the banner of the firmware, jumping back to barebox
> +  may have failed. In that case, double check that the memory size
> +  configured for TF-A/OP-TEE is correct and that the entry addresses
> +  used in barebox and TF-A/OP-TEE are identical.
> +
> +Hang during chainloading

s/Hang during chainloading/Hang During Chainloading/

> +========================
> +
> +Once basic system initialization is done, barebox prebootloader
> +will load the second stage.
> +
> +**Common problems**:
> +
> +- wrong SDRAM setup

s/wrong/Wrong/

> +- corrupted barebox proper read from boot medium

s/corrupted/Corrupted/

> +
> +**What to try**:
> +
> +- Check computed addresses
> +
> +  If your last output is ``jumping to uncompressed image``, this suggests that
> +  the hang occured while trying to execute barebox proper. barebox prints
> +  the regions it uses for its stack, barebox itself and the initial RAM
> +  as debug output. Verify these with the actual size of RAM installed and
> +  check if values are sane.
> +
> +- Check that barebox was loaded correctly
> +
> +  You can enable ``CONFIG_COMPILE_TEST`` and ``CONFIG_PBL_VERIFY_PIGGY``
> +  to have the barebox build system compute a hash of barebox proper,
> +  which the prebootloader will compare against the hash it computes
> +  over the compresed data read from the boot medium.
> +
> +- Check SDRAM setup
> +
> +  SDRAM setup differs according to the RAM chip being used, the System-on-chip,

s/System-on-chip/System-on-Chip/

> +  the PCB traces between them as well as outside factors like temperature.
> +  When a System-on-Module is used, the hardware vendor will optimally provide
> +  a validated RAM setup to be used. If RAM layout is custom, the System-on-Chip
> +  vendor usually provides tools for calculating initial timings and tuning them
> +  at runtime.
> +
> +  Because writes can be posted, issues with wrongly set up SDRAM may only become
> +  apparent on first execution or read and not during mere writing.
> +
> +  Issues of writes silently misbehaving should be detectable by
> +  ``CONFIG_PBL_VERIFY_PIGGY``, which reads back the data to hash it.
> +
> +  If the prebootloader is already running from SDRAM, boot hangs due to completely
> +  wrong SDRAM setup are less likely, but running a memory test from within barebox
> +  proper is still recommended.
> +
> +- Check if an exception happened
> +
> +  barebox can print symbolized stack traces on exceptions, but support for that
> +  is only installed in barebox proper. Early exceptions are currently not enabled
> +  by default, but can be enabled manually with ``CONFIG_ARM_EXCEPTIONS_PBL``.
> +
> +Preinitcall Stage
> +=================
> +
> +The prebootloader ``barebox_pbl_start`` ends up calling ``barebox_non_pbl_start``
> +in barebox proper. This function does:
> +
> +- relocation and setting up the C environment
> +- setting up the malloc area and KASAN
> +- calling ``start_barebox``, which runs the registered initcalls
> +
> +**Common problems**:
> +
> +- None, this is quite straight-forward code
> +
> +**What to try**:
> +
> +- Check if the code is executed. This can be done with ``putc_ll``. ``printf``
> +  is not safe to use everywhere in this function, because the C environment
> +  may not be set up yet.
> +
> +initcall Stage

s/initcall/Initcall/

> +=================
> +
> +After decompression and jumping to barebox proper, barebox will walk through
> +the compiled in initcalls.
> +
> +**Symptoms**:
> +
> +- Hangs after PBL output but before typical barebox banners
> +
> +**What to try**:
> +
> +- Enable ``CONFIG_DEBUG_INITCALLS`` while ``CONFIG_DEBUG_LL`` is enabled
> +
> +  This shows output for each initcall level, helping pinpoint where execution stops.
> +  ``CONFIG_DEBUG_LL`` is useful here, because it allows showing output, even
> +  before the first serial driver is probed.
> +
> +Driver Probe Stage
> +==================
> +
> +Initcalls don't necessarily correspond to driver probes as a driver may be
> +registered before a device or the device probe is postponed until resources
> +become available.
> +
> +**Symptoms**:
> +
> +- Hangs during hardware initialization
> +
> +**What to try**:
> +
> +- Enable``CONFIG_DEBUG_PROBES``
> +
> +  This prints each driver probe attempt and can help isolate the problematic peripheral.
> +
> +- Disable drivers selectively to see if a shell can be reached.
> +
> +Interactive Console
> +===================
> +
> +If you see output only with ``CONFIG_DEBUG_LL``, but not otherwise, you may not
> +have any consoles enabled or you are looking at the wrong console.
> +
> +For testing, you can enable ``CONFIG_CONSOLE_ACTIVATE_ALL`` to have barebox
> +proper print out logs on all console devices that it registers.
> +
> +Once you have the correct console figured out, consider enabling the option
> +``CONFIG_CONSOLE_ACTIVATE_ALL_FALLBACK``. This will fall back to activating all
> +consoles, when no console was activated by normal means (e.g. via the environment
> +or the device tree ``/chosen/stdout`` property).
> +
> +Kernel hang

s/Kernel hang/Kernel Hang/

> +===========
> +
> +**Symptoms**:
> +
> +- Hang after a line like
> +  ``Loaded kernel to 0x40000000, devicetree at 0x41730000``
> +
> +With kernel hangs, it's important to find out, whether the hang happens in barebox
> +still or already while executing the kernel.
> +Without EFI loader support in barebox, there is no calling back from kernel to barebox,
> +so a kernel hanging is usually indicative of an issue within the kernel itself.
> +
> +It's often useful to copy the kernel image into ``/tmp`` instead of booting directly
> +to verify that the hang is not just a very slow network connection for example.
> +The ``-v`` option to :ref:`command_cp` is useful for that.
> +The file size copied may differ from the original if the mean of transport rounds
> +up to a specific block size. In that case, round up the size on the host system
> +and run a digest function like :ref:`command_md5sum` to check  that the image
> +was transferred successfully.
> +
> +If the image is transferred correctly, the :ref:`command_boot` verbosity is increased
> +by each extra ``-v`` option. At higher verbosity level, this will also print out
> +the device tree passed to the kernel. The :ref:`command_of_diff` command is useful
> +to :ref:`visualize only the fixups that were applied by barebox to the device tree<of_diff>`.
> +
> +If you are sure that the kernel is indeed being loaded, the ``earlycon`` kernel
> +feature can enable early debugging output before kernel serial drivers are loaded.
> +barebox can fixup an earlycon option if ``global.bootm.earlycon=1`` is specified.
> +
> +Spurious aborts/hangs

s|Spurious aborts/hangs|Spurious Aborts/Hangs|

> +=====================
> +
> +**Symptoms**:
> +
> +- Hangs/Panics/Aborts that happen in a non-deterministic fashion and whose

s|Hangs/Panics/Aborts|Hangs/panics/aborts|

> +  probability is greatly influenced by enabling/disabing barebox options
> +  and corresponding shifts in the barebox binary
> +
> +It's generally advisable to run a memory test to verify basic operation and to check
> +if the RAM size is sane. barebox provides two commands for this: :ref:`command_memtest`
> +and :ref:`command_memtester`. In addition, some silicon vendors like NXP provide their
> +own memory test blobs, which barebox can load to SRAM via :ref:`command_memcpy` and
> +execute using :ref:`command_go`. By having the memory test outside DRAM, a much more
> +thorough memory test is possible.
> +
> +With ``CONFIG_MMU=y``, the decompression of barebox proper in the prebootloader
> +and the runtime of barebox proper will execute with MMU enabled for improved performance.
> +
> +This increase in performance is due to caches and speculative execution.
> +barebox will mark memory mapped I/O devices and secure firmware as ineligible for
> +being accessed speculatively, but it can only do so if the memory size it's told
> +is correct and if secure memory is marked reserved in the device tree.
> +
> +The memory map as barebox sees it can be printed with the :ref:`command_iomem`
> +command. Everything outside ``ram`` region is mapped non executible and uncacheable

s/executible/executable/

Best regards
Ulrich


> +by default. Everything inside ``ram`` regions that doesn't have a ``[R]`` next
> +to it is cacheable by default. The :ref:`command_mmuinfo` command can be used
> +to show specific information about the MMU attributes for an address.
> +
> +Memory Corruption Issues
> +========================
> +
> +Some hangs might be caused by heap corruption, stack overflows, or use-after-free bugs.
> +
> +**What to try**:
> +
> +- Enable ``CONFIG_KASAN`` (Kernel Address Sanitizer)
> +
> +  This provides runtime memory checking in barebox proper and can detect
> +  invalid memory accesses.
> +
> +  .. warning::
> +     KASAN gratly increases memory usage and may itself cause hangs in
> +     constrained environments.
> +
> +
> +Summary of Debug Options
> +========================
> +
> ++-----------------------------+-------------------------------------------------------+
> +| Option                      | Description                                           |
> ++=============================+=======================================================+
> +| CONFIG_DEBUG_LL             | Early low-level UART output                           |
> ++-----------------------------+-------------------------------------------------------+
> +| CONFIG_PBL_CONSOLE          | Print statements from PBL                             |
> ++-----------------------------+-------------------------------------------------------+
> +| CONFIG_DEBUG_PBL            | Enable all debug output in the PBL                    |
> ++-----------------------------+-------------------------------------------------------+
> +| CONFIG_PBL_VERIFY_PIGGY     | Verify barebox proper in PBL before decompression     |
> ++-----------------------------+-------------------------------------------------------+
> +| CONFIG_ARM_EXCEPTIONS_PBL   | Enable exception handlers in PBL                      |
> ++-----------------------------+-------------------------------------------------------+
> +| CONFIG_DEBUG_INITCALLS      | Logs each initcall                                    |
> ++-----------------------------+-------------------------------------------------------+
> +| CONFIG_DEBUG_PROBES         | Logs each driver probe                                |
> ++-----------------------------+-------------------------------------------------------+
> +| CONFIG_KASAN                | Detects memory corruption                             |
> ++-----------------------------+-------------------------------------------------------+
> +
> +Final Tips
> +==========
> +
> +- If all else fails, a JTAG debugger to single-step through the code can
> +  be very useful. To help with this, ``CONFIG_PBL_BREAK`` triggers an
> +  exception at the start of execution of the individual barebox stages,
> +  which ``scripts/gdb/helper.py`` can use to correctly set the base
> +  address, so symbols are correctly located.
> diff --git a/Documentation/devicetree/index.rst b/Documentation/devicetree/index.rst
> index 94e8d04f63c3..4f25b6c6869b 100644
> --- a/Documentation/devicetree/index.rst
> +++ b/Documentation/devicetree/index.rst
> @@ -175,6 +175,8 @@ In the ``chosen``-node, barebox fixes up
>  These values can be read from the booted linux system in ``/proc/device-tree/``
>  or ``/sys/firmware/devicetree/base``.
>  
> +.. _of_diff:
> +
>  To see a dry run of what barebox would fixup, the ``of_diff`` command can be
>  used::
-- 
Pengutronix e.K.                           | Ulrich Ölmann               |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |



      reply	other threads:[~2025-07-07  9:44 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-04 14:38 [PATCH 0/3] Documentation: devel: add new troubleshooting Ahmad Fatoum
2025-07-04 14:38 ` [PATCH 1/3] Documentation: devel: porting: split out architecture intro Ahmad Fatoum
2025-07-04 14:38 ` [PATCH 2/3] Documentation: devel: architecture: detail first/second stage handling Ahmad Fatoum
2025-07-04 14:38 ` [PATCH 3/3] Documentation: devel: troubleshooting: add new chapter Ahmad Fatoum
2025-07-07  9:05   ` Ulrich Ölmann [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6r8ql0cqvx.fsf@pengutronix.de \
    --to=u.oelmann@pengutronix.de \
    --cc=a.fatoum@pengutronix.de \
    --cc=barebox@lists.infradead.org \
    --cc=david.picard@clermont.in2p3.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox