Back to All Posts

Analysis of Watchdog Reset Caused by Firmware Cache Synchronization Issues

·Kyunghwan Kwon
tech
Analysis of Watchdog Reset Caused by Firmware Cache Synchronization Issues

Introduction

Pazzk provides not only its own reference boards but also firmware porting and system integration for customer hardware. To support diverse hardware constraints, the firmware maintains independent porting layers for both FreeRTOS and Zephyr, allowing the RTOS to be selected based on project requirements.

This article analyzes a watchdog reset observed on an ESP32-S3 board running a FreeRTOS-based ESP-IDF environment, where flash operations and cache synchronization paths were allowed to execute simultaneously.

Test environment and problem background

Pazzk uses a three-stage test process to verify firmware quality.

  1. Stage 1 — Development-stage testing: Function verification and initial stability tests performed directly by developers
  2. Stage 2 — In-house aging and automated testing: Long-duration operation and automated scenario tests on multiple boards deployed in the office
  3. Stage 3 — Limited beta testing: Conducted in real user environments within a friend-family or limited beta test group

Each stage uses a different build configuration according to the test objective. In particular, in Stage 2, to ensure both debugging convenience and reproduction of the real deployment environment, we operate by mixing the development build and production build at a 1:1 ratio.

사내 에이징 및 자동화 테스트 환경 — 여러 대의 ESP32-S3 보드를 장시간 병렬 운용

Although both builds use identical application code, they differ in system-level configuration, including MCU secure boot and flash access policies. In the production build, flash protection is activated. As a result, debugging via JTAG is blocked, and all flash data delivered over the external bus is transmitted in an encrypted state.

Classification Development Build Production Build
Flash encryption Disabled Enabled
read-out protection Disabled Enabled
JTAG debugging Available Unavailable
Flash Data Transmission Plain text Cipher text

As a result, in the production build environment, even with physical access, it is difficult to directly observe the internal state of problems that occur at runtime. Therefore, issues discussed in this article that only reproduce under specific timing conditions become difficult to analyze.

Memory requirements increase and PSRAM inroduction

As the product functionality expanded, runtime memory requirements steadily increased. In the early stages, the internal SRAM of the ESP32-S3 was sufficient, but with the introduction of additional features and an increase in the number of concurrent TLS connections, it became clear that relying solely on internal memory had its limitations.

To maintain a single mbedTLS-based TLS session, approximately 40 KiB of heap memory was required. Output buffers and application-level buffers could be adjusted, but the input record size could not be reduced. If the record size is lowered below that, session establishment may fail with some servers.

The TLS standard defines records with a maximum size of 16 KiB. While it is possible to reduce internal buffer sizes in implementations, compatibility issues may arise with some servers.

Taking into account future expansion of additional features such as VAS, we targeted at least six concurrent TLS connections. An additional 240 KiB (40 KiB × 6) of heap memory was required. At the time, the heap low watermark had dropped to around 25 KiB. At that point, the system was already operating with minimal memory headroom.

메모리 사용량 그래프 이미지

During the initial hardware design phase, securing more than 1 MiB of external memory was set as a requirement, and ultimately 2 MiB PSRAM was adopted. Thanks to this decision, starting from v1.0.6, we were able to fully implement memory expansion utilizing PSRAM.

Issue — Interrupt Watchdog Reset occurring with the combination of Flash Encryption and PSRAM

After introducing PSRAM, Interrupt Watchdog Resets began to occur intermittently in the stage 2 test environment. If we organize the conditions under which the issue occurs by function combinations, they are as follows.

WDT Reset Flash Encryption PSRAM
Not occurring Not used Not used
Not occurring Not used Used
Not occurring Used Not used
Occurring Used Used

The issue only occurred when using PSRAM together with flash encryption enabled. Therefore, it did not occur in the existing production build environment where only flash encryption was enabled, nor did it occur in the development build environment where only PSRAM was enabled during development. We were able to identify the issue in the second-stage test environment where the builds were tested at a 1:1 ratio.

Initially, we assumed that flash encryption increased flash access latency, which in turn delayed interrupt handling, resulting in an Interrupt Watchdog. Accordingly, we gradually increased the Interrupt Watchdog timeout to observe the symptoms, but the problem occurred identically even when the timeout was raised to several hundred milliseconds. This degree of timeout increase undermines system responsiveness and the execution guarantees of time-critical tasks, so we determined that the problem could not be solved by merely relaxing timing.

Next, we suspected the possibility of a deadlock between certain tasks during long-running execution. We adjusted the priorities of the suspected tasks and simplified the operating structure by binding them to a single core through processor affinity, but the problem still reproduced.

Because the issue only occurred after long operation, root cause tracing was not easy. However, analysis of the logs confirmed that the problem only occurred at the point when file system access and network communication were performed simultaneously. In other words, the issue reproduced at specific timing when flash I/O and the network stack were simultaneously active.

In the process, we also discovered a separate issue where, due to an implementation that erased the entire flash area at once during OTA updates, major tasks could be blocked for several seconds or more. Although this was unrelated to the main issue, it affected system responsiveness, so we improved it by switching from a whole-area erase method to a chunk-based erase method. As a result, the overall OTA duration slightly increased, but a single blocking interval was significantly reduced.

While we were forming various hypotheses without finding a lead, we began to suspect whether we might be using a memory region not capable of DMA for DMA. To check whether PSRAM was causing a conflict with DMA functionality, we reviewed the entire code and also started examining the relevant ESP-IDF source code. However, DMA requirements were already handled appropriately within the framework, and contention between DMA and PSRAM was ruled out as a possible cause.

Cause — Conflict between Flash Cache Disable section and esp_cache_msync

In the process of tracing the cause of the problem, we came to examine the possibility of a stall occurring during flash operations in sections where the cache is disabled — an angle we had not initially considered. In ESP-IDF, the cache is disabled when performing flash operations, and during this period, any code or data access that requires the cache becomes unavailable. Because this kind of system-level control handled by the framework has already been verified over a long period and is assumed to work well, it was difficult and burdensome to consider such a possibility from the outset.

The call flow of the ESP-IDF internal flash API is as follows:

  1. Before performing flash-related functions, it calls rom_spiflash_api_funcs->start (in esp_flash_api.c)
  2. This function is registered with an appropriate SPI interface in esp_flash_spi1_default_os_unctions.start (eg: spi1_startin components/spi_flash/spi_flash_os_func_app.c)
  3. spi1_start is a function that disables the cache. It calls cache_disable
  4. cache_disable() is a wrapper that calls spi_flash_disable_interrupts_caches_and_other_cpu, which actually disables the cache
  5. spi_flash_disable_interrupts_caches_and_other_cpu stops scheduling and prevents flash-related operations on the other core via an IPC call (in components/spi_flash/cache_utils.c)

In other words, while flash erase/write is being performed, cache access is completely blocked, and in this window, cache writeback or sync cannot be carried out properly.

Therefore, we initially assumed that the issue was caused by accesses to stack or code sections located in PSRAM, and attempted to mitigate it by configuring PSRAM to be used exclusively for data. However, the issue persisted.

Upon further analysis, this assumption did not hold up logically. The cache disabled interval is only temporary, and basic concurrency is already synchronized using spinlocks, making it unlikely that PSRAM-resident stack or code accesses were the root cause.

To find a clue to the problem, we decided to collect a coredump. Because the panic handler located on flash cannot operate normally during a cache disabled interval, we relocated the panic handler to SRAM and configured the Interrupt Watchdog to trigger a panic.

...

#0  0x4004e555 in ?? ()
#1  0x4004e65c in ?? ()
#2  0x4004ea14 in ?? ()
#3  0x40385b20 in cache_ll_invalidate_addr (cache_level=<optimized out>, type=CACHE_TYPE_ALL, cache_id=<optimized out>, vaddr=1010368512, size=65536) at /opt/es
p/idf/components/hal/esp32s3/include/hal/cache_ll.h:360
#4  cache_hal_invalidate_addr (vaddr=1010368512, size=65536) at /opt/esp/idf/components/hal/cache_hal.c:249
#5  0x40378ee0 in s_do_cache_invalidate (vaddr_start=1010368512, size=65536) at /opt/esp/idf/components/esp_mm/esp_mmu_map.c:407
#6  0x40378fd3 in s_do_mapping (target=MMU_TARGET_FLASH0, vaddr_start=1010368512, paddr_start=7864320, size=65536) at /opt/esp/idf/components/esp_mm/esp_mmu_map
.c:452
#7  0x420a8f35 in esp_mmu_map (paddr_start=7864320, size=<optimized out>, target=MMU_TARGET_FLASH0, caps=(MMU_MEM_CAP_READ | MMU_MEM_CAP_8BIT), flags=<optimized
 out>, out_ptr=0x3fcee500) at /opt/esp/idf/components/esp_mm/esp_mmu_map.c:596
#8  0x420a2dc1 in spi_flash_mmap (src_addr=7864320, size=1024, memory=SPI_FLASH_MMAP_DATA, out_ptr=0x3fcee560, out_handle=0x3fcee564) at /opt/esp/idf/components
/spi_flash/flash_mmap.c:88
#9  0x420ac244 in esp_partition_mmap (partition=0x3fcb6188, offset=<optimized out>, size=512, memory=ESP_PARTITION_MMAP_DATA, out_ptr=0x3fcee560, out_handle=0x3
fcee564) at /opt/esp/idf/components/esp_partition/partition_target.c:172
#10 0x420ac2c7 in esp_partition_read (partition=0x3fcb6188, src_offset=512, dst=0x3fcbc454, size=512) at /opt/esp/idf/components/esp_partition/partition_target.
c:50

...

#0  0x4037eedb in Cache_WriteBack_Addr (addr=1008346656, size=0) at /opt/esp/idf/components/esp_rom/patches/esp_rom_cache_esp32s2_esp32s3.c:141
#1  0x40385b48 in cache_ll_writeback_addr (cache_level=<optimized out>, type=CACHE_TYPE_DATA, cache_id=<optimized out>, vaddr=1008346656, size=16) at /opt/esp/i
df/components/hal/esp32s3/include/hal/cache_ll.h:403
#2  cache_hal_writeback_addr (vaddr=1008346656, size=16) at /opt/esp/idf/components/hal/cache_hal.c:264
#3  0x4037d514 in s_c2m_ops (vaddr=1008346656, size=16) at /opt/esp/idf/components/esp_mm/esp_cache_msync.c:56
#4  0x4037d60a in esp_cache_msync (addr=0x3c1a2620, size=16, flags=6) at /opt/esp/idf/components/esp_mm/esp_cache_msync.c:135
#5  0x420efeb7 in esp_aes_process_dma (ctx=0x3c1a2640, input=0x3c1a2620 <error: Cannot access memory at address 0x3c1a2620>, output=0x3fcbae40 "", len=16, strea
m_out=0x0) at /opt/esp/idf/components/mbedtls/port/aes/dma/esp_aes_dma_core.c:1006
#6  0x420e7a89 in esp_aes_crypt_ecb (ctx=0x3c1a2640, mode=1, input=0x3c1a2620 <error: Cannot access memory at address 0x3c1a2620>, output=0x3fcbae40 "") at /opt
/esp/idf/components/mbedtls/port/aes/dma/esp_aes.c:173
#7  0x420d9350 in ctr_drbg_update_internal (ctx=0x3c1a2620, data=0x3fcbae70 "") at /opt/esp/idf/components/mbedtls/mbedtls/library/ctr_drbg.c:363
#8  0x420d95b7 in mbedtls_ctr_drbg_random_with_add (p_rng=0x3c1a2620, output=<optimized out>, output_len=0, additional=0x0, add_len=<optimized out>) at /opt/esp
/idf/components/mbedtls/mbedtls/library/ctr_drbg.c:685
#9  0x420d95ec in mbedtls_ctr_drbg_random (p_rng=0x3c1a2620, output=0x3fcdb710 <error: Cannot access memory at address 0x3fcdb710>, output_len=32) at /opt/esp/i
df/components/mbedtls/mbedtls/library/ctr_drbg.c:708
#10 0x420ec3e1 in mbedtls_mpi_core_fill_random (X=0x3fcdb710, X_limbs=8, n_bytes=32, f_rng=0x420d95dc <mbedtls_ctr_drbg_random>, p_rng=0x3c1a2620) at /opt/esp/i
df/components/mbedtls/mbedtls/library/bignum_core.c:630

...

As a result of analyzing the coredump, the point at which the problem occurs was always the same code path. During flash operations, when cache writeback(esp_cache_msync) is invoked, it conflicts with the state where the cache is already disabled.

Espressif’s official documentation explicitly states, “Do not call esp_cache_msync during flash operations.” ESP-IDF does not internally serialize this situation or provide a concurrency-guarantee mechanism. Therefore, this combination is structurally bound to conflict.

In this case, during the TLS communication process, the AES hardware accelerator processes data via DMA and then calls esp_cache_msync. If this call overlaps with the cache-disabled period caused by flash operations, cache writeback becomes impossible, the system stalls for an extended time, and ultimately it leads to an Interrupt Watchdog Reset.

In other words, the issue is that the structural constraint — flash operations and cache synchronization must be performed mutually exclusively — was not guaranteed in ESP-IDF.

Result — Possible solutions and choices

The approaches to consider for solving this problem were the following five.

  1. Serializing flash operations and TLS operations at the application level
    • A method of sequencing flash operations and TLS processing at the application level. This shifts concurrency constraints that should be guaranteed inside the SDK onto the application, which is inefficient and cannot be considered a fundamental solution.
  2. Disabling the AES hardware accelerator
    • It is possible, but TLS processing speed is greatly reduced.
  3. Discontinuing the use of PSRAM
    • Not a realistic option since memory requirements already exceed internal SRAM limits.
  4. Forcibly bypassing or removing calls to esp_cache_msync
    • Since the DMA output area loses cache coherence, more serious problems may occur.
  5. Applying a patch to serialize flash operations (esp_flash) and cache writeback (esp_cache_msync) within the SDK
    • The only practical alternative to structurally resolve concurrency issues.

We prioritized solutions that did not require modifying third-party code as much as possible. However, since this issue was difficult to work around at the application level and was a structural conflict occurring within the SDK's internal execution path, we ultimately decided to modify the framework using the fifth approach. Since we are also using a patch to address the asynchronous processing issue in the I2S driver, we decided to accept some impact on maintainability.

diff --git a/components/mbedtls/port/aes/dma/esp_aes_dma_core.c b/components/mbedtls/port/aes/dma/esp_aes_dma_core.c
index a4a3d7c43b..78050f3aa1 100644
--- a/components/mbedtls/port/aes/dma/esp_aes_dma_core.c
+++ b/components/mbedtls/port/aes/dma/esp_aes_dma_core.c
@@ -986,6 +986,8 @@ int esp_aes_process_dma(esp_aes_context *ctx, const unsigned char *input, unsign
     bool output_needs_realloc = false;
     int ret = 0;
 
+    extern SemaphoreHandle_t g_extmem_flash_serial_lock;
+
     assert(len > 0); // caller shouldn't ever have len set to zero
     assert(stream_bytes == 0 || stream_out != NULL); // stream_out can be NULL if we're processing full block(s)
 
@@ -1003,11 +1005,14 @@ int esp_aes_process_dma(esp_aes_context *ctx, const unsigned char *input, unsign
         /* Flush cache if input in external ram */
 #if (CONFIG_SPIRAM && SOC_PSRAM_DMA_CAPABLE)
         if (esp_ptr_external_ram(input)) {
+            xSemaphoreTake(g_extmem_flash_serial_lock, portMAX_DELAY);
             if (esp_cache_msync((void *)input, len, ESP_CACHE_MSYNC_FLAG_DIR_C2M | ESP_CACHE_MSYNC_FLAG_UNALIGNED) != ESP_OK) {
                 mbedtls_platform_zeroize(output, len);
                 ESP_LOGE(TAG, "Cache sync failed for the input in external RAM");
+                xSemaphoreGive(g_extmem_flash_serial_lock);
                 return -1;
             }
+            xSemaphoreGive(g_extmem_flash_serial_lock);
         }
         if (esp_ptr_external_ram(output)) {
             size_t dcache_line_size;
@@ -1121,11 +1126,14 @@ int esp_aes_process_dma(esp_aes_context *ctx, const unsigned char *input, unsign
 #if (CONFIG_SPIRAM && SOC_PSRAM_DMA_CAPABLE)
     if (block_bytes > 0) {
         if (esp_ptr_external_ram(output)) {
+            xSemaphoreTake(g_extmem_flash_serial_lock, portMAX_DELAY);
             if(esp_cache_msync((void*)output, block_bytes, ESP_CACHE_MSYNC_FLAG_DIR_M2C) != ESP_OK) {
                 mbedtls_platform_zeroize(output, len);
                 ESP_LOGE(TAG, "Cache sync failed for the output in external RAM");
+                xSemaphoreGive(g_extmem_flash_serial_lock);
                 return -1;
             }
+            xSemaphoreGive(g_extmem_flash_serial_lock);
         }
     }
 #endif
diff --git a/components/mbedtls/port/sha/core/sha.c b/components/mbedtls/port/sha/core/sha.c
index 8aa2f1e0e1..e9731fb9ac 100644
--- a/components/mbedtls/port/sha/core/sha.c
+++ b/components/mbedtls/port/sha/core/sha.c
@@ -318,12 +318,17 @@ int esp_sha_dma(esp_sha_type sha_type, const void *input, uint32_t ilen,
     }
 
 #if (CONFIG_SPIRAM && SOC_PSRAM_DMA_CAPABLE)
+    extern SemaphoreHandle_t g_extmem_flash_serial_lock;
+    xSemaphoreTake(g_extmem_flash_serial_lock, portMAX_DELAY);
+
     if (esp_ptr_external_ram(input)) {
         esp_cache_msync((void *)input, ilen, ESP_CACHE_MSYNC_FLAG_DIR_C2M | ESP_CACHE_MSYNC_FLAG_UNALIGNED);
     }
     if (esp_ptr_external_ram(buf)) {
         esp_cache_msync((void *)buf, buf_len, ESP_CACHE_MSYNC_FLAG_DIR_C2M | ESP_CACHE_MSYNC_FLAG_UNALIGNED);
     }
+
+    xSemaphoreGive(g_extmem_flash_serial_lock);
 #endif
 
     /* Copy to internal buf if buf is in non DMA capable memory */
diff --git a/components/spi_flash/esp_flash_spi_init.c b/components/spi_flash/esp_flash_spi_init.c
index d16506d391..9a097ae15c 100644
--- a/components/spi_flash/esp_flash_spi_init.c
+++ b/components/spi_flash/esp_flash_spi_init.c
@@ -43,6 +43,8 @@ __attribute__((unused)) static const char TAG[] = "spi_flash";
 esp_flash_t *esp_flash_default_chip = NULL;
 #endif
 
+SemaphoreHandle_t g_extmem_flash_serial_lock;
+
 #if defined CONFIG_ESPTOOLPY_FLASHFREQ_120M
 #define DEFAULT_FLASH_SPEED 120
 #elif defined CONFIG_ESPTOOLPY_FLASHFREQ_80M
@@ -488,6 +490,11 @@ esp_err_t esp_flash_app_init(void)
     err = esp_flash_init_main_bus_lock();
     if (err != ESP_OK) return err;
 #endif
+
+    if ((g_extmem_flash_serial_lock = xSemaphoreCreateMutex()) == NULL) {
+        return ESP_ERR_NO_MEM;
+    }
+
     err = esp_flash_app_enable_os_functions(&default_chip);
     return err;
 }
diff --git a/components/spi_flash/flash_mmap.c b/components/spi_flash/flash_mmap.c
index d9476cbe30..b07dccb433 100644
--- a/components/spi_flash/flash_mmap.c
+++ b/components/spi_flash/flash_mmap.c
@@ -85,7 +85,10 @@ esp_err_t spi_flash_mmap(size_t src_addr, size_t size, spi_flash_mmap_memory_t m
     } else {
         caps = MMU_MEM_CAP_READ | MMU_MEM_CAP_8BIT;
     }
+    extern SemaphoreHandle_t g_extmem_flash_serial_lock;
+    xSemaphoreTake(g_extmem_flash_serial_lock, portMAX_DELAY);
     ret = esp_mmu_map(src_addr, size, MMU_TARGET_FLASH0, caps, ESP_MMU_MMAP_FLAG_PADDR_SHARED, &ptr);
+    xSemaphoreGive(g_extmem_flash_serial_lock);
     if (ret == ESP_OK) {
         vaddr_list[0] = (uint32_t)ptr;
         block->list_num = 1;
@@ -195,7 +198,10 @@ esp_err_t spi_flash_mmap_pages(const int *pages, size_t page_count, spi_flash_mm
     }
     for (int i = 0; i < block_num; i++) {
         void *ptr = NULL;
+        extern SemaphoreHandle_t g_extmem_flash_serial_lock;
+        xSemaphoreTake(g_extmem_flash_serial_lock, portMAX_DELAY);
         ret = esp_mmu_map(paddr_blocks[i][0], paddr_blocks[i][1], MMU_TARGET_FLASH0, caps, ESP_MMU_MMAP_FLAG_PADDR_SHARED, &ptr);
+        xSemaphoreGive(g_extmem_flash_serial_lock);
         if (ret == ESP_OK) {
             vaddr_list[i] = (uint32_t)ptr;
             successful_cnt++;

The applied solution is to serialize the flash operation path and the cache writeback path with a global mutex. This is a compromise for an immediate fix, carrying long-term maintenance risks due to being vulnerable to upstream changes in ESP-IDF and potentially altering the latency characteristics of the flash operation path.

Fundamentally, an upstream patch is needed that explicitly models and serializes the concurrency constraints between flash operations and the cache sync path inside the SDK.

Conclusion

The analysis process and patch summarized in this article clearly reveal the structural constraints that already existed within ESP-IDF and aim to ensure minimal execution safety at the framework level so that flash operations and cache synchronization paths cannot overlap in execution.

This patch is not a solution that fundamentally redesigns the internal behavior of the SDK; rather, it has the limitation of being a practical compromise using a global mutex. Nevertheless, in configurations where flash I/O and DMA-based cryptographic operations such as TLS, file system access, and OTA can occur simultaneously in a Flash Encryption environment, it provides meaningful stability improvements.

In a project that uses TLS communications together with flash operations while Flash Encryption is enabled, the reproduction conditions and analysis process summarized in this article can be directly used as reference materials to narrow down the cause of the problem or diagnose similar symptoms.

References

Comments