Rocksolid Light - comp.arch.embedded - Re: Library for save an events log in Flash

Il 19/04/2024 10:28, David Brown ha scritto:
> On 18/04/2024 22:36, pozz wrote:
>> Il 18/04/2024 21:30, David Brown ha scritto:
>>> On 18/04/2024 16:38, pozz wrote:
>>>> The request is very common: when some interested events occur in the
>>>> system, they, with the related timestamp, must be saved in a log.
>>>> The log must be saved in an external SPI Flash connected to a MCU.
>>>> The log has a maximum number of events. After the log is completely
>>>> filled, the new event overwrite the oldest event.
>>>>
>>>> I tried to implement such library, but I eventually found it's not a
>>>> simple task, mostly if you want a reliable library that works even
>>>> when errors occur (for example, when one writing fails).
>>>>
>>>> I started by assigning 5 sectors of SPI Flash to the log. There are
>>>> 256 events in a sector (the event is a fixed struct).
>>>> In this scenario, the maximum log size is 1024 events, because the
>>>> 5th sector can be erased when the pointer reaches the end of a sector.
>>>>
>>>> The first challenge is how the system can understand what is the
>>>> first (newest) event in the log at startup. I solved saving a
>>>> 16-bits counter ID to each event. The initialization routine starts
>>>> reading all the IDs and taks the greatest as the last event.
>>>> However initially the log is empty, so all the IDs are 0xFFFF, the
>>>> maximum. One solution is to stop reading events when 0xFFFF is read
>>>> and wrap-around ID at 0xFFFE and not 0xFFFF.
>>>
>>> Start at 0xfffe, and count down.
>>
>>
>> And what to do when the counter reaches zero? It can wrap-around up to
>> 0xfffe (that is very similar to an increasing counter from
>> 0x0000-0xFFFE).
>>
>
> How big are your log entries? How many entries are realistic in the
> lifetime of the system?

16 bytes each entry. It's difficult to reach 0xFFFF events in the log,
but why limit our fantasy? :-D

With 20 events per day, the log will be filled after 9 years. It's a
very long life, but I think I have solved the problem to understand if
the ID was wrapped-around.

>> > Or xor with 0xffff for storage. Or
>>> wrap at 0xfffe, as you suggest. Or use 32-bit values. Or have
>>> another way to indicate that the log entry is valid.
>>
>> I will add a CRC for each entry and that can be used to validate the
>> event. An empty/erased slot filled with 0xFF will not pass CRC
>> validation.
>>
>
> That's usually fine.
>
>>
>>> Or, since you have a timestamp, there's no need to track an ID - the
>>> timestamp will be increasing monotonically.
>>
>> I don't want to use timestamps for two reasons:
>>
>> - the system wall clock can be changed (the system is isolated)
>> - the library I'm writing doesn't know the content of "events", for
>>    it the event is an opaque sequence of bytes.
>>
>
> OK.
>
>>
>>>> However there's another problem. What happens after writing 65535
>>>> events in the log? The ID restarts from 0, so the latest event
>>>> hasn't the greatest ID anymore.
>>>>
>>>> These are the saved IDs after 65536 events:
>>>>
>>>>      1^ SECT    2^ SECT    3^ SECT    4^ SECT    5^SECT---------->
>>>>      0xFB00 ... 0xFC00 ... 0xFD00 ... 0xFE00 ... 0xFF00 ... 0xFFFF
>>>>
>>>> The rule "newest event has greatest ID" is correct yet. Now a new
>>>> event is written:
>>>>
>>>>      1^ SECT-------> 2^ SECT   3^ SECT   4^ SECT   5^SECT--------->
>>>>      0x0000 0xFB01.. 0xFC00 .. 0xFD00 .. 0xFE00 .. 0xFF00 .. 0xFFFF
>>>>
>>>> Now the rule doesn't work anymore. The solution I found is to detect
>>>> discontinuity. The IDs are consecutive, so the initialization
>>>> routine continues reading while the ID(n+1)=ID(n)+1. When there's a
>>>> gap, the init function stops and found the ID and position of the
>>>> newest event.
>>>
>>> Make your counts from 0 to 256*5 - 1, then wrap. Log entry "n" will
>>> be at address n * sizeof(log entry), with up to 256 log entries
>>> blank. Then you don't need to store a number at all.
>>
>> What do you mean with log entry "0"? Is it the oldest or the newest? I
>> think the oldest, because that formula is imho correct in this case.
>>
>> However it doesn't appear correct when the log has rotated, that
>> happens after writing 5x256+1 events. In this case the newest entry
>> ("n"=1024) is at address 0, not n*sizeof(entry).
>>
>
> (I misread your "5 sectors of an SPI flash chip" as "5 SPI flash chips"
> when first replying. It makes no real difference to what I wrote, but I
> might have used "chip" instead of "sector".)
>
> You have 256 entries per flash sector, and 5 flash sectors. For the log
> entry number "n" - where "n" is an abstract count that never wraps, your
> index "i" into the flash array is (n % 5*256). The sector number is
> then (i / 256), and the index into the sector is (i % 256). The
> position in the log is determined directly by the entry number, and you
> don't actually need to store it anywhere.
>
> Think of this a different way - dispense with the log entry numbers
> entirely. When you start up, scan the flash to find the next free slot.
> You do this by looking at slot 0 first. If that is not empty, keep
> scanning until you find a free slot - that's the next free slot. If
> slot 0 is empty, scan until you have non-empty slots, then keep going
> until you get a free one again, and that's the next free slot. If you
> never find a used slot, or fail to find a free slot after the non-free
> slots, then your first free slot is slot 0.
>
> Any new logs are then put in this slot, moving forward. If you need to
> read out old logs, move backwards. When storing new logs, as you are
> nearing the end of a flash sector (how near depends on the sector erase
> time and how often events can occur), start the erase of the next sector
> in line.

Yes, it is what I already do. However I disagree on the formula.

The higher-layer application requests log entry 0. What is it? The
newest event. My eventlog library should convert 0 to the slot index in
the Flash, that is directly related to the Flash addres (I don't really
need the number of sector here).

If the log is empty, event 0 for the application is slot 0 for the
eventlog library. However, if there are three events in the log, event 0
for the application is slot 2 for the eventlog library.

The application doesn't know anything about what I named the ID of the
event. It's just a number used by the lower-layer eventlog module to
find the first free slot at startup.

>>>> But he problems don't stop here. What happens if an event write
>>>> failed? Should I verify the real written values, by reading back
>>>> them and comparing with original data? In a reliable system, yes, I
>>>> should.
>>>>
>>>> I was thinking to protect an event with a CRC, so to understand at
>>>> any time if an event is corrupted (bad written). However the
>>>> possibility to have isolated corrupted events increase the
>>>> complexity of the task a lot.
>>>
>>> An 8-bit or 16-bit CRC is peanuts to calculate and check.
>>
>> I know. Here the increased complexity wasn't related to the CRC
>> calculation, but to the possibility to have isolated corrupted slots
>> in the buffer. Taking into account these corrupted slots isn't so
>> simple for me.
>>
>
> Think how such corruption could happen, and its consequences. For most
> event logs, it is simply not going to occur in the lifetime of working
> products - and if it does, as an isolated error in an event log, it
> doesn't matter significantly. Errors in a sensibly designed SPI NOR
> flash system would be an indication of serious hardware problems such as
> erratic power supplies, and then the log is the least of your concerns.
>
> The only thing to consider is a reset or power failure in the middle of
> writing a log event.

Yes, I agree with you.

>>> Write the whole log entry except for a byte or two (whatever is the
>>> minimum write size for the flashes), which must be something other
>>> than 0xff's. Check the entry after writing if you really want, then
>>> write the final byte or two. Then if there's a power-fail in the
>>> middle, it's obvious that the log entry is bad.
>>
>> I don't understand why to write in two steps. I think I can write all
>> the bytes of an entry, including CRC. After writing, I can read back
>> all the bytes and check integrity.
>>
>
> No, you can't. Or to be more accurate, you /can/, but it leaves you
> open to the risk of undetectable errors in the event of a power fail in
> the middle of writing. An 8-bit CRC has a 1 in 256 chance of passing if
> data is partially written.

I can use crc16. We are talking of power-fail/reset in the middle of
event writing, this is possible, but rare. I think the case a crc16
reporting a valid slot when the writing was wrong or incomplete is
extremely rare.

> If you have enough capacitance on the board to ensure that any write
> completes, and detection of failure of external power so that you never
> start a log write after power has failed, then it is probably safe to do
> the write as a single action.

No I can't know in advance when the power will fail.

>>> For SPI NOR flash, the risk of a bad write - other than through
>>> unexpected resets or power fails - is negligible for most purposes.
>>> Don't over-complicate things worrying about something that will never
>>> happen.
>>
>> Indeed unexpected resets, but mainly power failes, are possible. Rare,
>> but possible.
>>
>
> Yes.
>
>>
>>>> Suppose to write the 4th event with ID=3 at position 4 in the
>>>> sector. Now suppose the write failed. I can try to re-write the same
>>>> event at poisition 5. Should I use ID=4 or ID=5? At first, I think
>>>> it's better to use ID=5. The CRC should detect that at position 4 is
>>>> stored a corrupted event.
>>>> After that two events are written as well, ID=6 and 7.
>>>>
>>>> Now the higher application wants to read the 4th event of the log
>>>> (starting from the newest). We know that the newest is ID=7, so the
>>>> 4th event is ID=7-4+1=4. However the event with ID=4 is corrupted,
>>>> so we should check the previous ID=3... and so on.
>>>> Now we can't have a mathematical function that returns the position
>>>> of an event starting from it's ordinal number from the newest event.
>>>>
>>>>
>>>> Eventually I think it's better to use a public domain library, if it
>>>> exists.
>>>>
> In my experience, these things are simple enough to implement, and
> specific enough to the application and target, that I don't see the need
> of a library.
>
>

Subject	Replies	Author
Library for save an events log in Flash By: pozz on Thu, 18 Apr 2024	6	pozz

"Turn on, tune up, rock out." -- Billy Gibbons

devel / comp.arch.embedded / Re: Library for save an events log in Flash