A cache block can be in one of three states.
Block is not present in cache. Need to fetch it from memory or other cache.
Block is present in caches and in memory. All the processors can read this block without having to communicate.
Block is present in only 1 cache. The processor that has the block can read/write it without communication. If other processor tries to access that block, they will need to get it from this cache.
To perform an invalidate, the processor simply acquires bus access and broadcasts the address to be invalidated on the bus. All processors continuously snoop on the bus, watching the addresses. The processors check whether the address on the bus is in their cache. If so, the corresponding data in the cache are invalidated.
When a write to a block that is shared occurs, the writing processor must acquire bus access to broadcast its invalidation. If two processors attempt to write shared blocks at the same time, their attempts to broadcast an invalidate operation will be serialized when they arbitrate for the bus. The first processor to obtain bus access will cause any other copies of the block it is writing to be invalidated. If the processors were attempting to write the same block, the serialization enforced by the bus also serializes their writes. One implication of this scheme is that a write to a shared data item cannot actually complete until it obtains bus access. All coherence schemes require some method of serializing accesses to the same cache block, either by serializing access to the communication medium or another shared structure.
For writes we would like to know whether any other copies of the block are cached because, if there are no other cached copies, then the write need not be placed on the bus in a write-back cache. Not sending the write reduces both the time to write and the required bandwidth.
To track whether or not a cache block is shared, we can add an extra state bit associated with each cache block, just as we have a valid bit and a dirty bit. By adding a bit indicating whether the block is shared, we can decide whether a write must generate an invalidate. When a write to a block in the shared state occurs, the cache generates an invalidation on the bus and marks the block as modified. No further invalidations will be sent by that core for that block. The core with the sole copy of a cache block is normally called the owner of the cache block (MOSI).
Every bus transaction must check the cache-address tags, which could potentially interfere with processor cache accesses. One way to reduce this interference is to duplicate the tags and have snoop accesses directed to the duplicate tags. Another approach is to use a directory at the shared L3 cache; the directory indicates whether a given block is shared and possibly which cores have copies. With the directory information, invalidates can be directed only to those caches with copies of the cache block. This requires that L3 must always have a copy of any data item in L1 or L2, a property called inclusion.
When an invalidate or a write miss is placed on the bus, any cores whose private caches have copies of the cache block invalidate it. For a write miss in a write-back cache, if the block is exclusive in just one private cache, that cache also writes back the block; otherwise, the data can be read from the shared cache or memory.
To understand why this protocol works, observe that any valid cache block is either in the shared state in one or more private caches or in the exclusive state in exactly one cache. Any transition to the exclusive state (which is required for a processor to write to the block) requires an invalidate or write miss to be placed on the bus, causing all local caches to make the block invalid. In addition, if some other local cache had the block in exclusive state, that local cache generates a write-back, which supplies the block containing the desired address. Finally, if a read miss occurs on the bus to a block in the exclusive state, the local cache with the exclusive copy changes its state to shared.
State Diagram – CPU Requests
The top label is the event that caused the transition. The bottom label is the reply to that event.
- CPU tries to read an I block. I means that the block is not present in this cache. Block has to fetched from another cache/memory and thus a read miss has to placed on the bus. When the missing block arrives it will enter the state S.
- CPU reads a S block. This corresponds to a read hit. The block will stay on the shared state and no request has to be placed on the bus.
- CPU tries to write an I block. I means that the block is not present in this cache. A write miss has to be placed on the bus and when the block has been fetched it will enter the state M.
- CPU writes to a S block. This corresponds to a write hit. But there may be other copies of this cache block which needs to be invalidated. Therefore an invalidation message is placed on the bus. After the invalidation this cache block will be the only and most recent copy and therefore it enters the state M.
- CPU reads/writes to a M block. Both correspond to cache hits. Therefore, no transaction has to be placed on the bus.
State Diagram – Bus Requests
Bus requests are placed on the bus in response to CPU request. The diagram shows how a cache block reacts to bus requests.
- If the cache controller snoops an invalidation request for a cache block that is in state S in its cache, it must invalidate this cache block, since it means that another processor wants to write this block.
- If the cache controller snoops a write miss request for a cache block that is in state S in its cache, it must invalidate this cache block, since it means that another processor incurred a write miss and thus wants to write this block (fetched from memory).
- If the cache controller snoops a read miss request for a cache block that is state S in its cache. No action. The memory will return the block to the processor that wants to read it. This copy of the block can stay in state S.
- If the cache controller snoops a read miss request for a cache block that is state M in its cache. Another processor wants to read this block but it is not present in its cache. Since this cache has the only copy of the cache block and the memory copy is outdated/stale, the controller has to write back the block to the request as well as to memory. The memory read access needs to be aborted since the memory controller will return a stale copy of the block. The new state of this cache block should be S, since after this transaction there will be multiple copies of the block – this cache, requesting cache, memory.
- If the cache controller snoops a write miss request for a cache block that is state M in its cache. Another processor wants to write this block but it is not present in its cache. Since this cache has the only copy of the cache block and the memory copy is outdated/stale, the controller has to write back the block to the request as well as to memory. The memory write access needs to be aborted since the memory controller will return a stale copy of the block. Since the other processor will eventually write to this block, this block is moved to I state.
The protocol assumes that operations are atomic—that is, an operation can be done in such a way that no intervening operation can occur. For example, the protocol described assumes that write misses can be detected, acquire the bus, and receive a response as a single atomic action.
In reality this is not true. In fact, even a read miss might not be atomic; after detecting a miss in the L2 of a multi-core, the core must arbitrate for access to the bus connecting to the shared L3. Nonatomic actions introduce the possibility that the protocol can deadlock, meaning that it reaches a state where it cannot continue.
One solution can be that the processor that sends invalidate can hold bus until other processors receive the invalidate.
Shared memory bus and snooping bandwidth is bottleneck for scaling symmetric multiprocessors. Few ways to overcome this bottleneck are
- Duplicate the tags and have snoop accesses directed to the duplicate tags.
- Use a directory at the shared L3 cache; the directory indicates whether a given block is shared and possibly which cores have copies. With the directory information, invalidates can be directed only to those caches with copies of the cache block.
- Use crossbars or point-to-point networks with banked memory.
Add exclusive state to indicate clean block in only one cache. Prevents the need for a write invalidate on a write.
Owned state. Write back to memory only when O block is replaced.