caching - Write file: Data consistency in practice -
i working on multi-user file storage system, in real world system must face event of system crash or power failure, i'm researching consistency , durability.
many of database systems supported acid, , modern computer systems supported journaling file system. have noticed logging system important such systems, logging system can figure out happened , has not happened before crash, while system restarting befitting recovery job.
a typical logging system working steps:
- write log (data or meta-data)
- write actual data
- commit log
so when system crash event happens there few possibilities:
- log not complete: ignore it
- log not committed: data not complete - rollback
- log committed: operation finished
some journaling file systems work that.
i have no idea how database system works, in general database system software running in userspace, , know, there several things between file writing function , disk surface:
- process cache
- system cache
- on-disk cache
so when function returns, data may not on disk, may in these caches.
on windows systems caching can disabled file_flag_no_buffering flag when createfile, msdn said "when caching disabled, read , write operations directly access physical disk", first question is, file_flag_no_buffering turns off on-disk cache ? or how can make sure data has reached surface of disk ?
and further question: sata , scsi disks using "command queuing" technology, commands in queue re-ordered processed more efficiently, logging system depends on time-order, command queuing bad logging systems(in userspace) ? or how can make sure has been written before b ?
the basic way overwrite data in crash-safe way is:
- write data new storage location first. (you're not overwriting yet.)
- tell os flush above stable storage, using posix
fsync
function. meant flush caches , everything, when function returns, data physically on disk. - write "journal" entry somewhere indicates new data update has been written , ready commit.
- flush journal entry disk.
- read data wrote in step 1 , write "real" storage location. (this actual overwrite.)
- write journal entry says change has been committed.
- delete temporary file created in step 1.
the flushes serve write barriers: ensure before flush has been safely stored on disk before after flush can written. between pair of barriers, reordering of writes (e.g. due disks command queueing) isn't problem, because barriers ensure order correct in places matters. in step 1, don't care if disk physically writes second half of file before writes first half; care whole file has been written before journal entry attesting new file complete.
after crash, go through journal , process each entry:
- if find file step 1 doesn't have corresponding entry step 3, treat file incomplete , discard it. rollback of incomplete change.
- if entry step 3 present not 1 step 6, repeat step 5. it's possible step 5 partially completed before crash, doesn't matter; means might overwriting of data identical bytes, harmless.
- if entry step 6 present, repeat step 7 deleting file if still exists.
you might find informative read postgresql's documentation on reliability , write-ahead logging (which postgresql's term sort of journaling mechanism described above.) incorporates additional safety measures, such checksumming of wal (journal) entries protect against corruption, , disk flushes deferred , batched better performance during normal operation (at expense of crash recovery possibly taking little longer).
speaking of databases, however, it'd easier , safer use 1 — robust , well-tested consistency , durability mechanisms — trying roll own. if full database server postgresql heavyweight application, consider using lighter sqlite or berkeley db (which low-level key-value store, not sql relational database). both support atomic commits.
Comments
Post a Comment