These are my notes for TDT4225 . These are my own understandings of the subject and it may not be completely correct or accurate. It is also written in slightly below average Norwegian English.
Why do I post them here?
There is a couple of reasons. The first is for convenience. I like to write Markdown as it is very easy to structure text easily and it does not take any focus away from the text, it also has great readability both uncompiled and compiled. It is also very practical to have this available on the web as i like to read my notes on my iPad, and it is faster to access here. The last reason is that someone may find them useful.
Chapter: 2.1, 2.2, 2.3, 2.6, 4.1-4.4
- in-memory data structure
- sorted by keys
- key/value pairs or deletion markers
- kept small, a few megabytes
- Sorted string table
- Sequence of entries sorted by key
- key/value pairs or deletion markers
- index at end
- Kept small, typically 2 MB
Organized in levels:
Snapshots and iterators
Recovery and durability
The log contains complete records, which makes things recoverable.
The log may be disabled to increase performance.
Sstables are immutable and only new files are created during compaction.
Paper: Christian Forfang, Evaluation of High-Performance Key-Value Stores Chapter: 3.1, 3.2, 3.3, 3.5, 4.1-4.4
Developed for OpenLDAP as a replacement for BerkeleyDB.
Key/value store implemented using B+-trees.
Ordered map interface for range queries.
Transactions through MVCC (multi-version concurrency control)
Writers don't block readers
Readers don't block writers
Memory-mapped files, recovery free restart
No background compaction
Write transactions create a new version of the tree so that readers can use the old version. New version becomes current when write is finished.
New transactions are redirected to the new tree. Existing transactions may use the old tree.
Only one write transaction at a time.
When pages become unavailable, they are marked as reusable in a subdatabase.
Neighbour pages may be spread around on disk...
Book: Distributed Systems - Concepts and Design Chapter: 10
First really big filesharing application. Quickly grew. Used some central servers for indexing. After initial query, the data was directly shared between clients.
Time and global states
Book: Distributed Systems - Concepts and Design Chapter: 14
Skew: Diff at a given time
Drift: Diff over time
- Point in time
- Can be used to figure out order
- Needs synchronized clocks
Perfect synchronization is impossible.
- Ordering of events
- Ordering in focus
- Usualy implemented using a counter
If the order of events is the only important thing, physical time is overkill.
- Very accurate
- Based on Atomic clock
- Adjusted based on the earths rotation
- Timezones are offsets of UTC
- Distributed via GPS (1micros) and ground (1ms)
External sync: Distributed system synchronizing against external sources (e.g GPS)
Internal sync: Distributed system synchronizing internally in a distributed system. This does not necessarily give the correct time.
Problem: Communication takes time.
Implementation of External sync.
External time server S Local server p
- p sends a request to S
- S sends a response with the current time t
- When p gets response, set clock to time t + half the time since initial request.
Implementation of Internal sync.
One master, rest slaves
- Master polls all slaves
- Slaves responds with current clock
- Master calculates the average clock (ignores big diffs, takes transfer time into account)
- Master sends out individual diffs to every slave.
Network Time Protocol (NTP)
Protocol for time synchronization over the internet. Uses UTC.
Focus: - Security - Scalability - Correct - Reliability
The NTP servers i built up in a hierarchy where the root node is directly connected to a UTC source and the leaf nodes are client machines.
Synchronization: - Multicast - Procedure call (like Christians algorithm) - Symmetric mode (Communication between different servers, high accuracy)
Why: - Distributed garbage collection - Distributed deadlock detection - Distributed debugging
How? Cuts and global consistent states.
Distributed garbage collection
Objects without a active reference is garbage.
Reference: - Local - At other nodes/hosts/processes - In messages
Distributed deadlock detection
A subset of the global history.
For all events e in cut C......
Coordination and agreement
Book: Distributed Systems - Concepts and Design Chapter: 15
Reliable channel: The message always makes it through, it is not changed and faults are handled before the end system.
Async system: Cannot make assumptions of time.
Sync system: - Max time of message transfer - Max time of instruction execution - Timeouts to detect errors
Distributed Mutual exclusion
Multiple processes shares a resource.
Now, distributed with muliple machines, uses messages to communicate.
Algorithms: - Central server - Token-ring - Multicast - Polling
All processes ask for access. The access is granted to one process at a time. The server can quickly become a bottleneck.
Messagetypes: request, grant and release.
A token is sent throug the ring, the process which holds the token can access the critical resource. The token is located at the same process during the entier operation.
Needs 0 to N messages to enter, 1 message to exit.
Needs 2(N-1) messages to enter and 0 messages to exit.
Needs 2sqrt(N) to enter and sqrt(N) to exit. Better than multicast if N > 4.
Book: Operating Systems in Depth Chapter: 6
Fast storage for storing full tracks of information. Improves reads, not writes.
Larger blocks, improved layout
Major innovations of FFS/UFS.
Problem with larger blocks is wasted space. FFS used complicated strategy, split blocks into fragments, where a fragment where two sectors. The FS must keep track of all the free blocks, but also the partially filled blocks. Files must be put in continouos fragments.
FFS tries to arrange tings so files can be expanded.
As long as there is plenty of free blocks, whole allocation is used.
Allocating partial blocks containing just the number of fragments needed and pays the price for copying when files grow.
Keep data blocks close to the corresponding inodes on disk. Keep inodes for directories close to one another.
Regions / Cylinder groups / Block groups
A group contains a space set aside for inodes as well as a space for data and indirect blocks.
It is important to identify cylinder groups with plenty of free space. FFS uses quadratic hashing to quickly find groups with sufficient free space and allocates groups uniformly.
Reducing rotational latency
Reducing latency is done by placing successive blocks one block apart. This technique is called block interleaving.
Clustering and extents
Used in ext2 and later versions of FFS. Instead of using block interleaving, clustering is done by grouping blocks together so that more can be read in a sequence. The preallocations is stashed together as fixed-sized blocks.
Files are treated as collections of larger contiguous regions of disk space called extents. One large file can consist of a single extent, allowing fast access with little metadata. External fragmentation is a problem.