8.4. Self-Organizing Lists¶
8.4.1. Introduction¶
While ordering of lists is most commonly done by key value, this is not the only viable option. Another approach to organizing lists to speed search is to order the records by expected frequency of access. While the benefits might not be as great as when sorted by key value, the cost to organize (at least approximately) by frequency of access can be much cheaper, and thus can speed up sequential search in some situations.
Assume that we know, for each key \(k_i\), the probability \(p_i\) that the record with key \(k_i\) will be requested. Assume also that list \(\mathbf{L}\) is ordered so that the most frequently requested record is first, then the next most frequently requested record, and so on. Search in the list will be done sequentially, beginning with the first position. Over the course of many searches, the expected number of comparisons required for one search is
In other words, the cost to access the record in \(\mathbf{L}[0]\) is 1 (because one key value is looked at), and the probability of this occurring is \(p_0\). The cost to access the record in \(\mathbf{L}[1]\) is 2 (because we must look at the first and the second records' key values), with probability \(p_1\), and so on. For \(n\) records, assuming that all searches are for records that actually exist, the probabilities \(p_0\) through \(p_{n-1}\) must sum to one.
Certain probability distributions give easily computed results.
Example 8.4.1
Calculate the expected cost to search a list when each record has equal chance of being accessed (the classic sequential search through an unsorted list). Setting \(p_i = 1/n\) yields
This result matches our expectation that half the records will be accessed on average by normal sequential search. If the records truly have equal access probabilities, then ordering records by frequency yields no benefit. In the more general case, we must consider the probability (labeled \(p_n\)) that the search key does not match that for any record in the array. In that case, the general formula gives us
Thus, \(\frac{n+1}{2} \leq \overline{C}_n \leq n\), depending on the value of \(p_0\).
A geometric probability distribution can yield quite different results.
Example 8.4.2
Calculate the expected cost for searching a list ordered by frequency when the probabilities are defined as
Then,
For this example, the expected number of accesses is a constant. This is because the probability for accessing the first record is high (one half), the second is much lower (one quarter) but still much higher than for the third record, and so on. This shows that for some probability distributions, ordering the list by frequency can yield an efficient search technique.
In many search applications, real access patterns follow a rule of thumb called the 80/20 rule. The 80/20 rule says that 80% of the record accesses are to 20% of the records. The values of 80 and 20 are only estimates; every data access pattern has its own values. However, behavior of this nature occurs surprisingly often in practice (which explains the success of caching techniques widely used by web browsers for speeding access to web pages, and the use of a buffer pool to speed access to data stored in slower memory such as a disk drive). When the 80/20 rule applies, we can expect considerable improvements to search performance from a list ordered by frequency of access over standard sequential search in an unordered list.
Example 8.4.3
The 80/20 rule is an example of a Zipf distribution. Naturally occurring distributions often follow a Zipf distribution. Examples include the observed frequency for the use of words in a natural language such as English, and the size of the population for cities (i.e., view the relative proportions for the populations as equivalent to the "frequency of use"). Zipf distributions are related to the Harmonic Series. Define the Zipf frequency for item \(i\) in the distribution for \(n\) records as \(1/(i {\cal H}_n)\). The expected cost for the series whose members follow this Zipf distribution will be
When a frequency distribution follows the 80/20 rule, the average search looks at about 10-15% of the records in a list ordered by frequency.
This is potentially a useful observation that typical "real-life" distributions of record accesses, if the records were ordered by frequency, would require that we visit on average only 10-15% of the list when doing sequential search. This means that if we had an application that used sequential search, and we wanted to make it go a bit faster (by a constant amount), we could do so without a major rewrite to the system to implement something like a search tree. But that is only true if there is an easy way to (at least approximately) order the records by frequency.
In most applications, we have no means of knowing in advance the frequencies of access for the data records. To complicate matters further, certain records might be accessed frequently for a brief period of time, and then rarely thereafter. Thus, the probability of access for records might change over time (in most database systems, this is to be expected). Self-organizing lists seek to solve both of these problems.
Self-organizing lists modify the order of records within the list based on the actual pattern of record access. Self-organizing lists use a heuristic for deciding how to reorder the list. These heuristics are similar to the rules for managing buffer pools. In fact, a buffer pool is a form of self-organizing list. Ordering the buffer pool by expected frequency of access is a good strategy, because typically we must search the contents of the buffers to determine if the desired information is already in main memory. When ordered by frequency of access, the buffer at the end of the list will be the one most appropriate for reuse when a new page of information must be read.
8.4.1.1. Frequency Count¶
There are three traditional heuristics for managing self-organizing lists.
The most obvious way to keep a list ordered by frequency would be to store a count of accesses to each record and always maintain records in this order. This method will be referred to as frequency count or just "count". Count is similar to the least frequently used buffer replacement strategy. Whenever a record is accessed, it might move toward the front of the list if its number of accesses becomes greater than a record preceding it. Thus, count will store the records in the order of frequency that has actually occurred so far. Besides requiring space for the access counts, count does not react well to changing frequency of access over time. Once a record has been accessed a large number of times under the frequency count system, it will remain near the front of the list regardless of further access history.
8.4.2. Move to Front¶
Bring a record to the front of the list when it is found, pushing all the other records back one position. This is analogous to the least recently used buffer replacement strategy and is called move-to-front. This heuristic is easy to implement if the records are stored using a linked list. When records are stored in an array, bringing a record forward from near the end of the array will result in a large number of records (slightly) changing position. Move-to-front's cost is bounded in the sense that it requires at most twice the number of accesses required by the optimal static ordering for \(n\) records when at least \(n\) searches are performed. In other words, if we had known the series of (at least \(n\)) searches in advance and had stored the records in order of frequency so as to minimize the total cost for these accesses, this cost would be at least half the cost required by the move-to-front heuristic. (This can be proved using amortized analysis.) Finally, move-to-front responds well to local changes in frequency of access, in that if a record is frequently accessed for a brief period of time it will be near the front of the list during that period of access. Move-to-front does poorly when the records are processed in sequential order, especially if that sequential order is then repeated multiple times.
8.4.3. Transpose¶
Swap any record found with the record immediately preceding it in the list. This heuristic is called transpose. Transpose is good for list implementations based on either linked lists or arrays. Frequently used records will, over time, move to the front of the list. Records that were once frequently accessed but are no longer used will slowly drift toward the back. Thus, it appears to have good properties with respect to changing frequency of access. Unfortunately, there are some pathological sequences of access that can make transpose perform poorly. Consider the case where the last record of the list (call it \(X\)) is accessed. This record is then swapped with the next-to-last record (call it \(Y\)), making \(Y\) the last record. If \(Y\) is now accessed, it swaps with \(X\). A repeated series of accesses alternating between \(X\) and \(Y\) will continually search to the end of the list, because neither record will ever make progress toward the front. However, such pathological cases are unusual in practice. A variation on transpose would be to move the accessed record forward in the list by some fixed number of steps.
8.4.3.1. An Example¶
While self-organizing lists do not generally perform as well as search trees or a sorted list, both of which require \(O(\log n)\) search time, there are many situations in which self-organizing lists prove a valuable tool. Obviously they have an advantage over sorted lists in that they need not be sorted. This means that the cost to insert a new record is low, which could more than make up for the higher search cost when insertions are frequent. Self-organizing lists are simpler to implement than search trees and are likely to be more efficient for small lists. Nor do they require additional space. Finally, in the case of an application where sequential search is "almost" fast enough, changing an unsorted list to a self-organizing list might speed the application enough at a minor cost in additional code.
As an example of applying self-organizing lists, consider an algorithm for compressing and transmitting messages. [1] The list is self-organized by the move-to-front rule. Transmission is in the form of words and numbers, by the following rules:
- If the word has been seen before, transmit the current position of the word in the list. Move the word to the front of the list.
- If the word is seen for the first time, transmit the word. Place the word at the front of the list.
Both the sender and the receiver keep track of the position of words in the list in the same way (using the move-to-front rule), so they agree on the meaning of the numbers that encode repeated occurrences of words. Consider the following example message to be transmitted (for simplicity, ignore case in letters).
The car on the left hit the car I left
The first three words have not been seen before, so they must be sent as full words. The fourth word is the second appearance of "the" which at this point is the third word in the list. Thus, we only need to transmit the position value "3". The next two words have not yet been seen, so must be sent as full words. The seventh word is the third appearance of "the", which coincidentally is again in the third position. The eighth word is the second appearance of "car", which is now in the fifth position of the list. "I" is a new word, and the last word "left" is now in the fifth position. Thus the entire transmission would be
The car on 3 left hit 3 5 I 5
This approach to compression is similar in spirit to Ziv-Lempel coding, which is a class of coding algorithms commonly used in file compression utilities. Ziv-Lempel coding replaces repeated occurrences of strings with a pointer to the location in the file of the first occurrence of the string. The codes are stored in a self-organizing list in order to speed up the time required to search for a string that has previously been seen.
[1] | The compression algorithm and the example used both come from the following paper: J.L. Bentley, D.D. Sleator, R.E. Tarjan, and V.K. Wei, "A Locally Adaptive Data Compression Scheme", Communications of the ACM 29, 4(April 1986), 320-330. |