I can't believe it's been 5 years since my last blog post. Skip a few weeks, then a few months, and next thing you know it's 5 years. In any event, here's a little Fun and Profit with Markov Chain Monte Carlo.
I can't believe it's been 5 years since my last blog post. Skip a few weeks, then a few months, and next thing you know it's 5 years. In any event, here's a little Fun and Profit with Markov Chain Monte Carlo.
Posted at 09:51 PM | Permalink | Comments (0)
Reblog
(0)
| |
|
After a recent release I did some crude calculations to see how the new feature performed. We have a business intelligence team to do these analysis but I was eager and wanted early answers. The initial numbers looked promising but were they statistically significant or just the result of dumb luck? This eventually led me to do some interesting power analysis using G*Power 3, which I'll now share with you now.
Lest I divulge any company secrets I will use the following fictitious example. A recent study says men with guitars are more attractive to women. Our customers are predominantly women so we want to see if a redesigned registration page, one with a handsome man playing a guitar, would increase signups compared to our current registration page featuring an adorable puppy. Our A/B test scenario looks like this,
After running our A/B test we will be in one of four possible states,
What we want is low alpha, high power, where power is 1 - beta. We want to minimize probability of false positive or negative. Using G*Power 3 our numbers look like this,
For alpha 0.05 and power 0.8 we need total sample size of 88. This is for an effect size of 0.3.
Effect size is "practical" significance. It measures the significance of the difference. Our registration rate may be 40% whereas our purchase rate may only be 4%. The effect size for an increase to 45% and 4.5% are 0.0269 and 0.0116, respectively, using this handy online calculator (Cramer's V). How does this impact our power analysis?
Analysis: A priori: Compute required sample size
Input: Effect size w = 0.0269
α err prob = 0.05
Power (1-β err prob) = 0.8
Df = 1
Output: Noncentrality parameter λ = 7.8489977
Critical χ² = 3.8414588
Total sample size = 10847
Actual power = 0.8000069
Analysis: A priori: Compute required sample size
Input: Effect size w = 0.0116
α err prob = 0.05
Power (1-β err prob) = 0.8
Df = 1
Output: Noncentrality parameter λ = 7.8488848
Critical χ² = 3.8414588
Total sample size = 58330
Actual power = 0.8000012
We go from a sample size of about 10k to about 58k, which intuitively makes sense, since trying to determine if a 0.5% bump in purchase rate is due to a handsome guitar man or dumb luck will take a lot more samples than the more "obvious" 5% in registration conversion.
So now the picture is complete. We estimate our effect size and along with our desired alpha and power we can determine the sample size. We then run our A/B test for this sample size. Once the results are in we do another calculation to see whether the guitar man makes a difference in our registrations. My bet is still on the adorable puppy.
Posted at 06:35 PM | Permalink | Comments (0) | TrackBack (0)
Reblog
(0)
| |
|
I have always thought the "Y" in Y Combinator was a reference to Yahoo. I knew their co-founder Paul Graham (the "Lisp guy") had ties with Yahoo (they acquired his company Viaweb) so I just assumed the "Y" must be a shortening of Yahoo, just like YUI. Boy was I wrong.
The realization occurred watching Doug Crockford (the "Javascript guy") reflect on the history of Javascript. Crockford initially thought Javascript was a joke until he found out it had lambdas. He proceeded to present Javascript's version of the Y Combinator, at which point I knew I was an idiot.
A quick read of Y Combinator's FAQ confirmed my embarassing ignorance,
Why did you choose the name "Y Combinator?"
The Y combinator is one of the coolest ideas in computer science. It's also a metaphor for what we do. It's a program that runs programs; we're a company that helps start companies.
In the spirit of WTF is F-Bounded Polymorphism, I now ask, WTF is a Y Combinator? From the factorial example above we can see that it has something to do with recursion but there is no explicit recursion involved (factorial does not reference factorial).
Here is a Ruby explicit recursive factorial function in which we do reference factorial within the body of the factorial,
We will now try to get rid of this explicit recursion. First, write this as lambdas instead of using def, giving us
We can further abstract out factorial as an argument to the lambda,
This reads: define factorial as a function that takes in a fac function and returns another function that takes an integer n. We can now call factorial, passing in factorial as argument, then pass in an integer (here 5).
Let's abstract lambda { |fac| ...} and call it H. Nothing has changed, just less clutter.
In mathematics, this is called a fixpoint. We pass an argument to a function and the result is that argument. Here the argument is the factorial function. It could just be a number; for example, zero and one are fixpoints of x = x ^2 (x squared). In lambda calculus speak, when H is applied to factorial the result is factorial. Or, calling H with argument factorial returns factorial. factorial is a fixpoint of H.
We would like to generalize this. So let us define a function Y that takes in a function and returns a fixpoint of the function. Hey, this is our Y Combinator!
We can now get a recursive factorial function from the non-recursive H function if we had a suitable Y function.
Now here is the crazy part: Y is not recursive. That's correct, Y does not need recursion. Here is the magical Y Combinator in Ruby,
With the Y Combinator we can now do this,
Our non-recursive factorial function looks very much like a recursive factorial function, except it requires a factorial argument to work for n > 2. The Y Combinator supplies us with this factorial function argument. You can kind of see how this works even if it's not 100% crystal clear. The le (lambda expression) agument is our non-recursive factorial function. We essentially remember the non-recursive factorial function and through a couple of self applications f.call(f) we're able to generate a copy of the non-recursive factorial when needed.
To make it a little clearer, let us derive the Y Combinator using regular def but without recursion. This time, we'll use a length function as an example. The length function takes in an array and returns the number of elements in the array. Assume arrays have a "rest of" method that returns the remainder of the array (i.e., Array.slice(1..-1)). Also assume we have this "eternity" function that never returns.
Instead of defining a total length function (one that will work with all arrays) let us define a partial length function that only works for arrays of length 0, 1 or 2.
length_2 still requires a length_function which we currently don't have so let's use the eternity_function. If we pass in a 3 element array things won't work since the eternity_function won't return.
To remove the repetitions we will define a make_length function and our length_2 function will now look like this,
Nothing has changed, length_2 will still bomb for arrays with more than 2 elements. From the above you can see how we just need to find a way to generate a new copy of make_length if we wanted to define length_3, length_4, length_infinity.
Instead of passing in the eternity function how about passing in the make_length function? Worth a try,
With these changes we can now determine the length of a 3 element array. We can rename our functions,
and with that we have a general length function that will work for all arrays. We've generated a recursive function from a non-recursive one. We are able to essentially create copies of the non-recursive function as needed.
There you have it, the Y Combinator. And to think, all these years, I thought it was the Yahoo Combinator. It's rather mind blowing (my Y Combinator, not my ignorance). What are the characteristics of human life? Intelligence and reproduction. We can think and we can make babies.
Here we demonstrated how code can reproduce, make new copies of itself. Code evolution. First there was the non-reproducing make_length function that could only calculate length of zero element arrays. As chance would have it, in the primodial soup was also the Y Combinator function. When make_length and Y Combinator mixed we gained the ability to generate new copies of the make_length function when needed.
Lots of good information on Y Combinators on the web. My two favorite: The Little Schemer and The Implementation of Functional Programming Languages, which the above is largely a summary of.
Posted at 06:30 PM | Permalink | Comments (0) | TrackBack (0)
Reblog
(0)
| |
|
I recently acquired the writeaheadlog.com domain. It seems like all the cool kids have interesting domains, like allthingsdistributed.com or crazybob.org, and I didn't want to be left out. your-name.com is so 2008. "Ahead" inspires forward-thinking and optimistism--full steam ahead! "Write" and "log" are appropriate and factual--this is a log of my writing. Taken together, write-ahead-logging is used in many database systems, hinting this will be a software and programming centric blog. Surprised such a great domain was still available; out of 2,267,233,742 Internet users I alone deem it worthy of $19.99/year.
So what exactly is a write-ahead log? Imagine your application has two counters. The constraint is the counters must be equal in all consistent states. We might double each counter during a transaction,
Problems arise when there are system failures. After output(A) there is a power outage so output(B) does not get executed. On disk, A is 20 but B is still 10, violating our contraint. We need a mechanism to handle such failures since they cannot be prevented.
Enter the log. The log records information about transactions so we can restore our system to a consistent state. The first log approach, the undo log, reverses the changes of incomplete transactions. In our example, upon recovery, changes to A are undone so A is once again 10 and (A == B == 10). The log is of course written to nonvolatile storage.
An undo log looks something like this,
When we update A we log a record indicate its before value 10. Likewise, when we change B from 10 to 20 we record its before value 10. Before outputting A and B to disk we must flush the log (the undo log, not the data log). Only after output(A) and output(B) are successful can we record <commit T>.
With the undo log in place, how do we recovery from failure? We read the undo log from the end (most recently written record) to the start and find incomplete transactions. Transactions with a <commit> record we can ignore because we know that <commit> can only be recorded after output has been successful. If there is no <commit> record for the transaction we cannot be certain that output was successful, so we use the undo information to revert the changes. <T, B, 10> sets B back to 10 and <T, A, 10> sets A back to 10. The undo records are idempotent so if there is a crash during recovery we can just recovery as usual, setting B to 10 even if B is already 10; setting A to 10 even if A is already 10. The undo log records <abort T> to indicate we aborted the transaction.
The undo log is great but there is one annoying performance issue. Before we can record <commit T> in the undo log we must do output(A) and output(B), incurring disk I/O. If we have a lot of transactions we keep having to do output in order to maintain the integrity of the undo log. We may want to buffer the output until a convenient time.
Enter the redo log. Instead of undoing a change we will record information (the new value v) so we can rerun transactions, reapplying the change if necessary. Before doing any output we must record the <commit> record. We write-ahead. The write-ahead logging rule,
Before modifying any database element X on disk, it is necessary that all log records pertaining to this modification of X, including both the update record <T, X, v> and the <COMMIT T> record, must appear on disk.
So our redo log will look something like this,
We record the new values (20 and 20) then commit then flush the log. We can only do output after the <commit T> record has been written and the log flushed. This solves the issue of buffering our output. We still will do disk I/O for the log flush but logs are sequential appends so it can be much faster than random outputs to random blocks on the disk (see my blog on Log Structured File Systems for Dummies, not that I'm calling you a dummy).
To recovery with a redo log we begin at the head of the log scanning forward (opposite of the undo log). If we see an incomplete transaction (no <commit T> record) we can ignore (except adding an <abort T>), knowing that no output was ever done. However, if we see a <commit T> record we don't know whether the output was successful or not, so we just redo the change, even if it's redundant. A would be set to 20, B would be set to 20 even though the log indicates the transaction committed with <commit T>. Like the undo log, the changes are idempotent so repeated calls are fine.
Now you know the inspiration for writeaheadlog.com. There is a lot more information in Database Systems: The Complete Book (which the above is a blatant plagarism of). For example, we can combine the undo and redo log to create a undo/redo log that stores before and after values. This allows us more flexibility in when we write the <commit T> record and when we need to flush the log and do output. There are also challenges of checkpointing so we don't have to read the entire log.
Incidentally, while very few of us have the skills (myself included) and desire to write a transaction/log/recovery manager, the undo/redo approach to maintaining consistency can be used to mimic transaction-like properties. Imagine remote HTTP calls or NoSQL systems that lack strict transaction support. We might use the undo/redo techniques/approach so our application can recover to a consistent state by repeatedly applying idempotent changes in the face of failures.
Posted at 08:58 PM | Permalink | Comments (0) | TrackBack (0)
Reblog
(0)
| |
|
Premature optimization may be the root of all evils in the go to environment of the 1960s but these days it may well be the difference between Friendster and Facebook. Performance is important. Some say it's a feature. I've been working on some performance related issues the last few days and thought I'd share some thoughts. In particular, how to measure page loading and rendering using Navigation Timing, Selenium WebDriver and Chrome Content Scripts extensions.
It's easy enough to measure server latency and response times: launch some ec2 instances, run a few scripts. I did this to compare curl pretransfer times for a few hosting services. To my surprise, googleapis was the slowest.
I was somewhat hesitant to say that Google was slow but it looks like Pingdom came to similar conclusions. Should be noted that this may not tell the whole story. Google may have better geo edge distribution, uptime, transfer bandwidth, etc.
What is more tricky is measuring performance from a real user's perspective as seen through the browser. To do this, I had to dig a little deeper into Navigation Timing and how browsers load and render pages. This is the general model of web timing (click for larger version),
I've created a pagespeed test page to help see things in action. Fill in the various download values to simulate slow downloading of Javascript, CSS and images. Some examples,
You can enter in combinations. For example, slow Javascript in the head and even slower CSS at the bottom (wait for the text to turn green).
This is interesting and all, but how can we measure it? Using Selenium WebDriver and the PerformanceTiming interface. Here is a little demo video. The web timing script does the following,
I do this thrice then display the averages, which for this run with my bad Comcast internet are,
dns lookup avg 148
conn time avg 59
ssl handshake avg 1335773222840
response latency avg 122
transfer time avg 0
DOM parse time avg 5420 (includes blocking js/css, does not factor in progressive rendering)
fetch to interactive time avg 5757 (time from fetch to page being interactive)
*** page is rendered ***
defer scripts time avg 4977
remaining image loading time avg 0
post load time avg 0
fetch to loaded time avg 10736
elapsed fold avg 3487
elapsed total avg 5420
Took 148ms to do the DNS lookup for tinou.com. Then waited 122ms for my slow server to response. The response is so small that it didn't really capture any transfer time. It took about 5.4s to parse the DOM, which aligns with the 5 seconds it takes download the bottom CSS. Since the defer Javascript was loaded in parallel, the defer script time is only 4.97s, not the full 10s. The fetch to load time is 10.7s (when everything is loaded) due to the defer Javascript taking so long.
What about the elapsed fold average and elapsed total average? Those measurements are tricky. The PerformanceTiming interface don't capture these values. Even though it took 5.4s to render the page (due to the bottom CSS), the user is able to see everything above the fold in 3.5s (due to the head Javascript).
I was able to capture those numbers with a Chrome extension. I could have edited my pagespeed page to calculate the numbers directly but wanted to capture these values unobtrustively, without changing the page. The Chrome extension uses Content Scripts. I tell Chrome to inject some Javascript at "document start." The script does the following,
Then it's just simple math to get the time from the start to when the fold was displayed and when the page bottom was displayed. Note, I am not fully aware of the impact of the extension code but it doesn't appear to cause noticeable issues (like hogging/freezing the page). Ran it for bestbuy.com and amazon.com. Here is amazon's 3 run result,
runs 3
http://www.amazon.com
Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more
dns lookup avg 118
conn time avg 92
ssl handshake avg 1335777335397
response latency avg 164
transfer time avg 461
DOM parse time avg 668 (includes blocking js/css, does not factor in progressive rendering)
fetch to interactive time avg 1052 (time from fetch to page being interactive)
*** page is rendered ***
defer scripts time avg 22
remaining image loading time avg 541
post load time avg 13
fetch to loaded time avg 1628
elapsed fold avg 266
elapsed total avg 648
So it took about 260ms for amazon.com to render above the fold and another 400ms or so for the rest of the page to finish. Here's the Amazon video.
So there you have it, a simple way to get web timing and see how your pages are performing from the users perspective.
Posted at 02:23 AM | Permalink | Comments (0) | TrackBack (0)
Reblog
(0)
| |
|
Although I've never used Riak, have been a distant fan just because it's written in Erlang. Erlang, systems that never stop! ® In one of Basho's whitepapers they mention the use of the Log-structured Merge Tree (LSM-tree) data structure for fast indexing. So what's a LSM-tree? It's a "disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period." Riak is often used in write heavy environments so it's important that indexing is fast. So what's a LSM-tree again? Hmmm, LSM-trees is inspired by the Log-structured File system (LSF), so I better first learn a little more about LSF.
The driving force behind the Ousterhout and Rosenblum's Log-structured File system was (is) the mechanical limitations of the disk drive. Unlike processor or memory, disk drives have mechnical moving parts and is governed by the laws of Newtonian physics. To read or write to disk the arm first has to move to the desired track, then there's a rotational delay until the disk spins to the relevant sector. This access time is in the milliseconds, which is an eternity compared to memory speed or processor cycles. Access time overhead is exasperated when the workload is frequent, small reads and writes. More (relative) time is spent moving the the disk head around than actual data transfer.
[Aside. Slow disk drives is one of the reasons I prefer to develop on desktops and not laptops. You get a fancy new MacBook Pro with the latest processor and a shit load of RAM only to be bounded by I/O. Money is better spent on the fastest disk drive you can buy.]
The situation for reads is "easily" solved with file cache. More memory, bigger caches, better hit rates, less read requests will have to go to disk. But more memory does not help as much with writes. File systems can buffer more writes to memory before flushing to disk but the flushes still need to be frequent to avoid data lost; and the writes still involve accessing random parts of the disk.
To see this clearly, below is a diagram of a traditional Unix File System involving writing two single-block files in two directories.
Unix FS involves 8 random, non-sequential writes (numbered, but not in that order). 4 to the inodes and 4 to the data blocks (2 directories, 2 files). Half of these are synchronous writes to avoid leaving the file system in an inconsistent state. The other half can be done with an asynchronous delayed write-back. Newer file systems have many optimization to help with performance, like keeping inodes and data blocks closer together, but the point remains that these types of file systems suffer from the limitation of disk access time.
Ousterhout and Rosenblum's log-structured file system gets around this by avoiding random, non-sequential writes altogether. Writes are done asynchronously in a large sequential transfer. This minimizes the access time latency and allows the file system to operate closer to the disk's maximum throughput rate. As the diagram shows, the same information is written to disk: 4 inodes and 4 data blocks (2 directories, 2 files). But it's written sequentially by appending to the log. Data (both metadata like inode and the actual file data) is never overwritten in-place, just appended to the log.
This is clever and all but how do we get the data back?!? In the traditional Unix FS the inodes are at fixed location(s). Given inode number 123 it's easy to calculate its disk location with a little math, and once we have the inode location we can get the data blocks. This doesn't work with LSF since inodes are not fixed--they're appended to the log just like the data blocks. Easy enough, create an inode map that maps inodes to their locations. Wait a second, how can we then find the location of the inode maps? Finally, it's time to write to a fixed location, the checkpoint region.
The checkpoint region knows the location of the active inode maps. At startup we read in the checkpoint region, load the locations of the inode maps into memory, then load the inode maps into memory. From then on, it's all in-memory. The checkpoint region is periodically written to disk (checked point). Once we have the inode maps read requests behave much like the traditional Unix FS: lookup the inode, perform access control, get the data blocks.
In summary, read requests don't change much and we can leverage file cache to improve performance. Write requests, however, show dramatic improvements, especially for frequent, small write requests, since we always write sequentially in large chunks.
But the story doesn't end quite yet. If we always append, never overwrite in-place, we will eventually run out of space unless we can reclaim free space. Reclaiming free space, that sounds like memory garbage collection in programming languages; and that's exactly what the LSF does, garbage collect.
Imagine that segments 5 and 6 have both live and dead blocks (files that have been deleted). The segment cleaner (garbage collector) can compact segments 5 and 6 then copy only the live blocks into an available free segment. Each segment has a segment summary block (not shown) with information about itself to help in this process (which blocks are dead, etc.). Then it's just a matter of moving the links in the segment linked list to restore the order. I'm of course hand waving here as things are more involved. Like memory garbage collection it's in the details and optimizations that will determine if the system is performant. Issues like garbage collecting long-live objects (data), when to run the collector, etc. emerge.
There you have it, the Log-structured File system. Next time, the Log Structure Merge Tree.
Posted at 02:11 AM | Permalink | Comments (0) | TrackBack (0)
Reblog
(0)
| |
|
The other day I was chatting with a colleague about Memcached. Eviction policy came up, and I casually mentioned that Memcache isn't strictly LRU. But a quick Bing search said Memcache is LRU, like this Wikipedia entry. Hmm, I was 99.9% sure Memcache is not LRU, something to do with how it manages memory, but maybe I was wrong all these years. After reading through some Danga mailing lists and documentation, the answer is, Memcached is LRU per slab class, but not globally LRU.
So what does this exactly mean. Imagine starting Memcached with these command-line arguments to override the defaults,
memcached -vv -m 1 -I 1k -f 3 -M
-vv
for verbose output-m 1
maximum 1 megabyte cache-I 1k
1 kilobyte (1024 byte) page size-f 3
chunk size growth factor of 3-M
return error on memory exhausted (rather than removing items)
just to see things more explicitly
The output will be something like this,
slab class 1: chunk size 96 perslab 10
slab class 2: chunk size 288 perslab 3
slab class 3: chunk size 1024 perslab 1
<26 server listening (auto-negotiate)
<27 server listening (auto-negotiate)
<28 send buffer was 9216, now 5592405
<29 send buffer was 9216, now 5592405
<28 server listening (udp)
<28 server listening (udp)
<28 server listening (udp)
<28 server listening (udp)
<29 server listening (udp)
<29 server listening (udp)
<29 server listening (udp)
<29 server listening (udp)
The visual representation below may be helpful,
We have a cache with a max memory limit of 1 megabyte. This 1 megabyte of memory is divided into 1 kilobyte (1024 byte) pages. These 1024 byte pages are assigned, as needed, on a first come, first serve basis, to the three slab classes. Page assignment to a slab class is permanent. The three slab classes are determined by the page size and the chunk size growth factor. If you change the page size and/or chunk size growth factor you'll get a different slab class configuration.
With our settings, slab 1 has a chunk size of 96 bytes (the smallest, initial chunk size). There are 10 chunks per page (11 chunks would be more than the 1024 byte page size) in slab 1. Slab 2 has a chunk size of 288 (96 x 3 growth factor). There are only 3 chunks per page in slab 2. Finally, slab 3 has a chunk size equal to the page size and obviously only 1 chunk per page.
Each chunk can hold an item up to its size. By item, I mean the key, the value and some overhead. The item's size determines which slab it is stored in. Small items, say 40 or 60 bytes, are put in slab 1. Leftover storage is wasted. That is, putting a 60 byte item in a 96 byte chunk results in 36 byte wasted space. These 36 bytes are not available for other use. Slab 2 holds items between 96-288 bytes. Slab 3 holds items between 288-1024 bytes. With our arguments, we could not store items larger than 1024 bytes.
So that's an overview of how Memcached is layed out. Pages, chunks, slabs. Now imagine we store a bunch of small items and a bunch of large items but no medium items until all pages from the page pool are used.
Slab 1 might have 800 pages and slab 3 might have 200 pages (those numbers are not exact, won't add up). Everything is good, until we try to store a medium size item.
Memcached will give us an out of memory error (or won't store the item w/o the -M option) because slab class 2 has no available pages to use. Pages from slab class 1 and slab class 3 can't be re-assigned to slab 2. You can think of Memcached as having separate internal caches defined by the slab classes.
Finally, this brings us back to the LRU question. Since each slab class is essentially a separate cache, there is no global LRU, only slab class LRU. The least recently used item won't get evicted for a new item if the items are not in the same slab class. Depending on your storage/access patterns, you could end up with one slab class constantly evicting recently used items, while another slab having a bunch of old items that just sit around.
Which brings me to another point, items/chunks are not actively reclaimed/expired. Memached does not have a background thread that explicitly expires items, reclaiming used chunks for new items. When a slab needs a chunk and there are no more pages, Memcache will look at the queue tail for items to evict. Memcache will make a best effort to evict expired items (items you've explicitly set to expire after some time). In scenario 1, item 2, an expired item, is evicted. However, in scenario 2, item 1, which has not yet expired, will be evicted, even though item 4 would seem like the better candidate. But since item 4 is not near the tail, Memcached stops looking and just expires item 1.
Another example of leaky abstraction? The Memcached API is very simple, straightforward. But at a certain point, you need implementation knowledge to make sure things are behaving as expected.
Posted at 04:42 AM | Permalink | Comments (1) | TrackBack (0)
Reblog
(0)
| |
|
Amazon.com says I bought Programming Ruby (2nd edition) on November 4, 2004. Seven short years of Java later I'm finally putting it to use. One immediate question is, how does garbage collection work in Ruby? Is it a tracing or reference counting collector? Is it generational, and if so, how many generations? These questions eventually led me to an interesting paper by David Bacon, Perry Cheng and VT Rajan on A Unified Theory of Garbage Collection.
Universally, garbage collectors are viewed as either tracing or reference counting. The Sun HotSpot VM, for example, is a tracing collector. It traverses the object graph from the roots, finding reachable live objects and reclaiming unreachable dead objects. PHP, on the other hand, uses reference counting to decide which objects can be safely garbage collected. This table summarizes the typical characterization of each collector type,
Tracing |
Reference Counting | |
Collection Style |
Batch |
Incremental |
Cost per Mutation |
None |
High |
Throughput |
High |
Low |
Pause Times |
Long |
Short |
Real Time |
No |
Yes |
Collects Cycles |
Yes |
No |
The first observation is, once optimizations are added to the basic tracing and reference counting collectors, they begin to converge. For example, with deferred reference counting there is a periodic scanning of stack references to reduce per-mutation cost. A generational tracing collector, on the other hand, imposes a per-mutation overhead but gains shorter pause times. So even though collectors are typically described as one or the other, we should really view them as hybrids.
The second interesting observation is, tracing algorithms are the duals of the reference counting algorithms. Both types of algorithms share the same structure.
Tracing |
Reference Counting | |
Starting Point |
Roots |
Anti-Roots |
Graph Traversal |
Forward from roots |
Forward from anti-roots |
Objects Traversed |
Live |
Dead |
Initial Reference Count |
Low (0) |
High |
Reference Count Reconstruction |
Addition |
Subtraction |
Extra Iteration |
Sweep Phase |
Trial Deletion |
Tracing collectors start at the root, traverse foward, finding live objects. Initially all reference counts are zero (an underestimate) but live objects have their reference count incremented to their true value (keeping a bit flag is just an optimization of actual count). Reference counting collectors start from non-root objects, traversing forward to find dead objects. Reference counts are initially high (an overestimation that includes references from dead objects) but decremented until the true reference count is obtained. Ultimately, the difference is just the initial value (an underestimate or overestimate) and whether we increment or decrement to obtain the true reference count.
So that was interesting, but lets return my the original question about Ruby garbage collection and add a few other languages in for fun. Here's a quick summary. Don't quote me on it, corrections appreciated.
Language | Garbage Collection (for reference/popular implementation - others may be available) |
Erlang | tracing - per process (very small processes), compacting then generational optimization |
Haskell | tracing - generational (3), 512KB nursery |
Java | tracing - generational (2), serial, concurrent, concurrent + incremental |
Perl | reference counting - does not handle cycles |
PHP | reference countng - 5.3 handles cycles w/ "on the fly" cycle detector, prior versions did not |
Python | reference counting - handles cycles w/ tracing cycle detector |
Ruby | tracing - non-generational |
Scala | see Java |
Posted at 12:17 PM | Permalink | Comments (0) | TrackBack (0)
Reblog
(0)
| |
|
Spent the other day reading up on ZooKeeper a little. ZooKeeper is a distributed, open-source coordination service for distributed applications. Out of the box ZooKeeper can be used for name service, configuration and group membership. Building on ZooKeeper's primitives, you can create barriers, locks, queues, etc.
Sounds interesting enough, but what got me really interested was ZooKeeper's atomic broadcast, the guts of ZooKeeper. Atomic broadcast is very closely related to consensus (apparently equivalent in certain asynchronous systems). Both are fundamental problems in distributed systems.
So what exactly is atomic broadcast? To answer this, lets work our way from Reliable Broadcast.
Reliable Broadcast
Reliable Broadcast is the weakest type of fault-tolerant broadcast. Informally, if a correct process broadcasts a message then all correct processes eventually receive that message. Reliable Broadcast doesn't impose any message delivery ordering.
FIFO Broadcast
FIFO Broadcast is reliable broadcast that satisifies FIFO Order. If a process broadcasts message m1 before message m2, then no correct process delivers m2 before it delivers m1. For example, if message m1 is a deposit of $100 into your banking account and message m2 is a subsequent withdrawal of $75, you most definitely want FIFO Broadcast, otherwise your bank will charge you an overdraft fee (on top of their ridiculous $5 ATM fee).
Causal Broadcast
Sometimes FIFO Order is not enough because FIFO ordering too limited in context (a single process). You need Causal Order to guarantee that if the broadcast of message m1 "happens before" or "causally precedes" the broadcast of message m2, then no correct process delivers m2 before it delivers m1.
For example, imagine three processes a, b and c. Process a broadcasts m1a, "Banks to charge $5 ATM fee." Process b delivers m1a then broadcasts message m1b, "That's outrageous!" Without Causal Order, process c could deliver message m1b before m1a,
m1b : "That's outrageous!"
m1a : "Banks to charge $5 ATM fee."
which doesn't make sense (what is outrageous?). Causal Order requires delivery of m1a before m1b.
Atomic Broadcast
Causal Order imposes a partial ordering in the system, so messages without causal relationships are logically concurrent and do not have any delivery order guarantees. This can be problematic in some cases. For example, imagine two replicated databases DB1 and DB2. Process a broadcasts message m1a, "deposit $100." Process b broadcasts message m1b, "charge 10% fee." Since there is no causal relationship between the messages, this situation may arise,
DB1 : $0 + $100 - ($100 * 10%) = $90
DB2 : $0 - ($0 * 10%) + $100 = $100
which is obviously not desirably.
Atomic Broadcast imposes a Total Order on the system so that all messages are delivered in the same order, whatever that order maybe. So in this example, it's either $90 or $100 but never $90 in one database and $100 in the other.
We can further classify Atomic Broadcast by the actual ordering it imposes. FIFO Atomic Broadcast is then Reliable Broadcast that is FIFO Order and Total Order. Causal Atomic Broadcast is then Reliable Broadcast that is Causal Order and Total Order. Causal Atomic Broadcast is the strongest guarantee we've examined.
Summary
This diagram nicely summarizes Reliable Broadcast and the orderings above (taken from "A Modular Approach to Fault-Tolerant Broadcasts and Related Problems," which the above is basically a summary of).
Posted at 04:14 PM | Permalink | Comments (0) | TrackBack (0)
Reblog
(0)
| |
|
Time flies when you're having fun (even when you're not). It's been more than a year since my last work related blog. Writing about random computer science and programming topics is like going to the gym; once you get off the routine it's very easy to just sit on your literal and figurative ass.
For my first topic back I want to relay some sequential random sampling algorithms I came across this week. It started out when we needed some random samples from the database. Performance issues aside, MySQL lets you do
select column from table order by rand() limit n
to select n random rows from a table. The randomness of the rand() function aside, someting troubled me about this approach. Won't your results be skewed toward the lower rows when you get duplicate random values?
So I was searching around for some random sampling algorithms and ran across a few from the works of Jeffrey Vitter I found rather interesting.
Algorithm S
Sequentially go through N elements to select n random samples. For each element generate an independent random variable U between 0 and 1 and test that NU > n. If true, skip, otherwise select for the sample and decrement n. In either case decrement N.
Algorithms A-D
These algorithms optimize Algorithm S by skipping elements. Define a skip function S(n,N) that will tell us how many elements to skip.
Algorithm S and its optimizations are for sequential sampling where N is known. But sometimes N is not known or it's prohibitive to determine N. Then we turn to reservoir samping.
Algorithm R
There is a reservoir R of n sample elements. The invariant is that the reservoir is a random sample of all the elements we've processed thus far. To maintain the invariant, as we sequentially process each element generate an independent random variable between 0 and the current element. Then either put the current element into the reservoir or ignore it.
Algorithms X, Y and Z
Optimization of algorithm R so we can skip over elements, much like Algorithms A-D.
Posted at 12:11 PM | Permalink | Comments (0) | TrackBack (0)
Reblog
(0)
| |
|