More precisely, this time I'm going to look at what sample density is needed to resolve "Rayleigh features", by which I mean Airy disks that are spaced at the Rayleigh criterion.
Motivation for writing this up is that in recent memory I've seen a couple of instances where people argued that two pixels per Rayleigh separation distance is enough.
I gather this was based on combining the ideas that (a) Rayleigh separation is the smallest thing we care about, and (b) the Nyquist sampling theorem says that two samples per cycle is enough.
Unfortunately, that line of reasoning is not correct. Here I'll attempt to illustrate and explain why that is, and in the end talk about some aspects of Nyquist that you may not have realized.
To begin, let's define a test problem.
Here is a line cut through a simulated optical image that consists of four Airy disks organized in two pairs. In each pair, the separation between the two Airy peaks is set to the Rayleigh criterion. But there's no particular relationship between one pair and the other pair, and in fact I'm going to adjust the positions of the pairs so as to create vivid examples.

Graph title: "This is the optical image to be captured, two pairs of Airy disks at Rayleigh separation."
What I'm now going to do is "capture" that image by sampling to discrete pixels. I'll do that under three conditions, and in each condition I'll tweak the positions of the two pairs so as to show the best and worst cases.
First, I'll sample at two pixels per Rayleigh separation. This works great IF the pixels happen to line up properly with the peaks and valleys, but it totally fails to resolve the separation if the pixels do not happen to line up well.
In my tests, the captured pixel values showed no valley at all in something over 1/4 of random positions. Losing features is not a rare occurrence.


Graph titles: "At two samples per Rayleigh separation, many pairs are not resolved at all. Here are a couple more cases where no valley is seen."
It turns out that there are a couple of things wrong with casually applying the Nyquist sampling theorem this way.
One of them, very important, is that even though we might only be interested in Rayleigh-sized features, the optical image actually contains higher resolution information that can mess up the sampling. The actual cutoff frequency for a diffraction-limited image corresponds to the Abbe separation, about a factor of 1.22 smaller than the Rayleigh separation.
So, to properly apply Nyquist, we need to use a smaller spacing of our samples, about 2.44 samples per Rayleigh separation.
If we do that, then this is what happens:

Graph title: "At Nyquist = two samples per cycle at cutoff, all pairs are resolved but contrast varies widely."
Well, that's certainly a lot better -- we don't actually miss the separation even in the worst case. On the other hand, we do find a pretty serious variation in contrast, from about 16.1% in the best case down to about 5.7% in the worst case. (This is calculating contrast=(peak-valley)/peak in both cases.)
No surprise, we can do better by increasing the sampling density.
Here is the same exercise, done with 3 samples per cycle at cutoff, rather than Nyquist's 2 samples per cycle:

Graph title: "At 1.5*Nyquist, three samples per cycle at cutoff, contrast is much more uniform."
With this increase in sampling density, we're now looking at a contrast variation only from 19.1% to 15.6%.
From here, a simple intuition is correct: higher sample density gives a more accurate capture, lower sample density gives a less accurate capture, and all of that is on a continuum with diminishing returns. Many people would agree that 3 samples per cycle is a lot better than 2, while 4 samples per cycle is better than 3 but maybe not worth the cost, and 10 samples per cycle would be crazy.
While I was writing this up, I realized that there's a very important aspect of the Nyquist sampling theorem that I'm hoping is subtle and frequently overlooked. I'm hoping that because it took me a long time to get a good grip on it.
Here is an illustration of it:

Graph title: "Given these values sampled at Nyquist density, which show two very different patterns, we can reconstruct the original signal, which has two copies of the same pattern, slightly shifted."
Now, in words...
With some additional fine print, what the sampling theorem guarantees is that a bandwidth limited signal can be perfectly reconstructed from a set of discrete samples taken at 2 samples per period of the bandwidth limit.
The theorem does not say that the samples themselves will "look like" the original signal in any particular way.
To recover the original signal from Nyquist samples, you have to go through a Fourier-based reconstruction process that implicitly depends on prior knowledge of the bandwidth limit. In the above example, the reconstruction process would be figuring out that the original signal must have been overshooting and undershooting between the sample points, because that's the only way it could exactly meet the sample points given the bandwidth restriction.
If you do any other sort of reconstruction, for example the casual filling-in-the-gaps interpolation that we humans do without even thinking, then the signal you think you see may be significantly different from the one that was actually measured.
I hope this helps somebody else. I'm sure I've learned a lot in trying to figure out how to explain it.
--Rik