What is the ideal computer for running Zerene?

nathanm · Post by **nathanm** » Mon Mar 20, 2017 1:07 pm

This question is directed mainly at Rik, but I think other people may be interested in the answer, or have some practical experience.

I need to process a lot of stacks, and that is a computationally intensive task. So many that I want to buy a machine (or a couple of machines) to run day and night stacking images.

In order to handle big images (my images are 50 to 100 mpixel images each), you need a lot of memory. But how much? Does the amount of memory used depend on the size of a single image, or does it depend on the number of images in a stack? My guess is that it is a multiple of a single image - i.e. you don't need to have all of the images in stack in memory at the same time, but I don't know.

One approach is to buy a fast machine (i.e. multiple processors, high speed CPU). But I don't know how well Zerene uses multi-processors. If it does, how many processor cores can it use efficiently?

It usually the case the absolute fastest computer at any point in time commands a price premium relative to slightly lower performance. So another approach is to buy somewhat cheaper machines (say, with fewer cores), but several of them. That does not help with stacking a single image faster, but it does help with overall throughput, and I have plenty of images to stack.

A final parameter is disk speed. Some programs do a lot of reading and writing to disk. Depending on the program, some will read everything into memory if they have it. Others read and write, so if you configure some of main memory as a disk cache you are better off. Or you can use a solid state disk to do the stacking from. Is this a factor with Zerene.

So if I am buying a machine (or multiple) for Zerene, how much memory per machine (for 100 mpixel images)?

Am I better off with a multiprocessor machine with a lot of cores? Or multiple machines?

Will disk speed matter much? Is there a disk speed / main memory tradeoff (i.e. with more memory, disk speed doesn't matter)?

A final question is more of a legal issue - do I need multiple licenses to run Zerene on multiple computers doing stacking? I'm happy to buy them if needed - I suppose I could read the license agreement but it is simpler to ask.

The best option would be if there was Zerene-as-a-service hosted on something like Amazon Web Services, so I could pay to have N images stacking when I need them. I assume that is not a practical thing to hope for in the short term.

rjlittlefield · Post by **rjlittlefield** » Mon Mar 20, 2017 2:35 pm

Most of the issues that you've asked about are at least touched in the discussion at http://www.photomacrography.net/forum/v ... 699#141699 and continuing onto the next page.

Standard license terms allow execution on only one computer at a time.

Correct, Zerene-as-a-service is not going to happen any time soon. Even if I were eager to do that, not many potential users would have the bandwidth to make it practical.

--Rik

Chris S. · Post by **Chris S.** » Mon Mar 20, 2017 10:03 pm

Rik,

When specifying RAM allocation in Zerene Stacker preferences, does each session of Zerene Stacker take the allocated amount of Ram for its own? Or do concurrent sessions of Zerene Stacker share a single allocation?

In other words, given Nathan's 100-megapixel images and your recommendation that for his needs, he should run multiple sessions of Zerene Stacker concurrently, is it preferable to allocate RAM appropriately for a single session (say, 20 gigabytes of RAM for 100-megapixel images), and assume that each Zerene Stacker session will claim a unique 20 gigabyes of RAM? Or do all concurrent Zerene Stacker sessions share a common pool of RAM (implying that, say, 80 gigabytes of RAM be allocated if four Zerene Stacker sessions are to be run at once).

If the former, should we configure Zerene Stacker to use 20 gigabytes of RAM for stacks of 100 megapixel images? Or if the latter, should we configure Zerene Stacker to use 80 gigabytes of Ram if, in this example, four sessions of Zerene Stacker will be run concurrently?

In either case, can Zerene Stacker gain efficiency from using, say, 80 gigabytes or RAM? This is a lot of RAM! Most users see little gain beyond 32 gigs RAM. This said, RAM is cheap, and upgrading is not all that costly.

nathanm wrote:It usually the case the absolute fastest computer at any point in time commands a price premium relative to slightly lower performance. So another approach is to buy somewhat cheaper machines (say, with fewer cores), but several of them. That does not help with stacking a single image faster, but it does help with overall throughput, and I have plenty of images to stack.

(Snip)

Am I better off with a multiprocessor machine with a lot of cores? Or multiple machines?

Nathan, I've done quite a few PC builds--as, I suspect, you or your team have also done. With each build, the necessary review of current technology--costs, benefits, efficiencies, failure rates, etc.--is the biggest part of the work. My current sense--which falls in line with with prior findings--is that it's better to stay away from bleeding edge's very expensive multiprocessor workstations. Here, my sense echoes yours perfectly. And as is usual, the current price/performance curve for image stacking hardware echoes--as ever--the inflection curve for gaming performance (which drives the PC market). I typically build for myself and clients what amounts to a moderately high-end gaming computer (with a few changes) for image-stacking/photoshop. To emphasize, moderately high-end--definitely not bleeding edge.

In your situation, it strikes me as preferable for you to have several PCs at this level, rather than fewer at a much higher price, well into the region of diminishing returns.

I'll contact you privately with a parts list I developed about a month ago to build a pair of stacking/Photoshop computers for myself and a friend. Quite a bit of homework went into choosing this components--which I consider to be solid.

--Chris S.

mjkzz · Post by **mjkzz** » Mon Mar 20, 2017 11:07 pm

On feature I use often and really like is the batch processing where you can start a bunch of stacking jobs and leave it over night. It is a life saver for me when I do stack and stitch.

Also my experience is that Zerene is not memory hungry as I thought, at least with default configuration, nor it is CPU hungry. You can check this using task manager while running a stack. I did not find running multiple sessions improves anything though, I think it is due to CPU utilization. So batch is the way to go with single workstation.

One suggestion to Rik is to utilize GPU, if algorithm is more or less sequential, multiple GPU can be installed to scale it.

nathanm · Post by **nathanm** » Tue Mar 21, 2017 3:55 pm

Thanks Rik for the tips.

mjkzz - I agree that using GPU would be great. However it has the problem that just doing small bits of code in GPU doesn't help much because you bottleneck on the non-GPU pieces. And it is a ton of work to move it all.

Realistically I need to make a lot of stacks before Rik could possibly be done with that, even if he wanted to do it, which is unclear.

In the short term I am using an existing server that I have, but I am changing the parameters for running Zerene (i.e. running many copies in parallel, and the other suggestions in the thread Rik pointed to).

With previous parameters, ZS was really slow for me. I just hadn't optimized it.

With updated parameters based on the tips in the other thread, but still running a single copy, it takes ~2 hours to do a stack. That includes a little less than an hour for raw conversion (and thus not ZS, but still important), and then a bit more than an hour in ZS (in PMAX mode).

However on the same hardware I am running that on, I ought to be able to get a lot more throughput. My CPU utilization is ~22%, so I should probably be able to get 3X and maybe 4X improvement if I run enough copies of Zerene simultaneously.

If I can get 2 stacks /hour throughput and I run 24 x 7 I might be OK for the time being.

That is on a really expensive server my company already had which has tons of CPU and dram.

The main tradeoff here is the price vs throughput for memory and processor. In order to stack a 100 mpixel image, I need ~10 gb of DRAM. But one copy of ZS will not max out CPU to 100%, so you need to run multiple copies so the parallel parts of one will overlap with the non-parallel parts of another. So you need enough dram to handle the number of copies you need to max out the CPU. The data from Rik's other thread suggests that you need at least 3 copies of ZS for a 6-core machine. So that is ~30 gb of DRAM.

For people with smaller images, you would need less dram.

Going forward, if I need more than my existing monster machine can handle, it looks like the most cost effective throughput is going to be a bunch of relatively inexpensive PCs. Rack mount PCs with 32 - 48gb and a previous generation Xeon with 6-8 cores are pretty cheap - like $600 if you are willing to buy refurbished units. Getting 5-10 of these seems very likely to get more throughput than buying monster machines.

I will probably buy one machine like that and test the theory.

curt0909 · Post by **curt0909** » Tue Mar 21, 2017 4:37 pm

Determining the best CPU would depend on how Zerene scales with multiple cores. I have a Xeon E5 2695 v3 14 core with 64gb ram and a fast m.2 solid state drive. This system is slower than my 6 year old overclocked i5 quadcore in 90% of photoshop tasks. Photoshop scales well with 4 cores but beyond that you'll see diminished returns. As long as you have a minimum of 4 cores Single core speed>extra cores for most tasks.

However, if you'd be running multiple instances of Zerene it won't matter if it suffers from the same multi-core bottleneck as PS. Unfortunately you can't open more than one instance of PS for raw conversions so that bottleneck is still there.

https://www.pugetsystems.com/labs/artic ... mance-625/

rjlittlefield · Post by **rjlittlefield** » Tue Mar 21, 2017 4:40 pm

Chris S. wrote:When specifying RAM allocation in Zerene Stacker preferences, does each session of Zerene Stacker take the allocated amount of Ram for its own? Or do concurrent sessions of Zerene Stacker share a single allocation?

Each session gets its own allocation of the specified size.

Concurrent sessions share nothing except possibly a bit of read-only code space, and then only if the operating system is smart enough to do that on its own.

can Zerene Stacker gain efficiency from using, say, 80 gigabytes or RAM?

Yes, to the extent that a) multiple simultaneous sessions allow multiple cores to be used more efficiently, and b) large images take lots of memory.

The normal recommendation of 100-200 megabytes per megapixel is to provide maximum performance for processing a single stack, including retouching and generation of screen preview images, with overlap of I/O and computation.

If you just want to stack, then you can get by with somewhat less than 100 megabytes per megapixel, especially if you turn off generation of screen preview images and turn off overlap of I/O and computation.

Turning off screen preview images is always a good idea if you aren't going to use them, for example if you never retouch, or retouch so seldom that the preview images cost more than they're worth.

Turning off overlap of I/O and computation is normally a bad idea, which is why it defaults to turned on if you're using a Pro license. However, if you're running multiple simultaneous sessions, then there's a potential to overlap I/O on one stack for computation on another. In that case the reduction in RAM requirement, while not large, might turn out to be a good tradeoff by allowing you to bump up the number of simultaneous sessions in limited memory.

mjkzz wrote:nor it is CPU hungry.

I would be interested to hear some numbers, particularly if they are much different from what are discussed in the other thread that I pointed to. With my setup, I usually see Zerene Stacker consuming about 75% cpu when running a single PMax.

--Rik

mjkzz · Post by **mjkzz** » Tue Mar 21, 2017 9:06 pm

My CPU is i7 clocked at 2.8gh, 16GB of RAM.

Yes, on my computer it runs at about 70% - 75% with nothing else is running, so I do not think it is CPU hungry, ie, nor it is CPU hungry.

rjlittlefield · Post by **rjlittlefield** » Tue Mar 21, 2017 9:47 pm

mjkzz wrote:Yes, on my computer it runs at about 70% - 75% with nothing else is running

That sounds normal.

Internally the program will mostly be alternating between periods of 1 thread active, hence utilization of 100%/N for whatever N processors are shown by Task Manager, and periods of 100% when N threads are active in a parallel section with all processors utilized.

That pattern can be difficult to see, because:
a) the program typically bounces back and forth between those states several times per second, while the Task Manager display is updated only once per second, and
b) even when only 1 thread is active, Windows is liable to assign that thread to a different physical processor from time to time.

The result is that Task Manager CPU Usage History tends to show a bunch of graphs that are all more or less equally busy and apparently not saturated, with none of them peaking over 80% or so. Nonetheless the program will be bound by CPU performance under those conditions.

I'm not completely sure what you mean by "CPU hungry", but I suspect that you're being misled by the appearance of the Task Manager display.

--Rik

rjlittlefield · Post by **rjlittlefield** » Tue Mar 21, 2017 10:36 pm

Regarding GPU, let me share here some email that I wrote fairly recently:

Now, about your question:
> is there any way to harness graphics card GPU(s) to accelerate the stacking process?

This is something that I look at from time to time. In naive theory, the potential gain is enormous. In more accurate theory, and in practice, the gain would be far less, with significant uncertainty, while the cost to even find out how much gain is guaranteed to be high.

The problem is that while GPUs are blazingly fast at doing arithmetic on local data, they have limited memory capacity and limited bandwidth.

When I assembled my development system several years ago, I had a pretty powerful graphics card put into it just to cover the possibility that I would want to try exploiting it. What I have is an nVidia GeForce GTX 660 Ti, which has 1344 CUDA cores, giving in theory a whopping 2459.52 GFLOPS. No, that's not a typo -- almost 2.5 teraflops, humming along at 2 floating point operations (probably one multiply-accumulate) per core per cycle, at 915 MHz.

In contrast, the cpu on my system benchmarks only around 28 GFLops even under ideal conditions.

So that's almost 87:1 advantage in favor of the GPU. It's pretty hard to ignore a number like that.

But there's a catch. All those numbers, both gpu and cpu, relate only to the speed of crunching numbers that have already been loaded from memory into registers. As soon as you start doing memory operations, the speed goes way down. For example the aggregate bandwidth of my GPU card is quoted as 144.192 GB/sec (by http://www.gpureview.com/geforce-gtx-66 ... d-670.html). There are 4 bytes per single-precision floating point value, so if we run a=b+c where all the vectors have to be stored and fetched, then the overall rate could not be any faster than (144.192 gigabytes per second) / (3 values per addition) * (4 bytes per value) = 12 GFLOPS. Even that number assumes that the vectors are already resident on the GPU card. If they're not, then you have to pay more time to transfer them to and from main memory, so you can pretty much double the data transfer time and halve the FLOPS.

So, for carefully coded large dense matrix-multiply the GPU screams, while for vector addition it's more like "Why did I bother?"

Unfortunately, most of the operations that are done for focus stacking look they're more on the vector addition end of the spectrum. So while I can't predict exactly how fast they might run, I'm pretty confident that it's no more than a few times faster than the cpu, not 100 times faster.

As a sort of experimental cross-check on this somewhat depressing assessment, I look periodically at the "cloth" program distributed by jocl.org at http://www.jocl.org/cloth/cloth.html . This demonstration program performs a simulation of a cloth-like material being dropped over a sphere, producing on-screen images like this:

Because it's a demonstration program, it comes with a small GUI that allows one to select how the simulation is to be done. The main options that I find interesting are "Java", which I think means executing traditional Java code with ordinary loops, and "JOCL/GPU", which I think means running the simulation entirely on the GPU. Certainly that's consistent with the device utilizations shown by Windows Task Manager and my GPU monitor (Asus GPU Tweak).

Now, the interesting thing is that on my development system the program runs no more than 2X faster in JOCL/GPU mode than it does in regular Java mode, even for the largest dataset. I find this pretty sobering.

Adding insult to ineffectiveness, when I try to run the cloth demo on my Windows 10 laptop, the batch script reports error 126 on OpenCL.dll, followed by"no OpenCL implementation available". What is necessary to fix this? I have no idea. The web suggests that it's some sort of driver problem.

Finally -- or perhaps I should say "first" -- to make it work at all I would have to completely restructure and recode every function that would run on the GPU. I suspect it's the restructuring that would be really painful. Right now I just assume that memory is large enough to hold the entire image in memory at one time, and iterate over all the pixels. That assumption doesn't work with large images and relatively small graphics cards, so I would have to restructure the algorithms to operate on "tiles" that would get swapped in and out of graphics card memory.

Every time I think about this, I reach the same conclusion: using the graphics card is not worth the trouble for current functions. The one situation where it could make sense is if, in the future, I come up with some clever algorithm that generates great results but has a very high cost in terms of operations per pixel. In that case the graphics card could make it possible to generate a better result in the same time, rather than the same result in faster time.

Thanks for prompting me to look at this again.

Best regards,
--Rik

mjkzz · Post by **mjkzz** » Wed Mar 22, 2017 3:05 am

I'm not completely sure what you mean by "CPU hungry", but I suspect that you're being misled by the appearance of the Task Manager display.

OK, I think I might not be expressing myself correctly in English. What I meant to say Zerene is neither memory hungry nor it is CPU hungry. In another words, Zerene does NOT need a lot of RAM memory to run and it does NOT need (not hungry of) a lot of CPU cycle when running.

mjkzz · Post by **mjkzz** » Wed Mar 22, 2017 3:21 am

Regarding GPU, let me share here some email that I wrote fairly recently:

Thanks for the detailed info.

Some day I will get a modern GPU (currently have 460) and give it a try. I can see a lot of code for my software can be made to execute in parallel (actually right not they are farmed to different threads, hence my code are memory hungry and CPU hungry always 100%)

curt0909 · Post by **curt0909** » Wed Mar 22, 2017 11:00 am

Just for reference, here is the cpu monitor from my pc during a single instance of zerene. 14 core Xeon E5 2695 v3 with 64gb ram

MacroLab3D · Post by **MacroLab3D** » Thu Mar 23, 2017 12:51 pm

rjlittlefield wrote: I would be interested to hear some numbers. With my setup, I usually see Zerene Stacker consuming about 75% cpu when running a single PMax.

--Rik

98-100% on 6950X

rjlittlefield · Post by **rjlittlefield** » Thu Mar 23, 2017 1:57 pm

To go along with naked numbers, it would be very helpful to know about Options > Preferences settings and other parameters that are important to computation speed.

Probably I have overlooked some in writing this list, but certainly these include:

Multiprocessing > "Overlap I/O with computation if possible"
Caching & Undo > "Cache [un]aligned screen images" (two checkboxes)
Preprocessing > "Use External TIFF Reader"
Alignment > "Advanced interpolators"
Memory Usage > "Megabytes of memory currently allocated..."
image size (pixel counts)
input file format (JPEG vs TIFF; for TIFF, uncompressed vs compressed)

If I had to guess, I would predict that curt0909 is seeing his low cpu utilization due to preview images turned on and overlap I/O turned off, while MacroLab3D is seeing his high cpu utilization with overlap turned on and quite possibly with preview turned off and other optimizations. The only time I see 98-100% on my system is when I'm running multiple sessions in parallel.

--Rik