Parallelization
Feb. 12th, 2017 10:05 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
My boss got back from AMS early the week before last, and decided that yes, she did want me to go all-out to try to get the new dataset fully bias-corrected by mid-March, in time for it to be included in a big assessment report.
This is both a simple and complicated goal. On the one hand, I don't need to do any further method development: I have code that does the bias correction and I've tested it thoroughly on a good set of test cases and I know it works. On the other hand, it takes about 18 seconds to do one location. When you multiply that out by the full spatial domain and all the different variables and simulations, it adds up to about 2500 CPU-hours to do the whole thing. That's about 104 days, which puts us considerably past mid-March.
The good news is that the problem is embarrassingly parallel. (I might say it's even a step beyond "embarrassingly" parallel and into the realm of the ludicrously or stupidly parallel.) So all I have to do is get it set up to run in parallel on our supercomputer or maybe on the cloud and it'll be done in no time. (In principle, if I could get half a million processors, I could do the whole thing in under a minute. In practice, I gather that it always takes at least 5 minutes to get things spun up.)
But there are a lot of unknowns in getting it to run in parallel. I theoretically know how to get R to run in parallel using MPI -- I'm a coauthor on a tech note about it -- but I've never actually done it myself. So the first week was pretty stressful, because I was constantly aware of the clock ticking as I tried, with rather limited success, to Make It Go.
Happily, early this last week I got some help from one of our computing consultants, and although we didn't get the MPI approach to really work, he pointed me at another approach that was both simpler and better suited to the task at hand. I've replicated my test case and it works beautifully, so now it's just about restructuring the data to scale it up. Which means I no longer have to have all the time horizons and contingency plans and feasibility evaluation checkpoints floating around in my head, I can just focus on implementing the solution. Which is a huge relief.
(Hmm, that turned longer than I planned. I'll save the non-work stuff for another post.)
This is both a simple and complicated goal. On the one hand, I don't need to do any further method development: I have code that does the bias correction and I've tested it thoroughly on a good set of test cases and I know it works. On the other hand, it takes about 18 seconds to do one location. When you multiply that out by the full spatial domain and all the different variables and simulations, it adds up to about 2500 CPU-hours to do the whole thing. That's about 104 days, which puts us considerably past mid-March.
The good news is that the problem is embarrassingly parallel. (I might say it's even a step beyond "embarrassingly" parallel and into the realm of the ludicrously or stupidly parallel.) So all I have to do is get it set up to run in parallel on our supercomputer or maybe on the cloud and it'll be done in no time. (In principle, if I could get half a million processors, I could do the whole thing in under a minute. In practice, I gather that it always takes at least 5 minutes to get things spun up.)
But there are a lot of unknowns in getting it to run in parallel. I theoretically know how to get R to run in parallel using MPI -- I'm a coauthor on a tech note about it -- but I've never actually done it myself. So the first week was pretty stressful, because I was constantly aware of the clock ticking as I tried, with rather limited success, to Make It Go.
Happily, early this last week I got some help from one of our computing consultants, and although we didn't get the MPI approach to really work, he pointed me at another approach that was both simpler and better suited to the task at hand. I've replicated my test case and it works beautifully, so now it's just about restructuring the data to scale it up. Which means I no longer have to have all the time horizons and contingency plans and feasibility evaluation checkpoints floating around in my head, I can just focus on implementing the solution. Which is a huge relief.
(Hmm, that turned longer than I planned. I'll save the non-work stuff for another post.)