CE and VirtualBox Apps

e
entity ID: 7097150 Posts: 29
21 Jun 2022 04:07 PM

I recently reattched to CE to see if the project connection issues had been resolved. Once I attached to the account manager (CE), CE attached me to Rosetta and LHC along with Number Fields. At this point, all kinds of bad things happened. Once attached to the projects (which I have absolutely no control over):

1. All the projects started downloading work to my system (I will focus on only one system but the same thing happened on the other one).

2. Rosetta tasks download rather quickly and started to execute on a 24 thread system. Due to Rosetta's version of the vboxwrapper, each task was copying the .vdi file from the projects directory to each slot directory for each task (that is 7GB each).

3. At the same time LHC has started downloading work but these are HUGE files (Several hundred megabytes each and more than one for each task). 

4. The consequence of this is a harddisk running 100% busy. Since the boinc-client is doing the copying task for Rosetta it stops communicating with the boinc manager. I am essentially locked out of the boinc system. The next thing that happens is tasks that have finished the copy process start to execute and use about 2.5GB of memory per task which then starts to use the swap file (only 32GB on a 24 thread system) but that isn't going to work because the hardisk is 100% busy. Soon tasks fail and new tasks start which starts the .vdi copy again and on it goes. The thing that terminated this nightmare (remember, I have no control over any of this) is the /var/lib/boinc filesystem filled up and since the boinc-client couldn't write it's state file, it terminated. 

This scenario was replicated on the 128 thread Dell server with 256GB of memory and a 300GB /var/lib/boinc data directory. The server ran out of memory and the data directory filled up. If one was to just restart the client, it will continue where it left off and the whole nightmare starts again. Since the client was down, I edited the preferences file to only use 10% of the CPUs and extended/cleanedup the data directories, then restarted the client. Once the client started and settled down with only a few tasks executing I was able to go in and suspend all the remaining tasks so that I could control their release into the system. I can manage further using staggered starts and app_config files.

Problem is: When I'm first attached to the projects, I don't know app names (for app_config files), I don't have project directories to store the app_config file, I don't know how much memory each task is going to use so I can't set limits, etc. I guess on smaller systems (6 cores or less with 16GB of memory and standard harddisks that PCs have today) it isn't so much of a problem but on larger systems, it is a BIG DEAL. Part of the problem with the memory is that some of the LHC tasks use 9GB to 10GB each which the user doesn't know when first attached. On a 128 thread system, how many VBox task can one run concurrently? There isn't an easy answer to that question due to all the dependencies. Like asking, How long is a piece of rope? Number Fields isn't any problem at all. Small downloads, small app, easily run 128 concurrently with no problem. So, the thinking that connecting to CE or Science United or Grid Republic and just download work and run it in the background unobtrusively is not reality. CPU priority seems to be the focus but it is actually the least of your worries.

Matt ID: 44 Posts: 302
21 Jun 2022 09:12 PM

Thanks for the very informative note.  

Our client application has preferences to cap use of CPU, RAM, disk space, and more.  The default settings for these should prevent the issues you describe.

I wonder: did you at some point change these settings manualy, perhaps to optimize resource use?

Matt ID: 44 Posts: 302
21 Jun 2022 09:13 PM

Thanks for the very informative note.  

Our client application has preferences to cap use of CPU, RAM, disk space, and more.  The default settings for these should prevent the issues you describe.

I wonder (for initial context for this discussion): did you at some point change these settings manually, perhaps to optimize resource use?

e
entity ID: 7097150 Posts: 29
21 Jun 2022 09:35 PM

If you are referring to the Charity Engine app, that only runs on Windows if I remember correctly. I am 100% a Linux user so that app doesn't apply for me. 

However, if your app is available in Linux, I would be willing to give it a go. Alternatively, if the source code is available, I would be willing to compile on Linux and try to get it function. Tristin indicated that he was researching packaging solutions for Linux

Matt ID: 44 Posts: 302
21 Jun 2022 09:45 PM

BOINC on Linux also has preferences to allow you to manage these dimensions of your system.  

I'm not sure what the defaults are, but you can cap overall CPU useage, number of cores used, RAM useage, disk space used, disk space to retain as free, and a number of dimensions of network useage. – These should give you the tools to prevent the issues which you describe above.  If you have any questions after you take a look, let us know -- 

(*For Windows users, the Charity Engine defaults should already preempt the behaviors above)

e
entity ID: 7097150 Posts: 29
21 Jun 2022 11:07 PM

As described in the original post, when you first connect to CE those parameters aren't set because one isn't familiar with the applications at the time. Yes, BOINC does have those parameters but, as many BOINC users will tell you, they don't always exactly apply when running a range of projects. For instance, limits set for LHC or Rosetta will unnecessarily limit Number Fields. Those BOINC values are BOINC global not app specific. As also stated, I have put in place the necessary app_config files to limit the number of concurrent VBox apps and still allow Number Fields to have the resources available for optimum function. However, even app_config has it's limitations for projects like LHC and WCG that have "sub-projects" or multiple apps. I can limit an app with app_config but if one of the other apps become unavailable, I can't take advantage of the available resources without making additional changes to the app_config and re-reading the config files. Then you get into micro-managing the boinc-client. The problem, as I see it, is the VBox apps. If you had CE only attach to non-VBox apps the problem would be greatly reduced (although not entirely eliminated). Let users opt-in to VBox apps.

Now that I have put in place the app_config files and set parameters for these projects, what happens if CE attaches me to another project? I don't have the app names to build an app_config file or a projects directory to put it in. What will happen is the new project will download a bunch of tasks and they will start to execute and the original problem will occur again. If I set BOINC CPU, Memory, and Disk parameters to lower values they will impact the currently running work that, as of today, is running quite well.  

Matt ID: 44 Posts: 302
22 Jun 2022 08:54 PM

I think perhaps we're crossing wires here somewhere

[1] First, by way of context for future readers: if you join Charity Engine as a new user, and attach a device (Windows, Mac, or Linux), you will automatically get a default set of computing preferences which should prevent the issues reported above.

The primary objective of these default preferences is to prevent participation in our service from interfering with volunteers' overall user experience with their device – ie Charity Engine should run in the background and not be a nuisance. We want Charity Engine to be an easy way to do good in the world.

[2] As to the issue you are reporting: If I understand correctly, you are aiming to revise your preferences to optimize not for general, "turn on and forget" reliability, but rather for peak throughput.

One thing to keep in mind with your specific goal is that Charity Engine is meaningfully different from most BOINC projects. We run a range of commercial workloads, with revenues being shared with charities and volunteer prize draws; and we also commit substantial resources to a wide range of public interest scientific research. So we run a large and ever-changing variety of applications.

[3] Having said all that: If VirtualBox apps pose a special challenge for your system and how you'd like to configure it, you can disable Vbox apps by updating the cc_config.xml file on your devices and specifying to disable vbox as follows (*requires a BOINC client restart):

<cc_config>

   <log_flags>

       [ ... ]

   </log_flags>

   <options>

      <dont_use_vbox>1</dont_use_vbox>

       [ ... ]

   </options>

</cc_config> 

[4] One last detail: Most of our device pool are consumer PCs (and thus generally get only one Vbox task at a time). We're discussing partnerships with other kinds of resource-providers, though; I wonder if you have several servers of the type you describe above, and if such a program might be of interest?

e
entity ID: 7097150 Posts: 29
22 Jun 2022 09:25 PM

Thanks Matt for the informative discussion. You may be right in your last discussion. I think the problem might be due to the fact my BOINC system was not installed brand new and has been in use Since 2004. Therefore, I have modified the preferences many many times for a wide range of projects over the years. The problem is more than likely due to the fact I have a local preferences file and the client is ignoring the global preferences provided by CE when first attaching. Regardless, after the initial attach and the resulting mayhem, it has been running quite well after I tweaked preferences, added things to the cc_config, and revamped the supporting filesystem(s). I saw the <don't_use_vbox> parameter but was hesitant to use it becuase that would have eliminated everything except Number Fields which kind of negates the original goal of attaching to an account manager (to be assigned to multiple projects without the need to manage everything myself). Some projects provide the ability for contributors to limit the number of tasks from certain apps but with CE I don't have that ability as I don't have a logon at the project site.

As to number 4 above, Yes, I do have several servers but not as large as the one described above. I'm only running the large one right now as I'm in a transition mode to a new location. Should be in the new location toward the end of the year. I have the 64 core (128 thread) server, (2) 32 core (64 thread servers) and several (3) 12 core (24 thread) desktop type systems with the intent to acquire another 64 core (128 thread) or larger system once the move is complete. I tend to like servers because they give me the hot-plug capabilities, ECC memory, CPU sparing, etc that minimizes downtime. If the described resources are of interest to you, I would be willing to discuss doing something with CE. 

Matt ID: 44 Posts: 302
23 Jun 2022 04:46 PM

Great; and likewise thanks for the informative discussion.

As for #4 above – I'll add you to an email list, and will drop a line if/when we launch that program.

Happy computing in the meantime –