ce6 pub beta opencl ati computation errors

j
jjch ID: 21776 Posts: 11
23 May 2014 11:00 PM

Hello everyone,

I am a moderately experienced Boinc user and have joined the Charity Engine team a few months back. During this time I have noticed that OpenCL work units do not run efficiently on Nvidia Quadro graphics cards.

What I decided to do was build a system with AMD FirePro cards just to crunch for OpenCL projects and keep those off the Nvidia cards. That way I can use them only for CUDA tasks which they are better at.

I have a server running Windows Server 2012 Datacenter with two AMD Firepro W7000 graphics cards. The driver version shows 12.104.2.0 from GPU-Z. It also shows OpenCL and Direct Compute 5.0 is enabled and CrossFire is Disabled.

Now I have run into a snag and haven't easily found a resolution.

All of the ce6-pub-beta (opencl_ati_101) tasks fail with computation errors within a few seconds. I get a ce6.exe has stopped working error box and when I look at the details I see the APPCRASH from ce6.exe and amdocl.dll

For what it's worth, SETI@home v7 7.03 (opencl_ati_cat132) work units run fine on both GPU's at the same time. Einstein Gamma-ray pulsar search #3 1.11 (FGRPopencl-ati) tasks work too.

Now I am stumped why the CE tasks are failing. I don't have an easy way to tell if the Work Units are bad or there is something wrong with my system. Maybe I am the first one to try this setup and we will have to break new ground to get it fixed.

Please advise if there is any known issue with the CE6 tasks or something I may have missed while building the new system. If you need more information to get to the bottom of this let me know.

Thanks in advance,

jjch

J
Jo1252 ID: 1311 Posts: 81
30 May 2014 10:12 PM

Is your issue fixed?

If not, are your drivers version the latest available ?

The appcrash indicates that there is a problem with OpenCL, might be worth it to try and reinstall the driver.

Is the OS up to date?

j
jjch ID: 21776 Posts: 11
01 Jun 2014 04:53 AM

No, this has not been solved yet.

I believe the AMD driver is the latest as of a couple weeks ago. 

I agree this is a problem with OpenCL but I'm not convinced it is related to the driver. The other two apps seem to run fine. It could be the new beta application

This server was recently built and updated so I'm not so sure reinstalling the driver would be helpful.

I was previously running Nvidia cards and deinstalled those drivers but it is possible there is some remaining issues or conflict from that. 

I have contacted the charity engine team and the developers are looking into it. I will provide an update when this is solved.

One thing that I realized is that this is a beta test app and they were coming in even thought my settings were set to No beta test apps.

Also, if it is helpful to anyone. McAfee seems to have sent out a new virus update that detects the ce6 app. You will need to configure an exclusion for the BOINC slots folders as well as the BOINC Charity Engine app folder.

jjch

S
StitchExperimen ID: 3911 Posts: 17
05 Jun 2014 03:08 AM

First off, this advice will be incomplete because I haven't done it in 9 months to a year and I have gone to server hardware instead of enthuast hardware [except for video cards... they give away the processing cores (AMD)]. I moved from dual NVidia 660 Ti to dual 7950s in a i7 3770. What I thought would de-install NVidia I kept finding traces left behind and then there were problems. So with a lot of research on Google I found a program that a number of reputable sites referenced and it was free and worked on cleaning up many video systems leftovers. So it might behoove you to try this approach. Another odd-ball approach might be to turn on the Beta testing and see if it allows a unit to progress past the present failure point- just a thought.

j
jjch ID: 21776 Posts: 11
12 Jun 2014 05:21 AM

I tried turning these back on and wasn't able to get any more work units. Maybe this application is not available now. I will have to see if there is any update again later.

jjch

j
jjch ID: 21776 Posts: 11
12 Jun 2014 05:24 AM

Also, these may have not been true Beta test work units. It appears that the name was not changed when they were sent out to general users. See the post on the ce7-pub-beta application.

Mark McA ID: 179 Posts: 228
14 Jun 2014 02:11 PM

Hi folks,

ce6 is actually quite old, well out of beta. We'll change the label.

It's an ATI-only app, so you won't see it if you have an Nvidia. Thanks for the heads-up about the AV false positive, we suspended it while we got it whitelisted (doesn't always work though, we've found before).

We're far from the only small software vendor having this issue. The more 'viruses' AV detects, the better it looks to the consumer...

Cheers,

Mark

j
jjch ID: 21776 Posts: 11
01 Jul 2014 10:31 PM

An update on this issue.

Today I when I logged into this server I noted that I have been receiving "1.10 ce6-pub-beta (opencl_ati 101)" work units for ATI GPU's again.  I found dozens of the "ce6.exe has stopped working" popups. Apparently if you leave the ce6.exe popup open the jobs continue to run and complete successfully. What I was doing previously is closing the popup and that was causing the jobs fail with the computation error. 

I went through the process of completely deinstalling the ATI driver and making sure the old NVIDIA driver was cleaned up by using the Display Driver Uninstaller (DDU) program. I also tried excluding jobs from one GPU to see if it had a similar problem to the POEM application and wouldn't run on two GPU's at the same time. It does run on two simultaneously if you don't mess with it as I mentioned.

Still no luck fixing the popup error. At least I know the jobs will complete if I leave it alone but I will have to login and clean them up every so often so the popups don't stack up indefinitely. I may end up causing a couple of jobs to fail each time since I can't tell which ones are for the jobs that are still running.

Any more ideas before I decide to completely rebuild the server from scratch to see if that clears up the problem?

jjch

Mark McA ID: 179 Posts: 228
01 Jul 2014 10:41 PM

Hi jjch,

Firstly, apologies we haven't changed the 'beta' name yet, it turns out that's going to mess up the AV white-listing and is a *much* bigger task than I thought (my fault for promising things before checking with the tech guys first).

As for the error, it's certainly a strange one. One of the guys will follow up soon.

Thanks for letting us know, it's appreciated.

Cheers,

Mark

DB
Dave Bush ID: 636103 Posts: 1
09 Oct 2014 09:44 AM

I've just installed Charity Engine and I get exactly the same problem. I've a completely generic AMD Radeon HD5450, with the drivers automatically installed by Windows 8.1 and up to date.

I'll try leaving the daiog up and see what happens.

 

 

j
jjch ID: 21776 Posts: 11
09 Oct 2014 06:36 PM

I have not found a resolution to this issue myself. You can minimize the appcrash error window and the work units should continue to run and complete. After the WU finishes you can close it.

To me this seems to be more of an annoyance than anything. One thing I don't know is if CE actually receives good data. The engineering team should investigate further and resolve it if possible. 

Over a period of time these windows will stack up and if you have alot of them it takes awhile to clear them. At that point it may be easier to shutdown BOINC processing and reboot.

I have also found the ce3 work units are experiencing a very similar issue running the cuda tasks on my systems running both Windows 7 and 2013 64bit. See this topic for some info: http://www.charityengine.com/forum/show-topic/1251

The 1.51 CE cuda tasks on my old xw4600 with a NVIDIA Quadro 5000 have been failing with computation errors and I havent found a fix to that problem either. This workstation is running Windows 7 32 bit which is the one difference that could possibly be causing it.

With all of these similar cases I have tried latest Windows updates, NVIDIA or AMD drivers, Microsoft Visual C++ 2010 and 2013 even the latest BOINC client version 7.4.21 without finding a resolution.

The CE engineering team really needs to look at these failures in the application as I think they are loosing alot of potential computing power.

JJCH