针对深度学习的GPU芯片选择

转自：http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/

It is again and again amazing to see how much speedup you get when you use GPUs for deep learning: Compared to CPUs 10x speedups are typical, but on larger problems one can achieve 20x speedups. With GPUs you can try out new ideas, algorithms and experiments much faster than usual and get almost immediate feedback as to what works and what does not. If you do not have a GPU and you are serious about deep learning you should definitely get one. But which one should you get? In this blog post I will guide you through the choices to the GPU which is best for you.

Having a fast GPU is a very important aspect when one begins to learn deep learning as this rapid gain in practical experience is key to building the expertise with which you will be able to apply deep learning to new problems. Without this rapid feedback it just takes too much time to learn from one’s mistakes and it can be discouraging and frustrating to go on with deep learning.

With GPUs I quickly learned how to apply deep learning on a range of Kaggle competitions and I managed to earn second place in the Partly Sunny with a Chance of Hashtags Kaggle competition, where it was the task to predict weather ratings for a given tweet. In the competition I used a rather large two layered deep neural network with rectified linear units and dropout for regularization and this deep net fitted barely into my 6GB GPU memory. More details on my approach can be found here.

Should I get multiple GPUs?

Excited by what deep learning can do with GPUs I plunged myself into multi-GPU territory by assembling a small GPU cluster with InfiniBand 40Gbit/s interconnect. I was thrilled to see if even better results can be obtained with multiple GPUs.

I quickly found that it is not only very difficult to parallelize neural networks on multiple GPU efficiently, but also that the speedup was only mediocre for dense neural networks. Small neural networks could be parallelized rather efficiently using data parallelism, but larger neural networks like I used in the Partly Sunny with a Chance of Hashtags Kaggle competition received almost no speedup.

Setup in my main computer: You can see three GXT Titan and an InfiniBand card. Is this a good setup for doing deep learning?

However, using model parallelism, I was able to train neural networks that were much larger and that had almost 3 billion connections. But to leverage these connections one needs just much larger data sets which are uncommon outside of large companies – I found some for these large models when I trained a language model on the entire Wikipedia corpus, but that’s about it.

On the other hand, one advantage of multiple GPUs is that you can run multiple algorithms or experiments separately on each GPU. You gain no speedups, but you get more information of your performance by using different algorithms or parameters at once. This is highly useful if your main goal is to gain deep learning experience as quickly as possible and also it is very useful for researchers, who want try multiple version of a new algorithm at the same time.

If you use deep learning only occasionally, or you use rather small data sets (smaller than say 10-15GB) and foremost dense neural networks, then multiple GPUs are probably not for you. However, when you use convolutional neural networks a lot, then multiple GPUs might still make sense.

Alex Krizhevsky released his new updated code which can run convolutional neural networks on up to four GPUs. Convolutional neural networks – unlike dense neural networks – can be run very efficiently on multiple GPUs because their use of weight sharing makes data parallelism very efficient. On top of that, Alex Krizhevsky’s implementation utilizes model parallelism for the densely connected final layers of the network which gives additional gains in speed.

However, if you want to program similar networks yourself, be aware that to program efficient multiple GPU networks is a difficult undertaking which will consume much more time than programming a simple network on one GPU.

So overall, one can say that one GPU should be sufficient for almost any task and that additional GPUs convey only benefits under very specific circumstances (many for very, very large data set).

So what kind of GPU should I get? NVIDIA or AMD?

NVIDIA’s standard libraries made it very easy to establish the first deep learning libraries in CUDA, while there were no such powerful standard libraries for AMD’s OpenCL. Right now, there are just no good deep learning libraries for AMD cards – so NVIDIA it is. Even if some OpenCL libraries would be available in the future I would stick with NVIDIA: The thing is that the GPU computing or GPGPU community is very large for CUDA and rather small for OpenCL. Thus in the CUDA community good open source solutions and solid advice for your programming is readily available.

Required memory size for simple neural networks

People often ask me if the GPUs with the largest memory are best for them, as this would enable them to run the largest neural networks. I thought like this when I bought my GPU, a GTX Titan with 6GB memory. And I also thought that this was a good choice when my neural network in the Partly Sunny with a Chance of Hashtags Kaggle competition barely fitted in my GPU memory. But later I found out that my neural network implementation was very memory inefficient and much less memory would have been sufficient – so sometimes the underlying code is limiting rather than the GPU. Although standard deep learning libraries are efficient, these libraries are often optimized for speed rather than for memory efficiency.

There is an easy formula for calculating the memory requirements of a simple neural network. This formula is the memory requirement in GB for standard implementation of a simple neural network, with dropout and momentum/Nesterov/AdaGrad/RMSProp:

${\mbox{Memory in GB } = 12\times 1024^{-3}\left(\left(\sum\limits_{i=0}^{\mbox{weights}} \mbox{rows}_i\times \mbox{columns}_i \right) + \mbox{batchsize}\sum\limits_{i=0}^{\mbox{layers}} \mbox{units}_i \right)}$
Memory formula: The units for the first layer is the dimensionality of the input. In words this formula means: Sum up the weight sizes and input sizes each; multiply the input sizes by the batch size; multiply everything by 4 for bytes and by another 3 for the momentum and gradient matrix for the first term, and the dropout and error matrix for the second term; divide to get gigabytes.

For the Kaggle competition I used a 9000x4000x4000x32 network with batchsize 128 which uses up:
${12\times 1024^{-3} ((9000\times 4000 + 4000\times 4000 + 4000 \times 32) + 128(9000+4000+4000+32)) \approx 0.62\mbox{GB}}$
So this fits even into a small GPU with 1.5GBs memory. However, the world looks quite different when you use convolutional neural networks.

Understanding the memory requirements of convolutional nets

In the last update I made some corrections for the memory requirement of convolutional neural networks, but the topic was still unclear and lacked definite advice on the situation.

I sat down and tried to build a similar formula as for simple neural networks, but the many parameters that vary from one net to another, and the implementations that vary from one net to another made such a formula too impractical – it would just contain too many variables and would be too large to give a quick overview of what memory size is needed. Instead I want to give a practical rule of thumb. But first let us dive into the question why convolutional nets require so much memory.

There are generally two types of convolutional implementations. One uses Fourier transforms the other direct computation on image patches.

${\mbox{featuremap}({\bf x}, {\bf x_0}) = \int\limits_{-\infty}^\infty \mbox{input}({\bf x }- {\bf x_0})\mbox{kernel}({\bf x_0}), d{\bf x} = \sqrt{2\pi}\times \mbox{input}^\star\times\mbox{kernel}^\star}$
Continuous convolution theorem with abuse in notation: The input represents an image or feature map and the subtraction of the argument can be thought of creating image patches with width ${x_0}$ with respect to some x, which is then multiplied by the kernel. The integration turns into a multiplication in the Fourier domain; here ${f^\star(x)}$ denotes a Fourier transformed function. For discrete “dimensions”(x) we have a sum instead of an integral – but the idea is the same.

The mathematical operation of convolution can be described by a simple matrix multiplication in the Fourier frequency domain. So one can perform a fast Fourier transform on the inputs and on each kernel and multiply them to obtain feature maps – the outputs of a convolutional layer. During the backward pass we do an inverse fast Fourier transform to receive gradients in the standard domain so that we can update the weights. Ideally, we store all these Fourier transforms in memory to save the time of allocating the memory during each pass. This can amount to a lot of extra memory and this is the chunk of memory that is added for the Fourier method for convolutional nets – holding all this memory is just required to make everything run smoothly.

The method that operates directly on image patches realigns memory for overlapping patches to allow contiguous memory access. This means that all memory addresses lie next to each other – there is no “skipping” of indices – and this allows much faster memory reads. One can imagine this operation as unrolling a square (cubic) kernel in the convolutional net into a single line, a single vector. Slow memory access is probably the thing that hurts an algorithms performance the most and this prefetching and aligning of memory makes the convolution run much faster. There is much more going on in the CUDA code for this approach of calculating convolutions, but prefetching of inputs or pixels seems to be the main reason of increased memory usage.

Another memory related issue which is general for any convolutional net implementation is that convolutional nets have small weight matrices due to weight sharing, but much more units then densely connected, simple neural networks. The result is that the second term in the equation for simple neural networks above is much larger for convolutional neural networks, while the first term is smaller. So this gives another intuition where the memory requirements come from.

I hope this gives an idea about what is going on with memory in convolutional neural nets. Now we have a look at what practical advice might look like.

Required memory size for convolutional nets

Now that we understand that implementations and architectures vary wildly for convolutional nets we know that it might be best for look for other ways to gauge memory requirements.
In general, if you have enough memory, convolutional neural nets can be made nearly arbitrarily large for a given problem (dozen of convolutional layers). However, you will soon run into problems with overfitting, so that larger nets will not be any better than smaller ones. Therefore data set size and the label class size might serve well as a gauge of how big you can make your net and in turn how large your GPU memory needs to be.

One example is here the Kaggle plankton detection competition. At first I thought about entering the competition as I might have a huge advantage through my 4 GPU system. I reasoned I might be able to train a very large convolutional net in a very short time – one thing that others cannot do because they lack the hardware. However, due to the small data set (about 50×50 pixels, 2 color channels, 400k training images; about 100 classes) I quickly found that overfitting was an issue even for small nets that neatly fit into one GPU and which are fast to train. So there was hardly a speed advantage of multiple GPUs and not any advantage at all of having a large GPU memory.

If you look at the ImageNet competition, you have 1000 classes, over a million 250×250 images with three color channels – that’s more than 250GB of data. Here Alex Krizhevsky’s original convolutional net did not fit into a single 3GB memory (but it did not use much more than 3GB). For ImageNet 6GB will usually be sufficient for a competitive net. You can receive slightly better results if you throw dozens of high memory GPUs at the problem, but the improvement is only marginal compared to the resources that you need for that.

I think it is very likely, that the next breakthrough in deep learning will be done with a single GPU by some researchers that try new recipes, rather than by researchers that use GPU clusters and try variations of the same recipe. So if you are a researcher you should not fear a small memory size – a faster GPU, or multiple smaller GPUs will often give you a better experience than a single large GPU.

Right now, there are not many relevant data sets that are larger than ImageNet and thus for most people a 6GB GPU memory should be plenty; if you want to invest into a smaller GPU and are unsure if 3GB or 4GB is okay, a good rule might to look at the size of your problems when compared to ImageNet and judge from that.

Overall, I think memory size is overrated. You can nicely gain some speedups if you have very large memory, but these speedups are rather small. I would say that GPU clusters are nice to have, but that they cause more overhead than the accelerate progress; a single 12GB GPU will last you for 3-6 years; a 6GB GPU is good for now; a 4GB GPU is good but might be limiting on some problems; and a 3GB GPU will be fine for most research that tests new architectures and algorithms on small data sets.

Fastest GPU for a given budget

Processing performance is most often measured in floating-point operations per second (FLOPS). This measure is often advertised in GPU computing and it is also the measure which determines which supercomputer enters the TOP500 list of the fastest supercomputers. However, this measure is misleading, as it measures processing power on problems that do not occur in practice.
It turns out that the most important practical measure for GPU performance is bandwidth in GB/s, which measures how much memory can be read and written per second. This is because almost all mathematical operations, such as dot product, sum, addition etcetera, are bandwidth bound, i.e. limited by the GB/s of the card rather than its FLOPS.

Comparison of bandwidth for CPUs and GPUs over time:Bandwidth is one of the main reasons why GPUs are faster for computing than CPUs are.

To determine the fastest GPU for a given budget one can use this Wikipedia page and look at Bandwidth in GB/s; the listed prices are quite accurate for newer cards (700 and 900 series), but older cards are significantly cheaper than the listed prices – especially if you buy those cards via eBay. Sometimes cryptocurreny mining benchmarks also are a reliable gauge of performance (you could gauge the performance of the GTX980 nicely before there were any deep learning benchmarks). But beware, some cryptocurreny mining benchmarks are compute bound and thus are uninformative for deep learning performance.

Another important factor to consider however, is that the Maxwell and Fermi architecture (Maxwell 900 series; Fermi 400 and 500 series) are quite a bit faster than the Kepler architecture (600 and 700 series); so that for example the GTX 580 is faster than any GTX 600 series GPU. The new Maxwell GPUs are significantly faster than most Kepler GPUs, and you should prefer Maxwell if you have the money. If you cannot afford a GTX Titan X or a GTX 980, a 4GB GTX 960 or a GTX 680 from eBay will be untroubled cheap choices. If you run primarily dense and recurrent neural networks a GTX 970 will be a very solid option, but if you use convolutional networks heavily the 3.5GB memory and its weird architecture (see below) will cause you a lot of trouble; a GTX 970 is not recommended in that case. Previously, I recommended a GTX 580 as a cheap solution, but I no longer favor this GPU due to the updated cuDNN library that features fast implementation of convolution. In its new update, cuDNN is significantly faster and it is to be expected that more and more libraries will integrate with cuDNN. The bad thing about the GTX 580 is that the card is too dated to be compatible with cuDNN, so I no longer recommend the GTX 580.

To give a rough estimate of how the cards perform with respect to each other on deep learning tasks:

GTX Titan X = GTX 980 Ti = 0.66 GTX 980 = 0.6 GTX 970 = 0.5 GTX Titan

GTX Titan X = 0.35 GTX 680 = 0.35 AWS GPU instance (g2.2 and g2.8) = 0.33 GTX 960

The 700 series is outdated, but the GTX Titan is still interesting as a cost effective 6GB option. The GTX 970 might also be an option, but along with the GTX 580 there are some thing you will need to consider.

Cheap but troubled

The GTX 970 is a special case you need to be aware of. The GTX 970 has a weird architecture which may cripple performance if more than 3.5GB is used and so it might be troublesome to use if you train large convolutional nets. The problem had been dramatized quite a bit in the news, but it turned out that the whole problem was not as dramatic for deep learning as the original benchmarks showed: If you do not go above 3.75GB, it will still be faster than a GTX 960.

To manage the memory problem, it would still be best, if you would learn to extend libraries with your own software routines which will alert you when you hit the performance decreasing memory zone above 3.5GB – if you do this, then the GTX 970 is an excellent, very cost efficient card that is just limited to 3.5GB. You might get the GTX 970 even cheaper on eBay, where disappointed gamers – who do not have control over the memory routines – may sell it cheaply. Other people made it clear to me that they think that a recommendation for the GTX 970 is crazy, but if you are short on money this is just better than no GPU, and other cheap GPUs like the GTX 580 have troubles too.

The GTX 580 is now a bit dated and the important cuDNN library does not support it, but you can still use libraries like cuda-convnet 1 and deepnet for convolution, which are both excellent libraries, but you should be aware that you cannot use (or not fully) libraries like torch7. If you do not use convolutional nets at all, the GTX 580 might still be a good choice.

If you do not like all these troubles, then go with a 4GB GTX 960 or a GTX 680 form eBay for a cheap solution with no troubles.

Amazon Web Services (AWS) GPU instances

Amazon web services instances can be a good option if you lack the money for a dedicated machine. It is also a very good option if you need to run multiple small experiments. Be aware however, that it is hardly possible to use multiple GPUs on a single deep learning architecture: The problem with AWS is that it uses specialized GPUs which support virtualization (the real machine has 8 GPUs in one computer; the virtualized machine only 1-4), and it is this virtualization which cripples the bandwidth between the GPUs. There are some patches which you can apply on the server to improve this but this will not completely remove the problem either. In the end, most algorithms will run more slowly on multiple GPUs on a AWS instance compared to a single GPU. But in the end this is only a minor issue, and I raise this point only so that you do not waste your time on parallelisation on AWS.

The true beauty of AWS are spot instances: Spot instances are very cheap virtual computers which you usually rent for a couple of hours to run a algorithm and after you have completed your algorithm you shut them down again. For about $1.5 you can rent a AWS GPU spot instance for two hours with which you can run easily 4 experiments on MNIST concurrently and do a total of 160 experiments in that two hours. You can use the same time to run full 8-12 CIFAR-10 or CIFAR-100 experiments.

It will generally be difficult to run ImageNet on AWS GPU instances because they offer only 4GB of memory. With 4GB that you will be able to run some models, but you will need more RAM to run newer, and more successful architectures. Another difficulty is the dataset size of ImageNet for which you need to rent extra space for your instance.

The usage of an AWS instances is quite simple, but it may take a couple of hours until you get used to the process of setting up an instance, logging into the instance and using the instance, but after that it will become easy to fire up some GPU instances to do some (small) deep learning work. Installing all necessary programs on a AWS instance can be a pain and this is why amazon created a feature where you can launch operating systems with pre-installed software (rather than a naked OSes). These pre-installed packages are called amazon machine images (AMIs) and you can use any public AMI which is available in your selected region (you can always change the region at the top right of your AWS console). There are many different AMIs for deep learning, where all deep learning software is already pre-installed (try google or the option introduced below) — so a GPU instance with all deep learning software is essentially only two steps away!

Moneywise, AWS GPU instances will be quite cheap not only for short experiments but also in the long run. The main disadvantages compared to a dedicated system is that AWS instances much slower and will not allow you to work with large data sets easily.

Another problem can be delay between the server and your computer which can make working with an AWS instance a real pain (especially in the Asian region, so I have heard); another problem is that you only have a console to work with. However, all this hassle can be reduced somewhat with IPython and iTorch notebooks from which you can execute theano and torch and code in your browser as an interactive session which you can save and load. A manual of how to get the IPython and iTorch notebooks working for deep learning on a AWS instance can be found here (the only problem is that you need to download cuDNN yourself).

In the developing world, even a cheap deep learning system can create big holes in one’s pocket and this is the point where AWS GPU instance can help you out — if you have very little money AWS will just be the best choice for you.

Another use-case is to use AWS GPU instances to run multiple experiments. While this use-case it not so typical for everyday deep learning work, it is a blessing if you really want to learn how to train deep learning architectures. If you grab CIFAR-10 or CIFAR-100 and run four different convolutional nets on a large GPU instance you will very quickly get the hang of training convolutional nets successfully.

Conclusion

With all this information in this article you should be able to reason which GPU to choose by balancing the required memory size, bandwidth in GB/s for speed and the price of the GPU, and this reasoning will be solid for many years to come. But right now my recommendation is to get a GTX Titan X or GTX980 if you have the money, a GTX 960 or GTX 680 from eBay for cheap solution, and if you are fine with their problems a 3GB GTX 580 or a GTX 970 from eBay might be suitable. If you need cheap memory for large convolutional nets a 6GB GTX Titan from eBay will be good (Titan Black if you have the money). If you have very little money then AWS GPU spot instances will be the best choice.

TL;DR advice

Best GPU overall: GTX Titan X
Cost efficient but expensive: GTX Titan X, GTX 980, GTX 980 Ti
Cost efficient but troubled: GTX 580 3GB (lacks software support) or GTX 970 (has memory problem)
Cheapest card with no troubles: GTX 960 4GB or GTX 680
I work with data sets > 250GB: GTX Titan, GTX 980 Ti or GTX Titan X
I have little money: GTX 680 3GB eBay
I have almost no money: AWS GPU spot instance
I do Kaggle: GTX 980 or GTX 960 4GB
I am a researcher: 1-4x GTX Titan X
I want to build a GPU cluster: This is really complicated, you can get some ideas here
I started deep learning and I am serious about it: Start with one GTX 680, GTX 980, or GTX 970 and buy more of those as you feel the need for them; save money for Pascal GPUs in 2016 Q2/Q3 (they will be much faster than current GPUs)

Update 2014-09-28: Added emphasis for memory requirement of CNNs
Update 2015-02-23: Updated GPU recommendations and memory calculations
Update 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
Update 2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
Update 2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation

Acknowledgements

I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointer out notebook solutions for AWS instances.

[Image source: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz3AI18t18Z]