- Why?
- What & How?
- Experiments on performance

November 29, 2017

- Why?
- What & How?
- Experiments on performance

Technique | Pursuing | Blocked by |
---|---|---|

SISD | Higher speed (GHz) | Physics |

MIMD | More CPU cores | Synchronization |

SIMD | More CUDA cores | Price :) |

Technique | Instruction | Vector length |
---|---|---|

CPU SIMD | SSE4.1 SSE4.2 AVX AVX2 FMA | Commonly <= 256 bit |

GPU SIMD | N/A, use CUDA | Thousands of Bytes |

Technique | Hard mode | Easy mode |
---|---|---|

CPU SIMD | C/C++, assembly; instruction-orientated | NumPy |

CPU MIMD | Java, etc; dead lock, data race | Java actor, Go, Rust |

GPU SIMD | CUDA; register-orientated | ??? |

On Reddit, four years ago.

- PyCUDA and PyOpenCL: Register-orientated :(
- NumbaPro: Register-orientated :(
- Theano: Died now :(
- gnumpy: Only a small subset of NumPy; Died now :(

- Currently, PyTorch is free for everyone.
- PyTorch feature: Tensor computation (like NumPy) with strong GPU acceleration.

- Target: balance on Consistency, Availability and Partition tolerance

- MapReduce, by Google, in 2004
- Hadoop (fair mode), Spark (easy mode)
- MPI (hard mode)

for i in range(m): for j in range(n): for k in range(r): A[i][j] += B[i][k] * C[k][j]

for i in range(m): for j in range(n): A[i, j] = np.sum(B[i, :] * C[:, j])

for i in range(m): for j in range(n): for k in range(r): A[i][j] += B[i][k] * C[k][j]

m_r_n = B.reshape(m, r, 1) * C.reshape(1, r, n) A = np.sum(m_r_n, axis=1)

for i in range(m): for j in range(n): for k in range(r): A[i][j] += B[i][k] * C[k][j]

import numpy as np B = np.array([[0, 1]]) # shape: 1x2 C = np.array([[0], [1], [2]]) # shape: 3x1 A = B + C print(A)

## [[0 1] ## [1 2] ## [2 3]]

G is an nxn adjacency matrix

for k in range(n): for i in range(n): for j in range(n): G[i][j] = min(G[i][j], G[i][k] + G[k][j])

for k in range(n): n_n = G[:, k].reshape(n, 1) + G[k, :].reshape(1, n) G = np.minimum(G, n_n)

for k in range(n): for i in range(n): for j in range(n): G[i][j] = min(G[i][j], G[i][k] + G[k][j])

n = int(sys.argv[1]) g = [[random.random() for _ in range(n)] for _ in range(n)] floyd(g, n) # warm up g = [[random.random() for _ in range(n)] for _ in range(n)] tic = time.time() floyd(g, n) toc = time.time() print(toc - tic)

Closing since I think this is out of reach of easy contributions. A trivial implementation is trivial, but users are likely to want fast versions that are hard to write. – Geoffrey Irving at @openai

- for | for_python << for_go < for_java < for_c
- k_cpu | k_numpy = k_tf = k_torch = for_java < for_c
- k_gpu | for_java < for_c < k_tf_gpu = k_torch_gpu

- I.e.
- for_python << for_go < k_numpy = k_tf = k_torch = for_java < for_c < k_tf_gpu = k_torch_gpu

- I.e.
- for_python << for_go < k_cpu = for_java < for_c < k_gpu

- for | for_python << for_go < for_java <= for_c
- ij | ij_*_gpu < ij_cpu < for_+
- full | full_cpu < for_+ < full_tf_gpu < full_torch_gpu
- mat | for_+ < full_*_gpu < mat_*

- I.e.
- ij_gpu < ij_cpu < full_cpu < for_java <= for_c < full_tf_gpu < full_torch_gpu < mat_cpu < mat_gpu

- I.e.
- ij < full_cpu < for_java/c < full_gpu < mat_*

- for_python << for_go < k_cpu = for_java < for_c < k_gpu
- ij < full_cpu < for_java/c < full_gpu < mat_*

- ij hurts
- Fully-vectorized gpu code, or java/c

- The code is available
- at https://github.com/meijun/vectorized-computing