- Why?
- What & How?
- Experiments on performance
Jun Mei
November 29, 2017
Technique | Pursuing | Blocked by |
---|---|---|
SISD | Higher speed (GHz) | Physics |
MIMD | More CPU cores | Synchronization |
SIMD | More CUDA cores | Price :) |
Technique | Instruction | Vector length |
---|---|---|
CPU SIMD | SSE4.1 SSE4.2 AVX AVX2 FMA | Commonly <= 256 bit |
GPU SIMD | N/A, use CUDA | Thousands of Bytes |
Technique | Hard mode | Easy mode |
---|---|---|
CPU SIMD | C/C++, assembly; instruction-orientated | NumPy |
CPU MIMD | Java, etc; dead lock, data race | Java actor, Go, Rust |
GPU SIMD | CUDA; register-orientated | ??? |
On Reddit, four years ago.
for i in range(m): for j in range(n): for k in range(r): A[i][j] += B[i][k] * C[k][j]
for i in range(m): for j in range(n): A[i, j] = np.sum(B[i, :] * C[:, j])
for i in range(m): for j in range(n): for k in range(r): A[i][j] += B[i][k] * C[k][j]
m_r_n = B.reshape(m, r, 1) * C.reshape(1, r, n) A = np.sum(m_r_n, axis=1)
for i in range(m): for j in range(n): for k in range(r): A[i][j] += B[i][k] * C[k][j]
import numpy as np B = np.array([[0, 1]]) # shape: 1x2 C = np.array([[0], [1], [2]]) # shape: 3x1 A = B + C print(A)
## [[0 1] ## [1 2] ## [2 3]]
G is an nxn adjacency matrix
for k in range(n): for i in range(n): for j in range(n): G[i][j] = min(G[i][j], G[i][k] + G[k][j])
for k in range(n): n_n = G[:, k].reshape(n, 1) + G[k, :].reshape(1, n) G = np.minimum(G, n_n)
for k in range(n): for i in range(n): for j in range(n): G[i][j] = min(G[i][j], G[i][k] + G[k][j])
n = int(sys.argv[1]) g = [[random.random() for _ in range(n)] for _ in range(n)] floyd(g, n) # warm up g = [[random.random() for _ in range(n)] for _ in range(n)] tic = time.time() floyd(g, n) toc = time.time() print(toc - tic)
Closing since I think this is out of reach of easy contributions. A trivial implementation is trivial, but users are likely to want fast versions that are hard to write. – Geoffrey Irving at @openai