- Why?
- What & How?
- Experiments on performance
November 29, 2017
| Technique | Pursuing | Blocked by |
|---|---|---|
| SISD | Higher speed (GHz) | Physics |
| MIMD | More CPU cores | Synchronization |
| SIMD | More CUDA cores | Price :) |
| Technique | Instruction | Vector length |
|---|---|---|
| CPU SIMD | SSE4.1 SSE4.2 AVX AVX2 FMA | Commonly <= 256 bit |
| GPU SIMD | N/A, use CUDA | Thousands of Bytes |
| Technique | Hard mode | Easy mode |
|---|---|---|
| CPU SIMD | C/C++, assembly; instruction-orientated | NumPy |
| CPU MIMD | Java, etc; dead lock, data race | Java actor, Go, Rust |
| GPU SIMD | CUDA; register-orientated | ??? |
On Reddit, four years ago.
for i in range(m):
for j in range(n):
for k in range(r):
A[i][j] += B[i][k] * C[k][j]
for i in range(m):
for j in range(n):
A[i, j] = np.sum(B[i, :] * C[:, j])
for i in range(m):
for j in range(n):
for k in range(r):
A[i][j] += B[i][k] * C[k][j]
m_r_n = B.reshape(m, r, 1) * C.reshape(1, r, n) A = np.sum(m_r_n, axis=1)
for i in range(m):
for j in range(n):
for k in range(r):
A[i][j] += B[i][k] * C[k][j]
import numpy as np
B = np.array([[0, 1]]) # shape: 1x2
C = np.array([[0],
[1],
[2]]) # shape: 3x1
A = B + C
print(A)
## [[0 1] ## [1 2] ## [2 3]]
G is an nxn adjacency matrix
for k in range(n):
for i in range(n):
for j in range(n):
G[i][j] = min(G[i][j], G[i][k] + G[k][j])
for k in range(n): n_n = G[:, k].reshape(n, 1) + G[k, :].reshape(1, n) G = np.minimum(G, n_n)
for k in range(n):
for i in range(n):
for j in range(n):
G[i][j] = min(G[i][j], G[i][k] + G[k][j])
n = int(sys.argv[1]) g = [[random.random() for _ in range(n)] for _ in range(n)] floyd(g, n) # warm up g = [[random.random() for _ in range(n)] for _ in range(n)] tic = time.time() floyd(g, n) toc = time.time() print(toc - tic)
Closing since I think this is out of reach of easy contributions. A trivial implementation is trivial, but users are likely to want fast versions that are hard to write. – Geoffrey Irving at @openai