task complexity包括step complexity(可以并行成几个操作) & work complexity(总共有多少个工作要做)。
e.g. 下面的tree-structure图中每一个节点表示1个操作数,每条边表示1个操作,同层edge表示相同操作,问该图表示的task的step complexity & work complexity分别是多少。
Ans:
step complexity: 3;
work complexity: 6。
下面会有更具体的例子。
__global__ void parallel_reduce_kernel(float *d_out, float* d_in){
int myID = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x;
//divide threads into two parts according to threadID, and add the right part to the left one, lead to reducing half elements, called an iteration; iterate until left only one element
for(unsigned int s = blockDim.x / 2 ; s>0; s>>1){
if(tid<s){
d_in[myID] += d_in[myID + s];
}
__syncthreads(); //ensure all adds at one iteration are done
}
if (tid == 0){
d_out[blockIdx.x] = d_in[myId];
}
}
__global__ void parallel_shared_reduce_kernel(float *d_out, float* d_in){
int myID = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x;
extern __shared__ float sdata[];
sdata[tid] = d_in[myID];
__syncthreads();
//divide threads into two parts according to threadID, and add the right part to the left one, lead to reducing half elements, called an iteration; iterate until left only one element
for(unsigned int s = blockDim.x / 2 ; s>0; s>>1){
if(tid<s){
sdata[myID] += sdata[myID + s];
}
__syncthreads(); //ensure all adds at one iteration are done
}
if (tid == 0){
d_out[blockIdx.x] = sdata[myId];
}
}
kernel<<<grid of blocks, block of threads, shmem>>>
parallel_reduce_kernel<<<blocks, threads, threads*sizeof(float)>>>(data_out, data_in);
parallel_reduce_kernel
Times | Read Ops | Write Ops |
1 | 1024 | 512 |
2 | 512 | 256 |
3 | 256 | 128 |
… | ||
n | 1 | 1 |
parallel_shared_reduce_kernel
Times | Read Ops | Write Ops |
1 | 1024 | 1 |
所以,parallel_reduce_kernel所需的带宽是parallel_shared_reduce_kernel的3倍。
Example:
目的:解决难以并行的问题
拍拍脑袋想一想上面这个问题O(n)的1个解法是out[i] = out[i⑴] + in[i].下面我们来引入scan。
Inputs to scan:
op | Identity |
加法 | 0 |
乘法 | 1 |
逻辑或|| | False |
逻辑与&& | True |
I/O | content | ||||
---|---|---|---|---|---|
input | [ |
… | |||
output | [ |
… |
其中
很简单:
int acc = identity;
for(i=0;i<elements.length();i++){
acc = acc op elements[i];
out[i] = acc;
}
work complexity:
step complexity:
那末,对scan问题,我们怎样对其进行并行化呢?
斟酌scan的并行化,可以并行计算n个output,每一个output元素i相当于
Q: 那末问题的work complexity和step complexity分别变成多少了呢?
Ans:
可见,step complexity降下来了,惋惜work complexity上去了,那末怎样解决呢?这里有两种Scan算法:
more step efficiency | more work efficiency | |
hillis + steele (1986) | √ | |
blelloch (1990) | √ |
即streaming’s
step 0: out[i] = in[i] + in[i⑴];
step 1: out[i] = in[i] + in[i⑵];
step 2: out[i] = in[i] + in[i⑷];
如果元素不存在(向下越界)就记为0;可见step 2的output就是scan 加法的结果(想一想为何,我们1会再分析)。
那末问题来了。。。
Q: hillis + steele算法的work complexity 和 step complexity分别为多少?
O(n^2) | |||||
work complexity | √ | ||||
step complexity | √ |
解释:
为了无妨碍大家思路,我在表格中将答案设为了白色,选中表格可见答案。