I recently have been playing around with CUDA, and was hoping to try out the unified memory model. I tried playing with sample code, and strangely, when launching the kernel, no values seemed to be updating. Modifying unified data from the host works fine, yet kernels launched simply won't modify the unified data.

我最近一直在玩CUDA,并希望尝试统一的内存模型。我尝试使用示例代码,奇怪的是,在启动内核时,似乎没有值更新。从主机修改统一数据工作正常,但启动的内核不会修改统一数据。

My card is a GTX 770 with 4GB of memory. I'm running Arch Linux, kernel 3.14-2, using GCC 4.8 to compile my samples. I'm setting the compute arch to sm_30, and activative -m64 flag

我的卡是GTX 770,内存为4GB。我正在运行Arch Linux,内核3.14-2,使用GCC 4.8编译我的样本。我将计算arch设置为sm_30,并激活-m64标志

Here's one sample that I was playing with. X[0] and X[1] always evaluate to 0, even when the kernel launches.

这是我正在玩的一个样本。即使内核启动,X [0]和X [1]也总是计算为0。

#include<stdio.h>
#include <cuda.h>

__global__ void kernel(int* x){
    x[threadIdx.x] = 2;
}

int main(){
    int* x;
    cudaMallocManaged(&x, sizeof(int) * 2);
    cudaError_t error = cudaGetLastError();
    printf("%s\n", error);
    x[0] = 0;
    x[1] = 0;

    kernel<<<1, 2>>>(x);
    cudaDeviceSynchronize();

    printf("result = %d\n", x[1]);

    cudaFree(x);
    return 0;
}

Another sample is this:

另一个例子是:

__global__ void adjacency_map_init_gpu(adjacency_map_t* map){
    int row = threadIdx.y + blockIdx.y * blockDim.y;
    int col = threadIdx.x + blockIdx.x * blockDim.x;

    int i = row * map->width + col;

    max(i, 0);
    min(i, map->width * map->height);

    map->connections[i] = 0;
}

__global__ void adjacency_map_connect_gpu(edge_t* edges, int num_edges, adjacency_map_t* map){

    int i = threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x);

    max(i, 0);
    min(i, num_edges);

    int n_start = edges[i].n_start;
    int n_end = edges[i].n_end;

    int map_index = n_start * map->width + n_end;
    map->connections[map_index] = 1;
    printf("%d new value: %d\n", map_index, map->connections[map_index]);
}

adjacency_map_t* adjacency_map_init(int num_nodes, edge_t* edges, int num_edges){
    adjacency_map_t *map;// = (adjacency_map_t*)malloc(sizeof(adjacency_map_t));
    cudaMallocManaged(&map, sizeof(adjacency_map_t));
    cudaMallocManaged(&(map->connections), num_nodes * num_nodes * sizeof(int));
    //map->connections = (int*)malloc(sizeof(int) * num_nodes * num_nodes);

    map->width = num_nodes;
    map->height = num_nodes;

    map->stride = 0;

    //GPU stuff
//    adjacency_map_t *d_map;
//    int* d_connections;

//    cudaMalloc((void**) &d_map, sizeof(adjacency_map_t));
//    cudaMalloc((void**) &d_connections, num_nodes * num_nodes * sizeof(int));

//    cudaMemcpy(d_map, map, sizeof(adjacency_map_t), cudaMemcpyHostToDevice);
//    cudaMemcpy(d_connections, map->connections, num_nodes * num_nodes, cudaMemcpyHostToDevice);
//cudaMemcpy(&(d_map->connections), &d_connections, sizeof(int*), cudaMemcpyHostToDevice);

//    edge_t* d_edges;
//    cudaMalloc((void**) &d_edges, num_edges * sizeof(edge_t));
//    cudaMemcpy(d_edges, edges, num_edges * sizeof(edge_t), cudaMemcpyHostToDevice);

adjacency_map_init_gpu<<<1, 3>>>(map);
cudaDeviceSynchronize();
//adjacency_map_connect_gpu<<<1, 3>>>(edges, num_edges, map);

cudaDeviceSynchronize();

//    cudaMemcpy(map, d_map, sizeof(adjacency_map_t), cudaMemcpyDeviceToHost);
//Synchronize everything
//    cudaFree(map);
//    cudaFree(edges);

return map;

}

Basically, I can access all the elements in the original structure on the host for the second snippet of code. Once I try to launch a kernel function, however, the pointer becomes inaccessible (at least, tested from gdb), and the entire object's data is inaccessible. The only portion of the edges and the map pointer I can still see after the first kernel launch are their respective locations.

基本上,我可以访问主机上原始结构中的所有元素,以获取第二段代码。但是,一旦我尝试启动内核函数,指针变得不可访问(至少从gdb测试),并且整个对象的数据不可访问。在第一次内核启动后,我仍然可以看到边缘的唯一部分和地图指针是它们各自的位置。

Any help would be greatly appreciated! Thanks so much!

任何帮助将不胜感激!非常感谢!

2 个解决方案

#1


1

Got it!

得到它了!

Turns out it was a problem with the IOMMU kernel option enabled. My motherboard, GIGABYTE 990-FXAUD3 seems to have had an error with IOMMU between the GPU and the CPU.

事实证明,启用IOMMU内核选项是一个问题。我的主板,GIGABYTE 990-FXAUD3似乎在GPU和CPU之间出现了IOMMU错误。

Detection: Whenever you launch Unified Memory accessing code in the console (without X), there should be an error message resembling this:

检测:每当您在控制台中启动统一内存访问代码(没有X)时,都应该出现类似这样的错误消息:

AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0017 address=0x00000002d80d5000 flags=0x0010]

AMD-Vi:记录事件[IO_PAGE_FAULT device = 01:00.0 domain = 0x0017 address = 0x00000002d80d5000 flags = 0x0010]

scrolling down the page. There might also be some discolouration in the top right of the screen (there was for me, at least).

向下滚动页面。屏幕右上方可能还有一些褪色(至少对我而言)。

Here's the solution (assuming you use GRUB):

这是解决方案(假设您使用GRUB):

Open /etc/default/grub, and for the line GRUB_CMDLINE_LINUX_DEFAULT="" add the option iommu=soft inside the quotes.

打开/ etc / default / grub,对于行GRUB_CMDLINE_LINUX_DEFAULT =“”在引号内添加选项iommu = soft。

Hope this helps people out! Big thanks to Robert Crovella for helping me narrow down the problem!

希望这可以帮助人们!非常感谢Robert Crovella帮助我缩小问题范围!

更多相关文章

  1. Linux2.6内核下键盘输入设备驱动的实现
  2. 边学边干Linux内核指导(8)——进程调度
  3. Linux 2.6.x 内核模块加载错误 “Invalid module format” 解决
  4. Linux内存管理 (10)缺页中断处理
  5. 进程实际内存占用: 私有驻留内存数(Private RSS)介绍
  6. Linux内核分析-9/进程的调度时机
  7. MediaPlayer源码存在的内存泄漏问题,释放资源的正确方式
  8. android版本与linux内核版本对应关系
  9. 编译Android4.3内核源代码

随机推荐

  1. 具有1位条目的numpy布尔数组
  2. Python------类的结构细分,异常处理,方法
  3. cocos2d-3.2+python+NDK搭建游戏开发环境
  4. python小练习,利用dict,做一个简单的登录
  5. Python中字符大小写的问题
  6. python编程之一:使用网格索引算法进行空间
  7. Python/模块与包之模块
  8. 初识python:高阶函数(附-高阶函数)
  9. 【py交易】算法竞赛入门经典6.3.1 小球下
  10. 返回语句是否应该有括号?