专利快速检索-快速检索全球专利，免费商用专利数据库-IPRDB

1. 发明授权

US08751771B2 Efficient implementation of arrays of structures on SIMT and SIMD architectures 有权
标题翻译：在SIMT和SIMD架构上高效地实现结构数组
公开(公告)号：US08751771B2
公开(公告)日：2014-06-10
申请号：US13247855
申请日：2011-09-28
申请人： Brian Fahs , Henry Packard Moreton , Brett W. Coon , Kathleen Elliott Nickolls
发明人： Brian Fahs , John R. Nickolls , Henry Packard Moreton , Brett W. Coon
IPC分类号： G06F12/00 , G06F13/00 , G06F13/28 , G06F9/26 , G06F9/34 , G06F9/38 , G06F9/30
CPC分类号： G06F9/3885 , G06F9/30036 , G06F9/3009 , G06F9/30123 , G06F9/345 , G06F9/3824 , G06F9/3851 , G06F9/3887 , G06F12/0207 , G06T1/20
摘要： One embodiment of the present invention sets forth a technique providing an optimized way to allocate and access memory across a plurality of thread/data lanes. Specifically, the device driver receives an instruction targeted to a memory set up as an array of structures of arrays. The device driver computes an address within the memory using information about the number of thread/data lanes and parameters from the instruction itself. The result is a memory allocation and access approach where the device driver properly computes the target address in the memory. Advantageously, processing efficiency is improved where memory in a parallel processing subsystem is internally stored and accessed as an array of structures of arrays, proportional to the SIMT/SIMD group width (the number of threads or lanes per execution group).
摘要翻译：本发明的一个实施例提出了一种技术，其提供了一种在多个线程/数据通道上分配和访问存储器的优化方式。具体来说，设备驱动程序接收到作为阵列结构的阵列设置的存储器的指令。设备驱动程序使用关于指令本身的线程/数据通道数和参数的信息来计算存储器中的地址。结果是存储器分配和访问方法，其中设备驱动器正确地计算存储器中的目标地址。有利的是，处理效率得到改善，其中并行处理子系统中的存储器被内部存储和访问为与SIMT / SIMD组宽度（每个执行组的线程或通道数）成比例的阵列结构的阵列。

2. 发明授权

US08732713B2 Thread group scheduler for computing on a parallel thread processor 有权
标题翻译：线程组调度程序，用于在并行线程处理器上进行计算
公开(公告)号：US08732713B2
公开(公告)日：2014-05-20
申请号：US13247819
申请日：2011-09-28
申请人： Brett W. Coon , John Erik Lindholm , Robert J. Stoll , Nicholas Wang , Jack Hilaire Choquette , Kathleen Elliott Nickolls
发明人： Brett W. Coon , John R. Nickolls , John Erik Lindholm , Robert J. Stoll , Nicholas Wang , Jack Hilaire Choquette
IPC分类号： G06F9/46
CPC分类号： G06F9/4881 , G06F2209/483
摘要： A parallel thread processor executes thread groups belonging to multiple cooperative thread arrays (CTAs). At each cycle of the parallel thread processor, an instruction scheduler selects a thread group to be issued for execution during a subsequent cycle. The instruction scheduler selects a thread group to issue for execution by (i) identifying a pool of available thread groups, (ii) identifying a CTA that has the greatest seniority value, and (iii) selecting the thread group that has the greatest credit value from within the CTA with the greatest seniority value.
摘要翻译：并行线程处理器执行属于多个协作线程数组（CTA）的线程组。在并行线程处理器的每个周期，指令调度器在随后的周期中选择要发行的线程组以执行。指令调度器通过（i）识别可用线程组的池，（ii）识别具有最大资历值的CTA来选择要执行的线程组，以及（iii）选择具有最大信用值的线程组从具有最高资历价值的CTA内。

3. 发明授权

US08645638B2 Shared single-access memory with management of multiple parallel requests 有权
标题翻译：具有管理多个并行请求的共享单访问存储器
公开(公告)号：US08645638B2
公开(公告)日：2014-02-04
申请号：US13466057
申请日：2012-05-07
申请人： Brett W. Coon , Ming Y. Siu , Weizhong Xu , Stuart F. Oberman , John R. Nickolls , Peter C. Mills
发明人： Brett W. Coon , Ming Y. Siu , Weizhong Xu , Stuart F. Oberman , John R. Nickolls , Peter C. Mills
IPC分类号： G06F12/00 , G06F13/00
CPC分类号： G06F12/084 , Y02D10/13
摘要： A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.
摘要翻译：多线程处理器中的并发线程使用内存。任何可寻址的存储位置都可以由任何并发线程访问，但一次只能访问一个位置。存储器耦合到并行处理引擎，其产生一组并行存储器访问请求，每个指定对于不同请求可能相同或不同的目标地址。序列化逻辑选择一个目标地址，并确定哪个请求指定所选择的目标地址。允许所有这些请求并行进行，而其他请求被推迟。可以通过序列化逻辑重新生成和处理延迟请求，以便通过一次访问组中的每个不同的目标地址来满足一组请求。

4. 发明授权

US08327123B2 Maximized memory throughput on parallel processing devices 有权
标题翻译：最大化并行处理设备的内存吞吐量
公开(公告)号：US08327123B2
公开(公告)日：2012-12-04
申请号：US13069384
申请日：2011-03-23
申请人： Norbert Juffa , Brett W. Coon
发明人： Norbert Juffa , Brett W. Coon
IPC分类号： G06F9/30
CPC分类号： G06F9/3887 , G06F9/3455 , G06F9/3851 , G06F9/3889
摘要： In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
摘要翻译：在用于流计算的并行处理装置中，流的每个数据元素的处理可能不是计算密集的，因此与读取流并写入结果所需的存储器访问时间相比，处理可能需要相对较少的时间来计算。因此，内存吞吐量通常会限制流计算的性能。一般来说，提供了用于在这种存储器吞吐量限制的流计算中实现改进的，优化的或最终最大化的存储器吞吐量的方法。通过提高跨多个处理元件和线程的聚合内存吞吐量，最大化流计算性能。通过平衡线程和线程组之间的处理负载以及耦合到并行处理设备的硬件存储器接口来实现高聚合内存吞吐量。

5. 发明申请

US20120089792A1 EFFICIENT IMPLEMENTATION OF ARRAYS OF STRUCTURES ON SIMT AND SIMD ARCHITECTURES 有权
标题翻译：对SIMT和SIMD建筑结构的有效实施
公开(公告)号：US20120089792A1
公开(公告)日：2012-04-12
申请号：US13247855
申请日：2011-09-28
申请人： Brian FAHS , John R. Nickolls , Kathleen Elliott Nickolls , Henry Packard Moreton , Brett W. Coon
发明人： Brian FAHS , John R. Nickolls , Kathleen Elliott Nickolls , Henry Packard Moreton , Brett W. Coon
IPC分类号： G06F12/00
CPC分类号： G06F9/3885 , G06F9/30036 , G06F9/3009 , G06F9/30123 , G06F9/345 , G06F9/3824 , G06F9/3851 , G06F9/3887 , G06F12/0207 , G06T1/20
摘要： One embodiment of the present invention sets forth a technique providing an optimized way to allocate and access memory across a plurality of thread/data lanes. Specifically, the device driver receives an instruction targeted to a memory set up as an array of structures of arrays. The device driver computes an address within the memory using information about the number of thread/data lanes and parameters from the instruction itself. The result is a memory allocation and access approach where the device driver properly computes the target address in the memory. Advantageously, processing efficiency is improved where memory in a parallel processing subsystem is internally stored and accessed as an array of structures of arrays, proportional to the SIMT/SIMD group width (the number of threads or lanes per execution group).
摘要翻译：本发明的一个实施例提出了一种技术，其提供了一种在多个线程/数据通道上分配和访问存储器的优化方式。具体来说，设备驱动程序接收到作为阵列结构的阵列设置的存储器的指令。设备驱动程序使用关于指令本身的线程/数据通道数和参数的信息来计算存储器中的地址。结果是存储器分配和访问方法，其中设备驱动器正确地计算存储器中的目标地址。有利的是，处理效率得到改善，其中并行处理子系统中的存储器被内部存储和访问为与SIMT / SIMD组宽度（每个执行组的线程或通道数）成比例的阵列结构的阵列。

6. 发明申请

US20120036329A1 LOCK MECHANISM TO ENABLE ATOMIC UPDATES TO SHARED MEMORY 有权
标题翻译：锁定机制使原始内存更新到共享内存
公开(公告)号：US20120036329A1
公开(公告)日：2012-02-09
申请号：US13276224
申请日：2011-10-18
申请人： Brett W. Coon , John R. Nickolls , Lars Nyland , Peter C. Mills
发明人： Brett W. Coon , John R. Nickolls , Lars Nyland , Peter C. Mills
IPC分类号： G06F12/14
CPC分类号： G06F12/084 , G06F9/3004 , G06F9/30087 , G06F9/30185 , G06F9/526 , G06F2209/521
摘要： A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.
摘要翻译：用于锁定和解锁对原子操作的共享存储器的访问的系统和方法提供指示锁是否成功的即时反馈。读取数据将返回给具有锁定状态的请求者。在写入期间在读取或解锁期间锁定时，锁定状态可能会同时更改。因此，在读取 - 修改 - 写入操作之前或期间，不必将锁定状态检查为单独的事务。另外，可以为每个原子存储器操作明确地指定锁定或解锁。因此，对于不修改内存位置的内容的操作，不执行锁定操作。

7. 发明授权

US08074224B1 Managing state information for a multi-threaded processor 有权
标题翻译：管理多线程处理器的状态信息
公开(公告)号：US08074224B1
公开(公告)日：2011-12-06
申请号：US11311963
申请日：2005-12-19
申请人： Bryon S. Nordquist , Brett W. Coon
发明人： Bryon S. Nordquist , Brett W. Coon
IPC分类号： G06F9/46 , G06F9/40 , G06F15/76
CPC分类号： G06F9/52 , G06F9/3012 , G06F9/30123 , G06F9/3851 , G06F9/3853 , G06F9/3887 , G06T1/20
摘要： Embodiments of the present invention facilitate dynamically adapting to state information changes in a graphics processing environment. In one embodiment, a master register holds state information corresponding to units of work (threads) to be performed. The state information in the master register is copied to a per-group state register when a group of threads is to be launched. The per-group state register is coupled to processing engines configured to process the threads, so that the processing engines read state information from the per-group state register rather than the master register. In another embodiment, a number of master registers may be used to store state information for different types of threads.
摘要翻译：本发明的实施例有助于动态地适应图形处理环境中的状态信息变化。在一个实施例中，主寄存器保存对应于要执行的工作单元（线程）的状态信息。当一组线程要启动时，主寄存器中的状态信息被复制到每组状态寄存器。每组状态寄存器被耦合到配置成处理线程的处理引擎，使得处理引擎从每组状态寄存器而不是主寄存器读取状态信息。在另一个实施例中，可以使用多个主寄存器来存储不同类型的线程的状态信息。

8. 发明申请

US20110173414A1 MAXIMIZED MEMORY THROUGHPUT ON PARALLEL PROCESSING DEVICES 有权
标题翻译：最大化的并行处理器件的存储器
公开(公告)号：US20110173414A1
公开(公告)日：2011-07-14
申请号：US13069384
申请日：2011-03-23
申请人： Norbert Juffa , Brett W. Coon
发明人： Norbert Juffa , Brett W. Coon
IPC分类号： G06F9/38
CPC分类号： G06F9/3887 , G06F9/3455 , G06F9/3851 , G06F9/3889
摘要： In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
摘要翻译：在用于流计算的并行处理装置中，流的每个数据元素的处理可能不是计算密集的，因此与读取流并写入结果所需的存储器访问时间相比，处理可能需要相对较少的时间来计算。因此，内存吞吐量通常会限制流计算的性能。一般来说，提供了用于在这种存储器吞吐量限制的流计算中实现改进的，优化的或最终最大化的存储器吞吐量的方法。通过提高跨多个处理元件和线程的聚合内存吞吐量，最大化流计算性能。通过平衡线程和线程组之间的处理负载以及耦合到并行处理设备的硬件存储器接口来实现高聚合内存吞吐量。

9. 发明申请

US20110078417A1 COOPERATIVE THREAD ARRAY REDUCTION AND SCAN OPERATIONS 有权
标题翻译：合作螺线减排和扫描作业
公开(公告)号：US20110078417A1
公开(公告)日：2011-03-31
申请号：US12890227
申请日：2010-09-24
申请人： Brian FAHS , Ming Y. Siu , Brett W. Coon , John R. Nickolls , Lars Nyland
发明人： Brian FAHS , Ming Y. Siu , Brett W. Coon , John R. Nickolls , Lars Nyland
IPC分类号： G06F9/38
CPC分类号： G06F9/522 , G06F8/458 , G06F9/3004 , G06F9/30087 , G06F9/30145 , G06F9/3851
摘要： One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.
摘要翻译：本发明的一个实施例提出了一种用于跨独立执行的多个线程执行聚合操作的技术。聚合被指定为屏障同步或屏障到达指令的一部分，其中除了执行屏障同步或到达之外，指令聚合（使用缩减或扫描操作）由每个线程提供的值。当线程执行屏障聚合指令时，线程有助于扫描或缩小结果，并等待执行任何更多指令，直到所有线程都执行了阻挡聚合指令为止。在所有线程执行了屏障聚合指令之后，向每个线程传送减少结果，并且当线程执行屏障聚合指令时，将扫描结果传送给每个线程。

10. 发明申请

US20110078381A1 Cache Operations and Policies For A Multi-Threaded Client 有权
标题翻译：多线程客户端的缓存操作和策略
公开(公告)号：US20110078381A1
公开(公告)日：2011-03-31
申请号：US12890476
申请日：2010-09-24
申请人： Steven James HEINRICH , Alexander L. Minkin , Brett W. Coon , Rajeshwaran Selvanesan , Robert Steven Glanville , Charles McCarver , Anjana Rajendran , Stewart Glenn Carlton , John R. Nickolls , Brian Fahs
发明人： Steven James HEINRICH , Alexander L. Minkin , Brett W. Coon , Rajeshwaran Selvanesan , Robert Steven Glanville , Charles McCarver , Anjana Rajendran , Stewart Glenn Carlton , John R. Nickolls , Brian Fahs
IPC分类号： G06F12/08 , G06F12/00
CPC分类号： G06F12/0842 , G06F12/0897
摘要： A method for managing a parallel cache hierarchy in a processing unit. The method including receiving an instruction that includes a cache operations modifier that identifies a level of the parallel cache hierarchy in which to cache data associated with the instruction; and implementing a cache replacement policy based on the cache operations modifier.
摘要翻译：一种用于在处理单元中管理并行高速缓存层级的方法。该方法包括接收包括高速缓存操作修饰符的指令，该缓存操作修饰符标识其中要缓存与指令相关联的数据的并行高速缓存层级的级别; 并基于高速缓存操作修饰符实现高速缓存替换策略。

你已经成功收藏专利！

检索式保存成功!

IPRDB

热门服务

关于我们

友情链接

联系方式