会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 2. 发明授权
    • Distributing processing tasks within a processor
    • 在处理器中分配处理任务
    • US07865894B1
    • 2011-01-04
    • US11311997
    • 2005-12-19
    • Bryon S. NordquistJohn R. Nickolls
    • Bryon S. NordquistJohn R. Nickolls
    • G06F9/46
    • G06F9/5044
    • Embodiments of the present invention facilitate distributing processing tasks within a processor. In one embodiment, processing clusters keep track of resource requirements. If sufficient resources are available within a particular processing cluster, the available processing cluster asserts a ready signal to a dispatch unit. The dispatch unit is configured to pass a processing task (such as a cooperative thread array or CTA) to an available processing cluster that asserted a ready signal. In another embodiment, a processing task is passed around a ring of processing clusters until a processing cluster with sufficient resources available accepts the processing task.
    • 本发明的实施例便于在处理器内分发处理任务。 在一个实施例中,处理集群跟踪资源需求。 如果在特定处理集群内有足够的资源可用,则可用的处理集群将向就绪信号发出准备好的信号。 调度单元被配置为将处理任务(诸如协作线程数组或CTA)传递到断言就绪信号的可用处理簇。 在另一个实施例中,处理任务围绕处理集群环传递,直到具有足够资源的处理集群接受处理任务为止。
    • 3. 发明授权
    • Coalescing memory barrier operations across multiple parallel threads
    • 在多个并行线程之间合并记忆障碍操作
    • US09223578B2
    • 2015-12-29
    • US12887081
    • 2010-09-21
    • John R. NickollsSteven James HeinrichBrett W. CoonMichael C. Shebanow
    • John R. NickollsSteven James HeinrichBrett W. CoonMichael C. Shebanow
    • G06F9/46G06F9/38G06F9/30
    • G06F9/3834G06F9/3004G06F9/30087G06F9/3851
    • One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.
    • 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。 来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。 此外,存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。 例如,第一类型的存储器障碍指令可以将存储器事务提交到共享L1(一级)高速缓存的一组协作线程的级别。 第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。 最后,第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。 执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。
    • 5. 发明授权
    • Generating event signals for performance register control using non-operative instructions
    • 使用非操作指令生成用于性能寄存器控制的事件信号
    • US07809928B1
    • 2010-10-05
    • US11313872
    • 2005-12-20
    • Roger L. AllenBrett W. CoonIan A. BuckJohn R. Nickolls
    • Roger L. AllenBrett W. CoonIan A. BuckJohn R. Nickolls
    • G06F9/30G06F17/00G09G5/02
    • G06T1/20G06F9/30072G06F9/30076G06F11/3466G06F2201/86G06F2201/865G06F2201/88
    • One embodiment of an instruction decoder includes an instruction parser configured to process a first non-operative instruction and to generate a first event signal corresponding to the first non-operative instruction, and a first event multiplexer configured to receive the first event signal from the instruction parser, to select the first event signal from one or more event signals and to transmit the first event signal to an event logic block. The instruction decoder may be implemented in a multithreaded processing unit, such as a shader unit, and the occurrences of the first event signal may be tracked when one or more threads are executed within the processing unit. The resulting event signal count may provide a designer with a better understanding of the behavior of a program, such as a shader program, executed within the processing unit, thereby facilitating overall processing unit and program design.
    • 指令解码器的一个实施例包括:指令解析器,被配置为处理第一非操作指令并产生对应于第一非操作指令的第一事件信号;以及第一事件多路复用器,被配置为从指令接收第一事件信号 解析器,以从一个或多个事件信号中选择第一事件信号,并将第一事件信号发送到事件逻辑块。 指令解码器可以在诸如着色器单元的多线程处理单元中实现,并且当在处理单元内执行一个或多个线程时,可以跟踪第一事件信号的出现。 所得到的事件信号计数可以使设计者更好地理解在处理单元内执行的诸如着色器程序之类的程序的行为,从而有助于整体处理单元和程序设计。
    • 6. 发明授权
    • Bit reversal methods for a parallel processor
    • 并行处理器的位反转方法
    • US07640284B1
    • 2009-12-29
    • US11424514
    • 2006-06-15
    • Nolan D. GoodnightJohn R. Nickolls
    • Nolan D. GoodnightJohn R. Nickolls
    • G06F17/14
    • G06F17/142G06F7/76
    • Parallelism in a processor is exploited to permute a data set based on bit reversal of indices associated with data points in the data set. Permuted data can be stored in a memory having entries arranged in banks, where entries in different banks can be accessed in parallel. A destination location in the memory for a particular data point from the data set is determined based on the bit-reversed index associated with that data point. The bit-reversed index can be further modified so that at least some of the destination locations determined by different parallel processes are in different banks, allowing multiple points of the bit-reversed data set to be written in parallel.
    • 处理器中的并行性被利用以基于与数据集中的数据点相关联的索引的位反转来置换数据集。 被许可的数据可以存储在具有排列在存储体中的条目的存储器中,其中可以并行地访问不同存储体中的条目。 基于与该数据点相关联的位反转索引来确定来自数据集的用于特定数据点的存储器中的目的地位置。 可以进一步修改位反转索引,使得由不同并行进程确定的至少一些目的地位置在不同的存储体中,允许并行写入位反转数据集的多个点。
    • 7. 发明申请
    • SYSTEMS AND METHODS FOR COALESCING MEMORY ACCESSES OF PARALLEL THREADS
    • 用于并行线程的存储器访问的系统和方法
    • US20090240895A1
    • 2009-09-24
    • US12054330
    • 2008-03-24
    • Lars NylandJohn R. NickollsGentaro HirotaTanmoy Mandal
    • Lars NylandJohn R. NickollsGentaro HirotaTanmoy Mandal
    • G06F12/00
    • G06F9/3824G06F9/3851G06F9/3885G06F9/3891
    • One embodiment of the present invention sets forth a technique for efficiently and flexibly performing coalesced memory accesses for a thread group. For each read application request that services a thread group, the core interface generates one pending request table (PRT) entry and one or more memory access requests. The core interface determines the number of memory access requests and the size of each memory access request based on the spread of the memory access addresses in the application request. Each memory access request specifies the particular threads that the memory access request services. The PRT entry tracks the number of pending memory access requests. As the memory interface completes each memory access request, the core interface uses information in the memory access request and the corresponding PRT entry to route the returned data. When all the memory access requests associated with a particular PRT entry are complete, the core interface satisfies the corresponding application request and frees the PRT entry.
    • 本发明的一个实施例提出了一种用于有效且灵活地执行线程组合的存储器访问的技术。 对于为线程组服务的每个读取应用程序请求,核心接口生成一个未决请求表(PRT)条目和一个或多个内存访问请求。 核心接口基于应用程序请求中的存储器访问地址的扩展来确定存储器访问请求的数量和每个存储器访问请求的大小。 每个存储器访问请求指定存储器访问请求服务的特定线程。 PRT条目跟踪挂起的内存访问请求的数量。 当存储器接口完成每个存储器访问请求时,核心接口使用存储器访问请求中的信息和对应的PRT条目来路由返回的数据。 当与特定PRT条目相关联的所有存储器访问请求完成时,核心接口满足相应的应用请求并释放PRT条目。
    • 8. 发明授权
    • Defect tolerant redundancy
    • 缺陷容错冗余
    • US07477091B2
    • 2009-01-13
    • US11105326
    • 2005-04-12
    • John R. Nickolls
    • John R. Nickolls
    • G06F11/16
    • G11C29/848
    • Circuits, methods, and apparatus for using redundant circuitry on integrated circuits in order to increase manufacturing yields. One exemplary embodiment of the present invention provides a circuit configuration wherein functional circuit blocks in a group of circuit blocks are selected by multiplexers. Multiplexers at the input and output of the group of circuit blocks steer input and output signals to and from functional circuit blocks, avoiding circuit blocks found to be defective or nonfunctional. Multiple groups of these circuit blocks may be arranged in series and in parallel. Alternate multiplexer configurations may be used in order to provide a higher level of redundancy. Other embodiments use all functional circuit blocks and sort integrated circuits based on the level of functionality or performance. Other embodiments provide methods of testing integrated circuits having one or more of these circuit configurations.
    • 用于在集成电路上使用冗余电路的电路,方法和装置,以增加制造产量。 本发明的一个示例性实施例提供一种电路配置,其中一组电路块中的功能电路块由多路复用器选择。 电路组输入和输出的多路复用器将输入和输出信号转换到功能电路块和从功能电路块输出,避免电路块发现有故障或无功能。 这些电路块的多组可以串联和并联布置。 可以使用替代多路复用器配置以提供更高级别的冗余。 其他实施例使用所有功能电路块并且基于功能或性能的级别对集成电路进行分类。 其他实施例提供了测试具有这些电路配置中的一个或多个的集成电路的方法。
    • 9. 发明授权
    • Register based queuing for texture requests
    • 基于注册排队的纹理请求
    • US07456835B2
    • 2008-11-25
    • US11339937
    • 2006-01-25
    • John Erik LindholmJohn R. NickollsSimon S. MoyBrett W. Coon
    • John Erik LindholmJohn R. NickollsSimon S. MoyBrett W. Coon
    • G06T11/40G06T15/00G06T1/00G09G5/00
    • G06T11/60G09G5/363
    • A graphics processing unit can queue a large number of texture requests to balance out the variability of texture requests without the need for a large texture request buffer. A dedicated texture request buffer queues the relatively small texture commands and parameters. Additionally, for each queued texture command, an associated set of texture arguments, which are typically much larger than the texture command, are stored in a general purpose register. The texture unit retrieves texture commands from the texture request buffer and then fetches the associated texture arguments from the appropriate general purpose register. The texture arguments may be stored in the general purpose register designated as the destination of the final texture value computed by the texture unit. Because the destination register must be allocated for the final texture value as texture commands are queued, storing the texture arguments in this register does not consume any additional registers.
    • 图形处理单元可以排队大量纹理请求,以平衡纹理请求的可变性,而不需要大的纹理请求缓冲区。 专用纹理请求缓冲区排队相对较小的纹理命令和参数。 另外,对于每个排队的纹理命令,通常比纹理命令大得多的一组相关的纹理参数存储在通用寄存器中。 纹理单元从纹理请求缓冲区中检索纹理命令,然后从相应的通用寄存器获取相关的纹理参数。 纹理参数可以存储在指定为由纹理单元计算的最终纹理值的目的地的通用寄存器中。 因为当纹理命令排队时,必须为目标寄存器分配最终纹理值,所以将纹理参数存储在该寄存器中不消耗任何其他寄存器。
    • 10. 发明授权
    • Galois field arithmetic unit for use within a processor
    • 用于处理器内的伽罗瓦域算术单元
    • US07313583B2
    • 2007-12-25
    • US10460599
    • 2003-06-12
    • Joshua PortenWon KimScott D. JohnsonJohn R. Nickolls
    • Joshua PortenWon KimScott D. JohnsonJohn R. Nickolls
    • G06F15/00H03M13/00
    • G06F7/724
    • A Galois field arithmetic unit includes a Galois field multiplier section and a Galois field adder section. The Galois field multiplier section includes a plurality of Galois field multiplier arrays that perform a Galois field multiplication by multiplying, in accordance with a generating polynomial, a 1st operand and a 2nd operand. The bit size of the 1st and 2nd operands correspond to the bit size of a processor data path, where each of the Galois field multiplier arrays performs a portion of the Galois field multiplication by multiplying, in accordance with a corresponding portion of the generating polynomial, corresponding portions of the 1st and 2nd operands. The bit size of the corresponding portions of the 1st and 2nd operands corresponds to a symbol size of symbols of a coding scheme being implemented by the corresponding processor.
    • 伽罗瓦域算术单元包括伽罗瓦域乘法器部分和伽罗瓦域加法器部分。 伽罗瓦域乘法器部分包括多个伽罗瓦域乘法器阵列,其通过根据生成多项式乘以第1和第2操作数和第2和/ >操作数。 1 nd / / SUP>操作数的位大小对应于处理器数据路径的位大小,其中Galois域乘法器阵列中的每一个执行Galois的一部分 根据生成多项式的对应部分乘以1< S>和2< nd>操作数的对应部分进行场乘法运算。 第1和第2和第2操作数的对应部分的位大小对应于由对应的处理器实现的编码方案的符号的符号大小。