SSE4

SSE4 ( Streaming SIMD Extensions 4 ) es un conjunto de instrucciones de CPU SIMD que se utiliza en la microarquitectura Intel Core y AMD K10 (K8L) . Se anunció el 27 de septiembre de 2006 en el Foro de desarrolladores de Intel de otoño de 2006 , con detalles vagos en un documento técnico ; ^{[1] Los} detalles más precisos de 47 instrucciones estuvieron disponibles en el Foro de Desarrolladores Intel de primavera de 2007 en Beijing , en la presentación. ^[2]SSE4 es totalmente compatible con software escrito para generaciones anteriores de microprocesadores de arquitectura Intel 64 e IA-32. Todo el software existente continúa funcionando correctamente sin modificaciones en los microprocesadores que incorporan SSE4, así como en presencia de aplicaciones nuevas y existentes que incorporan SSE4. ^[3]

Subconjuntos SSE4

Intel SSE4 consta de 54 instrucciones. Un subconjunto que consta de 47 instrucciones, denominado SSE4.1 en alguna documentación de Intel, está disponible en Penryn . Además, SSE4.2 , un segundo subconjunto que consta de las 7 instrucciones restantes, está disponible por primera vez en Core i7 basado en Nehalem . Intel atribuye a los comentarios de los desarrolladores un papel importante en el desarrollo del conjunto de instrucciones.

Comenzando con los procesadores basados en Barcelona , AMD introdujo el conjunto de instrucciones SSE4a , que tiene 4 instrucciones SSE4 y 4 nuevas instrucciones SSE. Estas instrucciones no se encuentran en los procesadores de Intel que admiten SSE4.1 y los procesadores AMD solo comenzaron a admitir SSE4.1 y SSE4.2 de Intel (el conjunto completo de instrucciones SSE4) en los procesadores FX basados en Bulldozer . Con SSE4a, también se introdujo la función SSE desalineada, lo que significaba que las instrucciones de carga no alineadas eran tan rápidas como las versiones alineadas en direcciones alineadas. También permitió deshabilitar la verificación de alineación en operaciones SSE sin carga que acceden a la memoria. ^{[4] Más} tarde, Intel introdujo mejoras de velocidad similares a SSE no alineado en sus procesadores Nehalem, pero no introdujo acceso desalineado por instrucciones SSE sin carga hasta AVX . ^[5]

Confusión de nombres

Lo que ahora se conoce como SSSE3 (Supplemental Streaming SIMD Extensions 3), introducido en la línea de procesadores Intel Core 2 , fue denominado SSE4 por algunos medios hasta que Intel creó el apodo SSSE3. Apodado internamente Merom New Instructions, Intel originalmente no planeaba asignarles un nombre especial, lo que fue criticado por algunos periodistas. ^[6] Intel finalmente aclaró la confusión y reservó el nombre SSE4 para su próxima extensión de conjunto de instrucciones. ^[7]

Intel utiliza el término de marketing HD Boost para referirse a SSE4. ^[8]

Nuevas instrucciones

A diferencia de todas las iteraciones anteriores de SSE, SSE4 contiene instrucciones que ejecutan operaciones que no son específicas de las aplicaciones multimedia. Cuenta con una serie de instrucciones cuya acción está determinada por un campo constante y un conjunto de instrucciones que toman XMM0 como un tercer operando implícito.

Varias de estas instrucciones están habilitadas por el motor de reproducción aleatoria de ciclo único en Penryn. (Las operaciones de orden aleatorio reordenan los bytes dentro de un registro).

SSE4.1

Estas instrucciones se introdujeron con la microarquitectura Penryn , la contracción de 45 nm de la microarquitectura Core de Intel . El soporte se indica mediante el indicador CPUID.01H: ECX.SSE41 [Bit 19].

Instrucción	Descripción
MPSADBW	Calcule ocho sumas de compensación de diferencias absolutas, cuatro a la vez (es decir, \| x ₀ −y ₀ \| + \| x ₁ −y ₁ \| + \| x ₂ −y ₂ \| + \| x ₃ −y ₃ \|, \| x ₀ −y ₁ \| + \| x ₁ −y ₂ \| + \| x ₂ −y ₃ \| + \| x ₃ −y ₄ \|, ..., \| x ₀ −y ₇ \| + \| x ₁ −y ₈ \| + \| x ₂ −y ₉ \| + \| x ₃ −y ₁₀ \|); esta operación es importante para algunos códecs HD y permite calcular una diferencia de bloques de 8 × 8 en menos de siete ciclos. ^[9] Un bit de un operando inmediato de tres bits indica si y ₀ .. y ₁₀ o y ₄ .. y ₁₄ deben usarse desde el operando de destino, los otros dos si x ₀ ..x ₃ , x ₄ .. x ₇ , x ₈ ..x ₁₁ o x ₁₂ ..x ₁₅ deben usarse desde la fuente.
PHMINPOSUW	Establece la palabra de 16 bits sin signo inferior del destino en la palabra de 16 bits sin signo más pequeña en el origen y el siguiente desde abajo al índice de esa palabra en el origen.
PMULDQ	Multiplicación "larga" empaquetada de 32 bits con signo, dos (1º y 3º) de cuatro enteros empaquetados multiplicados dando dos resultados empaquetados de 64 bits.
PMULLD	Multiplicación "baja" empaquetada de 32 bits con signo, cuatro conjuntos empaquetados de enteros multiplicados dando cuatro resultados empaquetados de 32 bits.
DPPS , DPPD	Producto escalar para datos AOS (matriz de estructuras). Esto requiere un operando inmediato que consta de cuatro (o dos para DPPD) bits para seleccionar cuál de las entradas en la entrada multiplicar y acumular, y otros cuatro (o dos para DPPD) para seleccionar si poner 0 o el producto punto en el campo apropiado de la salida.
BLENDPS , BLENDPD , BLENDVPS , BLENDVPD , PBLENDVB , PBLENDW	Copia condicional de elementos en una ubicación con otra, basada (para la forma no V) en los bits de un operando inmediato y (para la forma V) en los bits del registro XMM0.
PMINSB , PMAXSB , PMINUW , PMAXUW , PMINUD , PMAXUD , PMINSD , PMAXSD	Mínimo / máximo empaquetado para diferentes tipos de operandos enteros
ROUNDPS , ROUNDSS , ROUNDPD , ROUNDSD	Redondear valores en un registro de punto flotante a números enteros, utilizando uno de los cuatro modos de redondeo especificados por un operando inmediato
INSERTPS , PINSRB , PINSRD / PINSRQ , EXTRACTPS , PEXTRB , PEXTRD / PEXTRQ	Las instrucciones INSERTPS y PINSR leen 8, 16 o 32 bits de un registro x86 o ubicación de memoria y lo insertan en un campo en el registro de destino dado por un operando inmediato. EXTRACTPS y PEXTR leen un campo del registro fuente y lo insertan en un registro x86 o en una ubicación de memoria. Por ejemplo, PEXTRD eax, [xmm0], 1; EXTRACTPS [addr + 4 * eax], xmm1, 1 almacena el primer campo de xmm1 en la dirección dada por el primer campo de xmm0.
PMOVSXBW , PMOVZXBW , PMOVSXBD , PMOVZXBD , PMOVSXBQ , PMOVZXBQ , PMOVSXWD , PMOVZXWD , PMOVSXWQ , PMOVZXWQ , PMOVSXDQ , PMOVZXDQ	Packed sign/zero extension to wider types
PTEST	This is similar to the TEST instruction, in that it sets the Z flag to the result of an AND between its operands: ZF is set, if DEST AND SRC is equal to 0. Additionally it sets the C flag if (NOT DEST) AND SRC equals zero. This is equivalent to setting the Z flag if none of the bits masked by SRC are set, and the C flag if all of the bits masked by SRC are set.
PCMPEQQ	Quadword (64 bits) compare for equality
PACKUSDW	Convert signed DWORDs into unsigned WORDs with saturation.
MOVNTDQA	Efficient read from write-combining memory area into SSE register; this is useful for retrieving results from peripherals attached to the memory bus.

SSE4.2

SSE4.2 added STTNI (String and Text New Instructions),^[10] several new instructions that perform character searches and comparison on two operands of 16 bytes at a time. These were designed (among other things) to speed up the parsing of XML documents.^[11] It also added a CRC32 instruction to compute cyclic redundancy checks as used in certain data transfer protocols. These instructions were first implemented in the Nehalem-based Intel Core i7 product line and complete the SSE4 instruction set. Support is indicated via the CPUID.01H:ECX.SSE42[Bit 20] flag.

Instruction	Description
CRC32	Accumulate CRC32C value using the polynomial 0x11EDC6F41 (or, without the high order bit, 0x1EDC6F41).^[12]^[13]
PCMPESTRI	Packed Compare Explicit Length Strings, Return Index
PCMPESTRM	Packed Compare Explicit Length Strings, Return Mask
PCMPISTRI	Packed Compare Implicit Length Strings, Return Index
PCMPISTRM	Packed Compare Implicit Length Strings, Return Mask
PCMPGTQ	Compare Packed Signed 64-bit data For Greater Than

POPCNT and LZCNT

These instructions operate on integer rather than SSE registers, because they are not SIMD instructions, but appear at the same time and although introduced by AMD with the SSE4a instruction set, they are counted as separate extensions with their own dedicated CPUID bits to indicate support. Intel implements POPCNT beginning with the Nehalem microarchitecture and LZCNT beginning with the Haswell microarchitecture. AMD implements both beginning with the Barcelona microarchitecture.

AMD calls this pair of instructions Advanced Bit Manipulation (ABM).

Instruction	Description
POPCNT	Population count (count number of bits set to 1). Support is indicated via the CPUID.01H:ECX.POPCNT[Bit 23] flag.^[14]
LZCNT	Leading zero count. Support is indicated via the CPUID.80000001H:ECX.ABM[Bit 5] flag.^[15]

The encoding of lzcnt is similar enough to bsr (bit scan reverse) that if lzcnt is performed on a CPU not supporting it such as Intel CPU's prior to Haswell, it will perform the bsr operation instead of raising an invalid instruction error despite the different result values of lzcnt and bsr.

Trailing zeros can be counted using the bsf (bit scan forward) or tzcnt instructions.

SSE4a

The SSE4a instruction group was introduced in AMD's Barcelona microarchitecture. These instructions are not available in Intel processors. Support is indicated via the CPUID.80000001H:ECX.SSE4A[Bit 6] flag.^[15]

Instruction	Description
EXTRQ/INSERTQ	Combined mask-shift instructions.^[16]
MOVNTSD/MOVNTSS	Scalar streaming store instructions.^[17]

CPU de apoyo

Intel
- Silvermont processors (SSE4.1, SSE4.2 and POPCNT supported)
- Goldmont processors (SSE4.1, SSE4.2 and POPCNT supported)
- Goldmont Plus processors (SSE4.1, SSE4.2 and POPCNT supported)
- Tremont processors (SSE4.1, SSE4.2 and POPCNT supported)
- Penryn processors (SSE4.1 supported, except Pentium Dual-Core and Celeron)
- Nehalem processors and Westmere processors (SSE4.1, SSE4.2 and POPCNT supported, except Pentium and Celeron)
- Sandy Bridge processors and newer (SSE4.1, SSE4.2 and POPCNT supported, include Pentium and Celeron)
- Haswell processors and newer (SSE4.1, SSE4.2, POPCNT and LZCNT supported)
AMD
- K10-based processors (SSE4a, POPCNT and LZCNT supported)
- "Cat" low-power processors
  - Bobcat-based processors (SSE4a, POPCNT and LZCNT supported)
  - Jaguar-based processors and newer (SSE4a, SSE4.1, SSE4.2, POPCNT and LZCNT supported)
  - Puma-based processors and newer (SSE4a, SSE4.1, SSE4.2, POPCNT and LZCNT supported)
- "Heavy Equipment" processors (SSE4a, SSE4.1, SSE4.2, POPCNT and LZCNT supported)
  - Bulldozer-based processors
  - Piledriver-based processors ^[18]
  - Steamroller-based processors
  - Excavator-based processors and newer
- Zen-based processors (SSE4a, SSE4.1, SSE4.2, POPCNT and LZCNT supported)
- Zen+-based processors (SSE4a, SSE4.1, SSE4.2, POPCNT and LZCNT supported)
- Zen2-based processors (SSE4a, SSE4.1, SSE4.2, POPCNT and LZCNT supported)
- Zen3-based processors (SSE4a, SSE4.1, SSE4.2, POPCNT and LZCNT supported)
VIA
- Nano 3000, X2, QuadCore processors (SSE4.1 supported)
- Nano QuadCore C4000-series processors (SSE4.1, SSE4.2 supported)
- Eden X4 processors (SSE4.1, SSE4.2 supported)
Zhaoxin
- ZX-C processors and newer (SSE4.1, SSE4.2 supported)

Referencias

^ Intel Streaming SIMD Extensions 4 (SSE4) Instruction Set Innovation, Intel.
^ Tuning for Intel SSE4 for the 45nm Next Generation Intel Core Microarchitecture, Intel.
^ Intel SSE4 Programming Reference
^ ""Barcelona" Processor Feature: SSE Misaligned Access". AMD. Archived from the original on August 9, 2016. Retrieved March 3, 2015.
^ "Inside Intel Nehalem Microarchitecture". Retrieved March 3, 2015.
^ My Experience With "Conroe", DailyTech
^ Extending the World’s Most Popular Processor Architecture Archived November 24, 2011, at the Wayback Machine, Intel
^ "Intel - Data Center Solutions, IOT, and PC Innovation". Intel.
^ Motion Estimation with Intel Streaming SIMD Extensions 4 (Intel SSE4), Intel.
^ "Schema Validation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)".
^ "XML Parsing Accelerator with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)".
^ Intel SSE4 Programming Reference p. 61. See also RFC 3385 for discussion of the CRC32C polynomial.
^ Fast, Parallelized CRC Computation Using the Nehalem CRC32 Instruction — Dr. Dobbs, April 12, 2011
^ Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2B: Instruction Set Reference, N–Z.
^ a b AMD CPUID Specification
^ Rahul Chaturvedi (17 September 2007). ""Barcelona" Processor Feature: SSE4a Instruction Set". Archived from the original on 25 October 2013.
^ Rahul Chaturvedi (2 October 2007). ""Barcelona" Processor Feature: SSE4a, part 2". Archived from the original on 25 October 2013.
^ "AMD FX-Series FX-6300 - FD6300WMW6KHK / FD6300WMHKBOX".

enlaces externos

SSE4 Programming Reference by Intel
PCMPSTR calculator for the SSE 4.2 string instructions

[1] Intel Streaming SIMD Extensions 4 (SSE4) Instruction Set Innovation, Intel.

[2] Tuning for Intel SSE4 for the 45nm Next Generation Intel Core Microarchitecture, Intel.

[3] Intel SSE4 Programming Reference

[4] ""Barcelona" Processor Feature: SSE Misaligned Access". AMD. Archived from the original on August 9, 2016. Retrieved March 3, 2015.

[5] "Inside Intel Nehalem Microarchitecture". Retrieved March 3, 2015.

[sse4criticism-6] My Experience With "Conroe", DailyTech

[sse4newinstructions-7] Extending the World’s Most Popular Processor Architecture Archived November 24, 2011, at the Wayback Machine, Intel

[8] "Intel - Data Center Solutions, IOT, and PC Innovation". Intel.

[9] Motion Estimation with Intel Streaming SIMD Extensions 4 (Intel SSE4), Intel.

[10] "Schema Validation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)".

[11] "XML Parsing Accelerator with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)".

[12] Intel SSE4 Programming Reference p. 61. See also RFC 3385 for discussion of the CRC32C polynomial.

[13] Fast, Parallelized CRC Computation Using the Nehalem CRC32 Instruction — Dr. Dobbs, April 12, 2011

[14] Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2B: Instruction Set Reference, N–Z.

[amd_cpuid-15] AMD CPUID Specification

[16] Rahul Chaturvedi (17 September 2007). ""Barcelona" Processor Feature: SSE4a Instruction Set". Archived from the original on 25 October 2013.

[17] Rahul Chaturvedi (2 October 2007). ""Barcelona" Processor Feature: SSE4a, part 2". Archived from the original on 25 October 2013.

[18] "AMD FX-Series FX-6300 - FD6300WMW6KHK / FD6300WMHKBOX".

[1] Los