Opcode | Encoding | 16-bit | 32-bit | 64-bit | CPUID Feature Flag(s) | Description |
---|---|---|---|---|---|---|
66 0F 3A 40 /r ib DPPS xmm1, xmm2/m128, imm8 | rmi | Invalid | Valid | Valid | sse4.1 | Compute the dot product of packed double-precision floating-point values in xmm1 and xmm2/m128. Use imm8 to control the operation. Store the result in xmm1. |
VEX.128.66.0F3A.WIG 40 /r ib VDPPS xmm1, xmm2, xmm3/m128, imm8 | rvmi | Invalid | Valid | Valid | avx | Compute the dot product of packed double-precision floating-point values in xmm2 and xmm3/m128. Use imm8 to control the operation. Store the result in xmm1. |
VEX.256.66.0F3A.WIG 40 /r ib VDPPS ymm1, ymm2, ymm3/m256, imm8 | rvmi | Invalid | Valid | Valid | avx | Compute two dot products of packed double-precision floating-point values in ymm2 and ymm3/m256. Use imm8 to control the operation. Store the result in ymm1. |
Encoding
Encoding | Operand 1 | Operand 2 | Operand 3 | Operand 4 |
---|---|---|---|---|
rmi | ModRM.reg[rw] | ModRM.r/m[r] | imm8 | |
rvmi | ModRM.reg[w] | VEX.vvvv[r] | ModRM.r/m[r] | imm8 |
Description
The (V)DPPD
instruction conditionally computes the dot product of packed double-precision floating-point values from the two source operands. The operation is controlled by the immediate. The result is stored in the destination operand.
Beginning with a sum of 0, the immediate's bits are interpreted as per this table:
Bit | Meaning if Set | Meaning if Clear |
---|---|---|
0 | Store the computed dot product in dest(0..31) | Store 0.0 in dest(0..31) |
1 | Store the computed dot product in dest(32..63) | Store 0.0 in dest(32..63) |
2 | Store the computed dot product in dest(64..95) | Store 0.0 in dest(64..95) |
3 | Store the computed dot product in dest(96..127) | Store 0.0 in dest(96..127) |
4 | Add src1(0..31) × src2(0..31) to the sum | Add 0.0 to the sum |
5 | Add src1(32..63) × src2(32..63) to the sum | |
6 | Add src1(64..95) × src2(64..95) to the sum | |
7 | Add src1(96..127) × src2(96..127) to the sum |
The VEX.256
form of the instruction operates in a manner similar to the legacy SSE form (only on 128 bits), but on both halves of the operands. In other words, each bit of the immediate controls two operations – one for the lower half, and one for the upper half.
All forms except the legacy SSE one will zero the upper (untouched) bits.
Operation
public void DPPS(SimdF32 dest, SimdF32 src, byte imm8)
{
// see note 1
F32 partial0 = imm8.Bit[4] ? dest[0] * src[0] : 0.0;
F32 partial1 = imm8.Bit[5] ? dest[1] * src[1] : 0.0;
F32 partial2 = imm8.Bit[6] ? dest[2] * src[2] : 0.0;
F32 partial3 = imm8.Bit[7] ? dest[3] * src[3] : 0.0;
F32 sum = partial0 + partial1 + partial2 + partial3;
dest[0] = imm8.Bit[0] ? sum : 0.0;
dest[1] = imm8.Bit[1] ? sum : 0.0;
dest[2] = imm8.Bit[2] ? sum : 0.0;
dest[3] = imm8.Bit[3] ? sum : 0.0;
// dest[4..] is unmodified
}
public void VDPPS_Vex128(SimdF32 dest, SimdF32 src1, SimdF32 src2, byte imm8)
{
// see note 1
F32 partial0 = imm8.Bit[4] ? src1[0] * src2[0] : 0.0;
F32 partial1 = imm8.Bit[5] ? src1[1] * src2[1] : 0.0;
F32 partial2 = imm8.Bit[6] ? src1[2] * src2[2] : 0.0;
F32 partial3 = imm8.Bit[7] ? src1[3] * src2[3] : 0.0;
F32 sum = partial0 + partial1 + partial2 + partial3;
dest[0] = imm8.Bit[0] ? sum : 0.0;
dest[1] = imm8.Bit[1] ? sum : 0.0;
dest[2] = imm8.Bit[2] ? sum : 0.0;
dest[3] = imm8.Bit[3] ? sum : 0.0;
// dest[4..] is unmodified
}
public void VDPPS_Vex256(SimdF32 dest, SimdF32 src1, SimdF32 src2, byte imm8)
{
// see note 1
F32 partial00 = imm8.Bit[4] ? src1[0] * src2[0] : 0.0;
F32 partial01 = imm8.Bit[5] ? src1[1] * src2[1] : 0.0;
F32 partial02 = imm8.Bit[6] ? src1[2] * src2[2] : 0.0;
F32 partial03 = imm8.Bit[7] ? src1[3] * src2[3] : 0.0;
F32 partial10 = imm8.Bit[4] ? src1[4] * src2[0] : 0.0;
F32 partial11 = imm8.Bit[5] ? src1[5] * src2[5] : 0.0;
F32 partial12 = imm8.Bit[6] ? src1[6] * src2[6] : 0.0;
F32 partial13 = imm8.Bit[7] ? src1[7] * src2[7] : 0.0;
F32 sum0 = partial00 + partial01 + partial02 + partial03;
F32 sum1 = partial10 + partial11 + partial12 + partial13;
dest[0] = imm8.Bit[0] ? sum0 : 0.0;
dest[1] = imm8.Bit[1] ? sum0 : 0.0;
dest[2] = imm8.Bit[2] ? sum0 : 0.0;
dest[3] = imm8.Bit[3] ? sum0 : 0.0;
dest[4] = imm8.Bit[0] ? sum1 : 0.0;
dest[5] = imm8.Bit[1] ? sum1 : 0.0;
dest[6] = imm8.Bit[2] ? sum1 : 0.0;
dest[7] = imm8.Bit[3] ? sum1 : 0.0;
dest[8..] = 0;
}
- The SIMD exception flags are updated after each multiplication (if it occurs), and after the addition. If an unmasked exception is reported during the multiplications, it will be raised before the sums. If the sums report an unmasked exception, it will be raised before the destination is updated. Any unmasked exceptions will leave the destination unmodified.
Intrinsics
__m128 _mm_dp_ps(__m128 a, __m128 b, const int mask)
__m256 _mm256_dp_ps(__m256 a, __m256 b, const int mask)
Exceptions
SIMD Floating-Point
#XM
#D
- Denormal operand.#I
- Invalid operation.#O
- Numeric overflow.#P
- Inexact result.#U
- Numeric underflow.
Other Exceptions
VEX Encoded Form: See Type 2 Exception Conditions.