x86 mov insns & short history of the most popular CPU architecture


Almost all desktop computers and
most server computers of today – run on x86 processors. You have two main players:
Intel, and AMD; both of which produce
processors that utilize – the same instruction set
for the most part. This architecture defines
sixteen base registers, called RAX, RBX, RCX, RDX, RSP, RBP,
RSI, RDI, and R8 through R15. These registers are 64 bits wide, but their lower half is
accessible by a separate name. The lower half of that is also
accessible by a separate name. And the lower half of that is
accessible by a separate name, but for the first
four registers, so is the upper 8-bit half
of the lowest 16 bits. How did such a wildly inconsistent
design come to be? I am Bisqwit, and today
we study the x86. [Beeps and sound effects] In 1971, Intel Corporation
made history. They released the first commercially
available microprocessor. The Intel 4004! In a single beautiful
ceramic package, it contained the
heart of a design – that would serve the world
for decades to come. It was called 4004 because – this was part of
a family of chips – that together would form
the core of a computer. The CPU was the fourth
component of that series. This processor had sixteen
general-purpose registers, called R0 through R15,
each of them four bits wide. It also had a special
accumulator register. The CPU would perform most of
its operations on the accumulator, but the indexed registers could
store intermediate values – without having to transfer data
between the RAM and the CPU. The sixteen 4-bit registers
could also be treated – as eight pairs of registers,
each pair being 8 bits wide. In other words, you had
eight eight-bit registers, and the lower and upper halves
of each register – could be accessed separately. There was also a 12-bit
program counter called PC, and three other program counters. These four program counters
would form a stack, so that you could do
function calls. Finally there was also
one single-bit carry flag. One year later, they released
the 8008 microprocessor. The 4004 was
a 4-bit microprocessor, and the 8008 was
an 8-bit microprocessor. In this processor, they still had a dedicated
accumulator, called A, and also six other registers. All of these registers were
eight bits wide, but the six other registers
could be paired – to form 16-bit units
called BC, DE and HL. H and L stand for high
and low, respectively. This register had a different
naming scheme than the others – maybe because it
had a separate role: It was the only register
that could be used – for indirect memory access. The 8008 now had four flags. Besides the carry-flag,
there was now an even-parity flag, a zero-flag and a sign-flag. The stack depth was increased
from 4 units to a total of 9 units, and the memory address width was
increased from 12 bits to 14 bits. Two years later they came up
with another processor. This time called the 8080. It didn’t have that many changes, but they had now
consolidated the flags into a single 8-bit register. The flags together with
the accumulator – could be thought as a
single 16-bit register, called Program Status Word. The memory address width was now
increased to the full 16 bits. In addition to the
program counter register – there was now a stack pointer
register called SP. This meant that the processor
no longer had an internal stack, but instead it used the system RAM, allowing for much
more complex programs – and call structures
than ever before. At this point, Intel was pretty happy and content
with their product line. [battle music] Until two years later, in 1976,
when another company, AMD, no wait, Zilog, released a processor called Z80. The Z80 was software-compatible
with the Intel 8080, but it added a number
of extra registers, many more CPU instructions, and improved the overall
design in many parts. It was a direct assault
at their competitor, and they were not
even hiding it. Intel was caught with
their pants down – with only their 8080
to offer to the market. A few months later they scrambled
to release the 8085 processor, but the 8085 was pretty much
exactly the same as the 8080, except it used a single 5 volt
supply for its operation, hence the 5 in its name. So Intel went back
to the drawing board, investing heavily
in CPU design. [calm music] Two years later,
in 1978, they came out with
the Intel 8086. This CPU had eight fully
16-bit general-purpose registers. The old BC had been
renamed into BX, and the old DE had been
renamed into DX. In between of those two – they also added a new
register called CX. The HL register was renamed
into a BP (base pointer), because it was no longer
accessible as two separate halves, and because it matched nicely with
the name of the stack pointer, SP. The program counter was renamed
into IP (instruction pointer), and the flags were moved into
a separate entity once again. The idea that Zilog had
with their index registers – was incorporated into the 8086, with two registers called
source index (SI), and destination index (DI). They also adopted much of the
assembler syntax from Zilog. But here’s the real kicker: The 8086 could access a
whopping megabyte of memory. This meant not just
64 kilobytes of memory, but 16 × 64 kilobytes of memory. This was such an unprecedented
wealth of memory, that when IBM designed
the world-famous IBM PC, they decided to place
the video RAM – at the 10th
64-kilobyte segment of the RAM, and all the system stuff above it, leaving only the lowest 640 kB
of contiguous memory – for the applications
and the operating system. And even that amount was not
fully available for applications, because the processor itself reserved
the first kilobyte for interrupt vectors, and the BIOS reserved the
next 256 bytes for itself, and the operating system would
need several kilobytes more, but that is another story. In any case, in a move that is
equally brilliant and crazy, Intel decided to not make their
index registers 20 bits – in order to access the
entire memory space, but instead, they used a pair of 16-bit
segment and a 16-bit offset. This meant that – as long as you only accessed
only 64 kilobytes of code, and 64 kilobytes of data, you could just set the segment registers
once and then forget about them, and just use 16-bit offsets. And the 8086 had
four segment registers. One for the code,
one for the data, one for the stack,
and for good measure, another one for data,
called extra segment. Anyway, first four base registers
were sixteen bits wide, but their lower and higher 8-bit
halves could be accessed individually, because why not. All preceding Intel processors
had a similar feature, and it had often proven to be
very useful in programming. So when they made 32-bit processors,
they kept the same design. The accumulator was extended to 32 bits,
and its lower half was called AX, and the entire register
was called EAX. This stands for Extended AX,
I believe. The two lowest 8 bit-wide units
could be still accessed individually. So in 2000 when AMD came up
with the 64-bit architecture – that they decided to call “amd64”, because why not, they extended the
registers to 64 bits. The whole register now
would be called RAX, and its lower 32-bit half
would be called EAX, and the lower 16-bit half of
that would be called AX, and the two lowest 8-bit-wide
units could still be accessed – by their individual names
given in 1978. And that’s where we are now. We have sixteen 64-bit registers. You already know the names
of the first eight of them. AMD added eight more, which
are called R8 through R15. Besides the sixteen
general-purpose registers, there are also
eight MMX registers, each of which are 64-bit wide. They are actually aliases – to the fpu register stack
which is 80 bits wide… But we don’t speak about the
fpu stack anymore. There are also sixteen
128-bit vector registers, called XMM0 through XMM15. But since the AVX, they are now actually
256-bit registers – called YMM0 through YMM15. Which is to say, if your CPU has AVX-512, they are now actually
512-bit registers, and there are 32 of them. If you have AVX-512, your CPU also has seven
opmask registers – called K1 through K7. These registers are either
16-bit wide or 64-bit wide – depending which CPU you have. And there is a heap of other registers
of various specialized uses, some of which you already
saw in earlier slides, but I am going to
totally ignore these – in the rest of the presentation. So now that we have the
introduction out of the way, let’s move on to the presentation. I am going to discuss the
register-to-register move instructions – in the x86 architecture. When Intel made the 8086,
they copied from Zilog – the fantastic idea that one move
instruction should be enough. You don’t need a heap
of different mnemonics – to do absolutely the same thing. And that’s why any time you move
data between the base registers, or even from and to the memory, or load constants into registers, you just use the MOV instruction. You can use it for
64-bit data, 32-bit data, 16-bit data and even for 8-bit data. However, when they added
the MMX extension, they kind of forgot
about this principle. When MMX registers
are in question, you would use the MOVD and MOVQ
instructions instead. Why? I have no idea. The names are logical though. D stands for double-word,
and Q stands for quad-word, so MOVD copies 32 bits
and MOVQ copies 64 bits. At least when they added
the Streaming SIMD extensions, or SSE, they were consistent. When you copy 32 bits between
a base register and an XMM register, you would use MOVD,
and MOVQ when copying 64 bits. Later on, when they added AVX, they added V to name of
every instruction – encoded using the VEX coding, but other than that,
they stayed consistent. That is, until AVX-512 came along, with its op-mask registers. Yes, you still
use MOVQ and MOVD, but now you would have
to add K – to the beginning of
the instruction name, to indicate that you are
operating with op-mask registers. As if that wasn’t already clear
from the parameters. You can only use these instructions
with op-mask registers. I digress. At least the pattern
is still the same. [record stratch sound] [slowly] MOVDQ2Q. Apparently that’s how they
decided to call an instruction – that copies 64 bits of data from
an SSE register to an MMX register. MOVSS and MOVSD are used to copy
a single floating point value – from the low part of
one register to another. These instructions contain
a hint for the instruction decoder – that they should be processed in this
particular calculation unit in the CPU. If the entire register must be copied,
then there are plenty of choices: movaps, movups, movapd,
movupd, movdqa, movdqu, and some variants that deal
with different size op-masks. All of these instructions
essentially do the same thing, but they contain hints for the
instruction decoder in the CPU – about the purpose of the data. There are additional constraints, if these instructions are used
with memory operands, but for simplicity I decided
to limit this video – to register-to-register moves only. Of course these instructions can be
also used with 256-bit registers. The move does not always have address
the lowest part of the register. The PINSRB instruction inserts 8 bits
from the source register – into any position
in the target register, chosen by an index number. Similarly there is
PINSRW for 16 bits, which can also be used
with MMX registers, PINSRD for 32-bit moves, and PINSRQ for 64-bit moves. All four of these require the source
to be a general-purpose register. If the source and target
are vector registers, the word INSERT is spelled in full, and the name of the instruction
must contain a hint – for the type of the data
to be moved. Now the opposite of insertion
is extraction. The PEXTRB instruction pulls out
a particular byte from a vector register – and stores it in a general-purpose register. The same goes
for 16-bit units, and 64-bit units. Wait,
I skipped one slide. Oh yeah, for 32 bits there
are two different instructions – that do exactly the same thing. PEXTRD, and EXTRACTPS. There is absolutely no difference
between these two, except a possible hint
to the instruction decoder. But then we also have – the vector-to-vector EXTRACT
instructions which came in AVX. So you just briefly saw
the EXTRACTPS instruction. The INSERTPS instruction
does not insert data – from a general-purpose register, but it copies between
two vector registers. It is kind of a swiss army knife
of an instruction really. You can select which 32-bit unit
you take from the source register. You can also select which 32-bit unit
you replace with that value. But you can also
selectively clear – some or all of the other
32-bit units in the target register, replacing them with zero. So many functions
in this single opcode. But we are only
getting started. All of the previous
instructions were – single-source
single-target moves. The broadcast operation takes
the lowest value – from the source vector register, and populates the
entire target register – with copies of that same value. This is very handy – if you need the target register
to be nothing – but copies of some
one particular value. If you need more
fine-grained control, there is MOVSLDUP, which divides both the source
and target registers into halves, and does the broadcast operation
for both halves individually. The MOVSHDUP instruction
is almost the same, but it takes the high value
from the source register, not the low one. The BLEND family of instructions
work like sieves. You have 8-bit sieves,
16-bit sieves, 32-bit sieves
and 64-bit sieves. In each of these, you supply a bitmask – which portions of the
target register are copied – from the same section
in the source register, and which portions
are left intact. Some of these instructions – take the bitmask as an
immediate parameter, and some of them take it
in another vector register. If there is one SIMD instruction – that warrants the
“guide dang it” status, it is the unpack instruction, which in my opinion – should have been called
a riffle or an interleave. It essentially takes
two source registers, and interleaves their
lower halves bytewise – into the target register. There are two variants: one which takes
the lower halves, and the other which
takes the high halves. Of course, the same operation also
exists wordwise, double-wordwise, and quad-wordwise. In this usage, there are again two instructions
that do exactly the same thing. The PUNPCKLQDQ,
and the UNPCKLPD. Again, this probably has
something to do with hinting. The shuffle-instructions are the
most free form of permutations. There are no byte-size
shuffles as far as I know, but the 16-bit shuffle
instruction – allows populating the first four
16-bit units of a target register – with any arbitrary copies
of the first 16-bit units – of the source register. There is a similar instruction
that operates on the high portions – rather than on the lower portions
of those registers. There are two dword-sized shuffle
operations that work differently. Pictured is the simpler one, which operates on a
single source register – and a single target register. The more complex one operates
on two source registers – and is rather complicated
as you can see, but it becomes slightly
easier to understand – when you look at its
64-bit counterpart. There is also a 128-bit
shuffle instruction in AVX. AVX-512 adds a couple
more of these, but I kind of ran
into a problem – where Intel kept adding
new instructions – faster than I could add them
into this presentation, so I gave up. The final set of reg-to-reg
copy instructions – that we are going
to look at – involves changing the size
of the data – while it is getting copied. These PACK instructions
take 16-bit values – from two source registers and
pack them into 8-bit slots – in a single target register. There are variants that
deal with unsigned values, and variants that deal
with signed values. If the value is out of range, the value is clamped
to the target range. It does not just simply
discard the excess bits. There is another set of instructions – that does the same with
32-bit sources and 16-bit targets. And finally we have
MOVSX and MOVZX. These instructions were already
added in Intel 80386. SX stands for sign extension,
and ZX stands for zero extension. And that is my brief,
abridged take of MOV instructions – in the x86 architecture. See the top or pinned comment
in the comments section – for more information that
I left out from the video! Happy programming,
see you again, and God bless you. [Subtitles translated by Your Name Here]