Persistence or data storage is a very important aspect for any enterprise or consumer software. However, in terms of speed or latency, the persistent media has always been orders of magnitude slower than the volatile memory. There is a constant effort to reduce this gap between volatile and non-volatile media. As shown in the pyramid, there has been a progression from rotating hard disks to NAND based SSD to NVMe based SSD; the speeds improved from milliseconds to 10s of microseconds.
More recently, there has been an advent of one of the most disruptive yet promising technologies – Non-Volatile Memory (NVM) or Persistent Memory (PM). Unlike most of the persistent media technologies that are block oriented, NVM is byte-addressable and have latencies close to DRAM, while densities better than DRAM. Some of the NVM technologies include Phase Change Memory (PCM) , Resistive RAM (ReRAM)  and Intel’s 3D XPoint . This media is attached to the DDR bus and directly addressable by the CPU. It requires no DMA. Thus, it can be accessed as a memory using load/store instructions from applications and store your data durably!
Developments in persistent memory technology
Persistent memory technology theoretically provides a great leap in speed-increase, but in practice poses some interesting software challenges. Professor Steven Swanson and his students’ study  found that on running existing Linux kernel on an emulated PM around 14 USD was spent in the Linux stack, compared to only 8-9 USD in hardware – software slower than the hardware!
These become redundant and introduce overhead with the much faster, byte-addressable and random access persistent memory. It is not simple to disable or remove these optimizations since they are an integral part of software such as file systems and databases. Hence, there is a lot of research on determining how to optimally access data from NVM using existing software and how to build new ones.
But there are some challenges
For example, even if MOV instruction returns, it is not guaranteed that the data has made it to the NV-DIMM and there is always a chance data loss in case of a sudden power failure. Since the caches evict the data and persist to PM in any order, it would break transactions, which are strict in ordering (e.g., allocating a linked list node and then updating the address in the previous linked list node).
Existing Intel x86 instructions, such as, CLFLUSH and SFENCE instructions help flush CPU caches to PM and in a specific order. Intel has recently introduced faster and optimized version of CLFLUSH, called CLFLUSHOPT, which is more suitable to flush large buffers. CLWB is another variant to flush caches, but it does not invalidate the cache line. It only write backs modified data to memory or NVM. In order to ensure that the data in cache line is persisted and in order with other writes, one needs to use a combination of CLFLUSHOPT/CLWB and SFENCE or CLFLUSH. We will see later APIs application programmers can call to achieve this functionality.
To achieve optimal performance from NVM and be able to use its characteristics (byte addressable persistence), there is a lot of research and standard body initiatives that is active in this area. One such example is that of a specification named NVM Programming Model (NPM)  proposed by SNIA. NPM defines recommended behavior between various user space and operating system (OS) kernel components supporting NVM. The specification describes the various access or programming modes. It discusses the aspects of atomicity, durability, ordering, error handling, etc. with respect to NVM.