Tumgik
#smt cache
songmingisthighs · 2 years
Text
Cache Masterlist
high school chaebol!wooyoung × reader, ??? × reader
start : September 1st 2022 KST / August 31st 2022 (author time)
status : completed
updates : daily, 12.30 am KST
✨️ - written chapter
Tumblr media
introduction pt. i | pt. ii | pt. iii
ch. i | ch. ii | ch. iii | ch. iv | ch. v | ch. vi | ch. vii | ch. viii | ch. ix | ch. x | ch. xi | ch. xii | ch. xiii | ch. xiv ✨️ | ch. xv | ch. xvi | ch. xvii | ch. xviii | xh. xix | ch. xx | ch. xxi | ch. xxii | ch. xxiii | ch. xxiv | ch. xxv ✨️ | ch. xxvi | ch. xxvii | ch. xxviii | ch. xxix ✨️ | ch. xxx | ch. xxxi | ch. xxxi | ch. xxxii ✨️ | ch. xxxiii | ch. xxxiv | ch. xxxv | ch. xxxvi | ch. xxxvii | ch. xxxviii | ch. xxxix | ch. xl ✨️ | ch. xli | ch. xlii | ch. xliii | ch. xliv | ch. xlv | ch. xlvi ✨️ | ch. xlvii | ch. xlviii | ch. xlix | ch. l | ch. li | ch. lii ✨️ | ch. liii | ch. liv ✨️ | ch. lv | ch. lvi | ch. lvii | ch. lviii ✨️ | ch. lix | ch. lx | ch. lxi | ch. lxii ✨️ | ch. lxiii | ch. lxiv ✨️ | ch. lxv | ch. lxvi | ch. lxvii | ch. lxviii ✨️ | ch. lxix | ch. lxx | ch. lxxi | ch. lxxii | ch. lxxiii | ch. lxxiv ✨️ | ch. lxxv | ch. lxxvi | ch. lxxvii | ch. lxxviii | ch. lxxix | ch. lxxx | ch. lxxxi ✨️ | ch. lxxxii | ch. lxxxiii | ch. lxxxiv | lxxxv | ch. lxxxvi | ch. lxxxvii ✨️ | ch. lxxxviii ✨️ | ch. lxxxix | epilogue
721 notes · View notes
eirikrjs · 2 years
Note
If SMT VI happens, and they do included stolas, but he got a redesign, would you still want him in your party? For me, if he looks worse than the hazbin hotel/helluva boss version of him, i wouldn't even recruit or make him through fusion, even if he's one of the best demons in that hypothetical game
Man. The one time I watched Helluva Boss for five minutes it was abject misery but I guess it wasn't because of Stolas. So I'd have to agree.
Personally, I'd want Stolas to be as exact to the Le Breton illustration as possible, since he's just that perfect. But even if Doi wants to get a little creative, I can't imagine it being as bad as this old doompost:
Tumblr media
Could it? One thing Stolas has going for him is a lack of Japanese pop culture cache. So he's more free to just be himself. As he should be.
But he's not a cute idol girl or Yoshitsune, so his hypothetical exclusive skill (”Gem Finder”?) will be garbage.
18 notes · View notes
maychusieutoc1 · 2 months
Text
Intel Xeon Gold 5220 Processor (18C/36T, 2.20Ghz, 24.75MB)
Bộ xử lý Intel Xeon Gold 5220 là một sản phẩm hàng đầu trong dòng Xeon của Intel, được thiết kế đặc biệt để đáp ứng các yêu cầu cao cấp của môi trường máy chủ và máy trạm trong các doanh nghiệp và tổ chức lớn. Với sức mạnh tính toán cao, khả năng đa nhiệm mạnh mẽ và tính linh hoạt, bộ xử lý này là một lựa chọn ấn tượng cho việc xử lý các ứng dụng và công việc đòi hỏi hiệu suất.
Hiệu suất vượt trội: Với 18 lõi vật lý và 36 luồng đa luồng (SMT), Xeon Gold 5220 mang lại sức mạnh tính toán đáng kinh ngạc. Điều này làm cho nó trở thành một lựa chọn lý tưởng cho các ứng dụng đòi hỏi hiệu suất cao như tính toán khoa học, phân tích dữ liệu và máy chủ ảo hóa.
Tăng cường hiệu suất với Turbo Boost: Xeon Gold 5220 có tốc độ xử lý cơ bản là 2,20 GHz, nhưng nó có khả năng tăng cường hiệu suất thông qua công nghệ Intel Turbo Boost. Điều này cho phép nó tăng tốc độ xử lý tạm thời khi cần thiết để xử lý các tác vụ đòi hỏi sức mạnh cao.
Lưu trữ và truy cập dữ liệu nhanh chóng: Với dung lượng cache lớn lên đến 24,75 MB, bộ xử lý này cung cấp khả năng lưu trữ và truy cập dữ liệu một cách hiệu quả, giảm thiểu độ trễ và tối ưu hóa hiệu suất tổng thể.
Tính năng tiên tiến và bảo mật: Xeon Gold 5220 được xây dựng trên kiến trúc Xeon Scalable của Intel, đi kèm với các tính năng tiên tiến như Intel Hyper-Threading và Intel Virtualization Technology, giúp tối ưu hóa hiệu suất và bảo mật của hệ thống.
Tiết kiệm năng lượng và hiệu quả vận hành: Với công suất thiết kế nhiệt (TDP) hợp lý, Xeon Gold 5220 giúp giảm thiểu chi phí vận hành và tiết kiệm năng lượng trong môi trường máy chủ và trung tâm dữ liệu.
Tóm lại, Intel Xeon Gold 5220 Processor là một lựa chọn lý tưởng cho các tổ chức và doanh nghiệp muốn tối ưu hóa hiệu suất và linh hoạt trong việc xử lý các ứng dụng và công việc đòi hỏi sức mạnh tính toán cao. Sức mạnh, hiệu suất và tính bảo mật của nó làm cho nó trở thành một trong những lựa chọn hàng đầu trên thị trường bộ xử lý cho môi trường doanh nghiệp.
0 notes
techstrendzzz · 3 months
Link
0 notes
govindhtech · 5 months
Text
At Dual-Socket Systems, Ampere’s 192-Core CPUs Stress ARM64 Linux Kernel
Tumblr media
Ampere’s 192-Core CPUs Stress ARM64 Linux Kernel
In the realm of ARM-based server CPUs, the abundance of cores can present unforeseen challenges for Linux operating systems. Ampere, a prominent player in this space, has recently launched its AmpereOne data center CPUs, boasting an impressive 192 cores. However, this surplus of computing power has led to complications in Linux support, especially in systems employing two of Ampere’s 192-core chips (totaling a whopping 384 cores) within a single server.
The Core Conundrum
According to reports from Phoronix, the ARM64 Linux kernel currently struggles to support configurations exceeding 256 cores. In response, Ampere has taken the initiative by proposing a patch aimed at elevating the Linux kernel’s core limit to 512. The proposed solution involves implementing the “CPUMASK_OFFSTACK” method, a mechanism allowing Linux to override the default 256-core limit. This approach strategically allocates free bitmaps for CPU masks from memory, enabling an expansion of the core limit without inflating the kernel image’s memory footprint.
Tackling Technicalities
Implementing the CPUMASK_OFFSTACK method is crucial, given that each core introduces an additional 8KB to the kernel image size. Ampere’s cutting-edge CPUs stand out with the highest core count in the industry, surpassing even AMD’s latest Zen 4c EPYC CPUs, which cap at 128 cores. This unprecedented core count places Ampere in uncharted territory, making it the first CPU manufacturer to grapple with the limitations of ARM64 Linux Kernel 256-core threshold.
The Impact on Data Centers
While the core limit predicament does not affect systems equipped with a single 192-core AmpereOne chip, it poses a significant challenge for data center servers housing two of these powerhouse chips in a dual-socket configuration. Notably, SMT logical cores, or threads, also exceed the 256 figure on various systems, further compounding the complexity of the issue.
AmpereOne: A Revolutionary CPU Lineup
AmpereOne represents a paradigm shift in CPU design, featuring models with core counts ranging from 136 to an astounding 192 cores. Built on the ARMv8.6+ instruction set and leveraging TSMC’s cutting-edge 5nm node, these CPUs boast dual 128b Vector Units, 2MB of L2 cache per core, a 3 GHz clock speed, an eight-channel DDR5 memory controller, 128 PCIe Gen 5 lanes, and a TDP ranging from 200 to 350W. Tailored for high-performance data center workloads that can leverage substantial core counts, AmpereOne is at the forefront of innovation in the CPU landscape.
The Road Ahead
Despite Ampere’s proactive approach in submitting the patch to address the core limit challenge, achieving 512-core support might take some time. In 2021, a similar proposal was put forth, seeking to increase the ARM64 Linux CPU core limit to 512. However, Linux maintainers rejected it due to the absence of available CPU hardware with more than 256 cores at that time. Optimistically, 512-core support may not become a reality until the release of Linux kernel 6.8 in 2024.
A Glimmer of Hope
It’s important to note that the outgoing Linux kernel already supports the CPUMASK_OFFSTACK method for augmenting CPU core count limits. The ball is now in the court of Linux maintainers to decide whether to enable this feature by default, potentially expediting the timeline for achieving the much-needed 512-core support.
In conclusion, Ampere’s 192-core CPUs have thrust the industry into uncharted territory, necessitating innovative solutions to overcome the limitations of current ARM64 Linux kernel support. As technology continues to advance, collaborations between hardware manufacturers and software developers become increasingly pivotal in ensuring seamless compatibility and optimal performance for the next generation of data center systems.
Read more on Govindhtech.com
0 notes
usi-thesipcompany · 2 years
Text
What are the types of storage servers?
The gadget is reliable with maximum storage space. The storage server is a type of server used to keep, secure, store, and manage digital files and folders. The purpose of a built server is limited to ample data storage and access to the data on a shared network. It can also be termed a file server. The storage server serves as a central point for data storage and access. 
The local client nodes' access is through GUI and FTP control panel. It serves as a backup server for data storage. 
The integral part of direct-attached storage (DAS), network-attached storage (NAS), and many more. The server is used to manage, secure, and store the digital access of data and files is called a storage server.
Types of storage
The storage server is of two types – dedicated and non-dedicated servers. 
The dedicated server exclusively uses a file server with specific workstations for reading and writing the database. The disk array formation is the result of data file storage. The technology is developed to operate multiple disk drives together as a unit. A disk array has a cache (faster than a magnetic disk) and advanced storage visualization and RAID. The type of disk array used is dependent on the storage network.
Once a machine is configured and made public on the network, users can start accessing the available storage space on the storage server by 'mapping' the drives on their computers. After mapping, the computer's operating system identifies the storage server as an additional drive. If the network configuration is done precisely, all computers are granted permission to create, modify and execute files directly from the server while adding extra shared storage space to each connected computer. 
USI has provided customers with the ODM/JDM/EMS Server, Storage, NAS, and SSD product and manufacturing service. We offer the L10 system design service, which includes the M/B, Firmware BIOS & BMC, Sub-card (Backplane, Add-on card, etc.), and enclosure & thermal design and system integration.
Server
In USI, customers will have both the ODM/JDM server products developing and the EMS server board build service. We offer the L10 server system design service, which includes the server M/B, Firmware BIOS & BMC, Sub-card (Backplane, Add-on card, etc.), enclosure & thermal design, and system integration. The customer's NPI will be managed in the Taiwan factory, and the mass production will be handled in the China factory, Shenzhen, and Kunshan, for the board and system build.
 Strengths
10+ years of Server MB, Cards, ODM/JDM design experience
Intel x86 platform hardware, BMC, and BIOS development talent expert
Total solution on system integration validation 
Certification and Regulatory Service 
Advanced SMT manufacturing, assembly, test process
World Wide Logistic and Service
0 notes
dailytechnologynews · 5 years
Photo
Tumblr media
Musings on Vega / GCN Architecture
Originally posted to /r/AMD, but someone asked that I copy/paste it here to /r/hardware.
In this topic, I'm just going to stream some ideas about what I know about Vega64. I hope I can inspire some programmers to try to program their GPU! Also, If anyone has more experience programming GPUs (NVidia ones even), please chime in!
For the most part, I assume that the reader is a decent C Programmer who doesn't know anything about GPUs or SIMD.
Vega Introduction
Before going further, I feel like its important to define a few things for AMD's Vega Architecture. I will come back later to better describe some concepts.
64 CUs (Compute Units) -- 64 CUs on Vega64. 56 CUs on Vega56.
16kB L1 (Level 1) data-cache per CU
64kB LDS (Local Data Store) per CU
4-vALUs (vector Arithmetic Logic Unit) per CU
16 PE (Processing Elements) per vALU
4 x 256 vGPRs (vector General Purpose Registers) per PE
1-sALU (scalar Arithmetic Logic Unit) per CU
8GB of HBM2 RAM
Grand Total: 64 CUs x 4 vALUs x 16 PEs == 4096 "shaders", just as advertised. I'll go into more detail later what a vGPR or sGPR is, but lets first cover the programmer-model.
GPU Programming in a nutshell
Here's some simple C code. Lets assume "x" and "y" are the input to the problem, and "output" is the output:
for(int i=0; i<1000000; i++){ // One Million Items output[i] = x[i] + y[i]; }
"Work Items", (SIMD Threads in CUDA) are the individual units of work that the programmer wishes to accomplish in parallel with each other. Given the example above, a good work item would be "output[i] = x[i] + y[i]". You would have one-million of these commands, and the programmer instinctively knows that all 1-million of these statements could be executed in parallel. OpenCL, CUDA, HCC, and other grossly-parallel languages are designed to help the programmer specify millions of work-items that can be run on a GPU.
"NDRange" ("Grid" in CUDA) specifies the size of your work items. In the example "for loop" case above, 1000000 would be the NDRange. Aka: there are 1-million things to do. The NDRange or Grid may be 2-dimentional (for 2d images), or 3-dimentional (for videos).
"Wavefronts" ("Warp" in CUDA) are the smallest group of work-items that a GPU can work on at a time. In the case of Vega, 64-work items constitutes a Wavefront. In the case of the for-loop described earlier, a wave-front would execute between [0, 1, 2, 3... 63] iterations together. A 2nd wave front would execute [64, 65, 66, 67, ... 127] together (and probably in parallel).
"Workgroups" ("Thread Blocks" in CUDA) are logical groups that the programmer wants to work together. While Wavefronts are what the system actually executes, the Vega system can combine up to 16-wavefronts together and logically work as a single Workgroup. Vega64 supports workgroups of size 1 through 16 Wavefronts, which correlates to 64, 128, ... 1024 WorkItems (1024 == 16 WaveFronts * 64 Threads per Wavefront).
In summary: OpenCL / CUDA programmers setup their code. First, they specify a very large number of work items (or CUDA Threads) which represents parallelism. For example: perhaps you want to calculate something on every pixel of a picture, or calculate individual "Rays" of a Raytracer. The programmer then groups the work items into workgroups. Finally, the GPU itself splits workgroups into Wavefronts (64-threads on Vega).
SIMD Segue
Have you ever tried controlling multiple characters with only one controller? When you hook up one controller, but somehow trick the computer into thinking it is 8-different-controllers? SIMD: Single Instruction Multiple Data, is the GPU-technique for actually executing these thousands-of-threads efficiently.
The chief "innovation" of GPUs is just this multi-control concept, but applied to data instead. Instead of building these huge CPU cores which can execute different threads, you build tiny GPU cores (or shaders) which are forced to play the same program. Instead of 8x wide (like in the GIF I shared), its 64x wide on AMD.
To handle "if" statements or "loops" (which may vary between work-items), there's an additional "execution mask" which the GPU can control. If the execution-mask is "off", an individual thread can be turned off. For example:
if(foo()){ doA(); // Lets say 10 threads want to do this } else { doB(); // But 54 threads want to do this }
The 64-threads of the wavefront will be forced to doA() first, with the 10-threads having "execution mask == on", and with the 54-remaining threads having "execution mask == off". Then, doB() will happen next, with 10-threads off, and 54-threads on. This means that any "if-else" statement on a GPU will have BOTH LEGS executed by all threads.
In general, this is called the "thread divergence" problem. The more your threads "split up", the more legs of if-statements (and more generally: loops) have to be executed.
Before I reintroduce Vega's Architecture, keep the multiple-characters / one-controller concept in mind.
Vega Re-Introduction
So here's the crazy part. A single Vega CU doesn't execute just one wavefront at a time. The CU is designed to run upto 40 wavefronts (x 64 threads, so 2560 threads total). These threads don't really all execute simultaneously: the 40-wavefronts are there to give the GPU something to do while waiting for RAM.
Vega's main memory controller can take 350ns or longer to respond. For a 1200MHz system like Vega64, that is 420 cycles of waiting whenever something needs to be fetched from memory. That's a long time to wait! So the overall goal of the system, is to have lots of wavefronts ready to run.
With that out of the way, lets dive back into Vega's architecture. This time focusing on CUs, vALUs, and sALUs.
64 CUs (Compute Units) -- 64 CUs on Vega64.
4-vALUs (vector Arithmetic Logic Unit) per CU
16 PE (Processing Elements) per vALU
4 x 256 vGPRs (vector General Purpose Register) per PE
1-sALU (scalar Arithmetic Logic Unit) per CU
The sALU is easiest to explain: sALUs is what handles those "if" statements and "while" statements I talked about in the SIMD section above. sALUs track which threads are "executing" and which aren't. sALUs also handle constants and a couple of other nifty things.
Second order of business: vALUs. The vALUs are where Vega actually gets all of their math power from. While sALUs are needed to build the illusion of wavefronts, vALUs truly execute the wavefront. But how? With only 16-PEs per vALU, how does a wavefront of size 64 actually work?
And btw: your first guess is likely wrong. It is NOT from #vALUs x 16 PEs. Yes, this number is 64, but its an utterly wrong explanation which tripped me up the first time.
The dirty little secret is that each PE repeats itself 4-times in a row, across 4-cycles. This is a hidden fact deep in AMD documentation. In any case, 4-cycles x 16 PE == 64 Workitems per vALU. x4 vALUs == 256 work-items per Compute Unit (every 4 clock cycles).
Why repeat themselves? Because if a simple addition takes 4-clock cycles to operate, then Vega only has to perform ~30 math operations while waiting for RAM (remember: 100ns, or 120-clock cycles, to wait for RAM to respond). Repeating commands over-and-over again helps Vega to hide the memory-latency problem.
Full Occupancy: 4-clocks x 16 PEs x 4 vALUs == 256 Work Items
Full Occupancy, or more like "Occupancy 1", is when each CU (compute unit) has one-work item for each physical thread that could run. Across the 4-clock cycles, 16 PEs, and 4 vALUs per CU, the Compute Unit reaches full occupancy at 256 work items (or 4-Wavefronts).
Alas: RAM is slow. So very, very slow. Even at Occupancy 1 with super-powered HBM2 RAM, Vega would spend too much time waiting for RAM. As such, Vega supports "Occupany 10"... but only IF the programmer can split the limited resources between threads.
In practice, programmers typically reach "Occupancy 4". At occupancy 4, the CU still only executes 256-work items every 4-clock cycles (4-wavefronts), but the 1024 total items (16-wavefronts) give the CU "extra work" to do whenever it notices that one wavefront is waiting for RAM.
Memory hiding problem
Main Memory latency is incredibly slow, but also is variable. RAM may take 350 or more cycles to respond. Even LDS, may respond in a variable amount of time (depending on how many atomic operations are going on, or bank-conflicts).
AMD has two primary mechanisms to hide memory latency.
Instruction Level -- AMD's assembly language requires explicit wait-states to hold the pipeline. The "s_waitcnt lgkmcnt(0)" instruction you see in the assembly is just that: wait for local/global/konstant/message counter to be (zero). Careful use of the s_waitcnt instruction can be used to hide latency behind calculations: you can start a memory load to some vGPRs, and then calculate with other vGPRs before waiting.
Wavefront Level -- The wavefronts at a system-level allow the CU to find other work, just in case any particular wavefront gets stuck on a s_waitcnt instruction.
While CPUs use out-of-order execution to hide latency and search for instruction-level parallelism... GPUs require the programmer (or compiler) to explicitly put the wait-states in. It is far less flexible, but far cheaper an option to do.
Wavefront level latency hiding is roughly equivalent to a CPU's SMT / Hyperthreading. Except instead of 2-way hyperthreading, the Vega GPU supports 10-way hyperthreads.
Misc. Optimization Notes
On AMD Systems, 64 is your magic minimum number. Try to have at least 64 threads running at any given time. Ideally, have your workload evenly divisible by 64. For example, 100 threads will be run as 64 thread wavefront + 36 thread wavefront (with 28 wasted vALU states!). 128 threads is more efficient.
vGPRs (vector General Purpose Registers) are your most precious resource. Each vGPR is a 32-bit of memory that executes at the full speed of Vega (1-operation every 4 clock cycles). Any add, subtract, or multiply in any work-item will have to travel through a vGPR before it can be manipulated. vGPRs roughly correlate to "OpenCL Private Memory", or "CUDA Local Memory".
At occupancy 1, you can use all 256 vGPRs (1024 bytes). However, "Occupancy 1" is not good enough to keep the GPU busy when its waiting for RAM. The extreme case of "Occupancy 10" gives you only 25 vGPRs to work with (256/10, rounded down). A reasonable occupancy to aim for is Occupancy 4 and above (64 vGPRs at Occupancy 4)
FP16 Packed Floats will stuff 2x16-bit floats per vGPR. "Pack" things more tightly to save vGPRs and achieve higher occupancy.
The OpenCL Compiler, as well as HCC, HIP, Vulkan compilers, will overflow OpenCL Private Memory into main-memory (Vega's HBM2) if it doesn't fit into vGPRs. There are compiler flags to tune how many vGPRs the compiler will target. However, your code will be waiting for RAM on an overflow, which is counterproductive. Expect a lot of compiler-tweaking to figure out what the optimal vGPRs for your code will be.
sGPRs (scalar General Purpose Registers) are similarly precious, but Vega has a lot more of them. I believe Vega has around 800 SGPRs per SIMD unit. That is 4x800 SGPRs per CU. Unfortunately, Vega has an assembly-language limit of 102 SGPRs allocated per wavefront. But an occupancy 8 Vega system should be able to hold 100 sGPRs per wavefront.
sGPRs implement OpenCL Constant memory specification (also called CUDA Constant memory). sGPRs are more flexible in practice: as long as they are uniform across the 64-item wavefront, an sGPR can be used instead of 64-individual precious vGPRs. This can implement a uniform loop (like for(int i=0; i<10; i++) {}) without using a precious vGPR.
If you can branch using sGPR registers ("constant" across the whole 64-item wavefront), then you will not need to execute the "else". Effectively, sGPR branching never has a divergence problem. sGPR-based branching and looping has absolutely no penalty on the Vega architecture. (In contrast, vGPR-based branching will cause thread-divergence).
The sALU can operate on 64-bit integers. sGPRs are of size 32-bits, and so any 64-bit operation will use two sGPRs. There is absolutely no floating-point support on the sALU.
LDS (Local Data Store) is the 2nd fastest RAM, and is therefore the 2nd most important resource after vGPRs. LDS RAM correlates to "OpenCL Local" and "CUDA Shared". (Yes, "Local" means different things between CUDA and OpenCL. Its very confusing). There is 64kB of LDS per CU.
LDS can share data between anything within your workgroup. The LDS is the primary reason to use a large 1024-thread workgroup: the workgroup can share the entire LDS space. LDS has full support of atomics (ex: CAS) to provide a basis of thread-safe communications.
LDS is roughly 32-banks (per CU) of RAM which can respond within 2-clock ticks under ideal circumstances. (It may be as slow as 64-clock ticks however). At 1200 MHz (Vega64 base clock), the LDS has 153GBps of bandwidth per compute unit. Across the 64-CUs of Vega64, that's a grand total of 9830.4 GBps bandwidth (and it goes faster as Vega boost-clocks!). Compared to HBM2, which is only 483.8 GBps, you can see why proper use of the LDS can accelerate your code.
Occupancy will force you to split the LDS. The absolute calculation is harder to formulate, because the LDS is shared by Workgroups (and there can be 1 to 16 wavefronts per workgroup). If you have 40 Workgroups (1-wavefront per workgroup), the 64kB LDS must be split into 1638 bytes between workgroups. However, if there are 5 Workgroups (8-wavefronts aka 512 workitems per workgroup), the 64kB LDS only needs to be split into 13107 chunks between the 5-workgroups, even at max occupancy 10.
As a rule of thumb: bigger workgroups that share more data will more effectively use the LDS. However, not all workloads allow you to share data easily.
The minimum workgroup size of 1 wavefront / 64-work items is treated as special. Barriers and synchronization never has to happen! Workgroup size of 1 wavefront (64-work items) by definition executes synchronously with itself. Still, use barrier instructions (and let the compiler figure out that it can turn barriers into no-ops).
A secondary use of LDS is to use it as a manually managed cache. Don't feel bad if you do this: the LDS is faster than L1 cache.
L1 vector data cache is 16kB, and slower than even LDS. In general, any program serious about speed will use the LDS explicitly, instead of relying upon the L1 cache. Still, its helpful to know that 16kB of global RAM will be cached for your CU.
L1 scalar data cache is 16kB, shared between 4 CUs (!!). While this seems smaller than vector L1 Cache, remember that each sALU is running 64-threads / work items. In effect, the 40-wavefronts (x4 == 160 wavefronts max across 4 CUs) represents 10240 threads. But any sALU doesn't store data per-thread... it stores data per wavefront. Despite being small, this L1 scalar data cache can be quite useful in optimized code.
Profile your code. While the theoretical discussion of this thread may be helpful to understanding why your GPGPU code is slow, you only truly understand performance if you read the hard-data.
HBM2 Main Memory is very slow (~120 cycles to respond), and relatively low bandwidth ("only" 480 GBps). At Occupancy 1, there will be a total of 16384 workitems (or CUDA Threads) running on your Vega64. The 8GB of HBM2 main-memory can therefore be split up into 512kB.
As Bill Gates used to say, 640kB should be enough for everyone. Unfortunately, GPUs have such huge amounts of parallelism, you really can't even afford to dedicate that much RAM even in an "Occupancy 1" situation. The secret to GPUs is that your work-items will strongly share data with each other.
Yeah yeah yeah, GPUs are "embarassingly parallel", or at least are designed to work that way. But in practice, you MUST share data if you want to get things done. Even with "Occupancy 1", the 512kB of HBM2 RAM per work-item is too small to accomplish most embarassingly parallel tasks.
References
AMD OpenCL Optimization Guide
AMD GCN Crash Course
Advanced Shader Programming on GCN
GCN Assembly Tutorial -- Seeing the assembly helps understand how sGPR or vGPRs work, and solidify your "wavefront" ideas.
Vega Assembly Language Manual -- 247 pages of dense, raw, assembly language.
1 note · View note
taimoorzaheer · 3 years
Text
Ryzen 5 5600G Falls to the Core i5-11400 In New Benchmarks
Ryzen 5 5600G Falls to the Core i5-11400 In New Benchmarks
AMD’s Ryzen 5000 (Cezanne) desktop APUs will make their debut in OEM and pre-built systems before hitting the retail market by the end of this year. However, the hexa-core Zen 3 APU (via Tum_Apisak) is already showing up in multiple benchmarks around the Internet. The Ryzen 5 5600G comes equipped with six Zen 3 cores with simultaneous multithreading (SMT) and 16MB of L3 cache. The 7nm APU…
Tumblr media
View On WordPress
0 notes
songmingisthighs · 2 years
Text
HELLAUR HELLAUR
so recently (literally today) rie added a forum to their discord server and i'm planning on abusing this forum for cache (and future series) purposes. i will be providing spoilers, early access, forum discussion (projections, suggestions, direct comments, etc.) so i want to invite EVERYONE who want to participate or is looking for a community in general to join our discord server ! :D
tagging :
@paralumanniluna @rdiamond2727 @miaatiny @baguette-atiny @kpopnightingale @dear-dreamie @potaeto-writes-on-wp @kwanisms @qghosty @charreddonuts @noonaishere @bbymatz @maddiebabyxoxo @kawennote09 @woo-stars @treasure-1117 @starjoongie1117 @cutie-wooyo @linhyyboo12 @kitty4hwa @dreamlesswonder86 @glitterhongjoong @kitty4hwa @ateezourstars @starlight-channie @jo-hwaberry @yla-aira @hyuckilstan @phenomenalgirl9 @flamingi @tannie13 @gxlden-bxbyy @kirooz @leagreenly @seoulscenarios @purenjuniverse @meowmeowminnie @star1117-archives @ilsedingsx @kkayfan @ckline35 @jaxavance @yoongiigolden @jayeonnature @hwanchaesong
27 notes · View notes
smtmarketing · 4 years
Text
4 Cách cơ bản để tối ưu tốc độ Website cho doanh nghiệp bạn
Hiện nay, Website đang là một công cụ không thể thiếu với bất cứ doanh nghiệp nào, đây không chỉ là nơi để doanh nghiệp giới thiệu các sản phẩm, dịch vụ của mình mà còn giúp tìm kiếm nguồn khách hàng mới. Tuy nhiên không phải doanh nghiệp nào cũng có sự quan tâm đủ cho Website, đặc biệt là rất nhiều Website bị phạm phải lỗi nghiêm trọng về tốc độ truy cập. Trong bài viết này, chúng ta sẽ tìm hiểu 4 cách để tối ưu tốc độ Website phổ biến nhất.
Vì sao cần ưu tiên tối ưu tốc độ Website?
Trong một nghiên cứu được thực hiện bởi Akamai, khoảng một nửa số người dùng Online mong đợi một trang web có tốc độ tải hay trong truy cập trong hai giây hoặc ít hơn. Nếu nó không thể truy cập trong vòng ba giây, những người truy cập có xu hướng bỏ đi.
Một thống kê đáng báo động hơn nữa là 64% người dùng không hài lòng với thời gian truy cập vào website sẽ chuyển sang mua ở trang web khác. Điều này có nghĩa là bạn không chỉ mất khách hàng tiềm năng, giảm tỷ lệ chuyển đổi, mà còn có nguy cơ trang web của bạn mất luôn những khách hàng muốn giới thiệu sản phẩm/dịch vụ của bạn cho người khác.
Đặc biệt nếu doanh nghiệp bạn đang quan tâm đến phương pháp Inbound Marketing thì một Website có tốc độ quá chậm sẽ khiến mọi chiến thuật Inbound trở nên kém hiệu quả rất nhiều.
Tumblr media
Việc tối ưu tốc độ Website sẽ mang về nhiều lợi ích cho doanh nghiệp
Xem thêm: 3 Vấn đề lớn khi doanh nghiệp thực hiện Inbound Marketing
Vì vậy, cần áp dụng 4 cách sau tối ưu tốc độ Website:
Giảm thiểu các yêu cầu HTTP
Yêu cầu HTTP (Giao thức truyền siêu văn bản) được tính bất cứ khi nào trình duyệt tìm nạp tệp, trang hoặc ảnh từ máy chủ web. Theo Yahoo, những yêu cầu này có xu hướng chiếm khoảng 80% thời gian tải của một trang web. Trình duyệt cũng giới hạn yêu cầu từ 4-8 kết nối đồng thời trên mỗi miền, có nghĩa là không thể tải hơn 30 nội dung cùng một lúc.
Điều này có nghĩa là bạn cần tải càng nhiều yêu cầu HTTP, thì thời gian tải trang web và truy xuất tất cả chúng càng lâu, làm tăng thời gian truy cập website của bạn. Các cách thông dụng để giảm thiểu các yêu cầu tải HTTP là gom các tập tin CSS/JS lại với nhau, chỉ tải các nội dung khi cần thiết (ví dụ nội có dung chỉ hiển thị trên máy tính và có nội dung chỉ hiển thị trên điện thoại) hoặc giảm số lượng hình ảnh có dung lượng nặng lại.
Sử dụng công nghệ CDN
Không phải mọi người truy cập vào website đều ở cùng một khu vực, một thành phố hay một quốc gia, do đó công nghệ CDN sẽ giúp việc truy cập Website dù người dùng ở bất kỳ đâu trở nên nhanh hơn rất nhiều. Hiểu đơn giản là công nghệ này giúp người dùng vào website nhanh hơn bằng cách giảm khoảng cách từ người truy cập website đến máy chủ.
Ví dụ nếu Website của bạn chỉ có máy chủ ở TP.HCM, thì tốc độ truy cập của người dùng ở Hà Nội hay các quốc gia khác sẽ chậm hơn người ở HCM, khi bật công nghệ CDN, tốc độ truy cập này sẽ tương đương nhau dù người dùng ở đâu. Hầu hết các dịch vụ Hosting hiện nay đều có công nghệ CDN, do đó bạn có thể yêu cầu nhà cung cấp Hosting bật công nghệ này cho website của mình.
Ứng dụng bộ nhớ đệm (Browser Caching)
Bộ nhớ đệm của trình duyệt cho phép nội dung trên trang web của bạn được tải xuống ổ cứng một lần vào bộ nhớ đệm hoặc không gian lưu trữ tạm thời. Các tệp đó hiện được lưu trữ cục bộ trên hệ thống của bạn, điều này cho phép tốc độ tải trang lần kế tiếp tăng lên.
Hiểu ngắn gọn là khi bạn truy cập một Website lần đầu, công nghệ Bộ nhớ đệm sẽ giúp lưu lại các thông tin trên Website đó và ở lần truy cập kế tiếp, tốc độ vào Web sẽ được cải thiện đáng kể vì mọi thứ đã được lưu lại trước. Trên WordPress, các plugin chuyên dùng để tối ưu tốc độ website như WP Rocket hay Autoptimize đều hỗ trợ công nghệ này.
Tumblr media
Công nghệ bộ nhớ đệm sẽ giúp tối ưu tốc độ Website đáng kể
Tối ưu hoá dung lượng hình ảnh
Hình ảnh là một trong những yếu tố làm chậm tốc độ truy cập Website nhất, do đó bạn cần thường xuyên tối ưu lại hình ảnh trên Web. Hình ảnh nên có dung lượng dưới 150KB, chiều rộng (width) dưới 1920 Pixel và chất lượng nằm ở mức 72 dpi. Nếu lớn hơn mức này bạn sẽ thấy hình ảnh xuất hiện rất chậm trên website.
Ngoài ra bạn cũng cần quan tâm đến định dạng của ảnh, hình ảnh có đuôi PNG sẽ phù hợp trong tình huống bạn không cần nền ảnh (ví dụ như Logo) còn hình ảnh có đuôi JPG sẽ phù hợp trong hầu hết các tình huống mà bạn không cần ảnh quá chi tiết. Nếu bạn đang dùng WordPress để xây dựng Website thì cũng đã có khá nhiều Plugin hỗ trợ việc tối ưu hình ảnh như TinyPNG.
Kết luận
Công việc tối ưu tốc độ Website không phải làm một lần rồi thôi mà bạn sẽ cần phải thường xuyên kiểm tra lại tốc độ Website của mình sau một thời gian sử dụng. Tốt nhất hãy đặt lịch kiểm tra định kì tốc độ Web và khảo sát các khách hàng của mình về trải nghiệm khi vào Web. Nến không, bạn sẽ không nhận ra được Website của mình đang có tốc độ truy cập rất chậm và đánh mất nhiều khách hàng tiềm năng.
Xem thêm: Website Thegioidong đã tăng trưởng thần tốc như thế nào với Content Marketing?
from SMT Marketing https://ift.tt/33RYqHp
0 notes
techstrendzzz · 3 months
Link
0 notes
foremostlist · 4 years
Text
AMD Ryzen 3 3300X and Ryzen 3 3100 Review
AMD Ryzen 3 3300X and Ryzen 3 3100 Review
The latest Ryzen processors from AMD are coming in as low as $100 with quad-cores and SMT support.
Meet the new Ryzen 3 3100, set to cost just $100, it features 4 cores, 8 threads and clocks between 3.6 and 3.9 GHz depending on the workload. For such an affordable processor it also features a rather large 18MB of cache. This part is rated for a 65 watt TDP and so you get the Wraith Stealth…
View On WordPress
0 notes
govindhtech · 6 months
Text
Thin and Light AMD Ryzen 7040U ‘Phoenix’ Gets You Ahead
Tumblr media
AMD introduced the Ryzen 7040U series
In May, AMD introduced the Ryzen 7040U series of thin and light notebook CPUs, codenamed ‘Phoenix.’ AMD initially announced four Zen 4-based CPUs for ultra-portable notebooks, with the Ryzen 7 7840U (8C/16T) and Ryzen 5 7640U (6C/12T) leading the way with AMD’s first-generation NPU for on-chip AI and inferencing.
A few months later, AMD introduces two more Ryzen 7040U processors. However, these chips use their smaller, optimized Zen 4c cores. AMD’s smaller Zen 4c core debuted with their EPYC 97×4 ‘Bergamo’ processors for native cloud deployments, but server chips were never their only use. Zen 4c was silently released in the consumer market as part of the Ryzen Z1 (non-extreme) CPU used in ASUS’s ROG Ally handheld (2x Zen 4 + 4x Zen 4c), but now it’s being properly introduced in Ryzen laptop chips.
The first of two new Ryzen 7040U processors with Zen 4c is the Ryzen 5 7545U, which has similar specs to the 7540U. The Ryzen 5 7545U’s unusual implementation of two full-fat Zen 4 cores and four Zen 4c cores is the main difference. Second, the Ryzen 3 7440U, a direct successor to the Ryzen 3 7440U, swaps most of its Zen 4 CPU cores for Zen 4c cores.
At the launch of Bergamo, AMD explained that the ‘c’ in Zen 4c stands for ‘Cloud,’ and it’s interesting that AMD chose to integrate Zen 4c in a low-end consumer segment. Comparing Zen 4 and Zen 4c’s core architectures and use cases is intriguing due to their significant differences.
Understanding the technicalities is crucial given AMD’s strategic decision to bring Zen 4c-based parts to consumers with the Ryzen 7040U series. With the Ryzen 5 7545U and Ryzen 3 7440U with Zen 4c hitting the market, understanding the decision is almost as important as extrapolating the performance and capability benefits.
The Zen 4c core is a smaller, feature-identical implementation of Zen 4. It uses denser libraries that can’t clock as high but make the core smaller and more power efficient.
On TSMC’s 5nm process, a conventional Zen 4 core and L2 cache have an area of 3.84 mm², similar to AMD’s initial Zen 4c EPYC chips. Compacting the same architecture on a 5nm process yields a smaller Zen 4c core, measuring 2.48 mm², or 35% smaller. Zen 4c’s basic goals are the same for servers and mobile: a smaller footprint lets AMD fit more cores into the same package. Alternatively, to fit the same CPU cores on a smaller, cheaper die.
AMD went both ways with EPYC server designs. Due to their higher density/smaller size, the EYPC 97×4 ‘Bergamo’ chips have up to 128 Zen 4c cores, 32 more than the top EPYC 9004 Genona chips. AMD’s “budget” EPYC 8004 ‘Siena’ chips with 64 Zen 4c cores over 4 CCDs were cheaper and lower power.
Enabling this smaller Zen 4 requires switching from high-performance libraries and high clockspeeds to high-density libraries. Despite those plumbing changes, Zen 4 and Zen 4c cores have the same features, core IPC, and L2 cache per core for mobile chips. Every figure and buffer, down to SMT, is present and runs clock-for-clock alike.
However, CPU performance is also about clockspeeds, so Zen 4’s area budget was spent on enabling it to clock over 5GHz. High-performance libraries perform well but are not space-efficient. However, high-performance libraries provide the space and features needed to reach chart-topping clockspeeds.
Zen 4c cores performance-wise are identical to Zen 4, but they have lower core clock frequencies. This, combined with high-density libraries’ lower power consumption, improves energy efficiency.
This brings us back to the Zen 4c-based Ryzen 7040U series. In their Ryzen 7040U series, AMD claims that using smaller Zen 4c cores with the same IPC is more power efficient at low TDPs, improving performance for sub-15W chips. The bottom of AMD’s mobile chip stack has lower performance, so we’re first seeing Zen 4c cores here.
Phoenix 2 vs. Phoenix
AMD’s new mobile chips use Phoenix 2, a monolithic silicon die. This is a budget version of AMD’s Ryzen 7000 mobile chips’ Phoenix die. It’s smaller with fewer functional blocks and uses Zen 4c CPU cores instead of Zen 4.
In fact, AMD has used Phoenix 2 before. It’s the most noticeable. AMD Ryzen 7040U silently launched it over the summer as part of the Ryzen Z1 series. The Ryzen Z1 Extreme uses the Phoenix die, while the weaker Ryzen Z1 was the first chip to use Phoenix 2.
AMD made Phoenix 2, which is impressive. Over the past few generations, the company has had a light mobile footprint for various reasons. They’ve never issued two mobile dies for a single architecture before, always using different bins of the same die (e.g. all Rembrandt for Ryzen Mobile 6000). AMD making a second Phoenix chip with Zen 4c cores is a change from the norm. AMD Ryzen 7040U can reach more laptop buyers with multiple chips instead of selling Phoenix (1) chips at high prices. Whether this means AMD will have a wider market presence is unclear.
According to AMD, Phoenix 2’s die size is 137mm², which is 23% smaller than the original 178 mm² Phoenix die. AMD’s budget mobile chip saves die space by removing CPU cores (8x Zen 4 -> 2x Zen 4 + 4x Zen 4c), GPU CUs (12 -> 4), and the Ryzen AI NPU.
AMD could have used Zen 4c cores to pack more cores into a Phoenix-sized die, but they are starting small with smaller chips.
Ryzen 5 7545U and 7440U chips
The Ryzen 5 7545U, a 2x Zen 4 + 4x Zen 4c chip (essentially a Phoenix 2), will headline these new chips. Zen 4c aside, this is the same configuration as the Ryzen 5 7540U, so it will be replaced by the 7545U.
The Ryzen 5 7545U’s peak turbo frequency of 4.9GHz is unchanged because AMD includes two full-fat Zen 4 cores.
Other than that, AMD claims both chips share a 3.2 GHz base frequency, meaning the Ryzen 5 7545U has 3.2 GHz Zen 4 and Zen 4c cores. The Ryzen 5 7545U shares 16 MB of L3 cache and 1 MB of L2 cache per core (6 MB).
The second AMD chip with Zen 4c is the Ryzen 3 7440U, which has one Zen 4 and three Zen 4c cores. This budget AMD mobile stack part has the fewest CPU cores (4 total) and the lowest peak clockspeed (4.7GHz for a single core). Besides CPU cores, the chip has 8 MB of L3 cache shared between the cores, 1 MB of L2 cache per core (4 MB total), and AMD’s RDNA 3-based Radeon 740M with four CUs clocked up to 2.5 GHz.
It’s still confusing to distinguish between the ‘new’ and ‘old’ Ryzen 3 7440U. AMD’s website still lists the original 7440U SKU as a pure Zen 4 part from May. AMD doesn’t distinguish between the original Ryzen 3 7440U and the new one with Zen 4c cores; both are the 7440U. AMD confirmed that the Ryzen 3 7440U was always one Zen 4c-based SKU. Despite AMD’s catalog entry, the official line is that the Zen 4-based Ryzen 3 7440U SKU doesn’t exist and that there has always been one.
Unsaid: Zen 4c CPU Cores and Clockspeeds
AMD has treated this, its first use of silicon-heterogenous CPU cores in consumer processors, lightly. AMD markets and documents Zen 4c as Zen 4 because it has the same IPC. While there’s something to be said for simplifying things for the masses, AMD’s briefing left us with some reservations and concerns about what wasn’t said.
AMD clearly states that none of the Zen 4c chips clock higher than 3.1GHz, 1.3GHz (30%) slower than the fastest Genoa chip (9174F) on its server processors. On their consumer chips, AMD only discloses the max turbo clockspeed for the regular Zen 4 core(s) and the base clockspeed for the entire chip. The fastest 7545U is 3.2GHz.
AMD’s server chips have different clockspeed guarantees than their consumer chips. The company guarantees any server CPU core can reach the same max clockspeed (if not all at once), while we favor cores from the consumer side, allowing the best couple to turbo higher.
All of AMD’s disclosures suggest Zen 4c won’t clock above 3GHz, as expected
This is a major difference from a Phoenix (1) chip with Zen 4 cores. Phoenix can get all 8 cores to 4GHz+ under power and thermal conditions, but Zen 4c’s lower clockspeeds are unbeatable. Zen 4c is much slower than Zen 4.
In practice, things won’t be so different. AMD’s performance graphs from their slides are accurate, and a 6/8 core Zen 4 setup can’t reach those clockspeeds in a 15W device. Phoenix 2 is likely more efficient and scores higher in multithreaded scenarios.
However, AMD is not helping themselves by not disclosing the Zen 4c cores’ maximum clockspeeds. AMD wants to hide the differences, but Zen 4 and Zen 4c are different CPU cores. Zen 4c is AMD’s efficiency core and should be treated as such. Which means its clockspeeds must be disclosed separately from the other cores, like Intel and Qualcomm do today.
Would AMD’s messaging have told you the Ryzen 3 7440U had 1 Zen 4 CPU core? How should consumers be informed?
Zen 4c is unique in offering identical IPC to Zen 4, but AMD is hurting its customers by ignoring the differences. IPC and clockspeeds determine CPU performance, so both must be considered. Since AMD has their own efficiency cores, AnandTech believes they should disclose the CPU cores and clockspeeds in their products. Anything less risks deceiving customers, even if AMD doesn’t mean to.
Read more on Govindhtech.com
0 notes
techphyte · 4 years
Photo
Tumblr media
AMD announces Ryzen 3 3100 and Ryzen 3 3300X desktop CPU Both have 4 core, 8 thread processors with SMT, 65W TDP and 18MB cache (L2 + L3). The difference is the clock speed, with the 3100 clocked at 3.6GHz base & 3.9GHz boost where the 3300X clocked at 3.8GHz base & 4.3GHz boost. 3100 is believed to be around $99 USD while the 3300X will be around $120 USD. Available for sale on May 21. Photo Source: Christian Wiediger #cpu #cpugaming #amd #ryzen #motherboard #gaming #nvidia #msi #intel https://www.instagram.com/p/B_XfsM6Dfm1/?igshid=14yjxbk6o0v0e
0 notes
dailytechnologynews · 6 years
Photo
Tumblr media
Numa Numa Yeah: How to Manage Huge-Core Counts in HEDT Builds
Most computer builds here, in /r/buildapc, and around the net are focused on "client" computers. That is, either the Intel i7-8700k or Ryzen 2700x based builds. But as we go towards the high-end space: AMD Threadripper 1950x or Intel i9-7900x (and beyond to Xeons and EPYC), the amount of information available online drops off dramatically, especially how it relates to Windows.
If anyone out there is actually willing to spend $500+ to $1000 on the CPU alone, I hope this information will be useful.
Introduction with Ryzen 2700x or Intel i7-8700k: Client systems and their limitations.
I think most people understand client-systems. But I'll type it out to catch people up to speed. Feel free to skip this section if you know stuff already.
Sad note: While Windows has an elegant job / process / thread model, its difficult to learn. In fact, "Task Manager" (Ctrl-Alt-Esc) doesn't really show you thread or process information very well. Use Process Explorer if you want to know the truth of your system.
Windows has "jobs" at the highest level. I'm going to ignore jobs: this post is too long already.
Windows has "processes", which are basically the programs you run (ex: Chrome). Every process is composed of multiple "threads", each thread potentially executing on its own core simultaneously. For example: in Chrome, one thread would be talking to the network, while a 2nd thread might be talking to your GPU to display the actual webpage, while a 3rd thread might be running the Javascript that makes a website interactive.
Every thread in Windows runs until it is forced to wait. For example, your Web Browser like Chrome will be forced to wait for the network before it can display a webpage. Or the Javascript thread will be waiting for you to push the left-mouse button. Every thread runs until it is forced to wait. Windows keeps track of which threads are waiting for what, and efficiently switches between them.
Threads which are "ready" to run are scheduled to run on a core. A Ryzen 2700x has 16-logical cores, or Intel i7-8700k has 12-logical cores. This means that the CPU can execute up to 16 (2700x) or 12 (8700k) things "simultaneously", for a peculiar definition of simultaneous.
The CPU itself "lies" to the OS to some degree. There are really only 8-cores on Ryzen 2700x or 6-cores on Intel. This is called Hyperthreading (Intel) or SMT (AMD). In effect, Hyperthreading / SMT allows the CPU to perform these tasks switches instead of Windows. Think of it as hardware-assisted task switching. The CPU will physically only execute 8-things at a time, but it can very efficiently switch between the 16-tasks that Windows gives it. Due to the hardware-assited nature and incredible speed of these tasks switches, the CPU Core can schedule the "hyperthread-brother" to execute while its sibling is waiting the ~10-nanoseconds for L3 Cache to respond. In contrast: a Windows Task Switch can take on the order of 10s of microseconds or 10,000x longer. (Sorta-kinda not really. There are more details such as super-scalar, multiple execution ports, register stalls and the like. But consider this paragraph to be a basic "conceptual" outline of what is going on.)
Scalability concerns on client systems
This is all good and all: but we start to hit scalability issues as we push for more cores:
The GHz barrier means Cores rarely get faster anymore. Historically, Intel / AMD offered faster cores to make faster computers. But physical limitations are being reached, and its becoming less-and-less possible to make faster cores. Instead, Intel / AMD want to offer more cores (at roughly the same speed) to the users.
Cores need to communicate. For example: when Chrome's "Network" thread reads from the internet, it needs to pass the data to the renderer to display the webpage. When the "Javascript" thread notices you clicked on the mouse, it needs to talk to the renderer (to display "button down"). This communication requires synchronization and time, and is built from a concept called "Cache Coherency". As such, more and more cores require a more complicated cache-coherency model.
As your cores do more and more work, the CPU starts to wait on memory more and more. You can't do work unless you can access data more quickly. To actually go faster, you'll need to talk to more memory faster somehow. As you'll soon see, solving this problem makes things complicated.
These problems give rise to the high-end build. AMD Threadripper and Intel i9-extreme / Intel Scalable platform. In short: HEDT builds give the user more memory-units and MANY more cores to play with. But this 2nd memory unit grossly complicates the architecture.
NUMA: Scaling above and beyond SMT
NUMA stands for "Non-uniform memory access", and it is what happens when multiple memory units begin to interact with the CPU. Internal to the CPU are the memory-controllers, which talk to your DDR4 RAM. Its easy to forget, but everything in a CPU is physically located somewhere, and physically, some cores will be closer to one memory unit. Its most obvious to see in the Intel i9-7980x die-shot. You can see that some cores can talk to one of the memory controllers faster than the other memory controller. Yes, these cores need to be laid out on the chip, and at the microscopic level, these differences can have an impact on performance.
All a "NUMA Node" is, are the group of processors that are "close" to one memory controller. In both Ryzen 1950x and Intel i9-7900x, there are two memory units / NUMA Nodes. Sure, Ryzen 1950x has "CCX" grouped by "infinity fabric", while Intel i9-7900x has a "mesh network" broken into "Sub-NUMA Clustering". But these details are abstracted away into simply "NUMA Node #0", and "Numa Node #1".
Furthermore, Core #0 and Core #2 (both on NUMA Node #0) can communicate extremely efficiently, while Core #0 vs Core #18 (one on NUMA #0 and the other on NUMA #1) will talk more slowly. By exposing the "truth" of this matter to programmers and system-administrators, NUMA allows the users and OS to more efficiently schedule your programs. Of course, Core #0 can talk to either NUMA Node #0 or #1, but the fact remains that Core #0 can talk to Node #0 faster than Numa-node #1.Keeping programs efficient is the job of the Operating System. And the OS has sane defaults as you can see, but understanding how the computer works can lead to smarter use of your computer.
The most important thing to keep in mind: NUMA is an abstraction. Threadripper 1950x has 4 CCX and 16-cores, but associates these with two NUMA nodes. Perhaps AMD could have made it four-NUMA nodes (one per CCX), but for whatever reason, 2x NUMA nodes is how AMD decided to lay out the 1950x. A big portion of the AMD Community focuses on inter-CCX latencies. With NUMA-controls in Windows, you, the administrator, can control CCX-latencies directly (although not on a per-CCX level. But at least on a per-memory unit level).
Intel i9-7900x also have latency issues, but to a smaller degree than Threadripper. And thus, NUMA controls apply to the HEDT Intel builds as well.
Window's Job: Scaling across NUMA
Windows is a NUMA-ready operating system, but how so is rarely discussed. Details are included in the EXCELLENT Windows Internals books (two parts divided into two books) if you're interested in learning all of these details. The NUMA-relevant portion is divided into two parts: Memory (chapter 10 from the 2nd book), and Scheduling (Chapter 5 from the 1st book). Memory is pretty obvious: NUMA allocations occur on one memory controller or the other. Software can control this, but otherwise, memory and NUMA is obvious (although tedious to do at the programming level).
The Windows scheduler however, is the "interesting" portion and is rarely discussed. The scheduler works as follows
Processes are assigned an "Ideal NUMA Node", and it alternates between your NUMA Nodes. For example, if you open up MS Word, MS Word will be assigned to "Ideal NUMA Node #0". When you open up Chrome, then it goes to "Ideal Numa Node #1". Open up Paint, and it goes back to NUMA Node #0. (Or if you are insane and bought a $6000+ computer: it goes to NUMA Node #2). Processes alternate between NUMA nodes. Whenever Processes ask for memory, Windows will default to memory located on the "Ideal" NUMA node.
Threads are assigned an "Ideal Core" that exists on a NUMA Node. In Threadripper, there are 16-logical per NUMA node (8-physical, x2 from SMT). Windows is smart enough to assign threads to physical cores first. So Chrome in this example would be allocated to NUMA Node #1 (Because MS Word was on Node 0 already). Then, Chrome's threads would be set to an ideal core #0, #2, #4, #6 of Numa#1... in that order. (Skipping over the "Hyperthread" siblings). After all, half the cores are "fake", and Windows optimizes as appropriate. Continuing this example, NUMA #0 has MS Word and Paint running on it. Paint would be assigned Core #1, Core #3, Core #5, etc. etc. while MS Word would be assigned Core #0, Core #2, Core #4.
"Ideal NUMA" and "Ideal Core" are hints to Windows. If Windows notices that a physical core is idle, it will often migrate the task to the non-ideal core. Sure, the thread will run slower, but that's better than not running at all. This step is incredibly complicated: Windows is juggling power-efficiency (for laptops), frequency, Turbo-boosts, NUMA, multiple threads and more. So if a core is "parked" for power-efficiency, Windows may decide to keep the core sleeping to save power. But in the typical case, Windows will try to keep as many cores active as possible.
These are the default settings of Windows. Programmers are able to customize their programs to change arbitrarily. Administrators can set "Thread Affinity", which allows you to LOCK a thread to a particular core. But as you can see, the default settings generally lead to good utilization of all NUMA Nodes and all cores, physical and logical.
Note: Any user can play with affinity masks by simply hitting ctrl-alt-exc, opening up task manager, and right-clicking on threads (in Win10) or "Processes" in Win7 (Spoiler alert: these are really threads even though Win7 calls them processes... confusing, isn't it??) to set their affinity. If you use "start /numa" or "start /affinity" through the command-line, you can set the program's affinity as a particular program starts up. I suggest learning some .bat scripting to make "start /numa" more convenient for yourself.
While Windows has very sane default settings. Windows optimizes these programs for low-latency. There is a slight issue: some major programs require tons of memory but are NOT NUMA-aware but benefit from bandwidth instead of latency. These programs would prefer to allocate memory from both NUMA nodes. Sure, accessing some memory will be slower, but at least both memory units can work together on the same problem, almost like RAID0 for disks.
UMA by default: "Creative" mode Memory-interleaving
The solution to the problem described in the previous section is to turn-off NUMA. In fact, the default setting of many BIOS is to be UMA by default, called "Creative Mode" by AMD. When in UMA mode or "Memory-Interleaving" mode, every thread will try its best to take advantage of both of the memory nodes, alternating between them as if they were in RAID0. But, half of the memory requests will be done quickly, while the other half will have a cross-NUMA delay. In many "creative" applications, the latency increase is not a big deal.
Unfortunately: this causes a problem in video games, as video game latency (CAS-latency) can have a real effect on performance.
Choosing UMA-mode vs NUMA-mode is therefore an interesting exercise in computer configuration. Some programs may run faster in NUMA-mode (if they care more about latency, which Windows optimizes for). While other programs, may run faster in UMA-mode.
Intel systems have "Sub-NUMA Clustering" instead. Turn it on for the NUMA-like behavior, turn it off to have memory-interleaving.
HEDT is about the 2nd Memory Unit: offering more scalability by talking to more RAM independently.
So now we've come full circle to the most important question. Do you need a HEDT build? What does HEDT get you?
Video Games are often designed for client systems -- It seems like NUMA / HEDT systems can play video games just fine, but video games are unlikely to take advantage of these high-end features of HEDT processors. As such, HEDT systems at best performs equally in video games, and in many cases can perform worse (!!) due to lower clocks and increased latencies. "Creative Mode" or "UMA mode" definitely slows down video games in the typical case.
Multi-tasking -- While Video Games themselves may not perform any better, having a 2nd NUMA Node for multi-threading purposes can lead to benefits. Streamers can run their video game on NUMA Node #0, while they can run x264 + OpenBroadcaster Studio on Numa Node #1. You can "force" applications to a NUMA node by thread-affinity (Ctrl-Alt-Esc), or the "start /numa" command through the Windows commandline. The two programs will independently run, with independent memory controllers on alternative cores, for maximum multi-tasking efficiency. This takes a bit of effort (or luck with the Windows Scheduler), but leads to increadibly efficient results.
NUMA-aware apps -- It seems like some applications are NUMA-aware directly, such as Matlab or SAS. In these cases, Matlab code will efficiently run across NUMA-nodes in as efficient manner as possible.
UMA / Memory-interleaving Creative apps -- Older applications may not be NUMA aware, but they can benefit from the default "UMA Mode" (or "Creative Mode") that these systems ship as. The increased bandwidth in this case is a bigger benefit than the latency slowdown.
In short: the high-end Desktop space adds a 2nd set of memory controllers (increasing your memory-bandwidth), which leads to a major complication of how your computer works. In many cases, a HEDT will perform better. But in many other cases, this 2nd memory controller may not be able to help you play video games any faster. Only buy an HEDT system if you know you can use that 2nd set of memory controllers.
HEDT isn't strictly superior. Its certainly more expensive, and even has a bandwidth / latency tradeoff due to these settings. Windows will do its best to automatically manage the system, but HEDTs will really shine if you put forth the effort to learn their niche and how to control them properly. I hope this post is a good stepping stone to understanding your HEDT system (or at least, whether or not you want to get one).
Misc Links
Cool Intel Forum Post on NUMA -- https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/392519
Official NUMA Docs from Microsoft -- https://msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx
Official NUMA Docs for Red Hat Linux -- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/virtualization_tuning_and_optimization_guide/chap-virtualization_tuning_optimization_guide-numa
1 note · View note