Arch Guidelines for Performance & Scalability

Static Not Dynamic
Cached
Distributed
Parallelized
(eg. a Map/Reduce architecture)
Asynchronous
Incremental
Queues and Worker Pools
(queues themselves don't necessarily help perf directly -- in fact, the overhead costs of enq/deq may hurt it a little, if considered only in isolation, in the small, and under ideal smooth flat low traffic conditions -- but can help both performance, scale and reliability/availability overall by putting a max ceiling on host resource usage; they enforce this ceiling by having a pool of worker threads of fixed or maximum size, and then giving worker threads and their machine hosts only as much as active work load as they can handle, and nothing more -- typically by having the workers 'pull' requests onto their plate, and only when they are ready, rather than having them 'pushed'; excess work requests are simply noted in memory temporarily and ideally persisted to a durable store if must never be lost; without queues or worker pools, especially with blind round robin load balancing across the frontmost work nodes, it's possible for a box to receive so many work requests, and spin up so many processes or threads to work them, concurrently, that, in the worst case, both cpu load and memory pressure goes through the roof, causing latencies to skyrocket and throughput to drop towards zero; typically all real queues have some max capacity and so must turn away new additions, though usually in a graceful way allowing the work requesting client (or message passer) to retry later if/when they wish; note that network traffic buffers are somewhat queue-like in this respect, except network packet persistance is almost non-existant by design)
Step Minimization (ie. do less)
Optimal Algorithms
(eg. search in O(n) rather than O(n^2))
Event-Driven Not Polled
(ie. don't use cpu/mem to just "run in place")
Non-Blocking IO
(can help reduce "idle" memory use/swaps, and other held/shared resources like file descriptors, pool connections, etc.)
Paginated Results
(eg. don't show all 1000 results, only 10 at once, with Next links)
Web Page Request Minimization
Network Locality
(one of the main points to using a CDN)
Machine Task-to-Data Locality
(eg. a big part of what Hadoop provides, in addition to Distributed, Parallelized and Queued)
Prediction Prefetch
(term orig assoc with CPU trick but can also guess enduser requests beforehand: do sneakily behind scenes, then present fresh results "instantly" when req)
Beefier Hardware
Doing Tasks in Hardware Rather than Software
(eg. cards/chips dedicated to doing graphics, crypto, network, physics)
Leaner Languages/Runtimes
(eg. if a task path is CPU-bound and/or memory-constrained, consider writing that portion in C rather than Ruby)
Tuning OS Parameters
Custom OS Kernel Build
Minimize Context Switching
(by a CPU's thread scheduler)
Avoid Swapping, Especially Thrashing
(modern OSs with virtual memory, like Linux, will swap unused memory chunks to disk, then read back in again as unneeded; if the total actively needed memory exceeds physical memory it can go into a repeated swap-out/swap-in cycle which is an even bigger hit, and will often take the box past the point of usability, needing a box restart)
Minimize Chattiness of Comm Protocols
Pass Only Small Messages
Avoid Lock Contention
Pass/Store Diffs Rather Than Complete Snapshots
Client/Server Architecture
(since the server portion is already running and has completed initialization, and the client can be thin, fast starting and transient if needed)
Function Inlining
(only significant if function body is very small and called very large number of times)
Less Memory Churn and GC Inside Your App Process

TODO: add more detail for each item

UP