BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160544Z
LOCATION:Track 10
DTSTART;TZID=America/New_York:20201113T115000
DTEND;TZID=America/New_York:20201113T121500
UID:submissions.supercomputing.org_SC20_sess230_ws_waccpd104@linklings.com
SUMMARY:Performance and Portability of a Linear Solver Across Emerging Arc
 hitectures
DESCRIPTION:Workshop\n\nPerformance and Portability of a Linear Solver Acr
 oss Emerging Architectures\n\nWalden, Zubair, Nielsen\n\nA linear solver a
 lgorithm used by a large-scale unstructured-grid computational fluid dynam
 ics application is examined for a broad range  of  familiar and emerging a
 rchitectures.  Efficient implementation of a linear solver is challenging 
 on recent CPUs offering vector architectures. Vector loads and stores are 
 essential to effectively utilize available memory bandwidth on CPUs, and m
 aintaining performance across different CPUs can be difficult in the face 
 of varying vector lengths offered by each. A similar challenge occurs on G
 PU architectures, where it is essential to have coalesced memory accesses 
 to utilize memory bandwidth effectively. In this work, we demonstrate that
  restructuring a computation, and possibly data layout, with regard to arc
 hitecture is essential to achieve optimal performance by establishing a pe
 rformance benchmark for each target architecture in a low level language s
 uch as vector intrinsics or CUDA. In doing so, we demonstrate how a linear
  solver kernel can be mapped to Intel Xeon and Xeon Phi, Marvell ThunderX2
 , NEC SX-Aurora TSUBASA Vector Engine, and NVIDIA and AMD GPUs. We further
  demonstrate that the required code restructuring can be achieved in highe
 r level programming environments such as OpenACC, OCCA and Intel OneAPI/SY
 CL, and that each generally results in optimal performance on the target a
 rchitecture. Relative performance metrics for all implementations are show
 n, and subjective ratings for ease of implementation and optimization are 
 suggested.\n\nRegistration Category: Workshop Reg Pass
END:VEVENT
END:VCALENDAR

