Being intrinsics allows for more optimisation, however it's a good point that inline assembly may work?
I believe it does. The C headers call compiler built-ins that generally follow the intrinsics themselves, e.g. this is an excerpt of clang 3.6.0's xmmintrin.h
(the x86 SSE intrinsics):
static __inline__ __m64 __attribute__((__always_inline__, __nodebug__))
_mm_max_pu8(__m64 __a, __m64 __b)
{
return (__m64)__builtin_ia32_pmaxub((__v8qi)__a, (__v8qi)__b);
}
static __inline__ __m64 __attribute__((__always_inline__, __nodebug__))
_mm_min_pi16(__m64 __a, __m64 __b)
{
return (__m64)__builtin_ia32_pminsw((__v4hi)__a, (__v4hi)__b);
}