Part 2: Inline Assembler

2. Modify the add.c to calculate b mod a using inline assembler and print the result.

```
#include <stdio.h>
// This is a very simple example of inline assembler.
// On AArch64, this code will calculate c=a+b then
// print the value of c.
//
int main() {
int a = 3;
int b = 19;
int c;
int d;
// __asm__ ("assembley code template" : outputs : outputs : clobbers)
// __asm__ ("add %0, %1, %2" : "=r"(c) : "r"(a),"r"(b) );
__asm__("udiv %0, %1, %2" : "=r"(c) : "r"(b), "r"(a) );
__asm__("msub %0, %1, %2, %3" : "=r"(c) : "r"(c), "r"(a), "r"(b) );
printf("%d\n", c);
}
```

Result:

```
[cle@aarchie simd_lab]$ time ./add
1
real 0m0.005s
user 0m0.001s
sys 0m0.004s
```

3. vol_inline.c contains a version of the volume scaling problem which uses inline assembler and the SQDMULH instruction. Copy, build and verify the operation of this program.

Default number of samples:

```
vol.h
#define SAMPLES 5000000
```

Results:

```
[cle@aarchie simd_lab]$ time ./vol_inline
Generating sample data.
Scaling samples.
Summing samples.
Result: 930
real 0m0.521s
user 0m0.500s
sys 0m0.020s
```

Decreased number of samples:

```
vol.h
#define SAMPLES 500
```

Results:

```
[cle@aarchie simd_lab]$ time ./vol_inline
Generating sample data.
Scaling samples.
Summing samples.
Result: 152
real 0m0.005s
user 0m0.001s
sys 0m0.004s
```

Increased number of samples:

```
vol.h
#define SAMPLES 90000000
```

Results:

```
[cle@aarchie simd_lab]$ time ./vol_inline
Generating sample data.
Scaling samples.
Summing samples.
Result: 713
real 0m9.404s
user 0m8.999s
sys 0m0.379s
```

Part 3: C Intrinsics

1. Default Result (changed SAMPLES in vol.h back to 5000000):

```
[cle@aarchie simd_lab]$ time ./vol_intrinsics
Generating sample data.
Scaling samples.
Summing samples.
Result: 930
real 0m0.522s
user 0m0.500s
sys 0m0.020s
```

Increased number of samples:

`#define SAMPLES 80000000`

Results:

```
[cle@aarchie simd_lab]$ time ./vol_intrinsics
Generating sample data.
Scaling samples.
Summing samples.
Result: -219
real 0m8.294s
user 0m8.012s
sys 0m0.239s
```

Q1: What do these intrinsic functions do?

`vst1q_s16(cursor, vqdmulhq_s16(vld1q_s16(cursor), vdupq_n_s16(vol_int)));`

Q2: Why is the increment below 8 instead of 16 or some other value?

Q3: Why is this line not needed in the inline assembler version of this program?

Q4: Are the results usable? Are they accurate?