Sun
Microsystems Inc. systems solutions based on the AMD Opteron
processor have attracted worldwide attention due to their
outstanding performance, low prices, and unique performance/Watt
energy utilization. These AMD64-based systems have achieved 65
world records over the past 2 years
(http://www.sun.com/x64/benchmarks/).
The new Sun Fire X4100 and X4200 systems set world performance
records just after being released. This outstanding achievement is
a result of productive collaboration between AMD and Sun.
Performance is a factor of both hardware and software. To
extract the maximum performance from the new AMD-64 based systems
on your critical C/C++ and Fortran applications, choose the best
compilers in the industry – Sun Studio 11. Then by setting
compiler options to take advantage of the Opteron system features,
you'll maximize your performance benefit. This article will show
you how.
Sun Studio 11 Compilers and Tools
While our focus below is Opteron based systems, the techniques
that we discuss in this note can be applied with small changes to
the SPARC platform and to Intel processor based systems running
Solaris. In most cases what's required are changes to the compiler
option arguments.
In this note our focus is compilation. Hence we do not address
coding techniques. Refer to the Sun
Studio 11 documentation for more details.
Test environment
Our test environment consists of a Sun W2100 workstation
running Solaris 10.
We use the GNU sed 4.1.4 utility source to illustrate the
compiler optimizations. The sed utility is used to filter text.
Specifically sed takes text as input and then performs one or more
operations on the text and outputs the modified text. The sed
utility is a C application. You can obtain the sed source from
http://ftp.gnu.org/pub/gnu/sed/sed-4.1.4.tar.gz
We run a sequence of performance tests. For each test, we
compile sed with the Sun Studio 11 compilers and with selected
compiler options. We then determine the impact of the selected
compiler options on performance by measuring the execution time of
sed to convert a text file to html using the following script:
demo% cat txt2htm.sed
s/\&/\&\;/g
s/[\<]/\<\;/g
s/[\>]/\>\;/g
s/^\s\+/\<\/p\>\<p\>/
s/^$/\<\/p\>\ \;\<p\>/
s/^/\<html\>\n\<head>\n\<\/head\>\n\<body\>\n\<\/p\>/
s/\s\s/\ \;\ \;/g
$s/$/\<\/p\>\n\<\/body\>\n/
We then call the time command to obtain the
performance numbers:
time -p sed -f txt2htm.sed
>/dev/null <test.txt
If you have access to an
Opteron-based Sun worskstation or server running Solaris 10 you
can follow along to get a sense of the impact of the option
setting on the performance of your application. If you do not
already have Sun Studio 11 installed on your system, you can
download it without charge from:
http://developers.sun.com/sunstudio/downloads/
Measuring with default options
Let's build and measure the performance of our sample sed
program. First we compile and run with no options set, using the
compiler's settings. We get a test execution time of 12.22
seconds. This number is for the default options provided
by the configure script without any optimizations.
Specifying Target Computer: -xarch, -xchip, -xcache
Often programmers compile their source code with the compiler's
default option settings. By careful choice of compiler options,
significant performance improvements can be had. These compiler
options are not turned on by default: you must set them
explicitly.
-xarch
The -xarch compiler option tells the compiler what
instruction set is available for code generation.
If you know the target platform for your application, specify
it with the -xarch compiler option. The compiler will
then generate code optimized for the target platform.
What x86 platform architectures are available? First, let's
look at the Pentium III. Some time ago computers had separate
chips for integer and floating point calculations. The floating
point coprocessor used the special floating point stack for
operations. Later, both floating and integer units were assembled
on a same chip, so now any CPU is able to perform both kind of
calculations. Some Pentium CPUs have the MMX (Multimedia
Extension) instruction set, providing a fast and more convenient
way for typical floating point calculations. That support has been
expanded in the Pentium III with the SSE (Streaming SIMD
Extensions) floating point instruction set, making the old
floating point stack calculations obsolete. With SSE you get
faster floating point calculations and 8 additional XMM floating
point registers available for the compiler. For these processors,
all the XMM registers are 32-bit.
Starting with Pentium IV, the SSE2 instruction set was
introduced as an expansion of the older SSE set. If you intend to
run your executable program on a Pentium IV, you need to compile
with -xarch=sse2.
Compared to the older Pentium CPUs, the current generation AMD
Opteron and Athlon64 architectures have much improved multimedia
extension support. These processors have 16 XMM registers, and 8
more general purpose registers. The general purpose registers are now 64 bit, suitable
for direct 64-bit calculations in addition to 32-bit
computation, and XMM registers are now 128 bit wide.
You can now directly access much more than 4GB of memory. To use
these features, compile your application with -xarch=amd64.
Note that the resulting executable will work not only on AMD64
based computers but on 64 bit Intel XEONs and EM64Ts as
well.
-xchip
A second target computer specific option is -xchip. This
option specifies the target computer CPU type. By default the
compiler will generate code for the generic x86 chip,
not taking advantage of the advanced Pentium IV or Opteron
features. But you can set -xchip=opteron or
-xchip=pentium4 to specify your primary target.
Unlike the -xarch option, where the compiler uses the architecture argument
to generate code specific for instruction set, the -xchip option is more of
a "hint" to the compiler – it helps the compiler optimize the program for a family of CPUs
with the same minimal instruction set.
-xcache
A third target computer specifc option is -xcache. This
option specifies the target computer cache configuration. For the
Opteron based system, this should be set -xcache=64/64/2:1024/64/16
-xtarget
You can specify the -xarch,
-xchip and -xcache options together by
using the composite option -xtarget. -xtarget
is a macro holding three settings: -xarch, -xchip,
and -xcache. Specifying -xtarget=opteron is
the equivalent of -xarch=sse2 -xchip=opteron
-xcache=64/64/2:1024/64/16.
Note that -xtarget=opteron conservatively sets
-xarch=sse2, and not -xarch=amd64.
So to get the most performant code for the AMD64 bit CPU you need
to set -xarch=amd64 in addition
to -xtarget=opteron. Note that if an option is repeated
on the compiler command line, the last occurrence of the option
takes precedence. So if you set -xarch=amd64 -xtarget=opteron
you will not get 64 bit code since the
macro expansion yields -xarch=sse2! In the current
example set the options in this order, -xtarget=opteron -xarch=amd64.
64-bit Memory Considerations
Setting -xarch=amd64 tells the compiler about
resources that are significant for code generation: more
registers, additional instructions, expanded direct memory access.
For instance your long type in C/C++ becomes 64 bit
and you do not need to use the long long data type
for 64-bit numerical calculations because the compiler can do that in
a more efficient way. Function arguments are now passed in
registers instead of memory, giving you an additional
performance boost.
But 64-bit mode is not always faster than 32-bit mode. In 64-bit
mode, even if your program is far from using 4GB of memory, you
still point to a memory location with 64-bit addressing. If you
are using pointers, they will take 8 bytes in 64-bit mode instead
of 4 bytes in 32-bit mode. That is not usually a problem until you
have a lot them. For example if you have large arrays of pointers
or structure or class instances with pointers you will
waste a lot of memory. Your program may become slower due
to swapping or to memory cache misses. So if your program uses
large arrays of pointers, consider using 32-bit SSE2 mode instead
of the 64-bit mode. It may be more efficient. You may need
to experiment to determine the best option setting.
Also be aware that C/C++ long type becomes 8 bytes
in 64-bit mode. If you are using long data types
extensively, consider using int instead or switch to
32 bit (int is 32 bits in both 32-bit and 64-bit
modes).
This is not a concern for Fortran since it has fixed data type
sizes for all platforms and also because Fortran does not make
extensive use of pointers.
Safe versus aggressive compiler option settings
Several of the compiler options described below can be used to
obtain significant performance gains, albeit with a possible loss
of robustness for programs that do not strictly follow
conservative coding techniques. Proper use of these advanced
options is based on an agreement between the programmer not to
code in specific ways and for the compiler to use that knowledge
for optimization. These advanced optimizations need to be enabled
explicitly by the programmer. They are not optimizations done by
default.
Setting optimization levels: -xO1 to -xO5
There are five optimization levels available for the Sun Studio
compilers, -xO1 to -xO5. All the
optimizations selected with -xOn options are safe
unless your program uses some low level memory manipulations, based on the
internal data layout knowledge. Use the highest optimization, -xO5, unless
it exposes some significant programming errors in your code.
Re-measuring sed performance
The executable we have started with was targeterd for
the generic (386)
architecture. Selecting amd64 architecture
give us 13.74 seconds, which is slower than 32 bit. But watch what
happens as you start using the optimization options.
We compile sed with optimization level -xO5.
We get 7.49 seconds for the sse2 32 bit architecture
and 6.68 seconds for the amd64 architecture. This is
about twice as fast as the un-optimized code.
Notice also that the 64-bit version now runs faster than the
32-bit version. To get the full power of the
64-bit processor working with sed, we had to enable compiler optimizations.
-fast macro
The -fast macro option conveniently collects a
number of powerful and relatively safe optimizations. It is a good first step
for getting the best performance out of most applications.
So what are the options collected in -fast and
under what conditions are they safe? Lets take a closer look at
some of the options included in -fast and see what
they do.
Using C compiler the -fast option expands to the
following options on x86/x64 processors:
-fns
-fsimple=2 -fsingle -nofstore -xalias_level=basic -xbuiltin=%all
-xdepend -xlibmil -xlibmopt -xO5 -xregs=frameptr
-xtarget=native.
Details on all these options can be found in the user guides
and man pages for each of the compilers.
Note that -fast includes -xO5. It also
includes -xtarget=native . The native argument
instructs the compiler to assume that the architecture of the
target computer is identical to that of the development machine.
On an AMD64 platform, the compiler will still select -xarch=sse2,
which is a 32-bit target. You need to specify your specific target
options after -fast. For example,
-fast -xarch=amd64
Other options included in the -fast macro may also
be overridden by adding the changed value following -fast
on the command line. It is good practice to specify -fast
as the first compiler option.
Re-measuring sed performance
We now compile the sed source using -fast with
-xarch=amd64. We get 6.33 seconds.
Floating Point Options: -fns -fsimple=2 -fsingle
-nofstore
The set of options -fns -fsimple=2 -fsingle -nofstore
tell the compiler that it may relax concerns about floating point
precision and runtime exceptions while compiling floating point
expressions. In most cases this is a reasonable assumption, and
allows the compiler to optimize over floating-point instructions.
However, use of these options may result in some loss of standards
conformance for floating-point operations and possible numerical
differences if the program algorithms are sensitive to rounding
errors.
Aliasing: -xalias_level=<level>
The -xalias_level option is a powerful
optimization for most C/C++ programs. In general, optimizing C/C++
is made difficult by the use of pointers in the source code. The
compiler can make few assumptions regarding how data is used in
the program, thus inhibiting many optimizations and code
restructuring.
Because the actual data and pointer values will be known only
at run time, the compiler cannot make any assumptions at compile
time. The situation where two pointers could cause the same data
to be changed or read indirectly is called aliasing. There are
many aliasing situations. Hence seven values are available for the
-xalias_level option of C compiler, the lowest being
-xalias_level=any, which prevents any optimization
that requires the assumption that some pointer is not aliasing
another one.
While the -xalias_level=basic setting that
appears in -fast is the next restrictive level after
any, it still allows the compiler to perform certain
optimizations even though pointers are used in the code. This
leads to a performance boost in most situations. The
-xalias_level=basic option guarantees to the
compiler, for example, that nowhere in the code will a double
variable be accessed by a pointer to long. But the
common practice of accessing any data with a char*
pointer is allowed with this option setting.
You can assert other aliasing levels by adding the appropriate
-xalias_level option after -fast. If you
are confident that your program is written without pointer tricks,
try using -xalias_level=strong for C programs or
-xalias_level=compatible for C++ first. But be careful.
The compiler will not warn you if the assumption regarding
aliasing is wrong. In most cases, if you have set the alias level
too aggressively, you will get runtime errors or memory
exceptions. More information about the various aliasing levels can
be found in the C User's Guide.
Aliasing is one of the biggest problems for any C/C++ compiler.
If performance is a main concern, avoid using pointers in
compute-intense loops. Array indexing performs better than doing
pointer arithmetic. For example:
for( int i=0; array[i]; i++ ) foo(array[i]); //
recommended
for( T* p=array; *p; p++ ) foo(*p); // avoid
-xrestrict
You may also consider using -xrestrict, which
tells the compiler there is no aliasing between the arguments in
functions. For example strcpy is a typical function
which takes two pointers (destination and source). These two
pointers can never alias each other. So for this example, make it
clear to the compiler by setting the -xrestrict
option.
A counter-example, where the compiler option -xrestrict
is not safe, is memmove. This is an example of a
function where aliasing is expected. If one compiles memmove
source with -xrestrict option, it would result in bad
code. In practice it is better to declare each function's
argument as restrict pointer as appropriate. However, when
compiling a big third party project it may be much easier to
specify -xrestrict. Experiment, but keep in mind that
the option applies to all the source code in a compile unit.
Re-measuring sed performance
We do not get any further performance benefits by using
-xrestrict or -xalias_level=strong for
our sed code right now. But note that adding
-xalias_level=any after -fast to tell
the compiler to be most conservative increases the run time to
7.03 seconds.
Intrinsics and Library Functions: -xbuiltin, -xlibmil,
-xlibmopt
Many compiler optimizations are possible when the compiler
knows exactly what the code is doing. A good example is a call to
a standard library function. The compiler could replace the call
with optimized code for the function inserted right in place. This
depends on an understanding between the programmer and the
compiler that the program does not override a standard function
with its own custom version. When this agreement holds, the
compiler can assume that all calls to standard functions use the
library version, and replace the function calls with the intrinsic
code itself. The option asserting this is -xbuiltin=%all,
which is a part of the -fast expansion.
To benefit from intrinsics, do not forget to include the system
headers into your source declaring the particular functions that
you use. The compiler's builtin intrinsics will not be used if you
do not include the right header files. Look for the compilation
warnings about undefined functions.
There are also two libraries of pre-optimized functions -
libm.il and libmopt.so. Use of these libraries
is specified by the -xlibmil -xlibmopt options, which
are a part of the -fast macro. These two libraries
provide optimized versions of standard math routines. The high
performance trade-off when using these libraries is that
information about arithmetic exceptions might not be available and
that errno variable will not be set. In many cases
this is a reasonable trade-off between functionality and
performance.
Frame Pointers: -xregs
Each function normally has a frame pointer. This is a special
stack structure that serves to help manage the function's data.
Without frame pointers the system will have trouble unwinding the
stack during exception handling and debugging. On the other hand,
if you choose not to use frame pointers in your program you free one
general purpose register for the compiler. Also the compiler could
generate a shorter function prologue, and function execution
should be faster. The option -xregs=frameptr,
which is a part of -fast, tells the compiler not to use
frame pointers.
While stack unwinding during exception handling is critical for
C++, this ability will not be lost when the option
-xregs=frameptr is set. The compiler does not omit
frame pointers where they are really necessary. You may still have
difficulty debugging so consider this option for achieving higher
optimization in production code.
Inter-procedural Optimizations: -xipo
Optimization can be much more effective if the compiler has
access to the full application project source, and not just
individual source modules. For example, by looking at the entire
application source code tree, the compiler could determine that it
can inline a function called in one source file and defined
in another. Function inlining removes the expensive function call
by replacing it inline with the actual code of the called
function. Inlining and related optimizations that read the entire
application source code tree are called interprocedural
optimizations. Use of these options can lead to a significant
performance gains.
To take advantage of inter-procedural optimization, you need to
provide the
compiler with all the source files in a single-step compile
and link. However this is not always possible, especially with
large projects. If you are doing mixed language development
requiring compilation with a combination of cc, CC, and f95
commands, une one of compiler commands to do the linkinng rather than
using the ld command directly.
To enable interprocedural optimization add -xipo=2
on the compiler command line in both the compile and link
steps. Even with separate module compilation you will
still benefit from interprocedural optimization if you
specify that option at every compiler invocation,
including linking.
Re-measuring sed performance
When we add -xipo=2 to the compilation options for
sed, we get 5.85 seconds – 7.5% faster!
Profile Feedback: -xprofile
The optimizations discussed so far are made by analyzing the
program source code only. Profile feedback takes into account the
actual program's runtime data to help the compiler determine
whether certain optimizations are worth performing or not. If the
compiler knows the typical data value at critical program
junctions it can effectively rearrange some branches or even
restructure to code completely so as to make it work faster in
typical use cases. The performance benefits you could get with
this option are high, these at the cost of additional programmer
effort.
To use profile feedback you need to provide the compiler
representative data samples. That is, it should be almost the
same as real working data in most typical cases. Also, the
sample data should be small enough to keep compilation
time reasonable. Note also that the profiled executable
collecting the sample run data will take longer to run than
usual.
To generate a "compiler training
run", compile your program with the same options that you
plan to use in production, but with the additional option
-xprofile=collect:./feedback. Now run the executable
using your typical input data. This run will create the
subdirectory ./feedback.profile in your current
directory. The data collected will be used later for optimization.
Now recompile everything with the same options but replacing
"collect" with "use", as in
-xprofile=use:./feedback. The compiler will now use
data from ./feedback.profile to direct its
optimizations.
To obtain the best representative data, you may want to run
several "collect" executions with different data sets.
When you do this, the results are merged. However if you
change your source or compilation options you must erase
the ./feedback.profile directory and run a new "collect"
pass(es) to get relevant results.
Re-measuring sed performance
Let's try profiling on sed. First we generate the
profile data by compiling sed with the options -fast
-xarch=amd64 -xipo=2 -xprofile=collect. Then
we use the profile data by compiling with -fast
-xarch=amd64 -xipo=2 -xprofile=use. We get 5.18 seconds.
Currently the Sun Studio x86 compiler has two different
profilers. The one that we selected with -xprofile is
the current production version in Sun Studio 11. There is also new
profiler which is still in development, and which will
become the default profiler with the next release of Sun Studio.
Let's try this new profiler. To enable the new profiler add the
driver option -iropt-prof along with -xprofile
option. To add it to the C compiler driver use -Wd,-iropt-prof.
To add it to the C++ or Fortran compiler use -qoption
CC -iropt-prof or -qoption f90 -iropt-prof.
We recompile sed with the -iropt-prof set in
addition to -xprofile. We now get 4.78 seconds.
Sometimes an optimization option that may not improve performance
by itself will provide a performance boost when used together with
profiling and interprocedural optimization. The -xalias_level
and -xrestrict options are examples of such options.
Earlier, when we tried the -xalias_level and
-xrestrict options, we did not get performance
improvements. Let's try them again: -fast -xarch=amd64
-xipo=2 -xprofile=... -Wd,-iropt-prof -xalias_level=strong
-xrestrict. We now get 4.70 seconds with this combination.
Memory Allocation
Memory layout is a critical factor for performance. If you use
the malloc function in your program, be aware that
there are various versions and some perform better than others.
The default malloc tries to save memory. However if
you only use malloc occasionally, you might find that
adding the option -lbsdmalloc to your compilation
string can improve performance. Using the bsdmalloc library
results in better alignment of allocated memory, and possibly
better performance through better cache utilization.
Suppose that your application has declared the following
structure of 64 bytes size and that memory is allocated for the
array of such structures:
typedef struct _s_t {
//...
}s_t;
//...
assert(sizeof(s_t)==64);
s_t *sa = (s_t*) malloc(sizeof(s_t)*NELEM);
When using the default malloc we can fall into the
situation where the starting address of the array and each
individual structure is not aligned to a 64-byte boundary. As a
result each structure will be placed in two cachelines instead of
a single one. Referencing a field at the beginning and at the end
of the structure will result in the request for two different
cachelines from memory. If we are doing non sequential array
access this may result in twice as many memory cache misses as
would occur if there were strict alignment. Using -lbsdmalloc,
the structure will force a fit into a single cache line.
The default malloc and free use
mutexes, which is time consuming. bsdmalloc does not, so the
allocation itself is faster. We trade-off memory
allocation density and speed, or allocation and use speed.
However, it seems unwise to use the bsdmalloc
instead of malloc on our sed example
since sed is a memory intensive application with numerous memory
allocation calls. We would just waste a memory and most
probably lose performance if we were to not use the
default malloc.
SIMD: -xvector=simd
The SSE2 instruction set includes some special SIMD
instructions. SIMD stands for Single Instruction Multiple Data.
That means you may process several data values concurrently with
one operation. Suppose that you scale a vector:
for( i=0; i < NELEM; i++ ) v[i] = v[i] * 2;
That would normally require NELEM multiplications. With the
SIMD "vector" instructions, this can be done with fewer
operations. If the data being multiplied is of type float
it will require 4 times fewer operations. The size of float
is 32 bits. SIMD instructions operate with data packed into a 128
bit XMM register. So we may place four 32 bit floats into one 128
bit XMM register and process it at once. Similar operations on
data of type char would require 16 times fewer
multiplications. A limitation of SSE2-style vectorization requires
that the data being processed be in adjacent memory locations. So
this would not work if we were striding through odd elements of
array, for example.
Sun Studio 11 introduced basic support for vectorization on
SSE2 platforms. The compiler looks for such operations and
vectorizes them whenever possible. That serves both floating point
and integer calculations. Currently the mode is still experimental
and not as beneficial as it can be. The support will be extended
in the next compiler releases. To enable vectorization support,
you need to specify the instruction set (at least -xarch=sse2)
along with the compiler flag -xvector=simd.
We get no performance benefit from -xvector=simd
for sed. So not every program benefits from every
optimization.
Prefetching: -xprefetch,-xprefetch_level=<level>
One of the biggest performance bottlenecks is memory speed.
While utilizing a high-speed cache can serve data to the CPU
quickly, these cache buffers have limited size. If your program
processes big data arrays that do not fit into the cache, the CPU
stalls waiting for more data. Programs like this are called memory
bound in comparison to CPU bound, where the CPU speed
is the bottleneck. And, as raw CPU processing speed increases,
more and more programs fall into the memory bound category. This
situation can be aggravated with SIMD which processes data arrays
many times faster than before.
Most memory accesses to fill the cache can be done in parallel
with computations. Advanced CPU architectures such as the AMD
Opteron can automatically prefetch data into cache. While some
automatic prefetching is done by the CPU, in certain situations
the compiler may generate additional prefetch instructions. The
heuristics used by the CPU or by the compiler to fill the cache in
advance are speculative and might not always fetch the data
actually needed. In practice prefetching could degrade performance
by pushing needed data out of the cache. Another challenge is
knowing how much data to prefetch. That depends on the CPU to
memory speed ratio and varies from one computer to another.
You can enable the generation of prefetching
instructions by compiling with the -xprefetch option. Fortran
includes prefetching in its expansion of the -fast
option. The -xprefetch=no option disables it
completely. You can regulate how aggressively the compiler
generates prefetches by setting the -xprefetch_level=<level>
option. The higher the level value (from one to three), the more
aggressively the compiler inserts prefetches.
The AMD CPUs implement special 3DNow! instructions extensions.
One of the 3DNow! instructions supports prefetches for memory
stores. Store prefetches are generated along with read prefetches
if the special -xarch is specified: pentium_proa,
ssea, sse2a, and amd64a. Note that compiling
for these architectures will make your executable program
incompatible with non-AMD platforms.
While beneficial in many cases, prefetching does not improve
performance for sed. That is an expected result, as
sed does not access large data arrays sequentially –
the normal case where prefetching usually helps.
Automatic Parallelization: -xautopar
Today's modern Sun Opteron-based systems are configured with
multiple multi-core CPUs. To fully leverage the available CPU
resources, applications are best parallelized through the
extensive use of threads. The compiler can help you in the
parallelization effort. Try compiling your application with the
-xautopar to see if there's a benefit. Then request
the number of execution units at runtime by setting the PARALLEL
environment variable to, typically, the maximum number of CPUs (or cores)
on your system minus 1. You may have to experiment with the best setting for your application. Some applications may run faster with auto parallelization when fewer than the maximum number of CPUs or cores is specified.
Conclusion
By using the Sun Studio 11 compilers and the right
compiler options we've sped up the sed utility from
12 seconds to less then 5 seconds – making it 60% faster
without recoding. We now have a version of sed that
runs 35% faster than the gcc compiled version.
We started with -fast -xarch=amd64 and added extra
options, some of which helped while others did not.
We saw that some compiler options may not help on their own,
but do so when selected in conjunction with other options, such as
-xalias_level option in conjunction with profiling.
For More Information
This portal has a number of other technical articles on
performance
and parallelization
that are worth reading.
Stanislav Mekhanoshin is the team lead of the Sun Studio Opteron Performance team. The team is based at the Sun Saint Petersburg Development Center in Russia. The main goal of the team is to improve the performance of code produced by the Sun Studio compilers for the x86 platform, and especially for Sun's Opteron based systems.
Stanislav graduated from Saint-Petersburg State Technological University in 1995 with a masters degree in computer science.
Prior to joining Sun, Stanislav worked on various software technologies in Russia -- including database, speech recognition, and programming for mobile devices.
|