# Interface Concepts¶

This page provides a brief description of the theoretical foundations and the practical implementation of the libCEED library.

## Theoretical Framework¶

In finite element formulations, the weak form of a Partial Differential Equation (PDE) is evaluated on a subdomain \(\Omega_e\) (element) and the local results are composed into a larger system of equations that models the entire problem on the global domain \(\Omega\). In particular, when high-order finite elements or spectral elements are used, the resulting sparse matrix representation of the global operator is computationally expensive, with respect to both the memory transfer and floating point operations needed for its evaluation. libCEED provides an interface for matrix-free operator description that enables efficient evaluation on a variety of computational device types (selectable at run time). We present here the notation and the mathematical formulation adopted in libCEED.

We start by considering the discrete residual \(F(u)=0\) formulation in weak form. We first define the \(L^2\) inner product between real-valued functions

where \(\bm{x} \in \mathbb{R}^d \supset \Omega\).

We want to find \(u\) in a suitable space \(V_D\), such that

for all \(\bm v\) in the corresponding homogeneous space \(V_0\), where \(\bm f_0\) and \(\bm f_1\) contain all possible sources in the problem. We notice here that \(\bm f_0\) represents all terms in (1) which multiply the (possibly vector-valued) test function \(\bm v\) and \(\bm f_1\) all terms which multiply its gradient \(\nabla \bm v\). For an n-component problems in \(d\) dimensions, \(\bm f_0 \in \mathbb{R}^n\) and \(\bm f_1 \in \mathbb{R}^{nd}\).

Note

The notation \(\nabla \bm v \!:\! \bm f_1\) represents contraction over both fields and spatial dimensions while a single dot represents contraction in just one, which should be clear from context, e.g., \(\bm v \cdot \bm f_0\) contracts only over fields.

Note

In the code, the function that represents the weak form at quadrature
points is called the CeedQFunction. In the Examples provided with the
library (in the `examples/`

directory), we store the term \(\bm f_0\) directly
into `v`

, and the term \(\bm f_1\) directly into `dv`

(which stands for
\(\nabla \bm v\)). If equation (1) only presents a term of the
type \(\bm f_0\), the CeedQFunction will only have one output argument,
namely `v`

. If equation (1) also presents a term of the type
\(\bm f_1\), then the CeedQFunction will have two output arguments, namely,
`v`

and `dv`

.

## Finite Element Operator Decomposition¶

Finite element operators are typically defined through weak formulations of
partial differential equations that involve integration over a computational
mesh. The required integrals are computed by splitting them as a sum over the
mesh elements, mapping each element to a simple *reference* element (e.g. the
unit square) and applying a quadrature rule in reference space.

This sequence of operations highlights an inherent hierarchical structure
present in all finite element operators where the evaluation starts on *global
(trial) degrees of freedom (dofs) or nodes on the whole mesh*, restricts to
*dofs on subdomains* (groups of elements), then moves to independent
*dofs on each element*, transitions to independent *quadrature points* in
reference space, performs the integration, and then goes back in reverse order
to global (test) degrees of freedom on the whole mesh.

This is illustrated below for the simple case of symmetric linear operator on
third order (\(Q_3\)) scalar continuous (\(H^1\)) elements, where we use
the notions **T-vector**, **L-vector**, **E-vector** and **Q-vector** to represent
the sets corresponding to the (true) degrees of freedom on the global mesh, the split
local degrees of freedom on the subdomains, the split degrees of freedom on the
mesh elements, and the values at quadrature points, respectively.

We refer to the operators that connect the different types of vectors as:

Subdomain restriction \(\bm{P}\)

Element restriction \(\bm{G}\)

Basis (Dofs-to-Qpts) evaluator \(\bm{B}\)

Operator at quadrature points \(\bm{D}\)

More generally, when the test and trial space differ, they get their own versions of \(\bm{P}\), \(\bm{G}\) and \(\bm{B}\).

Note that in the case of adaptive mesh refinement (AMR), the restrictions \(\bm{P}\) and \(\bm{G}\) will involve not just extracting sub-vectors, but evaluating values at constrained degrees of freedom through the AMR interpolation. There can also be several levels of subdomains (\(\bm P_1\), \(\bm P_2\), etc.), and it may be convenient to split \(\bm{D}\) as the product of several operators (\(\bm D_1\), \(\bm D_2\), etc.).

### Terminology and Notation¶

Vector representation/storage categories:

True degrees of freedom/unknowns,

**T-vector**:each unknown \(i\) has exactly one copy, on exactly one processor, \(rank(i)\)

this is a non-overlapping vector decomposition

usually includes any essential (fixed) dofs.

Local (w.r.t. processors) degrees of freedom/unknowns,

**L-vector**:each unknown \(i\) has exactly one copy on each processor that owns an element containing \(i\)

this is an overlapping vector decomposition with overlaps only across different processors—there is no duplication of unknowns on a single processor

the shared dofs/unknowns are the overlapping dofs, i.e. the ones that have more than one copy, on different processors.

Per element decomposition,

**E-vector**:each unknown \(i\) has as many copies as the number of elements that contain \(i\)

usually, the copies of the unknowns are grouped by the element they belong to.

In the case of AMR with hanging nodes (giving rise to hanging dofs):

the

**L-vector**is enhanced with the hanging/dependent dofsthe additional hanging/dependent dofs are duplicated when they are shared by multiple processors

this way, an

**E-vector**can be derived from an**L-vector**without any communications and without additional computations to derive the dependent dofsin other words, an entry in an

**E-vector**is obtained by copying an entry from the corresponding**L-vector**, optionally switching the sign of the entry (for \(H(\mathrm{div})\)—and \(H(\mathrm{curl})\)-conforming spaces).

In the case of variable order spaces:

the dependent dofs (usually on the higher-order side of a face/edge) can be treated just like the hanging/dependent dofs case.

Quadrature point vector,

**Q-vector**:this is similar to

**E-vector**where instead of dofs, the vector represents values at quadrature points, grouped by element.

In many cases it is useful to distinguish two types of vectors:

**X-vector**, or**primal X-vector**, and**X’-vector**, or**dual X-vector**here X can be any of the T, L, E, or Q categories

for example, the mass matrix operator maps a

**T-vector**to a**T’-vector**the solutions vector is a

**T-vector**, and the RHS vector is a**T’-vector**using the parallel prolongation operator, one can map the solution

**T-vector**to a solution**L-vector**, etc.

Operator representation/storage/action categories:

Full true-dof parallel assembly,

**TA**, or**A**:ParCSR or similar format

the T in TA indicates that the data format represents an operator from a

**T-vector**to a**T’-vector**.

Full local assembly,

**LA**:CSR matrix on each rank

the parallel prolongation operator, \(\bm{P}\), (and its transpose) should use optimized matrix-free action

note that \(\bm{P}\) is the operator mapping T-vectors to L-vectors.

Element matrix assembly,

**EA**:each element matrix is stored as a dense matrix

optimized element and parallel prolongation operators

note that the element prolongation operator is the mapping from an

**L-vector**to an**E-vector**.

Quadrature-point/partial assembly,

**QA**or**PA**:precompute and store \(w\det(J)\) at all quadrature points in all mesh elements

the stored data can be viewed as a

**Q-vector**.

Unassembled option,

**UA**or**U**:no assembly step

the action uses directly the mesh node coordinates, and assumes specific form of the coefficient, e.g. constant, piecewise-constant, or given as a

**Q-vector**(Q-coefficient).

### Partial Assembly¶

Since the global operator \(\bm{A}\) is just a series of variational restrictions
with \(\bm{B}\), \(\bm{G}\) and \(\bm{P}\), starting from its
point-wise kernel \(\bm{D}\), a “matvec” with \(\bm{A}\) can be
performed by evaluating and storing some of the innermost variational restriction
matrices, and applying the rest of the operators “on-the-fly”. For example, one can
compute and store a global matrix on **T-vector** level. Alternatively, one can compute
and store only the subdomain (**L-vector**) or element (**E-vector**) matrices and
perform the action of \(\bm{A}\) using matvecs with \(\bm{P}\) or
\(\bm{P}\) and \(\bm{G}\). While these options are natural for
low-order discretizations, they are not a good fit for high-order methods due to
the amount of FLOPs needed for their evaluation, as well as the memory transfer
needed for a matvec.

Our focus in libCEED, instead, is on **partial assembly**, where we compute and
store only \(\bm{D}\) (or portions of it) and evaluate the actions of
\(\bm{P}\), \(\bm{G}\) and \(\bm{B}\) on-the-fly.
Critically for performance, we take advantage of the tensor-product structure of the
degrees of freedom and quadrature points on *quad* and *hex* elements to perform the
action of \(\bm{B}\) without storing it as a matrix.

Implemented properly, the partial assembly algorithm requires optimal amount of
memory transfers (with respect to the polynomial order) and near-optimal FLOPs
for operator evaluation. It consists of an operator *setup* phase, that
evaluates and stores \(\bm{D}\) and an operator *apply* (evaluation) phase that
computes the action of \(\bm{A}\) on an input vector. When desired, the setup
phase may be done as a side-effect of evaluating a different operator, such as a
nonlinear residual. The relative costs of the setup and apply phases are
different depending on the physics being expressed and the representation of
\(\bm{D}\).

### Parallel Decomposition¶

After the application of each of the first three transition operators, \(\bm{P}\), \(\bm{G}\) and \(\bm{B}\), the operator evaluation is decoupled on their ranges, so \(\bm{P}\), \(\bm{G}\) and \(\bm{B}\) allow us to “zoom-in” to subdomain, element and quadrature point level, ignoring the coupling at higher levels.

Thus, a natural mapping of \(\bm{A}\) on a parallel computer is to split the
**T-vector** over MPI ranks (a non-overlapping decomposition, as is typically
used for sparse matrices), and then split the rest of the vector types over
computational devices (CPUs, GPUs, etc.) as indicated by the shaded regions in
the diagram above.

One of the advantages of the decomposition perspective in these settings is that the operators \(\bm{P}\), \(\bm{G}\), \(\bm{B}\) and \(\bm{D}\) clearly separate the MPI parallelism in the operator (\(\bm{P}\)) from the unstructured mesh topology (\(\bm{G}\)), the choice of the finite element space/basis (\(\bm{B}\)) and the geometry and point-wise physics \(\bm{D}\). These components also naturally fall in different classes of numerical algorithms – parallel (multi-device) linear algebra for \(\bm{P}\), sparse (on-device) linear algebra for \(\bm{G}\), dense/structured linear algebra (tensor contractions) for \(\bm{B}\) and parallel point-wise evaluations for \(\bm{D}\).

Currently in libCEED, it is assumed that the host application manages the global
**T-vectors** and the required communications among devices (which are generally
on different compute nodes) with **P**. Our API is thus focused on the
**L-vector** level, where the logical devices, which in the library are
represented by the Ceed object, are independent. Each MPI rank can use one or
more Ceeds, and each Ceed, in turn, can represent one or more physical
devices, as long as libCEED backends support such configurations. The idea is
that every MPI rank can use any logical device it is assigned at runtime. For
example, on a node with 2 CPU sockets and 4 GPUs, one may decide to use 6 MPI
ranks (each using a single Ceed object): 2 ranks using 1 CPU socket each, and
4 using 1 GPU each. Another choice could be to run 1 MPI rank on the whole node
and use 5 Ceed objects: 1 managing all CPU cores on the 2 sockets and 4
managing 1 GPU each. The communications among the devices, e.g. required for
applying the action of \(\bm{P}\), are currently out of scope of libCEED. The
interface is non-blocking for all operations involving more than O(1) data,
allowing operations performed on a coprocessor or worker threads to overlap with
operations on the host.

## API Description¶

The libCEED API takes an algebraic approach, where the user essentially
describes in the *frontend* the operators **G**, **B** and **D** and the library
provides *backend* implementations and coordinates their action to the original
operator on **L-vector** level (i.e. independently on each device / MPI task).

One of the advantages of this purely algebraic description is that it already includes all the finite element information, so the backends can operate on linear algebra level without explicit finite element code. The frontend description is general enough to support a wide variety of finite element algorithms, as well as some other types algorithms such as spectral finite differences. The separation of the front- and backends enables applications to easily switch/try different backends. It also enables backend developers to impact many applications from a single implementation.

Our long-term vision is to include a variety of backend implementations in libCEED, ranging from reference kernels to highly optimized kernels targeting specific devices (e.g. GPUs) or specific polynomial orders. A simple reference backend implementation is provided in the file ceed-ref.c.

On the frontend, the mapping between the decomposition concepts and the code implementation is as follows:

**L-**,**E-**and**Q-vector**are represented as variables of type CeedVector. (A backend may choose to operate incrementally without forming explicit**E-**or**Q-vectors**.)\(\bm{G}\) is represented as variable of type CeedElemRestriction.

\(\bm{B}\) is represented as variable of type CeedBasis.

the action of \(\bm{D}\) is represented as variable of type CeedQFunction.

the overall operator \(\bm{G}^T \bm{B}^T \bm{D} \bm{B} \bm{G}\) is represented as variable of type CeedOperator and its action is accessible through

`CeedOperatorApply()`

.

To clarify these concepts and illustrate how they are combined in the API, consider the implementation of the action of a simple 1D mass matrix (cf. tests/t500-operator.c).

```
1/// @file
2/// Test creation, action, and destruction for mass matrix operator
3/// \test Test creation, action, and destruction for mass matrix operator
4#include <ceed.h>
5#include <stdlib.h>
6#include <math.h>
7
8#include "t500-operator.h"
9
10int main(int argc, char **argv) {
11 Ceed ceed;
12 CeedElemRestriction elem_restr_x, elem_restr_u, elem_restr_qd_i;
13 CeedBasis basis_x, basis_u;
14 CeedQFunction qf_setup, qf_mass;
15 CeedOperator op_setup, op_mass;
16 CeedVector q_data, X, U, V;
17 const CeedScalar *hv;
18 CeedInt num_elem = 15, P = 5, Q = 8;
19 CeedInt num_nodes_x = num_elem+1, num_nodes_u = num_elem*(P-1)+1;
20 CeedInt ind_x[num_elem*2], ind_u[num_elem*P];
21 CeedScalar x[num_nodes_x];
22
23//! [Ceed Init]
24 CeedInit(argv[1], &ceed);
25//! [Ceed Init]
26 for (CeedInt i=0; i<num_nodes_x; i++)
27 x[i] = (CeedScalar) i / (num_nodes_x - 1);
28 for (CeedInt i=0; i<num_elem; i++) {
29 ind_x[2*i+0] = i;
30 ind_x[2*i+1] = i+1;
31 }
32//! [ElemRestr Create]
33 CeedElemRestrictionCreate(ceed, num_elem, 2, 1, 1, num_nodes_x, CEED_MEM_HOST,
34 CEED_USE_POINTER, ind_x, &elem_restr_x);
35//! [ElemRestr Create]
36
37 for (CeedInt i=0; i<num_elem; i++) {
38 for (CeedInt j=0; j<P; j++) {
39 ind_u[P*i+j] = i*(P-1) + j;
40 }
41 }
42//! [ElemRestrU Create]
43 CeedElemRestrictionCreate(ceed, num_elem, P, 1, 1, num_nodes_u, CEED_MEM_HOST,
44 CEED_USE_POINTER, ind_u, &elem_restr_u);
45 CeedInt strides_qd[3] = {1, Q, Q};
46 CeedElemRestrictionCreateStrided(ceed, num_elem, Q, 1, Q*num_elem, strides_qd,
47 &elem_restr_qd_i);
48//! [ElemRestrU Create]
49
50//! [Basis Create]
51 CeedBasisCreateTensorH1Lagrange(ceed, 1, 1, 2, Q, CEED_GAUSS, &basis_x);
52 CeedBasisCreateTensorH1Lagrange(ceed, 1, 1, P, Q, CEED_GAUSS, &basis_u);
53//! [Basis Create]
54
55//! [QFunction Create]
56 CeedQFunctionCreateInterior(ceed, 1, setup, setup_loc, &qf_setup);
57 CeedQFunctionAddInput(qf_setup, "_weight", 1, CEED_EVAL_WEIGHT);
58 CeedQFunctionAddInput(qf_setup, "dx", 1, CEED_EVAL_GRAD);
59 CeedQFunctionAddOutput(qf_setup, "rho", 1, CEED_EVAL_NONE);
60
61 CeedQFunctionCreateInterior(ceed, 1, mass, mass_loc, &qf_mass);
62 CeedQFunctionAddInput(qf_mass, "rho", 1, CEED_EVAL_NONE);
63 CeedQFunctionAddInput(qf_mass, "u", 1, CEED_EVAL_INTERP);
64 CeedQFunctionAddOutput(qf_mass, "v", 1, CEED_EVAL_INTERP);
65//! [QFunction Create]
66
67//! [Setup Create]
68 CeedOperatorCreate(ceed, qf_setup, CEED_QFUNCTION_NONE, CEED_QFUNCTION_NONE,
69 &op_setup);
70//! [Setup Create]
71
72//! [Operator Create]
73 CeedOperatorCreate(ceed, qf_mass, CEED_QFUNCTION_NONE, CEED_QFUNCTION_NONE,
74 &op_mass);
75//! [Operator Create]
76
77 CeedVectorCreate(ceed, num_nodes_x, &X);
78 CeedVectorSetArray(X, CEED_MEM_HOST, CEED_USE_POINTER, x);
79 CeedVectorCreate(ceed, num_elem*Q, &q_data);
80
81//! [Setup Set]
82 CeedOperatorSetField(op_setup, "_weight", CEED_ELEMRESTRICTION_NONE, basis_x,
83 CEED_VECTOR_NONE);
84 CeedOperatorSetField(op_setup, "dx", elem_restr_x, basis_x, CEED_VECTOR_ACTIVE);
85 CeedOperatorSetField(op_setup, "rho", elem_restr_qd_i, CEED_BASIS_COLLOCATED,
86 CEED_VECTOR_ACTIVE);
87//! [Setup Set]
88
89//! [Operator Set]
90 CeedOperatorSetField(op_mass, "rho", elem_restr_qd_i,CEED_BASIS_COLLOCATED,
91 q_data);
92 CeedOperatorSetField(op_mass, "u", elem_restr_u, basis_u, CEED_VECTOR_ACTIVE);
93 CeedOperatorSetField(op_mass, "v", elem_restr_u, basis_u, CEED_VECTOR_ACTIVE);
94//! [Operator Set]
95
96//! [Setup Apply]
97 CeedOperatorApply(op_setup, X, q_data, CEED_REQUEST_IMMEDIATE);
98//! [Setup Apply]
99
100 CeedVectorCreate(ceed, num_nodes_u, &U);
101 CeedVectorSetValue(U, 0.0);
102 CeedVectorCreate(ceed, num_nodes_u, &V);
103//! [Operator Apply]
104 CeedOperatorApply(op_mass, U, V, CEED_REQUEST_IMMEDIATE);
105//! [Operator Apply]
106
107 CeedVectorGetArrayRead(V, CEED_MEM_HOST, &hv);
108 for (CeedInt i=0; i<num_nodes_u; i++)
109 if (fabs(hv[i]) > 1e-14) printf("[%d] v %g != 0.0\n",i, hv[i]);
110 CeedVectorRestoreArrayRead(V, &hv);
111
112 CeedQFunctionDestroy(&qf_setup);
113 CeedQFunctionDestroy(&qf_mass);
114 CeedOperatorDestroy(&op_setup);
115 CeedOperatorDestroy(&op_mass);
116 CeedElemRestrictionDestroy(&elem_restr_u);
117 CeedElemRestrictionDestroy(&elem_restr_x);
118 CeedElemRestrictionDestroy(&elem_restr_qd_i);
119 CeedBasisDestroy(&basis_u);
120 CeedBasisDestroy(&basis_x);
121 CeedVectorDestroy(&X);
122 CeedVectorDestroy(&U);
123 CeedVectorDestroy(&V);
124 CeedVectorDestroy(&q_data);
125 CeedDestroy(&ceed);
126 return 0;
127}
```

The constructor

```
CeedInit(argv[1], &ceed);
```

creates a logical device `ceed`

on the specified *resource*, which could also be
a coprocessor such as `"/nvidia/0"`

. There can be any number of such devices,
including multiple logical devices driving the same resource (though performance
may suffer in case of oversubscription). The resource is used to locate a
suitable backend which will have discretion over the implementations of all
objects created with this logical device.

The `setup`

routine above computes and stores \(\bm{D}\), in this case a
scalar value in each quadrature point, while `mass`

uses these saved values to perform
the action of \(\bm{D}\). These functions are turned into the CeedQFunction
variables `qf_setup`

and `qf_mass`

in the `CeedQFunctionCreateInterior()`

calls:

```
CeedQFunctionCreateInterior(ceed, 1, setup, setup_loc, &qf_setup);
CeedQFunctionAddInput(qf_setup, "_weight", 1, CEED_EVAL_WEIGHT);
CeedQFunctionAddInput(qf_setup, "dx", 1, CEED_EVAL_GRAD);
CeedQFunctionAddOutput(qf_setup, "rho", 1, CEED_EVAL_NONE);
CeedQFunctionCreateInterior(ceed, 1, mass, mass_loc, &qf_mass);
CeedQFunctionAddInput(qf_mass, "rho", 1, CEED_EVAL_NONE);
CeedQFunctionAddInput(qf_mass, "u", 1, CEED_EVAL_INTERP);
CeedQFunctionAddOutput(qf_mass, "v", 1, CEED_EVAL_INTERP);
```

A CeedQFunction performs independent operations at each quadrature point and
the interface is intended to facilitate vectorization. The second argument is
an expected vector length. If greater than 1, the caller must ensure that the
number of quadrature points `Q`

is divisible by the vector length. This is
often satisfied automatically due to the element size or by batching elements
together to facilitate vectorization in other stages, and can always be ensured
by padding.

In addition to the function pointers (`setup`

and `mass`

), CeedQFunction
constructors take a string representation specifying where the source for the
implementation is found. This is used by backends that support Just-In-Time
(JIT) compilation (i.e., CUDA and OCCA) to compile for coprocessors.
For full support across all backends, these CeedQFunction source files must only contain constructs mutually supported by C99, C++11, and CUDA.
For example, explicit type casting of void pointers and explicit use of compatible arguments for `math`

library functions is required, and variable-length array (VLA) syntax for array reshaping is only available via libCEED’s `CEED_Q_VLA`

macro.

Different input and output fields are added individually, specifying the field name, size of the field, and evaluation mode.

The size of the field is provided by a combination of the number of components the effect of any basis evaluations.

The evaluation mode (see Typedefs and Enumerations) `CEED_EVAL_INTERP`

for both input and output fields indicates that the mass operator only contains terms of
the form

where \(v\) are test functions (see the Theoretical Framework). More general operators, such as those of the form

can be expressed.

For fields with derivatives, such as with the basis evaluation mode
(see Typedefs and Enumerations) `CEED_EVAL_GRAD`

, the size of the
field needs to reflect both the number of components and the geometric dimension.
A 3-dimensional gradient on four components would therefore mean the field has a size of
12.

The \(\bm{B}\) operators for the mesh nodes, `basis_x`

, and the unknown field,
`basis_u`

, are defined in the calls to the function `CeedBasisCreateTensorH1Lagrange()`

.
In this example, both the mesh and the unknown field use \(H^1\) Lagrange finite
elements of order 1 and 4 respectively (the `P`

argument represents the number of 1D
degrees of freedom on each element). Both basis operators use the same integration rule,
which is Gauss-Legendre with 8 points (the `Q`

argument).

```
CeedBasisCreateTensorH1Lagrange(ceed, 1, 1, 2, Q, CEED_GAUSS, &basis_x);
CeedBasisCreateTensorH1Lagrange(ceed, 1, 1, P, Q, CEED_GAUSS, &basis_u);
```

Other elements with this structure can be specified in terms of the `Q×P`

matrices that evaluate values and gradients at quadrature points in one
dimension using `CeedBasisCreateTensorH1()`

. Elements that do not have tensor
product structure, such as symmetric elements on simplices, will be created
using different constructors.

The \(\bm{G}\) operators for the mesh nodes, `elem_restr_x`

, and the unknown field,
`elem_restr_u`

, are specified in the `CeedElemRestrictionCreate()`

. Both of these
specify directly the dof indices for each element in the `ind_x`

and `ind_u`

arrays:

```
CeedElemRestrictionCreate(ceed, num_elem, 2, 1, 1, num_nodes_x, CEED_MEM_HOST,
CEED_USE_POINTER, ind_x, &elem_restr_x);
```

```
CeedElemRestrictionCreate(ceed, num_elem, P, 1, 1, num_nodes_u, CEED_MEM_HOST,
CEED_USE_POINTER, ind_u, &elem_restr_u);
CeedInt strides_qd[3] = {1, Q, Q};
CeedElemRestrictionCreateStrided(ceed, num_elem, Q, 1, Q*num_elem, strides_qd,
&elem_restr_qd_i);
```

If the user has arrays available on a device, they can be provided using
`CEED_MEM_DEVICE`

. This technique is used to provide no-copy interfaces in all
contexts that involve problem-sized data.

For discontinuous Galerkin and for applications such as Nek5000 that only
explicitly store **E-vectors** (inter-element continuity has been subsumed by
the parallel restriction \(\bm{P}\)), the element restriction \(\bm{G}\)
is the identity and `CeedElemRestrictionCreateStrided()`

is used instead.
We plan to support other structured representations of \(\bm{G}\) which will
be added according to demand.
There are two common approaches for supporting non-conforming elements: applying the node constraints via \(\bm P\) so that the **L-vector** can be processed uniformly and applying the constraints via \(\bm G\) so that the **E-vector** is uniform.
The former can be done with the existing interface while the latter will require a generalization to element restriction that would define field values at constrained nodes as linear combinations of the values at primary nodes.

These operations, \(\bm{P}\), \(\bm{B}\), and \(\bm{D}\),
are combined with a CeedOperator. As with CeedQFunctions, operator fields are added
separately with a matching field name, basis (\(\bm{B}\)), element restriction
(\(\bm{G}\)), and **L-vector**. The flag
`CEED_VECTOR_ACTIVE`

indicates that the vector corresponding to that field will
be provided to the operator when `CeedOperatorApply()`

is called. Otherwise the
input/output will be read from/written to the specified **L-vector**.

With partial assembly, we first perform a setup stage where \(\bm{D}\) is evaluated
and stored. This is accomplished by the operator `op_setup`

and its application
to `X`

, the nodes of the mesh (these are needed to compute Jacobians at
quadrature points). Note that the corresponding `CeedOperatorApply()`

has no basis
evaluation on the output, as the quadrature data is not needed at the dofs:

```
CeedOperatorCreate(ceed, qf_setup, CEED_QFUNCTION_NONE, CEED_QFUNCTION_NONE,
&op_setup);
```

```
CeedOperatorSetField(op_setup, "_weight", CEED_ELEMRESTRICTION_NONE, basis_x,
CEED_VECTOR_NONE);
CeedOperatorSetField(op_setup, "dx", elem_restr_x, basis_x, CEED_VECTOR_ACTIVE);
CeedOperatorSetField(op_setup, "rho", elem_restr_qd_i, CEED_BASIS_COLLOCATED,
CEED_VECTOR_ACTIVE);
```

```
CeedOperatorApply(op_setup, X, q_data, CEED_REQUEST_IMMEDIATE);
```

The action of the operator is then represented by operator `op_mass`

and its
`CeedOperatorApply()`

to the input **L-vector** `U`

with output in `V`

:

```
CeedOperatorCreate(ceed, qf_mass, CEED_QFUNCTION_NONE, CEED_QFUNCTION_NONE,
&op_mass);
```

```
CeedOperatorSetField(op_mass, "rho", elem_restr_qd_i,CEED_BASIS_COLLOCATED,
q_data);
CeedOperatorSetField(op_mass, "u", elem_restr_u, basis_u, CEED_VECTOR_ACTIVE);
CeedOperatorSetField(op_mass, "v", elem_restr_u, basis_u, CEED_VECTOR_ACTIVE);
```

```
CeedOperatorApply(op_mass, U, V, CEED_REQUEST_IMMEDIATE);
```

A number of function calls in the interface, such as `CeedOperatorApply()`

, are
intended to support asynchronous execution via their last argument,
`CeedRequest*`

. The specific (pointer) value used in the above example,
`CEED_REQUEST_IMMEDIATE`

, is used to express the request (from the user) for the
operation to complete before returning from the function call, i.e. to make sure
that the result of the operation is available in the output parameters
immediately after the call. For a true asynchronous call, one needs to provide
the address of a user defined variable. Such a variable can be used later to
explicitly wait for the completion of the operation.

## Gallery of QFunctions¶

LibCEED provides a gallery of built-in CeedQFunctions in the `gallery/`

directory.
The available QFunctions are the ones associated with the mass, the Laplacian, and
the identity operators. To illustrate how the user can declare a CeedQFunction
via the gallery of available QFunctions, consider the selection of the
CeedQFunction associated with a simple 1D mass matrix
(cf. tests/t410-qfunction.c).

```
1/// @file
2/// Test creation, evaluation, and destruction for QFunction by name
3/// \test Test creation, evaluation, and destruction for QFunction by name
4#include <ceed.h>
5
6int main(int argc, char **argv) {
7 Ceed ceed;
8 CeedVector in[16], out[16];
9 CeedVector Q_data, J, W, U, V;
10 CeedQFunction qf_setup, qf_mass;
11 CeedInt Q = 8;
12 const CeedScalar *vv;
13 CeedScalar j[Q], w[Q], u[Q], v[Q];
14
15 CeedInit(argv[1], &ceed);
16
17 CeedQFunctionCreateInteriorByName(ceed, "Mass1DBuild", &qf_setup);
18 CeedQFunctionCreateInteriorByName(ceed, "MassApply", &qf_mass);
19
20 for (CeedInt i=0; i<Q; i++) {
21 CeedScalar x = 2.*i/(Q-1) - 1;
22 j[i] = 1;
23 w[i] = 1 - x*x;
24 u[i] = 2 + 3*x + 5*x*x;
25 v[i] = w[i] * u[i];
26 }
27
28 CeedVectorCreate(ceed, Q, &J);
29 CeedVectorSetArray(J, CEED_MEM_HOST, CEED_USE_POINTER, j);
30 CeedVectorCreate(ceed, Q, &W);
31 CeedVectorSetArray(W, CEED_MEM_HOST, CEED_USE_POINTER, w);
32 CeedVectorCreate(ceed, Q, &U);
33 CeedVectorSetArray(U, CEED_MEM_HOST, CEED_USE_POINTER, u);
34 CeedVectorCreate(ceed, Q, &V);
35 CeedVectorSetValue(V, 0);
36 CeedVectorCreate(ceed, Q, &Q_data);
37 CeedVectorSetValue(Q_data, 0);
38
39 {
40 in[0] = J;
41 in[1] = W;
42 out[0] = Q_data;
43 CeedQFunctionApply(qf_setup, Q, in, out);
44 }
45 {
46 in[0] = W;
47 in[1] = U;
48 out[0] = V;
49 CeedQFunctionApply(qf_mass, Q, in, out);
50 }
51
52 CeedVectorGetArrayRead(V, CEED_MEM_HOST, &vv);
53 for (CeedInt i=0; i<Q; i++)
54 if (v[i] != vv[i])
55 // LCOV_EXCL_START
56 printf("[%d] v %f != vv %f\n",i, v[i], vv[i]);
57 // LCOV_EXCL_STOP
58 CeedVectorRestoreArrayRead(V, &vv);
59
60 CeedVectorDestroy(&J);
61 CeedVectorDestroy(&W);
62 CeedVectorDestroy(&U);
63 CeedVectorDestroy(&V);
64 CeedVectorDestroy(&Q_data);
65 CeedQFunctionDestroy(&qf_setup);
66 CeedQFunctionDestroy(&qf_mass);
67 CeedDestroy(&ceed);
68 return 0;
69}
```

## Interface Principles and Evolution¶

LibCEED is intended to be extensible via backends that are packaged with the library and packaged separately (possibly as a binary containing proprietary code). Backends are registered by calling

```
CeedRegister("/cpu/self/ref/serial", CeedInit_Ref, 50);
```

typically in a library initializer or “constructor” that runs automatically.
`CeedInit`

uses this prefix to find an appropriate backend for the resource.

Source (API) and binary (ABI) stability are important to libCEED. Prior to reaching version 1.0, libCEED does not implement strict semantic versioning across the entire interface. However, user code, including libraries of CeedQFunctions, should be source and binary compatible moving from 0.x.y to any later release 0.x.z. We have less experience with external packaging of backends and do not presently guarantee source or binary stability, but we intend to define stability guarantees for libCEED 1.0. We’d love to talk with you if you’re interested in packaging backends externally, and will work with you on a practical stability policy.