RESTful Web Services on Standalone Disaggregated FPGAs

—We present an architecture for ﬁeld-programmable gate arrays (FPGAs) to expose RESTful web services. This architecture allows clients to access accelerated web services from any platform and programming language that can perform RESTful API calls. By using this architecture, the client’s application beneﬁts from a high throughput and low latency web service interface. Traditionally, FPGAs are deployed in CPU-centric infrastructures as worker devices in the form of accelerators. However, for FPGA-centric applications, the overhead of a host CPU diminishes the performance, scalability and energy efﬁciency. cloudFPGA solves these issues by deploying FPGAs as standalone, disaggregated resources in the DC. Building on top of the cloudFPGA platform, the presented architecture simpliﬁes the integration of FPGA-accelerated functions with cloud applications. A conﬁgurable hardware block that can be generated from an OpenAPI-based spec-iﬁcation of the web service is used to deploy an FPGA-based application. We compare a natural language processing (NLP) application that is exposed as a web service using the traditional server infrastructure and our RESTful service layer. Measurements show an improvement of 20x in terms of throughput and 4x reduction in mean latency.


Introduction
To provide sufficient compute resources for the age of Big Data, the cloud computing environment is becoming a heterogeneous place [1]. New generations of general purpose processors (CPUs) cannot keep up with the performance requirements of some applications, causing new processing devices to be introduced to the server systems. Most prominent are general purpose graphics processing units (GPGPUs) which are well suited for highly parallel and math-oriented algorithms [2]. Another type of processing devices are field-programmable gate arrays (FPGAs). Because of their low power consumption, these devices can be integrated into compact sized servers and can accelerate a wide range of applications.
In today's cloud environments many services can be accessed via RESTful APIs [3]. Most commonly based on the hypertext transfer protocol (HTTP) the representational state transfer (REST) provides interoperability between computer platforms and programming languages. A RESTful API uses the HTTP verbs (GET, POST, etc.) together with the uniform resource identifier (URI) to trigger an operation which will return a response in a pre-defined format, most commonly XML or JSON. Examples services are natural language processing (NLP) tasks such as speech-to-text, text-to-speech or sentiment analysis or image analysis.
Traditionally, these web services are provided via application servers that implement the HTTP protocol and then communicate via the common gateway interface (CGI) or similar protocols with the application that implements the service function. Such an application can in turn access an FPGA to utilize accelerated functions and serve requests faster.
We present a configurable intellectual property (IP) block that is implemented on an FPGA to allow these accelerated functions to be exposed to the HTTP client. The IP block implements the basic functionality of the HTTP protocol and decodes the requests according to the developers specification in OpenAPI format [4]. The block manages the connections for timeouts and inspects the HTTP header for all required fields and their respective values. The accelerated function remains in charge of processing the payload and generating a response payload.
Traditionally, FPGAs are deployed in CPU-centric infrastructures, being connected over the PCIe bus to the host. The overhead of the CPU diminishes the performance, scalability, and the energy efficiency gains for FPGA-centric applications. cloudFPGA platform solves these issues by deploying FPGAs as standalone, disaggregated resources in cloud DCs.
We demonstrate that using the cloudFPGA platform, the throughput of the above RESTful application can be improved by 20x compared to a CPU-centric accelerated function running on a high-end server. The main contributions of the paper are as follows: • A re-usable and configurable intellectual property block that enables an FPGA to expose web services directly via HTTP without the need for a server system.
• An example application from the natural language processing domain which extracts entities from scientific documents.
• A comparison with state-of-the-art implementations running on a high-end enterprise server hosting an FPGA.
The remainder of the paper is organized as follows. Section 2 touches on the background of FPGAs and introduces the cloudFPGA platform. Section 3 explains in detail the configurable IP block and it's use with the example NLP application. Section 5 outlines some related publications. Section 4 describes the experimental setup and the measurement results. Conclusion and outlook are given in Section 6.

Background
Both moore's law and dennard scaling continued to work until the last decade, however, now both of them have started to become less applicable [5]. This leads to innovation in two directions: (i) alternatives to CMOS technology and (ii) innovations in system stack. Although not much credible success has yet been achieved in the first approach, the second approach has resulted in improved extensions in the system stack. These extensions evolved in multiple paths: (i) low-level micro-architectural extensions, such as multicore, SMP, SIMD, MIMD, and SoC in the processor road map, (ii) advanced memories, (iii) improved system software integration and cloud in the application space, and (iv) HW acceleration. However, the stringent physical limitations of CMOS technology inhibits exploiting increasingly more performance out of the CPUs except for the case of HW accelerators, which are built using customized HW.
Today, the domain of HW accelerators are dominated by graphics-processing units (GPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs). While GPUs and ASICs can effectively accelerate DC tasks, the rapidly changing nature of the DC application algorithms can quickly render a dedicated accelerator obsolete. This is especially true in the case of ASICs, where the design, verification and bring-up might take several years while the applications change on a monthly basis. On the other hand, a GPU consumes an order of magnitude more power than an FPGA. Meanwhile, FPGAs deliver the performance advantages of a fixed-purpose accelerator in a low-power, flexible hardware platform that can enable faster time to innovation.

FPGAs
Because of the high flexibility of FPGAs, the scope of potential applications is broad. Microsoft showed one of the prominent use cases of FPGAs at large-scale in DCs, by improving the throughput and latency of their Bing web search service using FPGAs [6]. FPGAs have successfully been used in other application areas as well. A few examples are, big-data analytics [7], cryptography [8], HPC [9], deep learning [10], network security [11], and natural language processing [12].
However, one disadvantage of the FPGAs is their programmability. Traditionally, FPGA-based applications are developed using hardware description languages (HDL) such as VHDL or Verilog, which requires the programmer to be knowledgeable on how the HW works at low-level. To alleviate these issues, one approach is to use HLS (High-Level Synthesis) tools, which offers a C-or C++-based programming environment. Another approach is to abstract the low-level HW details as much as possible by offering cloud-based FPGAs [13]. Typically, in these offerings, shellrole architectures are used [6] [13] [14], in which shell abstracts the FPGA's interaction between CPU and DRAM, and the role hosts the user application.
In DCs, traditionally, FPGAs are deployed in CPUcentric infrastructures, being connected to the host over the PCIe bus as worker devices [6] [13]. These infrastructures are suitable for CPU-centric applications, where most of the application processing is done in the CPU and the PCIe-attached FPGA works as an accelerator. However, for FPGA-centric applications, where most of the processing is done in the FPGA, the overhead of CPU-centric infrastructures diminishes the performance, scalability and power efficiency gains.
When offering FPGAs as compute resources to cloud users, existing CPU-centric solutions [15] face the following issues: (i) The CPU-centric approach requires the data path of the inter-FPGA communication to traverse a CPU for applications that use more FPGAs than a single CPUcentric unit can provide. (ii) The number of FPGAs that an application uses cannot be scaled independently from the number of CPU resources at the infrastructure level. (iii) the tight coupling of FPGAs to CPUs limits the number of FPGAs that can be deployed in a DC rack. Microsoft solved issue (i) by connecting PCIe-attached FPGAs in a mesh in their first implementation [6] and connecting each PCIeattached FPGA directly to the DC network as a bump in the wire (host ↔ NIC ↔ FPGA ↔ DC Network) in their second implementation [16]. Another approach to solving issue (i) without increasing the number of physical network connections to the host is to directly connect only the NIC to the DC network and the FPGA is connected to the NIC through PCIe (host ↔ FPGA ↔ NIC ↔ DC Network), which is theoretically possible by using for example SR-IOV (Single Root IO Virtualization) based vNICs, but to the best of our knowledge there are no FPGA implementations available yet supporting this communication. Even if there are alternative solutions for issue (i), issues (ii) and (iii) can not be solved in the CPU-centric approach.

cloudFPGA
In contrast to the traditional CPU-centric approach, our cloudFPGA platform [17] takes the FPGA-centric approach, where the FPGA is directly connected to the DC network as a standalone, disaggregated compute resource without the need of any hosting server ( Figure 1). The standalone, disaggregated nature enables FPGA-centric applications to scale independently of the number of CPUs, while enabling an efficient communication path (low latency and jitter with high throughput) within FPGA-based distributed applications [18] [19].  Similarly to other cloud-based FPGA approaches, we use a shell-role architecture in the cloudFPGA, in which the shell is called cloudSHELL (Figure 2). The cloudSHELL facilitates the necessary environment within the cloudFPGA to run user applications, abstracting CPU-FPGA, FPGA-FPGA, and FPGA-DRAM communication. It also relives the user from other FPGA bring-up tasks, such as specification of board-specific pin assignments and clocks.
As shown in Figure 2, user application only interacts with the application interface in the cloudSHELL. The application interface in turn is connected to the DRAM controller, to the TCP/IP stack, and to the management module. The application interface offers simple FIFO-based ports to the application to interact with DRAM and the TCP-based network. To mange each FPGA in the DC in a centralized manner, the management module consists of agents, which listens on a predetermined TCP port for commands from a centralized SW.
Interconnecting multiple-FPGAs for running distributed applications and changing the inter-FPGA data path on demand requires a new bit stream to be generated (full or partial bit stream) and the FPGA to be reconfigured, which disrupts the operation of the distributed applications and also introduce a high latency in the control path of the distributed application. This is not suitable for inherently dynamic cloud environments.
To address this issue, an agent is added to the management module of the cloudSHELL, which is called "fabric agent". The fabric agent listens for external commands on a predetermined TCP port when the FPGA is configured with the cloudSHELL. An external FPGA manager connects to the fabric agent over this TCP port and issues fabric commands to dynamically interconnect ROLEs in multiple FPGAs over the DC network. After decoding the received command by the fabric command decoder, the relevant commands (TCP listen, TCP connect, TCP close) are sent to the TCP connection manager in the application interface. After executing the commands, the TCP connection manager updates an internal table, which consists of FIFO ID and network link ID. FIFO ID corresponds to the index of the FIFO exposed to the ROLE, whereas network link ID corresponds to a TCP connection between two FPGAs. By updating the FIFO ID that corresponds to a link ID, the data path between ROLEs can be dynamically changed. Dynamic update of FIFO ID is also supported in fabric commands. This approach reduces the multi-FPGA fabric control path latency down to sub milliseconds from tens of minutes. To form a multi-FPGA fabric with 2 FPGAs took 0.754 ms, whereas the traditional approach of building a new bitstream and invoking reconfiguration took 29 minutes.
The above cloudFPGA HW is integrated into an OpenStack-based cloud infrastructure. In OpenStack, usually one provider network is used in combination with port vlans in the openvswitch-based integration bridge to maintain multi-tenancy across multiple hypervisors. However, cloudSHELL does not yet support openvswitch extensions and we do not expect that to change in the near future due to resource constraints. Hence, to integrate cloudFPGA HW into openstack, while maintaining multi-tenancy, we add a secondary network interface to VMs and CTs, which is called the FPGA network. The FPGA network corresponding to each tenant is based on a unique provider network. Traditional openvswitch-based data network is shown in blue in Figure 4, whereas the FPGA network is shown in green.

REST IP Block
Any FPGA application that wants to expose it's functionality as a RESTful web service must implement the HTTP protocol. Because it is a standard protocol many aspects of the communication can be implemented in an IP block that can be shared by all applications. A common FPGA design technique is to configure IP blocks which are then used by the application. Our REST IP block can be configured using an OpenAPI specification where a subset of features are currently supported.

OpenAPI Configuration
In order to configure the REST IP block we leverage the use of an OpenAPI specification (OAS). It defines machinereadable interface files for describing and documenting RESTful web services. It defines the various URI paths and HTTP verbs that are available for the API. Furthermore it defines required parameters for each API method and also specifies what MIME type each method consumes and produces.
An FPGA developer can write this specification according to the services that the application supports. The specification is then consumed by our generator tool which creates a set of customized Verilog files which implement the specification. The customized IP block can then be instantiated in the FPGA design and connected with the actual application. The generator tool will in turn create a mapping between URIs and binary command words that are used to communicate with the application. Figure 5 outlines the design flow for creating the REST IP block for a specific application

Architecture
The architecture of the REST IP block ( Figure 6) is designed for use with the shell-role architecture of the cloudFPGA platform. The REST IP block resides in the ROLE and interfaces with the cloudSHELL over two FIFObased AXI4 stream interfaces, which is an on-chip interconnect standard by ARM. There is one input stream and one output stream, which only carry the application layer data. The AXI4 stream standard defines an ID field which is used to indicate a connection ID and differentiate between different clients.  The first stage of the REST IP block decodes the HTTP header of an incoming request. Because the TCP protocol transmits at a segment level, the REST IP block needs to re-assemble these segments to a complete HTTP request message. While decoding the HTTP header the IP block collects and stores the segments in DDR memory on the cloudFPGA. It uses the Content-Length field to determine the overall length of the message. If this field is missing in the HTTP header, the request is flushed and an error response is sent to the client (411 -Length required). Also, a limit can be set on the payload length; if that limit is exceeded, the request is flushed and the error response-413 is sent to the client, indicating that the payload is too large.
The decoding stage uses a set of finite-state machine that can consume the input stream at wire speed (10 Gbps). Each state machines is responsible for a specific header field and translates the decoded element into a binary integer identifier, which is later used by the top-level state machine. If only a segment of the message is received, the current states of the state machines are saved to a connection buffer on the FPGA. When the next segment of a connection arrives the states are restored and the decoding continues. This context swap requires two clock cycles on the FPGA which is roughly 15 ns for this implementation.
The request's type and URI are decoded using the Ope-nAPI specification provided by the developer. The specification was used by our generator tool to create a custom state machine which maps the various API methods to a binary instruction word and is supplied to the actual appli-cation block. These commands are sent via an AXI4 stream interface once the complete payload has been received. The payload data is supplied to the application module via a second AXI4 stream interface. Any mismatching URIs or MIME types are immediately flushed and an error response is sent to the client without the application being involved.
Responses are generated by the response encoder of the REST IP block. It consists of two finite-state machines. One responsible for the overall response and one for generating the header information. A response can be triggered from two sources: the application or the internal request decoding logic. All internal responses are error codes or the informational Continue response and do not have a payload. Applications in turn may trigger the responses OK or Internal Server Error which can both have a payload. A third alternative for applications is the No Content, which indicates successful execution of the API call but the response requires no payload.
If the application provides a payload it has two options: either it knows the length of the payload at the time it submits the response command, or the payload length is unknown which is encoded as a zero-length payload with a generic OK response command. In this scenario the response encoder logic will send an HTTP response header with the transfer encoding set to chunked. In this mode the payload is sent via HTTP in chunks which are generated by the response encoder. Although this introduces some overhead to the processing performance of the FPGA it is a nice feature of the HTTP protocol because it avoids storing data on the local DDR memory. The AXI4 conformant TLAST signal indicates the last piece of data to be sent in this mode from the application and concludes the response payload.

Example Service
As an example service we choose a natural language processing (NLP) application which scans scientific documents for relevant entities. The service accepts plain text documents and returns a set of annotations which is returned as a JSON object. The application logic is implemented using the annotation query language (AQL) FPGA compiler framework from [20]. The logic is capable of processing the document in a single pass by evaluating it one byte per clock cycle, which is an eighth of the line rate. Using multiple instances of the processing logic this performance could be increased but for the initial evaluation this has not been implemented.
The RESTful API specification defines which annotations should be returned to the client. One examples is /annotate/, which returns the types of annotations. Another example is /annotate/ValueUnit, which returns only the annotations of the ValueUnit type. The text analytics application core generates all of these annotations in parallel. We filter those annotation results at the output of the core. The filtered results are then forwarded to the REST IP block where they are sent in chunked mode to the client.

Experiments
To evaluate our example web service application presented in Section 3 we compare it's performance against a server-based implementation. The server version has been evaluated as a pure software implementation and as an accelerated service using an FPGA.

Setup
All components are connected via a 10 Gigabit Ethernet network. The application client runs on an x86-based server at 2.6 GHz and 32 GB of memory. The server machine is an IBM POWER8-based server, which runs at 2.92 GHz and has 512 GB of memory. The POWER8 processor has 20 physical cores and can run up to 160 hardware threads simultaneously. The server hosts a commercial-off-the-shelf (COTS) FPGA accelerator board based on a Xilinx Kintex UltraScale FPGA, which is connected via the CAPI interface to the processor. The cloudFPGA prototype is also implemented on the same FPGA card, which is powered by a PCIe extension chasis. The card is connected via SFP+ connectors to the 10 GbE network. Figure 7 illustrates the experimental setup.
We use Apache JMeter [21] as the client application to generate the API requests from the client. We use HTTP 1.1 as the communication protocol which defines all connections to be persistent. This means that the TCP connection is kept alive and multiple request/response pairs can be sent. This reduces the overhead of re-establishing a TCP connection for each call. JMeter uses multiple threads to create the requests and collects information about the processing performance.
On the server machine we run an nginx HTTP server [22] with four worker processes. The actual web service is a Python application hosted by the uWSGI [23] application server running 20 processes. The two servers communicate via the Python standard web server gateway interface (WSGI). The web service application can access the accelerated functions on the CAPI-attached FPGA via a Python extension and can run in accelerated or in standard mode.
To evaluate the impact of the individual components we ran the tests in five scenarios: • nginx: nginx only serving a static file

Results
Apache JMeter reports the overall processing throughput in requests per second as well as the minimum, maximum and mean processing time of the individual API calls. Table 1 summarizes the results for all five testcase scenarios when there are 100 simultaneous requests in flight.
The pure nginx baseline performance is at 37,389 requests/s which is in line with numbers that can be found on the web. As there is no application processing involved at all, this is the highest rate at which the service could be operated on the POWER8 server. When adding a WSGI call to the picture, the throughput performance drops by nearly a factor of four but the jitter for the individual processing time is relatively small.
When running the full application API call, the performance significantly drops. The maximum throughput observed was 540 requests/s which results in a mean processing time of 175 ms for the client. Also the variation of the processing time becomes much wider. While the fastest call required 31 ms, the slowest took six times longer, 197 ms. When the application uses the CAPI-attached FPGA, the performance numbers improves again. The throughput increases to 7,917 requests/s and the processing times do not exceed 9 ms for the client. With the proposed architecture the entire service is running on the cloudFPGA. The client makes a direct RESTful API call to the FPGA, thus the entire communication and application processing occurs on a single chip. Our architecture was able to provide a processing throughput of 166,093 requests/s which is more than 20 times higher than the accelerated version of the service on a high-end server node. All processing calls required a maximum of 1 ms to complete. In a real-life environment this number would depend on the actual distance in the network from the client to the server. The results show that application throughput is increased by 15x with the acceleration of application logic only, whereas the throughput is increased by 308x with the acceleration of the whole stack, including TCP, HTTP, REST, and the application. Therefore, compared to the traditional approach of acceleration using PCIe-attached FPGAs, cloudFPGA performs 20x better.
Another interesting effect was observed with the number of concurrent requests. The cloudFPGA performs significantly better than the server implementation in situations where there is little or very high load. Figure 9 shows the number of requests per second over the number of concurrent requests that are in flight. If there is only one request in flight the performance is mainly latency driven, therefore the pure hardware implementation on the cloudFPGA is much faster. With an increasing number of simultaneous requests the performance depends more on the overall processing power of the server. The cloudFPGA and nginx measurements reach their peak at 10 concurrent requests and then remain fairly constant. All measurements with uWSGI involved increase up to 100 simultaneous requests and then drop for 1,000.

Power Consumption
Power efficiency has always been a stronghold for FP-GAs. With the cloudFPGA platform this efficiency can be fully exploited in a cloud environment. While the cloudF-PGA requires at maximum 25 W to operate the web service, the fully equipped POWER8 server requires 340 W in idle mode and more than 360 W when running the web service, which is measured using the on-system sensors of the server [24]. This is shows that the power consumption of a POWER8-based system is 13 times high compared to the cloudFPGA.

System Cost
Adding an FPGA to each server in a DC environment significantly increases the cost of a server unit. Microsoft's catapult implementation increased this cost by 30% [6]. For FPGA-centric applications the server that hosts the FPGA might not be efficiently used. Hence, by deploying standalone, disaggregated FPGAs, the cost of the server can be omitted from the total cost of the FPGA infrastructure. As an example, considering the application used in this paper, the system cost can be cut down by around $5000 (according to prices on web), by completely eliminating a server to host the web application.

Resource Consumption
When developing on FPGAs the individual functional modules require resources like configurable logic blocks (CLBs) or internal memory elements (BlockRAM). Therefore it is important to keep the resource consumption low for the interface logic to allow the actual application to utilize the remainder. Together with the cloudFPGA's network service layer the REST IP Block requires about 20 % of the overall logic resources of the FPGA and 30 % of the internal memory blocks. This is comparable with the resources required by the POWER service layer to implement the CAPI protocol. With newer generations of FPGAs more resources will be available for the application module.

Related Work
Cuenca-Asensi et al. [25] present the WS reconfigurable platform to support access to web services using the simple object access protocol (SOAP). As a communication layer HTTP is implemented in the protocol module which also allows communication with a service broker via the UDDI protocol. The example implementation shows a Wake on LAN over an Internet application which activates the device. The response times are up to four times lower compared to a software implementation on an Intel-based computer.
Yu et al. [26] present an FPGA implementation of a web server. It uses a MicroBlaze soft-core processor to perform some configuration and processing steps, while the main HTTP operations are performed in a web processing module in FPGA logic. Only the GET verb of the HTTP protocol is supported. The performance is actually worse for serving documents that are smaller than 10 kB compared to the nginx web server. The main benefit claimed is the power efficiency compared to an Intel Xeon-based server.
A rudimentary implementation for RESTful webservices is presented by Chang et al. [27]. The complete TCP/IP and HTTP stack is implemented to enable an example application for home device control. The system is limited to a single TCP message on a single connection. No performance numbers are provided, but the ability to implement such services on an embedded device is shown.
Alternative implementations of the HTTP protocol and services are often implemented using soft-core processors. Joshi et al. [28] demonstrate this with the Nios II processor provided by Altera for it's FPGAs.

Conclusion
Heterogeneous computing has arrived at the cloud. With the ever increasing amounts of data stored and collected, the need for higher processing performance is inevitable. Using accelerator devices such as GPGPUs or FPGAs this need can be satisfied for suitable applications.
RESTful APIs have become a de-facto standard for web applications to communicate with each other. Web services can be easily accessed from anywhere and any platform that supports the HTTP protocol. But when using HTTP for accelerated web services the protocol handling on a conventional processor can become a bottleneck.
In this work we propose a reusable hardware IP block that can be used on the cloudFPGA platform to create standalone web services on an FPGA. The IP block can be generated from an OpenAPI specification for the implemented API and frees the application developer from dealing with the HTTP protocol and erroneous API calls. Our example application from the natural language processing domain demonstrates a 20x throughput improvement compared to the same accelerated service on a high-end server. The mean processing time is reduced by a factor of 4. This show the potential for such an architecture in the cloud.

Future Directions
There are remaining features that need to be implemented on the platform to provide general usability.
Encryption: While annotating publicly available documents can be done with an unencrypted connection, many services require encrypted connections to provide secure communication. To enable this type of confidentiality the secure sockets layer (SSL) must be implemented on the platform.
Authorization: Users of the accelerated device must be identified and granted access to the service. Using an external authentication server and using a token-based access could be a solution.
Accountability: To know how often a specific user has used a web service it needs to be tracked. To keep the performance-levels, this accounting can be done in the FPGA's logic and periodically reported to a master server.
Scalability: Traditionally, services are scaled by running the same service on multiple nodes. These nodes may be accessed directly by load balancing at the DNS level or using HTTP load balancers [29]. Finding an adequate load balancing solution is essential to enable scalability of the presented platform. To support HTTP load balancing, the number of cloudFPGA instances must be dynamically scaled. For dynamic scaling, we plan to use the fabric agent of the cloudSHELL.