1. Introduction
In recent times, deep neural network (DNN), positioned as a cornerstone technology for Artificial Intelligence (AI) and Machine Learning (ML) [1], has achieved remarkable development. This technology has been widely applied in various fields including Computer Vision [2], Natural Language Processing [3] and speech recognition [4].
Nevertheless, with the improvement of universality and accuracy, the scale of DNN model is also growing, which means more memory and computational resources are required. For instance, when executing inference on a 224x224 image using VGG16, it entails processing over 138 million parameters through more than 15 billion operations. If executed on a Nexus 5 smartphone, the task would take approximately 16 seconds [5]. This is obviously intolerable for some real-time tasks. Consequently, to meet the memory and computation requirements, DNN inference tasks are typically offloaded to cloud servers with extensive computational resources. However, this traditional cloud computing paradigm encounters several challenges. First, it struggles to meet the real-time requirements of some Internet of Things (IoT) applications when the network condition is poor. Then, massive data may impose a considerable burden on network communication and cloud server processing. What’s more, concerns regarding privacy leaks due to data transmission to the cloud also cannot be ignored [6]. To address these problems, device-edge collaborative inference has emerged as a promising paradigm to promote edge intelligence.
Model partition is an important technology in collaborative inference. Motivated by the significant reduction of data size of some intermediate layers compared to that of input layer, a DNN model is partitioned so that the inference task can be sequentially executed on the end device and edge server. Proper partition can make full use of the computational resources of servers within limited communication overhead [7]. Most prior research on collaborative inference has been limited in the single scenario involving single task and single server. However, realistic scenarios often encompass the presence of multiple edge servers (ESs) and multiple end devices (EDs) with distinct DNN tasks. Meanwhile, different end devices and edge servers, encompassing smartphones, base stations, and gateways, may exhibit different computational capacities, forming the heterogeneous edge environments, in which collaborative inference needs to be considered.
This paper studies the DNN partition, task offloading and scheduling problem in heterogeneous collaborative inference systems, which aims to minimize the average weighted inference latency for DNN tasks. In this problem, each task can be partitioned at different layers according to the computational capacities of devices and network conditions, so we must determine the layers at which the DNN task is partitioned, i.e., partition strategy. Prior explorations have predominantly limited to the DNNs with chain topology. However, many advanced DNN models adopt DAG topology, e.g., GoogleNet [8] and ResNet [9], which brings new challenges to collaborative inference. Besides, each task can be offloaded to one of the servers in the system, so we must determine the server to which the task is offloaded, i.e., offloading strategy. What’s more, there can be more than one task offloaded to the same server, so we must determine the order in which the tasks are executed, i.e., scheduling strategy. The FCFS(First-Come-First-Serve) policy is commonly taken by previous works [11]. However, in real-world scenarios, different tasks often have different priorities. For example, in a smart home system, the priority of tasks responsible for security systems needs to be higher than those responsible for other tasks (such as audio control). When multiple tasks are offloaded to the same server, the scheduling strategy has an undeniable impact on their weighted inference latency. Hence, FCFS policy can hardly adapt to this priority scenario.
To fill these gaps, this paper deeply studies the collaboration of EDs and ESs in a heterogeneous scenario. We formulate this problem as a ILP problem and denote it as POSP, short for task Partition, Offloading and Scheduling Problem. Then a heuristic scheme CIS, i.e., collaborative inference scheme, is proposed for POSP. The main contributions of this paper are summarized as follows:
1) This paper puts forward the collaborative inference in a heterogeneous scenario. The stated problem seeks to minimize the average weighted inference latency by optimizing partition strategy, scheduling strategy and offloading strategy.
2) This paper builds a system model for heterogeneous collaborative inference, and proposes a scheme CIS to minimize average weighted inference latency based on this model. CIS decouples the optimization problem into three subproblems: DNN partition, task offloading and task scheduling.
3) Based on DADS [7], a widely used scheme for DNN partition, algorithm MCP is proposed to deal with the partition for different DNN models, no matter what the topology it is. Then we design SWRTF policy for the task scheduling problem since they have different priorities. At last, CIS utilizes branch and bound to obtain the final strategies, which traverse all feasible solutions in a breadth first manner with proper pruning.
4) Extensive experiments are conducted to verify the performance of this scheme. The comprehensive and in-depth analysis of the results demonstrates that our scheme can greatly reduce the inference latency compared with current approaches.
2. Related Work
Collaborative inference is a significant research direction in edge intelligence, which means end devices complete the DNN inference tasks with the assistance of edge servers or cloud servers.
Kang et al. [16] initially proposed layer-wise partition of DNN models as an approach to enable collaborative inference. However, their approach is limited to linearly-structured DNNs and proved ineffective for more general Directed Acyclic Graph (DAG) structured DNNs. Given that many DNNs exhibit DAG structures, Hu et al. [7] modeled the partition of these DNNs as a min-cut problem and provided a method for computing optimal partition points using max-flow solutions. On this basis, they introduced a system named DADS (Dynamic Adaptive DNN Splitting) that can handle model partition in dynamic network environments. Zhang et al. [17] noted that the min-cut-based partition method has a high time complexity, making it challenging to adapt to scenarios with rapidly changing network conditions. Consequently, they simplified the problem and introduced a two-stage system called QDMP for finding the optimal partitioning point. Wang et al. [18] proposed a hierarchical scheduling optimization strategy called DeepInference-L. By executing computations and data transfers between layers in a pipelined manner, they further reduced the overall latency of collaborative inference. Furthermore, Duan et al. [19] considered the scenarios where multiple DNN inference tasks run on a single mobile device. They employed convex optimization techniques to comprehensively address multi-task partition and scheduling strategies. However, these studies only consider the scenarios of a single device and a single server, which is not applicable to general edge computing scenarios.
Gao et al. [10] designed a dynamic evaluation strategy under a time slot model, dividing a DNN inference task into multiple subtasks and dynamically determining its offloading strategy. Tang et al. [11] proposed an iterative alternative optimization (IAO) algorithm to solve the problem of task partition in a multi-user scenario. Mohammed et al. [12] proposed that in the context of fog computing, a DNN model can be divided into multiple parts, each of which can be executed at fog nodes or locally. Combined with matching theory, an adaptive dynamic task partition and scheduling system DINA was proposed, which can greatly reduce the inference latency. Although the aforementioned studies take the multiple devices into consideration, they ignore the fact that offloading all tasks to a single edge server would lead to issues of excessive load on that server and underutilization of resources on other servers.
To address this problem, Yang et al. [13] introduced an edge-device collaborative inference system called CoopAI, which employs a novel partition algorithm to offload a DNN inference task onto multiple edge servers. By analyzing the characteristics of DNN inference, it permits servers to pre-fetch necessary data, reducing the cost of data exchange and consequently reducing inference latency. Liao et al. [14] delved into the DNN partitioning and task offloading challenges in heterogeneous edge computing scenarios. They conducted an analysis of the task offloading issue involving multiple terminal devices and multiple edge servers. Employing an optimal matching algorithm, they proposed an algorithm that comprehensively addresses both partitioning and offloading concerns, thereby reducing overall system inference latency and energy consumption. Shi et al. [15] presented an offline partitioning and scheduling algorithm, GSPI, for enhancing the speed of DNN inference tasks in a multi-user multi-server setting. However, it's important to note that they exclusively considered scenarios where all users execute the same DNN inference task.
This paper focuses on the collaborative DNN inference problem under the scenario with multiple end devices and multiple edge servers. Given the varying computational capacities of end devices and their distinct upload bandwidths to different edge servers, the manner in which DNN models are partitioned and the selection of servers for tasks to offloading significantly impact the inference latency. By comprehensively considering multiple factors, we propose the collaborative inference scheme CIS for heterogeneous edge computing environments.
3. System Model and Problem Formulation
We first introduce the heterogeneous collaborative inference system mentioned above in Section 3.1. Then we formalized our problem with the target of minimizing average weighted inference latency and the decision variable involving partition strategy 𝑃, offloading strategy 𝑋 and scheduling strategy Φ in Section 3.2.
3.1 Heterogeneous Collaborative Inference System
An edge computing system is comprised of a set of end devices and a set of resource-constrained edge servers. As shown in Fig. 1, each ED is equipped with a pretrained DNN model and executes the DNN inference task of this model. To accelerate the execution of DNN inference tasks, each DNN model can be partitioned at layer-level and then offloaded to one of the ESs. The EDs and ESs are connected in a LAN, where each ES is accessible for each ED.
Fig. 1. Collaborative inference system in heterogeneous edge computing scenarios
We denote the set of 𝑛 end devices as 𝐷 = {𝑑1, 𝑑2, ⋯, 𝑑𝑛}. For convenience, we use task 𝑗 to denote the task on 𝑑𝑗. Each end device 𝑑𝑗 is associated with two parameters: 𝑤𝑗 and Cap𝑑(𝑗). Here 𝑤𝑗 represents the priority of task 𝑗 and task with greater 𝑤𝑗 has a higher priority. Cap𝑑(𝑗) represents the computational capacity of ED 𝑑𝑗 measured in FLOPS. It worth nothing that in the heterogeneous system, these parameters of different EDs can be varying. In this paper, we assume all the tasks can be partitioned at most once, which means each task can only be offloaded to at most one server.
We denote the set of 𝑚 edge servers as 𝑆 = {𝑠1, 𝑠2, ⋯, 𝑠𝑚}. Let Cap𝑠(𝑖) denote the computational capacity of 𝑠𝑖, measured in FLOPS. Let 𝑏𝑖j denote the bandwidth between 𝑠𝑖 and 𝑑𝑗. In this paper, we assume that each 𝑏𝑖j is given and constant. ESs will pre-load the DNN models of the tasks offloaded to them. After intermediate data being sent to the server, the task will be added to a waiting list to wait for scheduling. Once a server is idle, it will select a task from the waiting list to execute, and the execution can’t be interrupted until the task is finished. Table 1 lists the main symbols used in this article.
Table 1. Main notations
3.2 System Model
3.2.1 DNN Layer-level Computation and Output data Model
DNN models are usually composed of a series of layers, such as convolutional layers, excitation layers, active layers, pooling layers and fully connected layers. To compute the inference latency of a DNN, we must analyze the computation and output data size of each layer of the DNN model.
Layer-level Computational Cost: We measure the computational cost of each layer using FLoating point OPerations (FLOPs), which represents the number of basic mathematical operations (such as addition, subtraction, multiplication, etc.) to be performed. Similar methods have been used in [14]. Let com(𝑣) denote the computational cost of layer 𝑣. Since some layers, like active layer, have a very small computational cost, we just consider the main DNN layers whose computational cost has impact on the inference latency of the model as follows:
• Convolutional Layer: Convolutional layer is one of the most basic layers in DNNs. It performs convolution operations on the input data through a set of convolution kernels to extract local features at different locations. The computational cost of convolution layer depends on the size of the input feature map and the size and number of the convolution kernel. For the convolution layer 𝑣, assuming the size of the input feature map is 𝑤𝑖n × ℎ𝑖n, the size of the convolution kernel is 𝑤𝑘 × ℎ𝑘, and the number of channels of the input feature map is 𝐶𝑖n, the number of channels of the output feature map is 𝐶out, the size of stride is 𝑤𝑠 × ℎ𝑠, then its computational cost is:
\(\begin{align}\operatorname{com}(v)=\left(\frac{w_{\text {in }}-w_{k}}{w_{s}}+1\right) *\left(\frac{h_{\text {in }}-h_{k}}{h_{s}}+1\right) * C_{\text {in }} * C_{\text {out }} * w_{k} * h_{k} * 2\end{align}\), (1)
where \(\begin{align}\left(\frac{w_{i n}-w_{k}}{w_{s}}+1\right) *\left(\frac{h_{i n}-h_{k}}{h_{s}}+1\right)\end{align}\) represents the number of multiplicative operations required for each output position and it is also the number of additive operations required for each output position.
• Fully Connected Layer: Fully connected layer is also one of the most basic layers in deep neural networks. By connecting each input neuron to an output neuron and giving each connection a weight, the features extracted from the previous layers are combined and integrated to generate the final output. For fully connected layer 𝑣, assuming that the dimension of the input feature vector is 𝑑𝑖n and the dimension of the output feature vector is 𝑑𝑜ut, then its computational cost is:
𝑐om(𝑣) = (𝑑𝑖n + (𝑑𝑖n − 1)) ∗ 𝑑𝑜ut, (2)
where 𝑑𝑖n denotes the multiplicative operation, and (𝑑𝑖n − 1) denotes the additive operation.
Output Data Size: We use data(𝑣) to denote the output data size of layer 𝑣. For layer 𝑣, assuming the size of its output feature map is 𝑤𝑜ut × ℎ𝑜ut, then the output data size of this layer 𝑣 is:
data(𝑣) = 𝐶𝑜ut ∗ 𝑤𝑜ut ∗ ℎ𝑜ut, (3)
If the tensor size of the input image is (3 × 224 × 224), the computation and output data size of the layers of MobileNet_V2 model are shown in Fig. 2.
Fig. 2. Computation and output data size of each layer of MobileNetV2 model.
3.2.2 DNN Partition Model
The inference process of DNN is actually a process of forward propagation, starting from the input layer and gradually moving forward, each layer conducts a series of calculations on its own input and sends the results to its subsequent layers as their input. Thus, given a DNN model 𝑀, we can represent 𝑀 as a DAG (directed acyclic graph) 𝐺 =<𝑉, 𝐸>, where 𝑣𝑖 ∈ 𝑉 corresponds to one layer and the directed edge 𝑒𝑖j ∈ 𝐸 represents the dependency between 𝑣𝑖 and 𝑣𝑗. It should be emphasized that each vertex may have multiple edges starting from it and multiple edges ending at it. For example, Fig. 3(a) shows a piece of GoogLeNet [8], which can be modeled as a DAG as shown in Fig. 3(b).
Fig. 3. A piece of GoogLeNet (a) and the DAG corresponding to it (b)
In the context of deep neural networks (DNNs), it should be noted that the computational cost and output size of each layer are different and independent of each other, which provides an opportunity for DNN partition. DNN partition is to divide a DNN model into two parts to execute them on different devices. Formally, we define 𝑝𝑗 =< 𝑉𝑙𝑗, 𝑉𝑟𝑗 > as the partition of task 𝑗 which partitions its vertex set 𝑉𝑗 into two disjoint subsets 𝑉𝑙𝑗 and 𝑉𝑟𝑗. The layers corresponding to the vertex in 𝑉𝑙𝑗 are executed on an ED, and the layers corresponding to the vertex in 𝑉𝑟𝑗 are executed on an ES. Fig. 3(b) shows a partition of GoogLeNet mentioned above.
Thus, as to task 𝑗 and one of its partitions 𝑝𝑗, the local computational cost is:
\(\begin{align}L_{j}\left(p_{j}\right)=\sum_{v \in V_{j}^{l}} \operatorname{com}(v)\end{align}\), (4)
the transmission data size is:
\(\begin{align}C_{j}\left(p_{j}\right)=\sum_{v \in V_{j}^{c}} \operatorname{data}(v)\end{align}\), (5)
and the remote computational cost is:
\(\begin{align}R_{j}\left(p_{j}\right)=\sum_{v \in V_{j}^{r}} \operatorname{com}(v)\end{align}\), (6)
where 𝑉𝑙𝑗 and 𝑉𝑟𝑗 denote the set of layers executed on the EDs and the set of layers executed on the ESs respectively. 𝑉𝑐𝑗 represents the set of layers that need to send their output to the ES, which means each layer in 𝑉𝑐𝑗 is belong to 𝑉𝑙𝑗 and has a successor layer in 𝑉𝑟𝑗.
3.2.3 Task Scheduling Model
In reality, there are often fewer edge servers than end devices, so it is common for multiple tasks to be offloaded to the same server. In our system, the server can only execute one task at a time, so tasks need to wait for scheduling before execution. As a result, the scheduling policy, specifically the execution order of tasks, has a great impact on the average weighted inference latency. In previous related works, they schedule the tasks in a first-come-first-service (FCFS) manner [11], but it can’t solve our problem well for tasks have different priorities. For example, suppose there are three tasks offloaded to the same server. We define the arrival time of a task as the time it takes before the intermediate data is sent to the server, including local computing time and data transmission time. Also, we the server computing time as the time of task execution on the server. Then the three task’s arrival time, server computing time and task priority are (5,5,3), (7,2,2), (3,6,1) respectively. Fig. 4 shows the results of two scheduling strategies, where the left one is according to FCFS and the right one is in another way. Their average weighted inference latencies are 83/3 and 24 respectively, which shows that different scheduling strategies have an important impact on inference latency. To formally represent the scheduling strategy, let 𝜙(𝑠𝑖) denote the task sequence on 𝑠𝑖, then the scheduling strategy of the system can be represented as Φ = {𝜙(𝑠1), 𝜙(𝑠2), ⋯, 𝜙(𝑠𝑚)}. The kth scheduled task on 𝑠𝑠𝑖𝑖 can be represented as 𝜙𝑘(𝑠𝑖), where 1 ≤ 𝑘 ≤ |𝜙(𝑠𝑖)|.
Fig. 4. Two different scheduling strategies
3.2.4 Task Offloading and Inference Latency Model
In the heterogeneous system, due to the disparity in computational capacity and bandwidth between ESs, it is obvious that the task offloading strategy does affect the inference latency. Formally, we use a set of binary variables 𝑋 = {𝑥11, 𝑥12, ⋯, 𝑥1𝑚, ⋯, 𝑥𝑛1, 𝑥𝑛2, ⋯, 𝑥𝑛m} to denote the offloading strategy. Specifically, 𝑥𝑖j = 1 if and only if task 𝑗 is offloaded to ES 𝑠𝑖; otherwise, 𝑥𝑖j = 0. Since one task can only be offloaded to one ES, we can get that ∑𝑚𝑖=1𝑥𝑖j = 1, ∀1 ≤ 𝑗 ≤ 𝑛.
Once a task is partitioned and decided to be offloaded to an ES, the execution of this task can be divided into four stages: local computing, data transmission, waiting for scheduling and remote computing. Therefore, if task 𝑗 is partitioned by 𝑝𝑗 and offloaded to ES 𝑠𝑖, its inference latency is:
𝑇𝑖j = 𝑡𝑙ocal𝑖j + 𝑡𝑡rans𝑖j + 𝑡𝑤ait𝑖j + 𝑡𝑟emote𝑖j, (7)
where 𝑡𝑙ocal𝑖j denotes the latency of local computing:
𝑡𝑙ocal ij= 𝐿𝑗(𝑝𝑗)/Cap𝑑(𝑗), (8)
𝑡𝑡rans𝑖j denotes the latency of data transmission:
\(\begin{align}t_{i j}^{\text {trans }}=\frac{C_{j}\left(p_{j}\right)}{b_{i j}}\end{align}\), (9)
𝑡𝑤ait𝑖j denotes the latency of waiting for scheduling:
𝑡𝑤ait𝑖j = 𝑚ax(𝑡𝑤ait𝑖j' + 𝑡𝑟emote𝑖j', + 𝑡𝑙ocal𝑖j + 𝑡𝑡rans𝑖j) - (𝑡𝑙ocal𝑖j + 𝑡𝑡rans𝑖j), (10)
where 𝑗′ denotes the task scheduled before 𝑗 on 𝑠𝑖. Suppose 𝑗 is 𝜙𝑘(𝑠𝑖), then 𝑗′ is 𝜙𝑘−1(𝑠𝑖). 𝑡𝑟emote𝑖j denotes the latency of remote computing:
𝑡𝑟emote𝑖j = 𝑅𝑗(𝑝𝑗)/Cap𝑠(𝑖). (11)
Assuming that all tasks start at the same time, the average weighted inference latency of the entire system is:
3.2.5 Problem Formulation
The QoE [23] of the applications based on DNN model improves with the reduction of inference latency. Since different tasks have different priorities, we try to minimize the average weighted inference latency 𝑇 of the system by collaborative inference. Considering the computational capacity and network bandwidth of the system, an inference reduction problem is formulated in this subsection. We now refer to this optimization problem with partition strategy 𝑃, offloading strategy 𝑋 and scheduling strategy Φ as POSP, e.g. task Partition, Offloading and Scheduling Problem, and define it as follows:
\(\begin{align}\begin{aligned} \text { POSP: } & \min _{P, \phi, X} T \\ \text { s.t. } & \text { C1: } V_{j}^{l} \cup V_{j}^{r}=V_{j}, V_{j}^{l} \cap V_{j}^{r}=\emptyset, \forall 1 \leq j \leq n \\ & \text { C2: } \sum_{i=1}^{m} x_{i j} \leq 1, \forall 1 \leq j \leq n \\ & C 3: x_{i j}=1 \text { or } 0, \forall 1 \leq j \leq n, \forall 1 \leq i \leq m \\ & C 4: \phi\left(s_{i}\right)=\left\{j \mid x_{i j}=1\right\} .\end{aligned}\end{align}\). (12)
The optimization objective function 𝑇 is the average weighted inference latency of the entire system, which can be formulated as:
\(\begin{align}T=\frac{1}{n} \sum_{i=1}^{m} \sum_{j=1}^{n} x_{i j} w_{j} T_{i j}\end{align}\). (13)
Constraint 𝐶1 guarantees the valid of partition strategy of each task. Constraint 𝐶2 and 𝐶3 manifest that one task can only be offloaded to one ES at most or even just be completed locally. Constraint 𝐶4 is used to guarantee t task sequence on 𝑠𝑖 is consistent with offloading strategy 𝑋. Due to the existence of multiple variables and the high coupling between variables, problem POSP is too complex to be solved directly. Thus, we need to give further analysis and decompose problem POSP to get the optimal strategies.
4. Algorithm Design
We have modeled our problem as an optimization problem POSP and point out that it is complex owing to the existence of multiple decision variables. In this section, we first reveal several structural properties of our problem and then decompose the problem into three parts. After that, we propose an algorithm MCP for DNN partition, a scheduling policy SWRTF for task scheduling and an algorithm BBO for DNN allocation to solve POSP step by step.
4.1 Problem Decomposition
Since there are multiple sets of variables in POSP and the variables are coupled with each other, we need to decouple them to decompose the complex problem. The main idea of our scheme is to give the corresponding partition strategy 𝑃 and scheduling strategy Φ for each offloading strategy 𝑋, thus binding 𝑃 and Φ to 𝑋, which means as long as 𝑋 is determined, 𝑃 and Φ can be determined accordingly. Then we just need to solve the new problem 𝒫 which only has 𝑋 as its decision variables. So, our scheme can be decomposed into three steps: 1) give the partition strategy 𝑃 for each offloading strategy 𝑋, 2) give the scheduling strategy Φ for each offloading strategy 𝑋, 3) generate the new problem 𝒫 which only has 𝑋 as its decision variables and solve this problem to give the optimal offloading strategy 𝑋. Since step 1) and 2) has bound 𝑃 and Φ to 𝑋, so with the result of step 3), we can give the final optimal strategy 𝑃, Φ and 𝑋 for problem POSP.
Fig. 5. CIS flow
4.1.1 DNN Partition
For most DNN, they have many different partitions, which makes it more difficult for us to solve the problem. However, some partitions will lead to excessive inference latency thus almost impossible to become the optimal solution. Therefore, we can reduce the solution space by selecting an optimal partition 𝑝𝑖j for each DNN task 𝑗 when trying to offload it to ES 𝑠𝑖. Based on the method proposed in [7], we design an algorithm MCP, e.g., min-cut based partition, which first constructs a latency graph 𝐺′ for each task 𝑗 and ES 𝑠𝑖 based on the DAG 𝐺 of the DNN model of task 𝑗 to convert the optimal partition problem to the minimum weighted s–t cut problem of 𝐺′, and then get the optimal partition 𝑝𝑖j using a min-cut algorithm.
For task 𝑗 and ES 𝑠𝑖, let 𝐺 = <𝑉, 𝐸> denote the corresponding DAG of the DNN model of task 𝑗, we can construct a weighted DAG 𝐺′, e.g., its latency graph, as follows:
1) Add a source node 𝑠𝑠 and a sink node 𝑡 to 𝐺′.
2) Add the remote computing edges 𝐸remote: For each node 𝑣 ∈ 𝑉, add an edge from 𝑠 to 𝑣, whose weight is com(𝑣)/Cap𝑠(𝑖), e.g., the time for executing this layer on 𝑠𝑖.
3) Add the local computing edges 𝐸local: For each node 𝑣 ∈ 𝑉, add an edge from 𝑣 to 𝑡, whose weight is com(𝑣)/Cap𝑑(𝑗), e.g., the time for executing this layer on 𝑑𝑗.
4) Add the data transmission edges 𝐸tans: It is worth nothing that the output data of a layer only need to be transmit at most once even if it has more than one successor layers. Thus, there are two conditions we add data transmission edges.
a) For each node 𝑣 ∈ 𝑉, if it only has one successor 𝑣′ in 𝐺, add an edge from 𝑣 to 𝑣′, whose weight is \(\begin{align}\frac{C_{j}(\text { data }(v))}{b_{i j}}\end{align}\), i.e., the time for transmitting 𝑣’s output data from 𝑑𝑗 to 𝑠𝑖.
b) For each node 𝑣 ∈ 𝑉, if it has more than one successors in 𝐺, we first add a virtual node 𝑣virtual to 𝐺′, then add an edge from 𝑣 to 𝑣virtual, whose weight is \(\begin{align}\frac{C_{j}(\text { data }(v))}{b_{i j}}\end{align}\), and then add an edge from 𝑣virtual to all the successors of 𝑣 in 𝐺, whose weight is positive infinity.
Fig. 6 shows an example of constructing the latency graph of a DAG. At this stage, the optimal partition problem has been converted to the minimum weighted s–t cut problem of the latency graph 𝐺′. For a s-t cut 𝐶 of 𝐺′, the edges in 𝐶 comprise three parts: remote computing edges, local computing edges and data transmission edges. Thus, the value of 𝐶 is exactly equal to the inference latency of the DNN task at the partition of 𝐶 without considering the time waiting for scheduling. Then we use a min-cut algorithm to get the minimum weighted s–t cut of 𝐺′, and get the optimal partition 𝑝𝑖j from this. In the following steps of determining scheduling and offloading strategies, we suppose task 𝑗 is partitioned at 𝑝𝑖j when offloaded to 𝑠𝑖. Now, we have bound 𝑃 to 𝑋, i.e., the partition strategy 𝑃 can be determined accordingly when offloading strategy 𝑋 is determined. Since the time complexity of min-cut algorithm is 𝑂(|𝑉|2 |𝐸| ln |𝑉|), where |𝑉| is the number of nodes in the DAG and |𝐸| is the number of edges in the DAG, the time complexity of constructing the latency graph is 𝑂(|𝑉| + |𝐸|), the time complexity of MCP is 𝑂(𝑚n|𝑉′|2 |𝐸′| ln |𝑉′|), where |𝑉′| is the largest number of nodes in all the DAG and |𝐸′| is the largest number of edges in all the DAG.
Fig. 6. Constructing the latency graph
Algorithm 1. MCP
4.1.2 Task Scheduling
To give the scheduling strategy Φ for each offloading strategy 𝑋, we design a new scheduling policy for tasks offloaded to the same ES called “shortest weighted remaining time first” (SWRTF). In particular, SWRTF has three rules:
1) The ES will not be idle unless there are no tasks in the waiting list.
2) Tasks are executed non-preemptively, that is, once a task starts executing, other tasks must wait until the execution of the task ends.
3) The task with shortest weighted remaining time RT𝑗 will be scheduled first, where RT𝑗 = 𝑡𝑟emote𝑖j/𝑤𝑗.
For a given task 𝑗 and ES 𝑠𝑖, if task 𝑗 is decided to be offloaded to 𝑠𝑖, then the partition 𝑝𝑖j of the task 𝑗 are given as Section 4.1.1 mentioned. Suppose the set of tasks offloaded to 𝑠𝑖 is 𝐽𝑖 = {𝑗1, 𝑗2, ⋯, 𝑗𝑞}. Then for each 𝑗𝑘 in 𝐽𝑖, the local computation 𝐿𝑗𝑘(𝑝𝑖𝑗𝑘), transmission data size 𝐶𝑗𝑘(𝑝𝑖𝑗𝑘) and remote computation 𝑅𝑗𝑘(𝑝𝑖𝑗𝑘) of 𝑗𝑘 are given. Thus 𝑡𝑙ocal𝑖𝑗𝑘, 𝑡𝑡rans𝑖𝑗𝑘 and 𝑡𝑟emote𝑖𝑗𝑘 can be obtained from Eqs. (8)(9)(11). Thus, the arrival time, i.e., the time of local computing and data transmission, and the shortest weighted remaining time of each task are given. Then the scheduling strategy 𝜙(𝑠𝑖) for these tasks can be given based on SWRTF. The same goes for other ESs. Now, we have bound Φ to 𝑋, i.e., scheduling strategy Φ can be determined accordingly when offloading strategy 𝑋 is determined.
4.1.3 Task Offloading
Based on the practical consideration on DNN partition and task scheduling in Section 4.1.1 and 4.1.2, we can bind partition strategy 𝑃 and scheduling strategy Φ to offloading strategy 𝑋, e.g., we just need to decide the offloading strategy 𝑋, then the partition strategy 𝑃 and scheduling strategy Φ will be decided accordingly. As a result, our initial optimization problem POSP can be transformed into an 0-1 integer optimization problem which only has one set of variables 𝑋:
\(\begin{align}\begin{array}{ll}\mathcal{P}: & \min _{X} T \\ \text { s.t. } & \text { C1: } x_{i j}=1 \xrightarrow{\text { yields }} p_{j}=p_{i j} \\ & \text { C2: } \sum_{i=1}^{m} x_{i j} \leq 1, \forall 1 \leq j \leq n \\ & \text { C3: } x_{i j}=1 \text { or } 0, \forall 1 \leq j \leq n, \forall 1 \leq i \leq m \\ & \text { C4: } \phi\left(s_{i}\right)=\left\{j \mid x_{i j}=1\right\}\end{array}\end{align}\). (14)
Actually, this is a variant of general assignment problem (GAP) where the cost of assigning a task to a server, i.e., its weighted inference latency 𝑤𝑗𝑇𝑖j, can be changed by the other tasks assigned to the same server since the waiting time 𝑡𝑤ait𝑖j is affected by 𝑋. Motivated by the algorithm proposed by Ross, G. Terry, and Richard M. Soland [20], we design a heuristic algorithm BBO, i.e., branch and bound optimization to solve this problem. The detailed introduction of BBO is in Section 4.2. Once the offloading strategy 𝑋 is determined by BBO, partition strategy 𝑃 and scheduling strategy Φ can be determined accordingly, thus determining the solution of the initial problem POSP.
4.2 Branch and Bound Optimization Algorithm (BBO)
Actually, branch and bound is a way to traverse the entire solution space of the problem with pruning to limit the time complexity. Branch is for generating the solution and bound is for pruning. According to branch and bound, the solution set for 𝒫 is separated into two mutually exclusive and collectively exhaustive subsets based on the 0-1 dichotomy of variable values, so as to the subsets created. Fig. 7 gives an example of separating the solution set of 𝒫. Each separation creates two new candidate problems whose solution sets differ only in the value assigned to a particular variable. We use BFS (breadth first search) to traverse the solution sets with bounding and pruning.
Fig. 7. An example of the solution space of this problem
The main processing procedures for each candidate problem 𝒫𝑘 are:
1) Bounding: make a relaxation of 𝒫𝑘 to get a lower bound Θ𝑘 of the objective function in this branch according to the relaxed problem 𝒫ℛ𝑘 and update the upper bound of the objective function for problem 𝒫 by substituting a feasible solution into the objective function.
2) Branching: select a variable as the separation variable to further separate the solution set.
3) Pruning: check if the lower bound of this branch is too large that needs to be pruned. These procedures’ detailed explanations are as follows, and for notational convenience, let 𝒫𝑘 denote the candidate problem of a branch and 𝐹𝑖 denote the set of tasks which has been fixed to be offloaded to ES 𝑠𝑖 in 𝒫𝑘.
4.2.1 Bounding
Bounding is the procedure to bound the upper bound and lower bound of each candidate problem for pruning. Relaxing the problem to get the bound in a traditional method in branch and bound algorithm. For the current candidate problem 𝒫𝑘, it is too complex to solve since the assignment cost of each task is tightly relate to the offloading strategy of other tasks, e.g., the assignment cost of each task is unknown at the beginning. Thus, we can get the relaxed problem 𝒫ℛ𝑘 by fixing the assignment of each task relative to each ES before solving the problem. In the current candidate problem 𝒫𝑘, some tasks’ assignment has been decided. Therefore, we can first compute the total weighted inference latency of an ES 𝑠𝑖 only considering the tasks have been decided to be offload to it. Then, for each task 𝑗 whose assignment has not been decided, the assignment cost 𝐶𝑖j equals to the increment of the total weighted inference latency of 𝑠𝑖 after assigning task 𝑗 to 𝑠𝑖:
𝐶𝑖j = 𝑇𝑖(𝑗) − 𝑇𝑖, (15)
where 𝑇𝑖 denotes the existing total weighted inference latency of tasks in 𝐹𝑖 utilizing SWRTF and 𝑇𝑖(𝑗) is the new total weighted inference latency of tasks in 𝐹𝑖⋃{𝑗} utilizing SWRTF. The relaxed problem 𝒫ℛ𝑘 can be formalized as:
\(\begin{align}\begin{array}{l} \mathcal{P} \mathcal{R}_{k}: \min _{X} T^{\prime} \\ \text { s.t. } \\ \text { C1: } \sum_{i=1}^{m} x_{i j} \leq 1, \forall 1 \leq j \leq n \text {, } \\ \text { C2: } x_{i j}=1 \text { or } 0, \forall 1 \leq j \leq n, \forall 1 \leq i \leq m \text {, } \end{array}\end{align}\), (16)
where \(\begin{align}T^{\prime}=\frac{1}{n} \sum_{i=1}^{m} \sum_{j=1}^{n} x_{i j} C_{i j}\end{align}\) is the average weighted assignment cost. Since 𝐶𝑖j is known, problem 𝒫ℛ𝑘 has an obvious solution 𝑋𝑘 by assigning every unassigned task 𝑗 ∈ ⋃𝑚𝑖=1𝐹𝑖 to the ES 𝑠𝑖𝑗 that minimizes 𝐶𝑖j. Substituting this solution into 𝒫ℛ𝑘 yields the lower bound of this branch, denoted by Θ𝑘. It is obvious that 𝑋𝑘 is a feasible solution for 𝒫𝑘, so we can calculate a valid objective function value 𝑇𝑣alid𝑘 of the initial problem 𝒫 by substituting 𝑋𝑘 into 𝒫𝑘. Since the problem seeks for minimum objective function value, 𝑇𝑣alid𝑘 is also the upper bound Ω of 𝒫.
Function ProblemRelax(𝒫𝑘)
4.2.2 Branching
Branching is the procedure to separate the solution space of the problem into different sub-problems for traversing. To select a 𝑥𝑖j ∈ 𝑋 to be the separation variable, i.e., separate the problem according to the value of 𝑥𝑖j, we compute a “re-offloading profit” 𝛿𝑖j for each ES 𝑠𝑖 and each task 𝑗 that has not been determined to be offloaded to which ES in the current candidate problem, that is, 𝑗 ∉ ⋃𝑚𝑖=1𝐹𝑖. Let 𝑋𝑘(𝑖) denote the solution by modifying the feasible solution 𝑋𝑘 with re-offloading task 𝑗 to ES 𝑠𝑖. It is obvious that 𝑋𝑘(𝑖) is also a feasible solution for 𝒫𝑘. Then 𝛿𝑖j can be defined as the reduction of the objective function 𝑇 in 𝒫𝑘:
𝛿𝑖j = 𝑇(𝑋𝑘) − 𝑇(𝑋𝑘(𝑖)), (17)
where 𝑇(𝑋𝑘) denotes the value of objective function by substituting 𝑋𝑘 into problem 𝒫𝑘. The selected separation variable 𝑥𝑖∗𝑗∗ is the one that has maximum 𝛿𝑖∗𝑗∗ among those with 𝑗 ∉ ⋃𝑚𝑖=1𝐹𝑖 in the candidate problem 𝒫𝑘. Then, the solution set is further separated into two subsets, one with 𝑥𝑖∗𝑗∗ = 1 and the other one with 𝑥𝑖∗𝑗∗ = 0.
Function MaxProfit(𝒫𝑘, X𝑘)
4.2.3 Pruning
Pruning is the procedure to limit the time complexity of the traverse of solution space. To avoid redundant computing, we safely discard some solution set based on two rules. First, as described in the bounding procedure, we can maintain the upper bound Ω of the initial problem 𝒫 and compute a lower bound Θ𝑘 for each candidate problem 𝒫𝑘. If Θ𝑘 > Ω, this branch should be pruned obviously since it is impossible to generate the optimal solution. Then, during the BFS of the solution sets, we set a threshold 𝜔 of the maximum number of candidate problems in each level. Particularly, the candidate problems in the same level are processed in a round and if the number of candidate problems in a round exceeds 𝜔, those with maximum lower bound Θ𝑘 are pruned.
4.2.4 Algorithm Design and Analysis
Algorithm 2 illustrates the pseudocode of our BBO algorithm. We maintain two problem sets 𝑄 and 𝑄′ for BFS and an upper bound Ω of the initial problem 𝒫 (line 1). The main loop of this algorithm is the BFS process of the solution sets and each loop can be decomposed into three steps. In step1, we compute the lower bound Θ𝑘 and feasible solution 𝑋𝑘 of the candidate problem 𝒫𝑘 utilizing a function ProblemRelax(𝒫𝑘) (line 4). In step2, if the lower bound Θ𝑘 of 𝒫𝑘 is not greater than Ω, we update the upper bound Ω (lines 6-7) and select a separation variable 𝑥𝑖∗𝑗∗ for branching with the function MaxProfit(𝒫𝑘, X𝑘) (lines 8-11). After conducting step1 and step2 for each candidate problem in this round (line 3), we check if the problem set for next round is too big and remove some candidate problems if necessary (lines 12-14). At last, we return the solution with minimum objective function value (lines 21-22). There are 𝑚n processing rounds and we need to process at most 𝜔 candidate problems each round and the time complexity of each candidate problem is 𝑂(𝑛2 log 𝑛). Thus, the time complexity of BBO algorithm is 𝑂(𝜔mn3 log 𝑛).
Algorithm 2. BBO (Branch and Bound Optimization)
Since BBO gives the offloading strategy 𝑋, the partition strategy 𝑃 and scheduling strategy Φ can be given accordingly based on Section 4.1.
5. Implementation and Evaluation
In this section, we first introduce the prototype setup for our experiment, and then compare our scheme with several existing schemes.
5.1 Prototype Setup
To evaluate the performance of our scheme, we build a heterogeneous device-edge system prototype. We use two Laptops to act as the edge servers to assist the end devices in executing their inference tasks, one equipped with a 6-core 2.60GHz Intel CPU and16-GB RAM and the other one equipped with a 4-core 1.60GHz Intel CPU and 8-GB RAM. The end devices are composed of two Raspberry Pis, each of them equipped with a 4-core ARM Cortex-A72 CPU, a JetsonTX2 equipped with a 4-core ARM Cortex-A57 CPU, a Jetson xavierNX equipped with a 6-core ARM V8 CPU. All end devices are connected to the edge server through LAN. We use the pretrained DNN models from the standard implementation from famous package PyTorch. An AlexNet model is deployed on Raspberry Pi1, a MobileNet_V2 model is deployed on Raspberry Pi2, a ResNe18 model is deployed on the JetsonTX2 and a VGG19 model is deployed on the Jetson xavierNX. Let the priority of the tasks on the four end devices be 1, 2, 3, and 4, respectively. The settings are listed in Table 2. We use tiny-ImageNet [21], a subset of the ILSVRC2012 classification dataset, as our dataset.
Table 2. Experimental Settings
5.2 Benchmarks
We evaluate our scheme by comparing the performance with four naive schemes and a SOTA (state-of-the-art) scheme as follows:
• Local-Only (LO): All inference tasks are executed locally.
• Edge-Only (EO): All EDs offload their tasks to the edge servers without DNN partition, the selection of the edge server to offload is random and the execution order of tasks on the server is FCFS.
• DADS [7] with random allocation and FCFS scheduling (RA-FS): We first use DADS to compute the optimal partition of each task for each ES, and then randomly allocate tasks to ESs where tasks are scheduled in the FCFS manner.
• DADS with random allocation and SWTRF scheduling (RA): We first use DADS to compute the optimal partition of each task for each ES, and then randomly allocate tasks to ESs where tasks are scheduled in the SWTRF manner.
• CCORAO [22]: This is an SOTA (state-of-the-art) algorithm for cloud assisted mobile edge computing in vehicular networks. The main idea of CCORA is to decide the offloading strategy and resource allocation strategy iteratively. We modify it to our problem POSP, i.e., deciding the offloading strategy and scheduling strategy iteratively.
5.3 Experimental Results
We first test the inference latency of each task on different devices. It can be seen in Fig. 8 that the computing capacity of different devices varies and the inference latency when tasks are executed locally is too high to support some real-time applications. Thus, we need to carefully design the collaborative inference scheme to reduce the inference latency in this heterogeneous system.
Fig. 8. The inference latency of different models on different devices.
Then we conduct multiple experiments by modifying the bandwidth between devices. The bandwidth configurations for the experiments are shown in Table 3. We use the four different bandwidth configurations to simulate different network conditions of the system. Configuration1 simulates the situation that the overall network condition of the system is good, Configuration2 simulates the situation that the overall network condition of the system is bad, Configuration3 simulates the situation that the network condition between different devices varies and Configuration4 simulates the situation that the network conditions between all end devices to the 2 ESs has a large gap.
Table 3. Four bandwidth configurations
Table 4 shows the partition and offloading strategies for tasks given by CIS at different configurations, where 2/11 means the model AlexNet has 11 layers and is partitioned at the 2nd layer, 0/11 means offloading the entire task to the edge server and 11/11 means executing the entire task locally. We can draw from the results that the better the network condition, the earlier the tasks are offloaded. When the system suffers adverse network conditions, the end devices tend to execute their tasks locally.
Table 4. Partition and offloading strategy at different configurations.
Fig. 9 shows the average weighted inference latency of different schemes. It can be seen that our scheme CIS has the similar performance to CCORAO, both of which can reduce about 29% to 71% on the average weighted inference latency compared to the other four naïve schemes. Further analysis of the experimental results reveals that the result of LO is stable but can’t be optimal, the result of RO is influenced by network conditions, when the network is poor, it may lead to intolerable latency, and the results of RA-FS and RA are relatively better since they make some improvement to LO and RO. CIS and CCORAO jointly considers DNN partition, task offloading and scheduling, so they can always obtain the optimal solutions when the problem space is small.
Fig. 9. The average weighted inference latency of different schemes at different configurations.
5.4 Simulation Experiments
In this section, a series of simulation experiments are conducted to further evaluate the performance of our proposed scheme CIS. Initially, we evaluate the performance of CIS with different network conditions in Section 5.4.1. Then we test the schemes when the size and number of tasks change in Section 5.4.2. At last, the robustness of CIS with different computational capacities patterns for the multiple edge servers and devices is validated in Section 5.4.3. For the numerical analysis, the computational capacities of EDs, the computational capacities of ESs and the bandwidth between ED to ES take a uniform distribution in the range of [1, 5] FLOPS, [10, 20] FLOPS and [0.1, 2.0] Mbps, respectively. The DNN models in our experiments include: AlexNet, MobileNet_V2, ResNet18 and VGG19.
5.4.1 Performance with different network conditions
In this section, we evaluate the performance of CIS with different network conditions. The simulation scenario has 12 EDs and 6 ESs, the computational capacities of EDs are: [1.2, 1.4, 1.6, 1.8, 2, 2.2, 2.4, 2.6, 2.8, 2.9, 2.3, 4.0], the computational capacities of ESs are: [11, 15, 17.5, 20, 24, 22], the priorities of tasks on the EDs are: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. An AlexNet model is deployed 𝑑1, 𝑑5, 𝑑9, a MobileNet_V2 model is deployed 𝑑2, 𝑑6, 𝑑10, a ResNet18 model is deployed 𝑑3, 𝑑7, 𝑑11, and a VGG19 model is deployed 𝑑4, 𝑑8, 𝑑12. For convenience, we set the bandwidth between all devices the same. Fig. 10 shows the simulation results at difference bandwidth. The inference latency of CIS and CCORAO are similar and always smaller than that of other schemes since the problem space is small and they can always get the optimal solution. The inference latency of LO still does not vary with bandwidth and with the continuous improvement of network bandwidth, the performances of other five schemes are also improving. At the beginning, the improvement is quite significant since the network condition is the main bottleneck at this time. When the bandwidth is high enough, this improvement gets smaller as the computational capacity is the main bottleneck at this time. Overall, our proposed scheme CIS has a well performance at different network conditions.
Fig. 10. Average weighted inference latency at difference bandwidth.
5.4.2 Performance with different numbers of tasks
In this section, we evaluate the performance of CIS with different numbers of tasks. We fix 100 ESs whose computational capacities are randomly taken from [10, 20] FLOPS. Then we conduct a series of experiments with different number of EDs, e.g., different number of tasks. From each number of EDs, we conduct the experiments for 50 times. For each experiment, the computational capacities of the EDs are randomly taken from [1, 5] FLOPS, the DNN model deployed on it is randomly picked from AlexNet, MobileNet_V2, ResNet18 and VGG19 and the bandwidth between an ED and an ES is randomly taken from [0.1, 2.0] Mbps. We take the results of scheme LO as the baseline and compute the relative average weighted inference latency for other five schemes, e.g., the average weighted inference latency of other scheme divided by that of LO. Then the average of results of the 50 times experiments is taken to compare different schemes. Fig. 11 shows the simulation results. There are four main observations regarding these results:
Fig. 11. Relative average weighted inference latency of different numbers of Tasks
• The performance of each scheme decreases as the number of tasks increases. This is intuitive since the number and computational capacities of the ESs are fixed.
• RO is always the worst scheme and can lead to more than 140% average weighted inference latency compared to LO. This is because we randomly select the ES to offload for each task, which means many tasks may be offloaded to the same ES, thus increasing the result. Likely, RA-FS and RA can also have worse performance compared to LO since the selection of ESs to offload is random. They are better than RO since they will first partition the DNN models, which can reduce the computation overhead of the ESs.
• The performance gap between CIS and RA-FS or RA increases as the number of tasks increases, because the number of tasks offloaded to the same ES is small at the beginning, that is, the scheduling strategy of tasks has little impact on the results. The more the tasks, the more important it is to decide proper offloading strategy and scheduling strategy.
• Compared to CCORAO, when the number of tasks is small, the advantage of CIS is not obvious since they can both get the optimal solutions. However, as the number of tasks increases, CCORAO cannot conduct enough rounds of iteration to get convergence. Although CIS also discard some subproblems to limit its complexity, our careful design of heuristic rules for discarding subproblems does have an undeniable impact on the results.
5.4.3 Performance with different computational capacities patterns
In this section, we validate the robustness of CIS with different computational capacities patterns for the multiple edge servers and devices. We fix the number of EDs and ESs 300 and 100 respectively. For each DNN model, there are 75 EDs deployed with it. The computational capacities of ESs and EDs are randomly taken from [10, 20] FLOPS and [1, 5] FLOPS respectively. We conduct the experiments for 100 times and compare the relative average weighted inference latency for other five schemes. The results are shown in Fig. 12. Compared with RO, RA-FS and RA, the performance our scheme CIS has obvious advantages. Overall, the inference latency when taking CIS is always lower. What’s more, the results of CIS are less dispersed, which means unacceptable results rarely occur. This is mainly due to the fact that the first three strategies randomly select the offloading strategy, which is obviously not feasible when the number of tasks is large. When it comes to CCORAO and CIS, both of them almost always demonstrates a certain level of performance improvement compared to the baseline scheme LO. However, the results of CIS exhibit a more concentrated distribution and a lower highest inference latency. This indicates our carefully designed heuristic rules for discarding subproblems in Section 4.2.3 do make sense.
Fig. 12. Relative average weighted inference latency with different computational capacities patterns
6. Conclusion
In this paper, we study the DNN inference acceleration in a heterogeneous edge computing scenario. We present a comprehensive analysis of the collaborative inference in the heterogeneous scenario and point out the complexity of this problem. A scheme CIS is proposed which jointly combines DNN partition, task offloading and task scheduling to accelerate the DNN inference tasks. Extensive experiments are conducted to evaluate our scheme. With a detailed analysis of evaluation results, CIS are validated to be more effective for improving the average weighted inference latency of the system.
Acknowledgement
This work was supported in part by the Science and Technology Project of State Grid Co., LTD (Research on data aggregation and dynamic interaction technology of enterprise-level real-time measurement data center, 5108-202218280A-2-399-XG).
참고문헌
- Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol.521, pp.436-444, May, 2015. https://doi.org/10.1038/nature14539
- J. Chen, and X. Ran, "Deep Learning With Edge Computing: A Review," in Proc. of Proceedings of the IEEE, vol.107, no.8, pp.1655-1674, Aug. 2019. https://doi.org/10.1109/JPROC.2019.2921977
- J. Chai, and A. Li, "Deep Learning in Natural Language Processing: A State-of-the-Art Survey," in Proc. of 2019 International Conference on Machine Learning and Cybernetics (ICMLC), pp. 1-6, 2019.
- W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in Proc. of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4960-4964, 2016.
- J. Mao, X. Chen, K. W. Nixon, C. Krieger, and Y. Chen, "MoDNN: Local distributed mobile computing system for Deep Neural Network," in Proc. of Design, Automation & Test in Europe Conference & Exhibition, pp.1396-1401, 2017.
- Q. Pu, G. Ananthanarayanan, P. Bodik, S. Kandula, A. Akella, P. Bahl, and I. Stoica, "Low Latency Geo-distributed Data Analytics," ACM SIGCOMM Computer Communication Review, vol.45, no.4, pp.421-434, Aug. 2015. https://doi.org/10.1145/2829988.2787505
- C. Hu, W. Bao, D. Wang, and F. Liu, "Dynamic Adaptive DNN Surgery for Inference Acceleration on the Edge," in Proc. of IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, pp.1423-1431, 2019.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going Deeper with Convolutions," in Proc. of 2015 lEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp.770-778, 2016.
- M. Gao, R. Shen, L. Shi, W. Qi, J. Li, and Y. Li, "Task Partitioning and Offloading in DNN-Task Enabled Mobile Edge Computing Networks," IEEE Transactions on Mobile Computing, vol.22, no.4, pp.2435-2445, Apr. 2023. https://doi.org/10.1109/TMC.2021.3114193
- X. Tang, X. Chen, L. Zeng, S. Yu, and L. Chen, "Joint Multiuser DNN Partitioning and Computational Resource Allocation for Collaborative Edge Intelligence," IEEE Internet of Things Journal, vol.8, no.12, pp.9511-9522, 2021. https://doi.org/10.1109/JIOT.2020.3010258
- T. Mohammed, C. Joe-Wong, R. Babbar, and M. Di Francesco, "Distributed Inference Acceleration with Adaptive DNN Partitioning and Offloading," in Proc. of IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, pp.854-863, 2020.
- C.-Y. Yang, J.-J. Kuo, J.-P. Sheu, and K.-J. Zheng, "Cooperative Distributed Deep Neural Network Deployment with Edge Computing," in Proc. of ICC 2021 - IEEE International Conference on Communications, pp.1-6, 2021.
- Z. Liao, W. Hu, J. Huang, and J. Wang, "Joint multi-user DNN partitioning and task offloading in mobile edge computing," Ad Hoc Networks, vol.144, 2023.
- L. Shi, Z. Xu, Y. Sun, Y. Shi, Y. Fan, and X. Ding, "A DNN inference acceleration algorithm combining model partition and task allocation in heterogeneous edge computing system," Peerto-Peer Networking and Applications, vol.14, pp.4031-4045, 2021. https://doi.org/10.1007/s12083-021-01223-1
- Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, "Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge," ACM SIGARCH Computer Architecture News, vol.45, no.1, pp.615-629, 2017. https://doi.org/10.1145/3093337.3037698
- S. Zhang, Y. Li, X. Liu, S. Guo, W. Wang, J. Wang, B. Ding, and D. Wu, "Towards Real-time Cooperative Deep Inference over the Cloud and Edge End Devices," Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol.4, no.2, pp.1-24, Jun. 2020. https://doi.org/10.1145/3397315
- N. Wang, Y. Duan, and J. Wu, "Accelerate Cooperative Deep Inference via Layer-wise Processing Schedule Optimization," in Proc. of 2021 International Conference on Computer Communications and Networks, pp.1-9, 2021.
- Y. Duan, and J. Wu, "Joint Optimization of DNN Partition and Scheduling for Mobile Cloud Computing," in Proc. of ICPP '21: Proceedings of the 50th International Conference on Parallel Processing, pp.1-10, 2021.
- G. T. Ross, and R. M. Soland, "A branch and bound algorithm for the generalized assignment problem," Mathematical programming, vol.8, pp.91-103, Dec. 1975. https://doi.org/10.1007/BF01580430
- Y. Le, and X. Yang, "Tiny imagenet visual recognition challenge," CS 231N, vol.7, no.7, 2015. https://cs231n.stanford.edu/reports/2015/pdfs/yle_project.pdf
- J. Zhao, Q. Li, Y. Gong and K. Zhang, "Computation Offloading and Resource Allocation for Cloud Assisted Mobile Edge Computing in Vehicular Networks," IEEE Transactions on Vehicular Technology, vol.68, no.8, pp.7944-7956, 2019. https://doi.org/10.1109/TVT.2019.2917890
- Y. Chen, K. Wu, Q. Zhang, "From QoS to QoE: A Tutorial on Video Quality Assessment", IEEE Communications Surveys & Tutorials, vol.17, no.2, pp.1126-1165, 2015. https://doi.org/10.1109/COMST.2014.2363139