huggingchat/papers-content / 2301 /2301.02494.md

|

64.1 kB

Adaptive Pattern Extraction Multi-Task Learning for Multi-Step Conversion Estimations

Xuewen Tao*
xuewen.txw@mybank.cn
MYbank, Ant Group
Beijing, China

Mingming Ha*
hamingming_0705@foxmail.com
School of Automation and Electrical
Engineering, University of Science
and Technology Beijing; MYbank, Ant
Group
Beijing, China

Qiongxu Ma
qiongxu.mqx@mybank.cn
MYbank, Ant Group
Shanghai, China

Hongwei Cheng
chw286885@mybank.cn
MYbank, Ant Group
Shanghai, China

Wenfang Lin
moxi.lwf@mybank.cn
MYbank, Ant Group
Hangzhou, Zhejiang, China

Xiaobo Guo†
xb_guo@bjtu.edu.cn
Institute of Information Science,
Beijing Jiaotong University, Mybank,
Ant Group,
Beijing, China

Abstract

Multi-task learning (MTL) has been successfully used in many real-world applications, which aims to simultaneously solve multiple tasks with a single model. The general idea of multi-task learning is designing kinds of global parameter sharing mechanism and task-specific feature extractor to improve the performance of all tasks. However, challenge still remains in balancing the trade-off of various tasks since model performance is sensitive to the relationships between them. Less correlated or even conflict tasks will deteriorate the performance by introducing unhelpful or negative information. Therefore, it is important to efficiently exploit and learn fine-grained feature representation corresponding to each task. In this paper, we propose a Adaptive Pattern Extraction Multi-task (APEM) framework, which is adaptive and flexible for large-scale industrial application. APEM is able to fully utilize the feature information by learning the interactions between the input feature fields and extracted corresponding tasks-specific information. We first introduce a DeepAuto Group Transformer module to automatically and efficiently enhance the feature expressivity with a modified set attention mechanism and a Squeeze-and-Excitation operation. Second, explicit Pattern Selector is introduced to further enable selectively feature representation learning by adaptive task-indicator vectors. Empirical evaluations show that APEM outperforms the state-of-the-art MTL methods on public and real-world financial services datasets. More importantly, we explore the online performance of APEM in a real industrial-level recommendation scenario.

CCS Concepts: • Information systems → Information systems applications; Computational advertising; • Multi-task Learning;

Keywords: Recommender System, Sequential Dependency, Multi-Task Learning, Representation Learning

1 Introduction

Multi-task learning (MTL) has enjoyed rather remarkable successes for various real-world scenarios, such as the online recommendation [7, 22], display advertising [5], customer acquisition management, financial service [30] and so forth. MTL techniques simultaneously learn multiple tasks by implicitly passing the message among related tasks [3, 24], compared with single-task learning, which can improve the overall performance of these tasks [4, 20]. In online advertising, recommendation, and customer acquisition etc, post-view click-through rate (CTR), post-click conversion rate (CVR) and post-view click-through & conversion rate (CTCVR) estimations are a series of classical tasks [15, 29] with sequential dependence of the customer acquisition process. In this case, sequential pattern of user behaviours means the later action only occur after the former action. More generally, this sequential pattern can be extended to multi-step conversion. As shown in Fig. 1, a multi-step conversion example in fiance service strictly follow this sequential dependence pattern. A customer will convert through stages of Impression → Click → Authorize → Conversion. Conversion behaviors such like applying loans, make deposit or purchasing investment products are only permitted after an authorization. Aimed at various kinds of concrete industrial applications, extensive studies [6, 14, 15, 25, 26, 28] focus on the CTR and CVR estimations. However, there are rare works to provide a formalized definition of the sequential dependence MTL (SDMTL) problem, which is of particular

*These authors contributed equally to this research.

†Xiaobo Guo is the corresponding author.significance in multi-step conversion estimations applicable to more diverse scenarios. In addition, the connection and difference between the general MTL and SDMTL is also unclear.

Figure 1. An illustration of a multi-step conversion in Finance Service.

Series works from ESMM [15] to ESCM² [29] pay more attention on the unbiased CVR estimation problem in a view of causality to correct sample selection bias. The dependent relation between the tasks like CTR and CVR is implicitly implied via the distribution of sample space. Recently, [30] captures the task dependency through the information transfer between different conversion steps and combines a calibrator to further constrain the dependent relationship. However, the dependency between each steps is still not deeply defined and discussed in most MTL works from a theoretical format.

Besides, as mentioned above, MTL methods improve prediction result through an information passing mechanism between tasks, which suggests improper feature sharing will result in even poorer or imbalanced performance in different tasks known as a negative transfer phenomenon. Therefore, general approach in MTL mainly focuses on designing kinds of information extraction modules (experts) to learn common and task-specific representations. Such as Cross-Stitch Network [16] and Sluice Network [19] employ a linear combination to leverage representations of different tasks but also require much more training parameters. SOTA method of Multi-gate Mixture-of-Experts (MMoE) approach [14] adopts an ensemble of experts submodules and gating network to model task relationships while consuming less computation. Progressive Layered Extraction (PLE) [22], separates task-common and task-specific parameters explicitly which could further avoid parameter conflicts caused by complex task correlation. These approaches assign individual parameters to each task to better exploit task information and improve model generalization. Nonetheless, feature expressivity with respect to each task is still limited since task-irrelevant information passing from the shared structure and more fine-grained representation learning is necessary.

In this paper, we first provide a formal definition of MTL on sequential dependence problem, and propose an optimizing object paradigm for recover the dependent relationship based on theoretical proof. And we also present a novel MTL

framework called Adaptive Pattern Extraction Multi-task (APEM) framework to selectively and dynamically enhance the representation learning for respective tasks along with the dependency-based object. APEM consists of two main modules: Adaptive Sample-wise Representation Generator (ASRG) and explicit Pattern Selector (PS). ASRG employs a dynamic selection mechanism to learn the hierarchical feature interaction from a sample-wise view to further separates the task-irrelevant information. The implement of explicit PS enables fine-grained feature learning by introducing task-specific indicator vectors. In a summary, main contributions of this paper are presented as follows:

• The SDMTL problem is first formally formulated, and its connections and differences with the general MTL problem are illustrated. Moreover, the distribution dependence relationship between the adjacent task spaces is revealed from a theoretical perspective.
• We present a multi-task learning framework named APEM for selectively fine-grained feature representation learning from a sample-wise view. ASRG and PS modules within APEM adaptively reconstruct the implicit shared representations and extract explicit task-specific information in an more efficient way.
• Extensive experiments on public and real-world industrial dataset are conducted to evaluate the effectiveness of APEM. Experiment results demonstrate that our proposed approach outperforms the state-of-the-art MTL methods. Furthermore, we explore the boundary of APEM in real-world industrial applications to prove its efficiency for large-scale online recommendations.

2 Preliminaries

In this section, the SDMTL problem and the connection between SDMTL and general MTL are elaborated. Then, from the expected loss's point of view, the distribution relationship between the adjacent task domains is revealed.

2.1 Problem Formulation

Consider a SDMTL problem over an input space $\mathbb{X}$ and a set of task ${\mathcal{T}i}{i=1}^N$ , where $N$ is the number of tasks and the corresponding task spaces are denoted as ${\mathbb{T}1, \dots, \mathbb{T}N}$ . A large dataset of data points ${x_j, o_j^1, \dots, o_j^N}{j=1}^M$ are given, where $M$ is the number of data points and $o_j^i \in {0, 1}$ corresponding to a binary classification problem or $o_j^i \in \mathbb{R}$ for a regression problem is the label of the $i$ -th task for the $j$ -th data point. Differing from the general MTL problem, for the SDMTL problem, there exists the sequential dependence relationship between tasks in the sense that the current task $\mathcal{T}i$ depends on the previous task $\mathcal{T}{i-1}$ , i.e., $\mathcal{T}{i-1} \rightarrow \mathcal{T}i$ . Let $X$ and $T_i$ be the random variables over the input space $\mathbb{X}$ and task output space $\mathbb{T}i$ , respectively. In this paper, for convenience of analysis, each task is set as a binary classification task. Asmentioned in literature [17] and shown in Fig. 1, for sequential dependence, one of core properties is that if the event $T{i-1}$ is not triggered, then the event $T_i$ must not occur, i.e., $P(T_i = 1, T{i-1} = 0|X) = 0$ , where $P(\cdot|\cdot)$ denotes the conditional probability. Therefore, according to this property, the random variables $T_i$ satisfies

$\begin{aligned} P(T_i = 1|X) &= \sum_{t_{i-1}, \dots, t_1 \in \{0,1\}} P(T_i = 1, \dots, T_1 = t_1|X) \\ &= P(T_i = 1, T_{i-1} = 1, \dots, T_1 = 1|X) \\ &= P(T_i = 1, T_{i-1} = 1|X), \\ P(T_i = 0|X) &= \sum_{t_{i-1}, \dots, t_1 \in \{0,1\}} P(T_i = 0, \dots, T_1 = t_1|X) \\ &= P(T_i = 0, T_{i-1} = 0|X) + P(T_i = 0, T_{i-1} = 1|X), \end{aligned} \quad (1)$

which implies that the positive samples of $T_i$ are derived from the positive samples of the task $T_i$ while the negative samples consist of the negative samples of tasks $T_i$ and $T_{i-1}$ due to the sequential dependence.

In addition, the sequential dependence relationship is also embodied in the constraints with respect to the conversion probabilities of the adjacent tasks. In [30], the sequential dependence relationship is formalized as

$\begin{aligned} P(T_1 = 1|X) &\geq P(T_2 = 1, T_1 = 1|X) \\ &\dots \\ &\geq P(T_{i-1} = 1, \dots, T_1 = 1|X) \\ &\geq P(T_i = 1, T_{i-1} = 1, \dots, T_1 = 1|X). \end{aligned} \quad (2)$

Then, a behavioral expectation calibrator is introduced into AITM [30] to guarantee the sequential dependence relationship (2). When the outputs of the model violate this condition, the designed loss will output a positive penalty term. However, the condition (2) cannot completely reflect the dependence relationship between tasks. Reconsidering the dependence relationship between $P(T_{i-1} = 1, \dots, T_1 = 1|X)$ and $P(T_i = 1, \dots, T_1 = 1|X)$ , it leads to

$\begin{aligned} P(T_{i-1} = 1|X) - P(T_i = 1|X) &= P(T_{i-1} = 1|X) - P(T_i = 1, T_{i-1} = 1|X) \\ &= P(T_{i-1} = 1|X) [1 - P(T_i = 1|T_{i-1} = 1, X)] \\ &= P(T_{i-1} = 1|X) P(T_i = 0|T_{i-1} = 1, X) \\ &= P(T_i = 0, T_{i-1} = 1|X). \end{aligned} \quad (3)$

Therefore, the dependence relationship between the adjacent tasks needs to satisfy the equality constraints (3), which also contains the dependence information $P(T_i = 1, T_{i-1} = 0|X) = 0$ .

Define a parametric hypothesis class per task as $f_i(x; \theta^s, \theta^i): \mathbb{X} \rightarrow \mathbb{T}_i$ , where $\theta^s$ and $\theta^i$ are shared parameters and task-specific parameters of the task $i$ . Also, the task-specific loss function is defined as $L_i(\cdot, \cdot): \mathbb{T}_i \times \mathbb{T}_i \rightarrow \mathbb{R}^+$ . Similar to the general MTL problem, the objective of SDMTL is to minimize

the following expected loss:

$\begin{aligned} \min_{\theta_s, \theta_1, \dots, \theta_N} \sum_{i=1}^N E_{X, T_1, \dots, T_N \sim \mathcal{O}} [w_i L_i(f_i(X; \theta^s, \theta^i), T_i)] \\ \text{s.t. } f_i(X; \theta^s, \theta^i) - f_{i-1}(X; \theta^s, \theta^i) = P(T_i = 0, T_{i-1} = 1|X), \\ i = 2, \dots, N, \end{aligned} \quad (4)$

where $\mathcal{O}$ is the distribution with domain $\mathbb{X} \times \mathbb{T}1 \times \dots \times \mathbb{T}N$ , and $w_i$ is the static or dynamically computed weight per task. Therefore, the SDMTL problem can be considered as a general MTL with the constraints $f{i-1}(x_j; \theta^s, \theta^{i-1}) - f_i(x_j; \theta^s, \theta^i) = P(T_i = 0, T{i-1} = 1|X)$ , which implies that the difference of $f_{i-1}(x_j; \theta^s, \theta^{i-1})$ and $f_i(x_j; \theta^s, \theta^i)$ is the probability of the event $T_i$ not occurring when the event $T_{i-1}$ is triggered. Considering the binary classification tasks and the labels $o_j^i$ and $o_j^{i-1}$ of the adjacent tasks, we can obtain the label corresponding to the probability $P(T_i = 0, T_{i-1} = 1|X)$ as

Table 1. Labels corresponding to tasks $\mathcal{T}_{i-1}$ , $\mathcal{T}_i$ , and the dependence relationship.

$o_j^i$	$o_j^{i-1}$	$o_j^i - o_j^{i-1}$	$P(T_i = 0, T_{i-1} = 1\|X)$
0	0	0	0
1	0	1	1
1	1	0	0

Therefore, according to Table 1, the label of the sequential dependence between the adjacent tasks is equivalent to the difference between $o_j^i$ and $o_j^{i-1}$ , i.e., $o_j^i - o_j^{i-1}$ .

Since there exists the dependence relationship between the previous and current tasks, i.e., $\mathcal{T}_1 \rightarrow \mathcal{T}_2 \rightarrow \dots \rightarrow \mathcal{T}_N$ , the sample space of the current task depends on that of the previous task. In general, the sample space of the previous task contains the sample space of the current one as shown in Fig. 2, which leads to the data distribution discrepancy between these two sample spaces. Consider the general CTR, CVR and CTCVR estimation tasks, i.e., impression $\rightarrow$ click $\rightarrow$ $\dots$ $\rightarrow$ conversion. We use the random variables $Y \in {0, 1}$ and $Z \in {0, 1}$ to denote the click event and the conversion event, respectively. Then, CTR, CVR and CTCVR with feature input $X$ are defined as $P(Y|X)$ , $P(Z|Y = 1, X)$ and $P(Z, Y = 1|X)$ , which satisfy

$P(Z = 1, Y = 1|X) = P(Y = 1|X)P(Z = 1|Y = 1, X). \quad (5)$

In this case, the training space of the traditional CVR estimation task is generally determined by the samples with $Y = 1$ in the CTR estimation task. However, for a new user, there are no impression and click records. The conversion rate estimation task is actually to estimate $P(Z = 1|X)$ . Considering $P(Z = 1, Y = 0|X) = 0$ [17] and according to (1), we can obtain

$\begin{aligned} P(Z = 1|X) &= P(Z = 1, Y = 1|X), \\ P(Z = 0|X) &= P(Z = 0, Y = 0|X) + P(Z = 0, Y = 1|X). \end{aligned} \quad (6)$ Figure 2. Distribution discrepancy of different task spaces in SDMTL. The curved surface represents the distribution $\mathcal{O}$ with domain $\mathbb{X} \times \mathbb{T}_1 \times \dots \times \mathbb{T}_N$ . The colored circles from the outside to the inside denote the domains of tasks $\mathcal{T}_1$ , $\mathcal{T}_2$ , $\mathcal{T}_3$ , and $\mathcal{T}_4$ , respectively.

According to (6), if the negative data points of the event $Z$ only derived from the space with $Y = 1$ is used to predict $P(Z = 1|X)$ , then the data distribution discrepancy between training space and inference space leads to inaccurate predictions.

2.2 Distribution Dependence Relationship Between Inference Space and Local Space

In this subsection, the relationship of expected losses between domains of adjacent tasks $\mathcal{T}_{i-1}$ and $\mathcal{T}i$ is established. In tasks $\mathcal{T}{i-1}$ and $\mathcal{T}i$ , the sample space with data points ${x_j \in \mathbb{X}, o_j^{i-1} \in {0, 1}, o_j^i \in {0, 1}}$ is called inference space, i.e., entire space for $\mathcal{T}{i-1}$ and $\mathcal{T}_i$ , and the sample space with data points ${x_j \in \mathbb{X}, o_j^{i-1} \in {1}, o_j^i \in {0, 1}}$ is called local space, also called training space in some traditional CVR estimation methods [15]. The distributions of inference and local spaces are denoted as $\mathcal{D}$ and $\mathcal{C}$ , respectively.

Therefore, the objective of these two tasks $\mathcal{T}_{i-1}$ and $\mathcal{T}_i$ with sequential dependence in inference space is to minimize the following expected loss:

$\begin{aligned} & E_{X, T_{i-1}, T_i \sim \mathcal{D}} [L(f_{i-1}(X), T_{i-1}) + L(f_i(X), T_i)] \\ &= E_{X, T_{i-1} \sim \mathcal{D}} [L(f_{i-1}(X), T_{i-1})] + E_{X, T_i \sim \mathcal{D}} [L(f_i(X), T_i)] \\ &= \int_{\mathcal{D}} L(f_{i-1}(x), t_{i-1}) P_{\mathcal{D}}(x, t_{i-1}) dx dt_{i-1} \\ &+ \int_{\mathcal{D}} L(f_i(x), t_i) P_{\mathcal{D}}(x, t_i) dx dt_i, \end{aligned} \quad (7)$

where $P_{\mathcal{D}}(\cdot, \cdot)$ is the joint distribution in inference space. On the other hand, if the model is trained in the local space $\mathcal{C}$ , then the task $\mathcal{T}i$ determines the sample distribution of the task $\mathcal{T}{i-1}$ . With this operation, the expected loss becomes the

following form:

$\begin{aligned} & E_{X, T_{i-1} \sim \mathcal{D}} [L(f_{i-1}(X), T_{i-1})] + E_{X, T_i \sim \mathcal{C}} [L(f_i(X), T_i)] \\ &= \int_{\mathcal{D}} L(f_{i-1}(x), t_{i-1}) P_{\mathcal{D}}(x, t_{i-1}) dx dt_{i-1} \\ &+ \int_{\mathcal{C}} L(f_i(x), t_i) P_{\mathcal{C}}(x, t_i) dx dt_i, \end{aligned} \quad (8)$

where $P_{\mathcal{C}}(\cdot, \cdot)$ is the joint distribution in local space. Next, the relationship between expected losses in (7) and (8) is revealed.

Theorem 2.1. If the expected losses in the inference and local spaces are defined as in (7) and (8), then, for any loss function $L(\cdot, \cdot)$ , they satisfy

$\begin{aligned} & E_{X, T_{i-1}, T_i \sim \mathcal{D}} [L(f_{i-1}(X), T_{i-1}) + L(f_i(X), T_i)] \\ &= E_{X, T_{i-1} \sim \mathcal{D}} [L(f_{i-1}(X), T_{i-1})] \\ &+ E_{X, T_i \sim \mathcal{C}} \left[ P_{\mathcal{D}}(T_{i-1} = 1) \frac{P_{\mathcal{D}}(T_i|X)}{P_{\mathcal{D}}(T_i, T_{i-1} = 1|X)} L(f_i(X), T_i) \right]. \end{aligned} \quad (9)$

Obviously, the distribution shift also exists in CTR, CVR and CTCVR estimations when they are trained in different spaces.

3 The Adaptive Pattern Extraction Multi-task Framework

The whole architecture of proposed APEM for sequential dependence multi-task learning is illustrated in Figure 3. APEM consists of two representation learning modules ASRG and PS to dynamically extract implicit and explicit feature information from a sample-wise view, and a sequential dependence task learning loss to reconstruct an unbiased task relationship on a global training space. Adaptive Sample-wise Representation Generator (ASRG) is responsible for hierarchical shared-representation learning, adopting inducing points to interact with different feature field corresponding to each input. Task Specific Adapter (PS) module cooperates with ASRG but works as a task-aware information extractor through designed task indicator and is with an independent message passing structure to better solve the task conflict. Besides those two, a sequence dependency learning loss between tasks is proposed and theoretical proved, which is able to describe the conditional dependent probability for sequential based multi-task learning from the whole training space and consequently improve the prediction result by precisely capturing the task relationship. We will elaborate ASRG and PS in section 3.1 and 3.2, and lastly discuss the relationship between sequence dependence tasks in section 3.3.Figure 3. An illustration of the overall architecture of APEM.

3.1 Adaptive Sample-wise Representation Generator

Fine-grained feature information extraction corresponding to different tasks is crucial in multi-task learning and significantly affects model performance. But feature generalization also needs to be included to balance the trade-off between tasks in terms of shared information. Based on these considerations, we propose a novel representation learning module, named Adaptive Sample-wise Representation Generator (ASRG). Besides learning generalized shared-information, we design a dynamic selector to learn the feature interaction from a sample-wise view to further separates the task-irrelevant info. The structure of ASRG is shown in Figure 4, which mainly consists of a dynamic activation layer and a feature interaction learning layer.

Dynamic Activation Layer. In recommendation scenario, input field usually contains kinds of user and item features. Given an input $\mathbf{x}$ from $F$ different feature fields, we denote $\mathbf{x}$ as the concatenation of all feature fields:

$\mathbf{x} = [x_1, x_2, \dots, x_F], \quad (10)$

where $x_i$ represents the value of the $i$ -th feature. As a commonly data preprocessing for online recommendation scenario with better generalization, we discretize numerical features $x_i$ through a Log-round operation to get an unique

Figure 4. The detail structure of Adaptive Sample-wise Representation Generator

value, and randomly initialize it with a vector of $d_f$ dimension. Thus, we obtain the input embedding for each feature field as $H = [h_1, h_2, \dots, h_F]^T$ , where $H \in \mathbb{R}^{F \times d_f}$ .

A transformation Layer is first applied to project the input embeddings into a $K$ dimension vector. The transformation layer can be any type of deep neural network structure and here we chose a standard MLP layer just for simplicity. The output $z_K$ is defined as a dynamic selector:

$z_K = \text{MLP}(H) \quad (11)$

where $z_K \in \mathbb{R}^K$ . Then, we implement a dynamic activation function $f_D$ inspired by [9] to get a sparser representation of $z_K$ , whose formulation is as follows:

$z_K = f_D(z_K) \quad (12)$

where $f_D$ is formulated as:

$f_D(z) = \begin{cases} 0, & z \leq -\frac{\gamma}{2} \\ -\frac{2}{\gamma^3} z^3 + \frac{3}{2\gamma} z + \frac{1}{2}, & -\frac{\gamma}{2} < z < \frac{\gamma}{2} \\ 1, & z \geq \frac{\gamma}{2} \end{cases} \quad (13)$

where $\gamma = \text{Max}{10 - 2e - 4 \cdot \text{step}, 1e - 3}$ and maximum $\text{step}$ during the training process is around $1e6$ . Dynamic selector $z_K$ works as a information filter which selectively interacts with input from the sample-wise view due to the Transformation Layer. As visualized in Figure 5, the output shape of $f_D$ becomes steeper with the increase of training step. Byutilizing $f_D, z_K$ creates a $K$ dimension sparse vector only contains values of 0 and 1 corresponding to each input sample.

Figure 5. Output of the dynamic activation function $f_D$ with the increase of training step.

Feature Interaction Learning Layer. Attention mechanism for learning hierarchical feature interaction is generally adopted but requires quadratic time complexity in standard self-attention structure. Here we design a learnable matrix called inducing points $I$ , enlightened from Set Transformer [12] to reduce the computational complexity from quadratic to linear. We define the inducing points $I \in \mathbb{R}^{K \times d_f}$ , where $K$ is the same as in dynamic selector $z_K$ . After an element-wise operation, we get a modified query $\hat{Q}$ as:

$\hat{Q} = I \odot z_K \quad (14)$

Then, we calculate the output $O_j$ from the attention operation according to the following formulation:

$O_j = \text{Attention}(\hat{Q}_j, K_j, V_j; \lambda) \quad (15)$

where $\hat{Q}j = I_j \odot z_K$ , $K_j = HW_j^K$ , $V_j = HW_j^V$ with trainable parameter $\lambda = {I_j, W_j^K, W_j^V}{j=1}^m$ and $m$ represents the number of multi-head. Then we get output $O$ from the multi-head attention with parameter $W^O$ as:

$O = \text{concat}(O_1, \dots, O_h) W^O \quad (16)$

Finally, the adaptive representation $Y_{ASRG}$ learned from ASRG can be formulated in a way of residual network:

$Y_{ASRG} = \text{LayerNorm}(O, H) \quad (17)$

The time complexity of feature interaction learning layer reduces from $O(F^2)$ to $O(K \times F)$ by introducing $I$ . As suggested in [18], $K$ , the reduced dimension of $I$ could be viewed as $K$ independent memory cells interacting with each feature field, which is further automatically selected by $z_k$ to distinguish feature information explicitly from a sample-wise view. Compared with traditional shared-representation learning structure in most MTL methods, ASRG learns more distinctive info in terms of a dynamic activation layer and feature interaction learning layer, which is mainly attributed to the former one combining a transformation layer and dynamic activation function to generate an adaptive mask corresponding to each input sample.

3.2 Explicit Pattern Selector

Besides effective shared-representation generated by ASRG, specific feature learning according to each task will strongly affect the model performance since it directly enhances the task-relevant information. In most MTL works, task-targeted feature extractors, such as the task-specific experts proposed by PLE are deliberately designed to learn the representation for each task. However, mutual interference between different tasks still exists since the shared and task-specific components are not completely separated in these cases.

In order to learn the task-aware information among different tasks with a more independent and thoroughly separated structure, we introduce a module named explicit Task Specific Adapters (PSs) as detail plotted in Figure 6. PS utilizes parameterized task indicator vector to interact with previous sample-wise common shared info from ASRG, which is able to extract task-specific representation by directly optimizing respective task object. The approach is similarly adopted in PAL [21] and K-adapter [27].

Figure 6. The detail structure of Pattern Selector.

As illustrated in Figure 6, we take the output $Y_{ASRG} \in \mathbb{R}^{K \times d_f}$ from the Adaptive Sample-wise Representation Generator to interact with a learnable task indicator vector $\alpha_i$ corresponding to each task $i$ . The $F_i$ is the output calculated through an attention operation between $Y_{ASRG}$ and $\alpha_i$ as:

$F_i = \text{Attention}(\alpha_i, Y_{ASRG}, Y_{ASRG}), \quad (18)$

where $\alpha_i \in \mathbb{R}^{1 \times d_f}$ is the task indicator vector and $F_i \in \mathbb{R}^{1 \times d_f}$ denotes the middle output of task-aware representation for each task $i$ correspondingly. Here, Attention is the same attention calculation operation as in formula (15). Consequently, for each task $i$ , we refer the task-specific information generated from the $k$ -th PS layer as $T_i^k$ and we calculate it also through a residual network and layer normalization for training efficiency as in (17):

$T_i^k = \text{LayerNorm}(T_i^{k-1} + F_i), \quad (19)$

where $T_i^{k-1} \in \mathbb{R}^{1 \times d_f}$ means the output from the previous PS layer for task $i$ , and $F_i^k \in \mathbb{R}^{1 \times d_f}$ is the task-aware embedding learned by the interaction between the task indicator for the$k$ -th layer with task $i$ and shared-common embedding. Note, $T_i^0$ is ignored at the first iteration.

It can be observed in Figure 6, task aware embedding $T$ obtained from Task Specific Adapters is trained independently and whose message doesn't pass into ASRG module among different layers. The proposed structure keeps the implicit (from ASRG) and explicit (from PS) representation learning modules more separated, which not only isolates the negative interference between tasks more thoroughly but also provides a extendable multi-task learning framework especially necessary in industrial implementation.

3.3 Loss Function Design Towards Sequential Dependence Multi-Task Learning

For the multi-task learning without sequential dependence, the loss function is generally designed as the following form:

$\mathcal{L}(\theta^s, \theta^1, \dots, \theta^N) = \sum_{i=1}^N \sum_{j=1}^M \frac{w_i}{M} L(f_i(x_j; \theta^s, \theta^i), o_j^i). \quad (20)$

From the loss function (20), it can be observed that this loss function cannot learn the sequential dependence relationship. As mentioned in subsection 2.1, in this paper, the constrained optimization problem can be transformed into the unconstrained case by using a penalty function. Therefore, the corresponding loss function for SDMTL is designed as

$\begin{aligned} \mathcal{L}(\theta^s, \theta^1, \dots, \theta^N) &= \mathcal{L}_{M-Task} + \mathcal{L}_{D-Task} \\ &= \sum_{i=1}^N \frac{w_i}{M} \sum_{j=1}^M L(f_i(x_j; \theta^s, \theta^i), o_j^i) \\ &\quad + \sum_{i=2}^N \frac{\sigma_{i-1}}{M} \sum_{j=1}^M L(f_{i-1}(x_j; \theta^s, \theta^{i-1}) - f_i(x_j; \theta^s, \theta^i), o_j^{i-1} - o_j^i), \end{aligned} \quad (21)$

where $\sigma_i$ is the penalty coefficients, $\mathcal{L}{M-Task}$ and $\mathcal{L}{D-Task}$ are the loss functions of the main tasks, i.e., $\mathcal{T}i$ , and the loss functions of the sequential dependence relationship, respectively. With this operation, each task and their corresponding dependence relationship can be trained separately. The loss functions of the dependence relationship $\mathcal{L}{D-Task}$ can be regarded as a regularization term. Note that the selection of negative samples determines the training space. Therefore, the positive samples of the task $\mathcal{T}_i$ are derived from the current task while the negative samples of $\mathcal{T}_i$ are derived from different tasks $\mathcal{T}i$ and $\mathcal{T}{i-1}$ . Similar to the subsection 3.2, expected losses of the dependence relationship derived from the entire space $\mathcal{D}$ and local space $\mathcal{C}$ are discussed as follows.

Theorem 3.1. If the dependence relationship is learned in the entire space $\mathcal{D}$ and the local space $\mathcal{C}$ , respectively, and the corresponding expected losses are denoted as $E_{X, T_{i-1}, T_i \sim \mathcal{D}}[L(f_{i-1}(X) -$

$f_i(X), T_{i-1} - T_i)]$ and $E_{X, T_{i-1}, T_i \sim \mathcal{C}}[L(f_{i-1}(X) - f_i(X), T_{i-1} - T_i)]$ , then these two expected losses satisfy

$\begin{aligned} &E_{X, T_{i-1}, T_i \sim \mathcal{D}}[L(f_{i-1}(X) - f_i(X), T_{i-1} - T_i)] \\ &= E_{X, T_{i-1}, T_i \sim \mathcal{C}} \left[ \frac{P_{\mathcal{D}}(T_{i-1} = 1) P_{\mathcal{D}}(T_{i-1} - T_i | X)}{P_{\mathcal{D}}(T_{i-1} - T_i, T_{i-1} = 1 | X)} \right. \\ &\quad \left. \times L(f_{i-1}(X) - f_i(X), T_{i-1} - T_i) \right] \end{aligned} \quad (22)$

4 Experiments

In this section, we describe the experiments to evaluate the performance of the proposed APEM framework, which are conducted on both public benchmark dataset and real-world industrial dataset in financial service. We also analyze the contribution of each modules consisting of APEM to further understand the working mechanism and demonstrate the effectiveness of proposed method for sequential dependence multi-task learning.

4.1 Experimental Setup

4.1.1 Datasets. Experiments are conducted on two dataset: the public benchmark Ali-CCP and an industrial dataset from the financial scenario.

• Ali-CCP dataset:¹ The public dataset Ali-CCP is used as the benchmark for model comparison with Alibaba Click and Conversion Prediction tasks. We use all the single-valued categorical features as generally adopted and randomly take 10% of the train dataset as the validation dataset for all models.
• Industrial Dataset: The industrial dataset is collected from our online recommendation platform in financial service, which describes users' clicking and conversion behavior responding to financial advertising. The dataset is divided for training, validation and test chronologically and we downsample the negative samples in the training set to keep the ratio of each set is 8:1:1.

4.1.2 Baseline Methods. To validate the effectiveness of APEM, we conduct our experiments on the following representative methods for comparison, which are SOTA multi-task learning approaches or recent sequence dependency learning method:

• Single-Task is a three-layer MLP network with hidden layer size of [256, 128, 64] for single-task optimization.
• Shared-Bottom constructs a shared bottom layer to learn the common representation across all tasks and introduces separated task tower for the object optimization respectively.

¹https://tianchi.aliyun.com/dataset/dataDetail?dataId=408Table 2. Detailed hyper-parameters settings for each dataset.

Dataset	Hyper-parameters Settings
Ali-CCP	$B = 1024, d_f = 18, M = 2, K = 64, L = 4, \lambda = 10^{-3}$
Industrial dataset	$B = 1024, d_f = 18, M = 2, K = 64, L = 4, \lambda = 10^{-3}$

• MMOE is inspired by the classic MoE method which adopts a group of shared bottom subnetworks as experts and introduces gating network assigning different tasks with distinctive weights.
• PLE generalizes CGC method and employs a progressive routing mechanism to extract and separate deeper semantic knowledge.
• AITM is a shared-bottom structure with adaptive information transfer for modeling sequential dependency among multi-step conversions.
• APEM is our proposed approach which adopts adaptive sample-wise representation generator and the explicit Pattern Selector, with dependency learning loss for the sequence dependence multi-task learning.

4.1.3 Implementation of $\mathcal{L}_{D-Task}$ . As discussed in section 3.3, $\mathcal{L}_{D-Task}$ can be regarded as a regularization, which constraints the relationship between sequential dependence tasks during training process. In this paper, we constructs a MSE (Mean-Squared Loss) as an implementation for $L$ in formula (21):

$L_{MSE} = \frac{1}{M} \sum_{j=1}^M w \cdot (y_j - \hat{y}_j)^2 \quad (23)$

where $y_j$ is label from $o_j^{i-1} - o_j^i$ and $\hat{y}j$ is the output from $f{i-1}(x_j; \theta^s, \theta^{i-1}) - f_i(x_j; \theta^s, \theta^i)$ for input $j$ . Each sample is equally treated with $w = 1$ .

4.1.4 Training Setup. In the experiments, each experiment is repeated 5 times, the average performance and the p value are both reported. We select the optimal hyper-parameters for each model in terms of grid search [13] for fair comparison. The batch size $B$ on each datasets is set as 1024 respectively during the training process. Adam optimizer [11] is applied with a learning rate $\lambda$ of 0.001. The dimension $d_f$ of input embedding layer is 18. The number of the stacked layers $L$ , the number of the attention heads $M$ and the number of the inducing points $K$ is illustrated in Table 2. The activation function of MLP in single-task modeling is ReLU.

4.2 Performance Comparison

The experimental results for all comparison methods with the evaluation metric AUC for each task are presented in Table 4. The best performance on different datasets are highlighted in boldface and underline for the best SOTA methods.

As can be observed, APEM outperforms most baseline models for each task on both datasets respectively.

The average performance on Ali-CCP dataset is poor both on CTR and CVR targets on all compared methods, which probably implies the input features are not qualified enough to express the targets or the irrelevant information affects significantly. For the latter case, task-specific feature extraction will play a key role to the prediction results in terms of filtering negative interference. As observed, APEM achieves 0.6203 and 0.6456 of AUC for CTR and CVR tasks respectively, with gains of 1.16% and 7.31% compared to the Singel-Task method. The improvements of CTR is significant with comparison to other methods but slightly poorer than PLE in the object of CVR. The result seems to be attributed to the trade-off between tasks considered by APEM and APEM. The difference improvements between CTR and CVR is smaller in APEM and suggests a more balanced optimization among tasks. The performance on the Industrial dataset of APEM obtains an considerable gain of 1.29% and 1.43% for both targets and significantly outperforms other methods. Compared with PLE, which achieves a second best result, the proposed model still gets an increase of the gain by 43% and 55% and further demonstrates its effectiveness.

4.3 Ablation Study

We conduct ablation study on different submodules in APEM in order to provide a detail analysis of its function and efficiency. The variant models of APEM consists of following structures and the notation is just for simplicity:

• APEM without ASRG: removing the dynamic activation layer in ASRG and replacing with a standard self attention operation.
• APEM without PS: removing task indicator in PS layers for all corresponding tasks.
• APEM without $\mathcal{L}_{D-Task}$ : removing the sequence dependence learning loss $\mathcal{L}_{D-Task}$ as denoted as in (21).
• APEM: complete structure of APEM.

The results of ablation study are presented in Table ?? with an evaluation metric AUC on both datasets for CTR and CVR tasks. As observed, the complete structure of APEM outperforms all other APEM-variants and we can draw the following conclusions for each submodule:

(1). Adaptive Sample-wise Representation Generator contributes to learn fine-grained and generalized shared representation for both tasks. In which, dynamic selector enables to select essential information for each sample which enhance the knowledge learning. Without the dynamic selection layer, model performance drops most for both CTR and CVR target as -0.5% and -0.57% in Industrial dataset. Fully interaction learning via a standard multi-head self-attention can't provide enough shared info. We believe that ASRG reconstruct the necessary information in an adaptive manner which not only learns the feature field interaction but filterTable 3. The performance (AUC) comparison with baselines. The Gain means the mean AUC improvement compared with Single-Task method. ** indicates that the improvement of the proposed APEM is statistically significant compared with the best baseline at a p-value < 0.01 over paired samples t-test.

Models	Ali-CPP				Industrial Dataset
Models	CTR	CVR	$Gain_{CTR}$	$Gain_{CVR}$	CTR	CVR	$Gain_{CTR}$	$Gain_{CVR}$
Single-Task	0.6089	0.6011	–	–	0.7081	0.7616	–	–
Shared-Bottom	0.6098	0.6225	0.15%	3.56%	0.7050	0.7614	-0.37%	0.11%
MMOE	0.6177	0.6223	1.45%	3.53%	0.7134	0.7673	0.82%	0.90%
PLE	0.6195	0.6355	1.74%	5.72%	0.7140	0.7675	0.90%	0.92%
AITM	0.6133	0.6391	0.72%	6.32%	0.7110	0.7667	0.47%	0.81%
ESMM	0.6193	0.6333	1.71%	5.36%	0.7154	0.7680	1.10%	0.99%
ESCM2	0.6153	0.6258	1.05%	4.11%	0.7146	0.7701	0.99%	1.26%
APEM	0.6198	0.6436**	1.79%	7.07%	0.7167**	0.7714**	1.29%	1.43%

Table 4. The performance (AUC) comparison of ablation study.

Models	Ali-CPP				Industrial Dataset
Models	CTR	CVR	$Gain_{CTR}$	$Gain_{CVR}$	CTR	CVR	$Gain_{CTR}$	$Gain_{CVR}$
APEM w/o ASRG	0.6178	0.6379	-0.32%	-0.89%	0.7131	0.7670	-0.50%	-0.57%
APEM w/o PS	0.6158	0.6382	-0.65%	-0.84%	0.7141	0.7695	-0.36%	-0.25%
APEM w/o $\mathcal{L}_{D-Task}$	0.6199	0.6319	0.02%	-1.82%	0.7160	0.7695	-0.10%	-0.25%
APEM	0.6198	0.6436	–	–	0.7167	0.7714	–	–

the noise by utilizing a group-level attention from a sample-wise view. The contribution in Ali-CPP is still obvious since whose features seems less expressive as discussed in section 4.2 and is greatly benefited by ASRG.

(2). Explicit Pattern Selectors works as a task-sensitive feature extractor, which is quite crucial in the multi-task learning to precisely extract task-relevant information for each task. It can be evidently observed that without the task attention mechanism (proposed task indicator), model performance drops dramatically and is slightly better than without ASRG in Industrial dataset but worse in Ali-CPP. It is suggested that a vanilla task-specific tower structure doesn't generate enough information during task optimizing process.

(3). The proposed sequence dependence learning loss $\mathcal{L}{D-Task}$ based on theoretical proof contributes to the model performance in terms of the additional information passing among related tasks. Although it seems less significant compared to other submodules in Industrial dataset but contributes most in the CVR task in Ali-CPP. CVR task probably depends on CTR task heavily and $\mathcal{L}{D-Task}$ modifies the biased object of original definition, which further optimizes the parameters by recovering their complete probability-dependent relationship.

4.4 Analysis of Dynamic Selector

Dynamic selector $z_K$ defined in formula (12) functions as a sparse mask generated based on the input sample, which cooperates with Inducing points $I$ interacting with feature fields selectively. We conduct several case studies of $z_K$ to provide an intuitive analysis as visualized in Figure 7.

Figure 7. An illustration of dynamic selector $z_K$ . (a). Distribution of the selection rate for different samples. (b). Plot of sample embeddings with high, middle and low selection rate is colored in green, yellow and blue respectively.

As noted in section 3.1, $z_K$ is a $K$ dimension vector only with values of 0 and 1, where 1 means interacting with implicit field group in terms of $I$ correspondingly and vice versa. We first plot the distribution of the non-zero rate (selection rate) of $z_K$ on the test samples of industrial dataset in Figure 7Table 5. The offline performances (AUC) comparison for two real-world financial scenes.

Models	Scene 1		Scene 2
Models	CTR	CVR	CTR	CVR
MMOE	0.8102	0.8034	0.8110	0.8719
APEM	0.8102	0.8072	0.8123	0.8773
Gain	-	0.47%	0.16%	0.62%

(a). It can be observed for most samples, the selection rate is between 55% and 60% and indicates that more than half the interaction groups are required for information extraction. For specific cases, some needs just less interaction groups and some needs more. We could regard this as a multi-view representations for each sample, such as different numbers of perspectives qualified enough to describe a customer's interest specially on online recommendation scenario. We further randomly plots sample embeddings with high (top 1%), mid (around 58%) and low (bottom 1%) selection rates in Figure 7 (b). As illustrated, samples with different interaction degrees show significant difference in embedding space and probably implying distinctive intentions. .

4.5 Efficiency Evaluation

In this section, we evaluate time and storage efficiency of our proposed method. We record the time cost during the training (per epoch) and inference process for APEM and other baseline models in Figure 8 (a), and their respective memory cost in Figure 8 (b). As illustrated in (a), APEM requires 1177 seconds for training an epoch on the Ali-CCP dataset with 38 millions samples, which is less efficient than other methods (295s for the best results from Shared-Bottom) but similar to the PLE (1195s). Since we generally train the model in an offline manner especially for large-scale data, relatively higher training efficiency is within tolerance. On the inference time, the essential factor considered in the online industrial application, APEM spends 47 seconds for forward propagation on the test data with 4.2 millions samples. Its deviation from top performance models like AITM and Shared-Bottom is 12 seconds and it is acceptable considering most industrial online inference situation's QPS threshold. Besides, APEM has the least parameters with 89 Mb (178 Mb for the largest model of Single-Task) as plotted in Figure 8 (b), which makes it easily deployed and portable. In a conclusion, APEM achieves a significant improvement with an appropriate computational time and advantageous storage capacity compared with other approved MTL methods.

4.6 Online A/B Performance

We implement an online A/B test between APEM and SOTA multi-task learning method of MMOE for one week. Two models are deployed on two real-world financial advertising scenarios with the objective of maximizing the CVR for the

Figure 8. Efficiency comparison among models. (a). Training and Inference time (b). Memory Cost

financial products as investment and credit loan. The offline comparison is presented in Table 5, APEM gets an improvement of 0.16% for CTR in Scene 2, 0.47% and 0.62% for CVR on each scene correspondingly. In Figure 9, we can observe the online performances of APEM compared with MMOE. As shown in the plot, APEM achieves significant and consistent improvements for both scenarios during the whole period, average increase of 9.22 % in scenario (a) and 3.76% in scenario (b) on the CVR task. The experiment result proves the efficiency and the stability of proposed method, which is qualified enough for large-scale industrial application.

5 Related Work

Multi-task Learning. Multi-task Learning (MTL) is proposed to learn the shared information among tasks to improve the model generalization and performance [2]. However, multi-task learning scenario usually suffers from the performance deterioration as negative transfer because of the complex relationship between different tasks [1, 23]. Therefore, much feature learning works in structure designing are proposed for necessary information extraction according to specific task and balancing the performances among all tasks. Cross-Stitch Network [16] use a linear combination of shared representations to learn the task-specific embeddings for each task. Based on the idea of Cross-Stitch, Sluice Network [19] is a generalized meta-architecture with more task-specific parameters by dividing each layer into task-specific and shared subspaces and achieves better performance specially for less correlated tasks. However, these approaches could not capture the sample dependence andFigure 9. Online A/B results on two real-world scenarios (a) and (b).

require more training data and less efficient for large-scale application. Inspired by the MoE [10] structure, multi-gate Mixture-of-Experts (MMoE) [14] employs an ensemble of experts submodules and gating network to combine the representation of the bottom experts to learn the task relationship while consuming less computation. Similarly, Multiple Relational Attention Network (MRAN) [32] models multiple relationships by three attention-based learning mechanism. Compared with MMoE, Progressive Layered Extraction (PLE) method [22] propose a novel MTL framework which separates task-common and task-specific parameters more explicitly and adopts a progressive separation routing mechanism to better alleviate parameter conflicts caused by complex task correlation.

Sequential Dependence Multi-task Learning. The most classical applications of the sequential dependence MTL (SDMTL) are the multi-step conversion process of customer acquisition in e-commerce, display advertising or finance systems. In general, the multi-step conversion process involves impression $\rightarrow$ click $\rightarrow$ $\dots$ $\rightarrow$ conversion, which corresponds to several estimation tasks like post-view click-through rate (CTR), post-click conversion rate (CVR) and post-view click-through & conversion rate (CTCVR) estimations and so fourth. Differing from the general MTL, there exist dependence relationships between the adjacent tasks in the SDMTL problem. For the CVR estimation problem, Entire Space Multi-task Model (ESMM) is proposed in [15] to overcome Sample Selection Bias (SSB) and Data Sparsity (DS) issues by introducing two auxiliary tasks of predicting CTR

and CTCVR. With this operation, the performance of the CVR estimation will depend heavily on the performance auxiliary tasks. As the number of steps increases in multi-step conversion path, the accumulation of performance errors becomes intolerable. Aimed at the DS problem of the CVR estimation, in [29], a novel user sequential behavior graph is established to achieve post-click behavior decomposition by inserting disjoint purchase-related deterministic action and other action into between click and conversion. Considering micro behaviors (user's interactions with items) and macro behaviors (user's interactions with specific components on the item detail page) of users, Wen et al. [28] propose a Hierarchically Modeling both Micro and Macro behaviors for CVR prediction to address SSB and DS issues by using the abundant supervisory labels from micro and macro behaviors. To models the sequential dependence among multi-step conversions, Adaptive Information Transfer Multi-task (AITM) framework with adaptive information transfer module is developed in [30] to directly predict the end-to-end conversion probabilities of each step. Besides, causal approaches have also been applied to achieve the debiasing post-click conversion rate estimation lately [6, 8, 26, 31]. However, for the sequential dependence multi-task learning problem, there is rare literature to develop a formalization description.

6 Conclusions

In this paper, we propose a sequence dependence multi-task learning framework named as Adaptive Pattern Extraction Multi-task (APEM) framework, which could selectively reconstruct implicit shared representations from a sample-wise view and extract explicit task-specific information in an more efficient way compared with common task-aware tower structure. We accomplish this by involving an Adaptive Sample-wise Representation Generator and a Pattern Selector. For the multi-task learning with dependency generally encountered in E-commerce online recommendation, we provide a detail theoretical proof about the dependent relationship from rigorous mathematical perspective. Based on our analysis, we design a dependence task learning loss to complete optimizing object in an unbiased format. The performance gains of APEM compared to several SOTA multi-task approaches on both public and real-world industrial datasets demonstrates its effectiveness and generalization characteristics. Besides, we carefully conduct ablation study, case study, efficiency evaluation and online A/B test to further analyze the contributions from different modules and its applicability for large-scale industrial scenarios.

7 Appendices

7.1 Proof for Theorem 2.1

Proof. Considering the definitions of the inference and local spaces, and their corresponding expected losses given in (7)and (8), we can obtain

$P_C(X, T_i) = P_{\mathcal{D}}(X, T_i | T_{i-1} = 1) \quad (24)$

in the sense that the joint distribution of $X$ and $T_i$ in $C$ is equivalent to, under $T_{i-1} = 1$ , the joint distribution of $X$ and $T_i$ in $\mathcal{D}$ .

According to (24) and the definition of $E_{X, T_i \sim \mathcal{D}}[L(f_i(X), T_i)]$ , the second term in the right-hand side of (9) satisfies

$\begin{aligned} & E_{X, T_i \sim C} \left[ P_{\mathcal{D}}(T_{i-1} = 1) \frac{P_{\mathcal{D}}(T_i | X)}{P_{\mathcal{D}}(T_i, T_{i-1} = 1 | X)} L(f_i(X), T_i) \right] \\ &= E_{X, T_i \sim C} \left[ \frac{P_{\mathcal{D}}(T_{i-1} = 1)}{P_{\mathcal{D}}(T_{i-1} = 1 | X, T_i)} L(f_i(X), T_i) \right] \\ &= \int_C \frac{P_{\mathcal{D}}(T_{i-1} = 1)}{P_{\mathcal{D}}(T_{i-1} = 1 | x, t_i)} L(f_i(x), t_i) P_C(x, t_i) dx dt_i \\ &= \int_{\mathcal{D}} L(f_i(x), t_i) \frac{P_{\mathcal{D}}(T_{i-1} = 1)}{P_{\mathcal{D}}(T_{i-1} = 1 | x, t_i)} P_{\mathcal{D}}(x, t_i | T_{i-1} = 1) dx dt_i \\ &= \int_{\mathcal{D}} L(f_i(x), t_i) \frac{P_{\mathcal{D}}(T_{i-1} = 1)}{P_{\mathcal{D}}(T_{i-1} = 1 | x, t_i)} \frac{P_{\mathcal{D}}(x, t_i, T_{i-1} = 1)}{P_{\mathcal{D}}(T_{i-1} = 1)} dx dt_i \\ &= \int_{\mathcal{D}} L(f_i(x), t_i) P_{\mathcal{D}}(x, t_i) dx dt_i \\ &= E_{X, T_i \sim \mathcal{D}}[L(f_i(X), T_i)]. \end{aligned} \quad (25)$

Therefore, equations (7), (8) and (25) imply that the relationship (9) holds. $\square$

7.2 Proof for Theorem 3.1

Proof. According to Bayes' theorem, we can obtain the following equality:

$\frac{P_{\mathcal{D}}(T_{i-1} = 1) P_{\mathcal{D}}(T_{i-1} - T_i | X)}{P_{\mathcal{D}}(T_{i-1} - T_i, T_{i-1} = 1 | X)} = \frac{P_{\mathcal{D}}(T_{i-1} = 1)}{P_{\mathcal{D}}(T_{i-1} = 1 | X, T_{i-1} - T_i)}. \quad (26)$

Considering the right-hand side of (22) and (26), it leads to

$\begin{aligned} & E_{X, T_{i-1}, T_i \sim C} \left[ \frac{P_{\mathcal{D}}(T_{i-1} = 1) L(f_{i-1}(X) - f_i(X), T_{i-1} - T_i)}{P_{\mathcal{D}}(T_{i-1} = 1 | X, T_{i-1} - T_i)} \right] \\ &= \int_C \left[ \frac{P_{\mathcal{D}}(T_{i-1} = 1)}{P_{\mathcal{D}}(T_{i-1} = 1 | x, t_{i-1} - t_i)} L(f_{i-1}(x) - f_i(x), t_{i-1} - t_i) \right. \\ & \quad \left. P_C(x, t_{i-1} - t_i) \right] dx dt_i dt_i \\ &= \int_{\mathcal{D}} \left[ \frac{P_{\mathcal{D}}(T_{i-1} = 1)}{P_{\mathcal{D}}(T_{i-1} = 1 | x, t_{i-1} - t_i)} P_{\mathcal{D}}(x, t_{i-1} - t_i | T_{i-1} = 1) \right. \\ & \quad \left. \times L(f_{i-1}(x) - f_i(x), t_{i-1} - t_i) \right] dx dt_{i-1} dt_i \\ &= \int_{\mathcal{D}} L(f_{i-1}(x) - f_i(x), t_{i-1} - t_i) P_{\mathcal{D}}(x, t_{i-1} - t_i) dx dt_{i-1} dt_i \\ &= E_{X, T_{i-1}, T_i \sim \mathcal{D}}[L(f_{i-1}(X) - f_i(X), T_{i-1} - T_i)], \end{aligned} \quad (27)$

which implies that (22) holds. The proof is completed. $\square$

References

[1] Jonathan Baxter. 1997. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine learning 28, 1 (1997), 7–39.
[2] Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41–75.
[3] Ling Chen, Donghui Chen, Fan Yang, and Jianling Sun. 2021. Neural episodic control. In A deep multi-task representation learning method for time series classification and retrieval. Information Sciences, 17–32.
[4] Michael Crawshaw. 2020. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796 (2020).
[5] Hongliang Fei, Jingyuan Zhang, Xingxuan Zhou, Junhao Zhao, Xinyang Qi, and Ping Li. 2021. GemNN: gating-enhanced multi-task neural networks with feature interaction learning for CTR prediction. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2166–2171.
[6] Tiankai Gu, Kun Kuang, Hong Zhu, Jingjie Li, Zhenhua Dong, Wenjie Hu, Zhenguo Li, Xiuqiang He, and Yue Liu. 2021. Estimating true post-click conversion via group-stratified counterfactual inference.
[7] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
[8] Siyuan Guo, Lixin Zou, Yiding Liu, Wenwen Ye, Suqi Cheng, Shuaiqiang Wang, Hechang Chen, Dawei Yin, and Yi Chang. 2021. Enhanced doubly robust learning for debiasing post-click conversion rate estimation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 275–284.
[9] H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathiamoorthy, and E. H. Chi. 2021. DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning. (2021).
[10] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation 3, 1 (1991), 79–87.
[11] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[12] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. 2019. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning. PMLR, 3744–3753.
[13] PM Lerman. 1980. Fitting segmented regression models by grid search. Journal of the Royal Statistical Society: Series C (Applied Statistics) 29, 1 (1980), 77–84.
[14] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1930–1939.
[15] Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. 2018. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1137–1140.
[16] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3994–4003.
[17] Conor O'Brien, Kin Sum Liu, James Neufeld, Rafael Barreto, and Jonathan J Hunt. 2021. An Analysis Of Entire Space Multi-Task Models For Post-Click Conversion Prediction. In Fifteenth ACM Conference on Recommender Systems. 613–619.
[18] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, andCharles Blundell. 2017. Neural episodic control. In International Conference on Machine Learning. PMLR, 2827–2836.

[19] Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. 2019. Latent multi-task architecture learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4822–4829.

[20] Jiayi Shen, Xiantong Zhen, Marcel Worring, and Ling Shao. 2021. Variational multi-task learning with gumbel-softmax priors. Advances in Neural Information Processing Systems 34 (2021), 21031–21042.

[21] Asa Cooper Stickland and Iain Murray. 2019. Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. In International Conference on Machine Learning. PMLR, 5986–5995.

[22] Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progressive layered extraction (PLE): A novel multi-task learning (MTL) model for personalized recommendations. In Fourteenth ACM Conference on Recommender Systems. 269–278.

[23] Partoo Vafaeikia, Khashayar Namdar, and Farzad Khalvati. 2020. A Brief Review of Deep Multi-task Learning and Auxiliary Task Learning. arXiv preprint arXiv:2007.01126 (2020).

[24] Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, Dengxin Dai, and Luc Van Gool. 2020. Revisiting multi-task learning in the deep learning era. arXiv preprint arXiv:2004.13379 2 (2020).

[25] Fangye Wang, Yingxu Wang, Dongsheng Li, Hansu Gu, Tun Lu, Peng Zhang, and Ning Gu. 2022. Enhancing CTR Prediction with Context-Aware Feature Representation Learning. arXiv preprint arXiv:2204.08758 (2022).

[26] Hao Wang, Tai-Wei Chang, Tianqiao Liu, Jianmin Huang, Zhichao Chen, Chao Yu, Ruopeng Li, and Wei Chu. 2022. ESCM²: Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation. arXiv preprint arXiv:2204.05125 (2022).

[27] Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Guihong Cao, Daxin Jiang, Ming Zhou, et al. 2020. K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808 (2020).

[28] Hong Wen, Jing Zhang, Fuyu Lv, Wentian Bao, Tianyi Wang, and Zulong Chen. 2021. Hierarchically modeling micro and macro behaviors via multi-task learning for conversion rate prediction. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2187–2191.

[29] Hong Wen, Jing Zhang, Yuan Wang, Fuyu Lv, Wentian Bao, Quan Lin, and Keping Yang. 2020. Entire space multi-task modeling via post-click behavior decomposition for conversion rate prediction. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 2377–2386.

[30] Dongbo Xi, Zhen Chen, Peng Yan, Yinger Zhang, Yongchun Zhu, Fuzhen Zhuang, and Yu Chen. 2021. Modeling the sequential dependence among audience multi-step conversions with multi-task learning in targeted display advertising. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3745–3755.

[31] Wenhao Zhang, Wentian Bao, Xiao-Yang Liu, Keping Yang, Quan Lin, Hong Wen, and Ramin Ramezani. 2020. Large-scale causal approaches to debiasing post-click conversion rate estimation with multi-task learning. In Proceedings of The Web Conference 2020. 2775–2781.

[32] Jiejie Zhao, Bowen Du, Leilei Sun, Fuzhen Zhuang, Weifeng Lv, and Hui Xiong. 2019. Multiple relational attention network for multi-task learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1123–1131.

Buckets:

huggingchat
/

papers-content