1 Introduction
Graph neural networks (GNNs)
[18, 40, 42, 12] are gaining increasing attention in the realm of graph representation learning. By generally following a recursive neighborhood aggregation scheme, GNNs have shown impressive representational power in various domains, such as point clouds [34], social networks [7], chemical analysis [6], and so on. Most of the existing GNN models are trained in an endtoend supervised fashion, which relies on a high volume of fineannotated data. However, labeling graph data requests a huge amount of effort from professional annotators with domain knowledge. To alleviate this issue, GAE [19] and GraphSAGE [12] have been proposed to exploit a naive unsupervised pretraining strategy that reconstructs the vertex adjacency information. Some recent works [16, 46] introduce selfsupervised pretraining strategies to GNNs which further improve the generalization performance.More recently, with developments of contrastive multiview learning in computer vision
[15, 3, 37]and natural language processing
[44, 24], some selfsupervised pretraining approaches perform as good as (or even better than) supervised methods. In general, contrastive methods generate training views using data augmentations, where views of the same (positive pairs) input are concentrated in the representation space with views of different inputs (negative pairs) pushed apart. To work on graphs, DGI [41] has been proposed to treat both graphlevel and nodelevel representations of the same graph as positive pairs, pursuing consistent representations from local and global features. CMRLG [13] achieves a similar goal by grouping adjacency matrix (local features) and its diffusion matrix (global features) as positive pairs. GCA [49] generates the positive view pairs through subgraph sampling with the structure priors with node attributes randomly masked. GraphCL [45] offers even more strategies for augmentations, such as node dropping and edge perturbation. While above attempts incorporate contrastive learning into graphs, they usually fail to generate views with respect to the semantic of original graphs or adapt augmentation policies to specific graph learning tasks.Blessed by the invariance of image semantics under various transformation, image data augmentation has been widely used [5] to generative contrastive views. However, the use of graph data augmentation might be ineffective here, as transformations on a graph might severely disrupt its semantics and properties for learning. In the meanwhile, InfoMin [38] improves contrastive learning for vision tasks and proposes to replace image data augmentation with a flowbased generative model for contrastive views generation. Thus, learning a probability distribution of contrastive views conditioned by an input graph might be an alternative to simple data augmentation for graph contrastive learning but still requests nontrivial efforts, as the performance and scalability of common graph generative models are poor in realworld scenarios.
In this work, we propose a learnable graph view generation method, namely AutoGCL, to address above issues via learning a probability distribution over nodelevel augmentations. While the conventional predefined view generation methods, such as random dropout or graph node masking, may inevitably change the semantic labels of graphs and finally hurt contrastive learning, AutoGCL adapts to the input graph such that it can well preserve the semantic labels of the graph. In addition, thanks to the gumbelsoftmax trick [17], AutoGCL is endtoend differentiable yet providing sufficient variances for contrastive samples generation. We further propose a joint training strategy to train the learnable view generators, the graph encoder, and the classifier in an endtoend manner. The strategy includes the view similarity loss, the contrastive loss, and the classification loss. It makes the proposed view generators generate augmented graphs that have similar semantic information but with different topological properties. In Table 1, we summarize the properties of existing graph augmentation methods, where AutoGCL dominates in the comparisons.
We conduct extensive graph classification experiments using semisupervised learning, unsupervised learning, and transfer learning tasks to evaluate the effectiveness of AutoGCL. The results show that AutoGCL improves the stateoftheart graph contrastive learning performances on most of the datasets. In addition, we visualize the generated graphs on MNISTSuperpixel dataset
[27] and reveal that AutoGCL could better preserve semantic structures of the input data than existing predefined view generators.Our contributions can be summarized as follows.

We propose a graph contrastive learning framwork with learnable graph view generators embedded into a auto augmentation strategy. To the best of our knowledge, this is the first work that builds learnable generative augmentation policies for graph contrastive learning.

We propose a joint training strategy for training the graph view generators, the graph encoder, and the graph classifier under the context of graph contrastive learning in an endtoend manner.

We extensively evaluate the proposed method on a variety of graph classification datasets with semisupervised, unsupervised, and transfer learning settings. The tSNE and view visualization results also demonstrate the effectiveness of our method.
2 Related Work
2.1 Graph Neural Networks
Denote a graph as where the node features are for . In this paper, we focus on the graph classification task using Graph Neural Networks (GNNs). GNNs generate nodelevel embedding through aggregating the node features of its neighbors. Each layer of GNNs serves as an iteration of aggregation, such that the node embedding after the th layers aggregates the information within its hop neighborhood. The th layer of GNNs can be formulated as
(1)  
(2) 
For the downstream tasks such as graph classification, the graphlevel representation is obtained via the READOUT function and MLP layers as
(3)  
(4) 
In this work we follow the existing graph contrastive learning literature to employ two stateoftheart GNNs, i.e., GIN [42] and ResGCN [2], as our backbone GNNs.
2.2 Pretraining Graph Neural Networks
Pretraining GNNs on graph datasets still remains a challenging task, since the semantics of graphs are not straightforward, and the annotation of graphs (proteins, chemicals, etc.) usually requires professional domain knowledge. It is very costly to collect largescale and fineannotated graph datasets like ImageNet
[20]. An alternative way is to pretrain the GNNs in an unsupervised manner. The GAE [19] first explored the unsupervised GNN pretraining by reconstructing the graph topological structure. GraphSAGE [12] proposed an inductive way of unsupervised node embedding by learning the neighborhood aggregation function. PretrainGNN [16] conducted the first systematic largescale investigation of strategies for pretraining GNNs under the transfer learning setting. It proposed selfsupervised pretraining strategies to learn both the local and global features of graphs. However, the benefits of graph transfer learning may be limited and lead to negative transfer [30], as the graphs from different domains actually differ a lot in their structures, scales and node/edge attributes. Therefore, many of the following works started to explore an alternative approach, i.e., the contrastive learning, for GNNs pretraining.2.3 Contrastive Learning
In recent years, contrastive learning (CL) has received considerable attention among the selfsupervised learning approaches, and a series of CL methods including SimCLR [3] and MoCov2 [4] even outperform the supervised baselines. Through minimizing the contrastive loss [11], the views generated from the same input (i.e., positive view pairs) are pulled close in the representation space, while the views of different inputs (i.e., negative view pairs) are pushed apart. Most of the existing CL methods [15, 47, 3, 9] generate views using data augmentation, which is still challenging and underexplored for the graph data. Instead of data augmentation, DGI [41] treated the graphlevel representations and the nodelevel representations of the same graph as positive view pairs. CMRLG [13] achieved an analogical goal by treating the adjacency matrix (local features) and the diffusion matrix (global features) as positive pairs. More recently, the GraphCL framework [45] employed four types of graph augmentations, including node dropping, edge perturbation2, subgraph sampling3, and node attribute masking1, enabling the most diverse augmentations by far for graph view generation. GCA [49] used subgraph sampling and node attribute masking as augmentations and introduced a prior augmentation probability based on the node centrality measures, enabling more adaptiveness than GraphCL [45]. However, these graph augmentation methods are not labelpreserving. Moreover, the augmentation intensity needs to be manually tuned and the augmentation policy is not adaptive to different tasks. In this work, we propose to learn the optimal augmentation policy from the graph data.
2.4 Learnable Data Augmentation
As mentioned above, data augmentation is a significant component of CL. The existing literature [3, 45] has revealed that the optimal augmentation policies are taskdependent and the choice of augmentations makes a considerable difference to the CL performance. The researchers have explored to automatically discover the optimal policy for image augmentations in the computer vision field. For instance, AutoAugment [5]
firstly optimized the combination of augmentation functions through reinforcement learning. FasterAA
[14] and DADA [22] proposed a differentiable augmentation optimization framework following the DARTS [23] style.However, the learnable data augmentation methods are barely explored for CL except the InfoMin framework [38], which claims that good views of CL should maintain the label information as well as minimizing the mutual information of positive view pairs. InfoMin employs a flowbased generative model as the view generator for data augmentation and trains the view generator in a semisupervised manner. However, transferring this idea to graph C: is a nontrivial task since current graph generative models are either of limited generation qualities [19] or designed for specific tasks such as the molecular data [6, 25]. To overcome this issue, in this work we build a learnable graph view generator that learns a probability distribution over the nodelevel augmentations. Compared to the existing graph CL methods, our method well preserves the semantic structures of original graphs. Moreover it is endtoend differentiable and can be efficiently trained.
3 Methodology
3.1 What Makes a Good Graph View Generator?
Our goal is to design a learnable graph view generator that learns to generate the augmented graph view in datadriven manner. Although various graph data augmentation methods have been proposed, there is less discussion on what makes a good graph view generator? From our perspective, an ideal graph view generator for data augmentation and contrastive learning should satisfy the following properties:

It supports both the augmentations of the graph topology and the node feature.

It is labelpreserving, i.e., the augmented graph should maintain the semantic information in the original graph.

It is adaptive to different data distributions and scalable to large graphs.

It provides sufficient variances for contrastive multiview pretraining.

It is endtoend differentiable and efficient enough for fast gradient computation via backpropagation (BP).
Here we provide an overview of the augmentation methods proposed in existing literature of graph contrastive learning in Table 1. CMRLG [13] applies diffusion kernel on adjacency matrix to get different topological structures. GRACE [48] uses random edge dropping and node attribute masking^{1}^{1}1Randomly mask the attributes of certain ratio of nodes. . GCA [49] uses node dropping and node attribute masking along with a structural prior. Among all the previous works, GraphCL [45] enables the most flexible set of graph data augmentations so far, as it includes node dropping, edge perturbation^{2}^{2}2Randomly replace certain ratio of edges with random edges. , subgraph^{3}^{3}3Randomly select a connected subgraph of certain size. , and attribute masking1. We provide a detailed ablation study and analysis of GraphCL augmentations with different augmentation ratios in the Section 1.1 of the supplementary.
In this work, we propose a learnable view generator to address all the above issues. Our view generator includes both augmentations of node dropping and attribute masking, but it is much more flexible since both two augmentations can be simultaneously employed in a nodewise manner, without the need of tuning the “aug ratio”. Besides the concern of model performance, another reason for not incorporating edge perturbation in our view generator is, the generation of edges through the learnable methods (e.g., VGAE [19]) requires to predict the full adjacency matrix that contains elements, which is a heavy burden for backpropagation when dealing with largescale graphs.
3.2 Learnable Graph View Generator
Figure 1 illustrates the scheme of our proposed learnable graph view generator. We use GIN [42] layers to get the node embedding from the node attribute. For each node, we use the embedded node feature to predict the probability of selecting a certain augment operation. The augmentation pool for each node is drop, keep, and mask. We employ the gumbelsoftamx [17] to sample from these probabilities then assign an augmentation operation to each node. Formally, if we use GIN layers as the embedding layer, we denote as the hidden state of node at the th layer and as the embedding of node after the th layer. For node , we have the node feature , the augmentation choice , and the function for applying the augmentation. Then the augmented feature of node is obtained via
(5)  
(6)  
(7)  
(8) 
The dimension of the last layer k is set as the same number of possible augmentations for each node. denotes the probability distribution for selecting each kind of augmentation.
is a onehot vector sampled from this distribution via gumbelsoftmax
[17] and it is differentiable due to the reparameterization trick. The augmentation applying function combines the node attribute and using differentiable operations (e.g. multiplication), so the gradients of the weights of the view generator are kept in the augmented node features and can be computed using backpropagation. For the augmented graph, the edge table is updated using for all , where the edges connected to any dropped nodes are removed. As the edge table is only the guidance for node feature aggregation and it doe not participate in the gradient computation, it does not need to be updated in a differentiable manner. Therefore, our view generator is endtoend differentiable. The GIN embedding layers and the gumbelsoftmax can be efficiently scaled up for larger graph datasets and more augmentation choices.3.3 Contrastive Pretraining Strategy
Since the contrastive learning requires multiple views to form a positive view pair, we have two view generators and one classifier for our framework. According to InfoMin principle [38], a good positive view pair for contrastive learning should maximize the labelrelated information as well as minimizing the mutual information (similarity) between them. To achieve that, our framework uses two separate graph view generators and trains them and the classifier in a joint manner.
3.3.1 Loss Function Defination
Here we define three loss functions, contrastive loss
, similarity loss , and classification loss . For contrastive loss, we follow the previous works [3, 45] and use the normalized temperaturescaled cross entropy loss (NTXEnt) [35]. We formulate the similarity function as(9) 
Suppose we have a data batch made up of graphs. We pass the batch to the two view generators to obtain graph views. We regard the two augmented views from the same input graph as the positive view pair. We use to denote the indicator function. We denote the contrastive loss function for a positive pair of samples as , the contrastive loss of this data batch as , the temperature parameter as , then we have
(10)  
(11) 
The similarity loss is used to minimize the mutual information between the views generated by the two view generators. During the view generation process, we have a sampled state matrix indicting each node’s corresponding augmentation operation (see Figure 1). For a graph , we denote the sampled augmentation choice matrix of each view generator as , then we formulate the similarity loss as
(12) 
Finally, for the classification loss, we directly use the cross entropy loss (). For a graph sample with class label , we denote the augmented view as and and the classifier as . Then the classification loss is formulated as
(13) 
is employed in the semisupervised pretraining task to encourage the view generator to generate labelpreserving augmentations.
3.3.2 Naive Training Strategy
For unsupervised learning and transfer learning tasks, we use a naive training strategy (naivestrategy). Since we do not know the label in the pretraining stage, the is not used because it does not make sense to just encourage the views to be different without keeping the labelrelated information. This could lead to generating useless or even harmful view samples. We just train the view generators and the classifier jointly to minimize the in the pretraining stage.
Also, we note that the quality of the generated views will not be as good as the original data. During the minimization, instead of just minimizing the between two augmented views like GraphCL [45], we also make use of the original data. By pulling the original data and the augmented views close in the embedding space, the view generator can be more likely to preserve the labelrelated information. The details of the naive training strategy are described in Algorithm 1.
3.3.3 Joint Training Strategy
For semisupervised learning tasks, we proposed a joint training strategy, performs contrastive training and supervised training alternately. This strategy generates labelpreserving augmentation and outperforms the naivestrategy, the experiment results and detailed analysis is shown in Section 4.1.3 and Section 4.3.
For the jointstrategy, during the unsupervised training stage, we fix the view generators, and train the classifer by contrastive learning using unlabeled data. During the supervised training stage, we jointly train the view generator with the classifier using labeled data. By simultaneously optimizing and , the two view generator are encouraged to generated labelpreserving augmentations, yet being different enough from each other. The unsupervised training stage and supervised training stage are repeated alternately. This is very different from previous graph contrastive learning methods. Previous work like GraphCL [45] use the pretraining/finetuning strategy, which first minimizes the contrastive loss () until convergence using the unlabeled data and then finetunes it with the labeled data.
However, we found that for graph contrastive learning, the pretraining/finetuning strategy are more likely to cause overfitting in the finetuning stage. And minimizing the too much may have negative effect for the finetuning stage (see Section 4.3). We speculate that minimizing the too much will push data points near the decision boundary to be too closed to each other, thus become more difficult the classifer to separate them. Because no matter how well we train the GNN classifer, there are still misclassified samples due to the natural overlaps between the data distribution of different classes. But in the contrastive pretraining state, the classifer is not aware of whether the samples being pulled together are really from the same class.
Model  MUTAG  PROTEINS  DD  NCI1  COLLAB  IMDBB  REDDITB  REDDITM5K 
GL  81.66±2.11          65.87±0.98  77.34±0.18  41.01±0.17 
WL  80.72±3.00  72.92±0.56    80.01±0.50    72.30±3.44  68.82±0.41  46.06±0.21 
DGK  87.44±2.72  73.30±0.82    blue80.31±0.46    66.96±0.56  78.04±0.39  41.27±0.18 
node2vec  72.63±10.20  57.49±3.57    54.89±1.61         
sub2vec  61.05±15.80  53.03±5.55    52.84±1.47    55.26±1.54  71.48±0.41  36.68±0.42 
graph2vec  83.15±9.25  73.30±2.05    73.22±1.81    71.10±0.54  75.78±1.03  47.86±0.26 
InfoGraph  89.01±1.13  blue74.44±0.31  72.85±1.78  76.20±1.06  blue70.65±1.13  blue73.03±0.87  82.50±1.42  53.46±1.03 
GraphCL  86.80±1.34  74.39±0.45  78.62±0.40  77.87±0.41  71.36±1.15  71.14±0.44  89.53±0.84  blue55.99±0.28 
Ours  blue88.64±1.08  75.80±0.36  blue77.57±0.60  82.00±0.29  70.12±0.68  73.30±0.40  blue88.58±1.49  56.75±0.18 
Model  BBBP  Tox21  ToxCast  SIDER  ClinTox  MUV  HIV  BACE 
No Pretrain  65.8±4.5  74.0±0.8  63.4±0.6  57.3±1.6  58.0±4.4  71.8±2.5  75.3±1.9  70.1±5.4 
Infomax  68.8±0.8  75.3±0.5  62.7±0.4  58.4±0.8  69.9±3.0  75.3±2.5  76.0±0.7  75.9±1.6 
EdgePred  67.3±2.4  blue76.0±0.6  blue64.1±0.6  60.4±0.7  64.1±3.7  74.1±2.1  76.3±1.0  blue79.9±0.9 
AttrMasking  64.3±2.8  76.7±0.4  64.2±0.5  61.0±0.7  71.8±4.1  74.7±1.4  77.2±1.1  79.3±1.6 
ContextPred  68.0±2.0  75.7±0.7  63.9±0.6  blue60.9±0.6  65.9±3.8  blue75.8±1.7  77.3±1.0  79.6±1.2 
GraphCL  blue69.68±0.67  73.87±0.66  62.40±0.57  60.53±0.88  blue75.99±2.65  69.80±2.66  78.47±1.22  75.38±1.44 
Ours  73.36±0.77  75.69±0.29  63.47±0.38  62.51±0.63  80.99±3.38  75.83±1.30  blue78.35±0.64  83.26±1.13 
Therefore, we propose a new semisupervised training strategy, namely the jointstrategy by alternately minimizing the and . Minimizing is inspired by InfoMin [38], so as to make the two view generator to keep labelrelated information while having less mutual information. However, since we only have a small portion of labeled data to train our view generator, it is still beneficial to use the original data just like the naivestrategy. Interestingly, since we need to minimize and simultaneously, a weight can be applied to better balance the optimization, but actually we found setting works pretty well during the experiments in Section 4.1. The detailed training strategy is described in Algorithm 2. And the overview of our whole framework is shown in Figure 2.
4 Experiment
4.1 Comparison with StateoftheArt Methods
4.1.1 Unsupervised Learning
For the unsupervised graph classification task, we contrastively train a representation model using unlabeled data, then fix the representation model and train the classifier using labeled data. Following GraphCL [45], we use a 5layer GIN with a hidden size of 128 as our representation model, and use an SVM as our classifier. We train the GIN with a batch size of 128 and a learning rate of 0.001. There are 30 epochs of contrastive pretraining under the naivestrategy. We perform a 10fold cross validation on every dataset. For each fold, we employ 90% of the total data as the unlabeled data for contrastive pretraining, and 10% as the labeled testing data. We repeat every experiment for 5 times using different random seeds.
We compare with the kernelbased methods like graphlet kernel (GL) [33], WeisfeilerLehman subtree kernel (WL) [32] and deep graph kernel (DGK) [43], and other unsupervised graph representation methods like node2vec [10], sub2vec [1], graph2vec [29] also the contrastive learning methods like InfoGraph [36] and GraphCL [45]. Table 2 show the comparison among different models for unsupervised learning. Our proposed model achieves the best results on PROTEINS, NCI1, IMDBbinary, and REDDITMulti5K datasets and the second best performances on MUTAG, DD, and REDDITbinary datasets, outperforming current stateoftheart contrastive learning method GraphCL.
4.1.2 Transfer Learning
We also evaluate the transfer learning performance of the proposed method. A strong baseline method for graph transfer learning is PretrainGNN [16]. The network backbone of PretrainGNN, GraphCL, and our method is a variant of GIN [42], which incorporates the edge attribute. We perform 100 epochs of supervised pretraining on the preprocessed ChEMBL dataset ([26, 8]), which contains 456K molecules with 1,310 kinds of diverse and extensive biochemical assays.
We perform 30 epochs of finetuning on the 8 chemistry evaluation subsets. We use a hidden size of 300 for the classifier, a hidden size of 128 for the view generator. We train the model using a batch size of 256 and a learning rate of 0.001. The results in Table 3 are the mean±std of the ROCAUC scores from 10 reps. Infomax, EdgePred, AttrMasking, ContextPred are the manually designed pretraining strategies from PretrainGNN [16].
Table 3 presents the comparison among different methods. Our proposed method achieves the best performance on most dataset, such as BBBP, SIDER, ClinTox, MUV and BACE, and compared with the current SoTA model – GraphCL [45], our method performs considerably better, for example, on BBBP dataset, the accuracy raises from 69.68±0.67 to 73.36±0.77. Considering all datasets, the average gain of using our proposed method is around 3.42. Interestingly, AttrMasking achieves the best performance on Tox21 and ToxCast, which is slightly better than our method. One possible reason is that attributes are important for classification in Tox21 and ToxCast datasets.
4.1.3 SemiSupervised Learning
Model  PROTEINS  DD  NCI1  COLLAB  GITHUB  IMDBB  REDDITB  REDDITM5K 
Full Data  78.25±1.61  80.73±3.78  83.65±1.16  83.44±0.77  66.89±1.04  76.60±4.20  89.95±2.06  55.59±2.24 
10% Data  69.72±6.71  74.36±5.86  75.16±2.07  74.34±2.00  61.05±1.57  64.80±4.92  76.75±5.60  blue49.71±3.20 
10% GCA  73.85±5.56  76.74±4.09  68.73±2.36  74.32±2.30  59.24±3.21  73.70±4.88  77.15±6.96  32.95±10.89 
10% GraphCL Aug Only  70.71±5.63  76.48±4.12  70.97±2.08  73.56±2.52  59.80±1.94  71.10±5.11  76.45±4.83  47.33±4.02 
10% GraphCL CL  74.21±4.50  76.65±5.12  73.16±2.90  75.50±2.15  63.51±1.02  68.10±5.15  78.05±2.65  48.09±1.74 
10% Our Aug Only  blue75.49±5.15  blue77.16±4.53  73.33±2.86  75.92±1.93  60.65±1.04  blue71.90±2.88  blue79.65±2.84  47.97±2.22 
10% Our CL Naive  74.57±3.29  75.55±4.76  73.22±2.48  blue76.60±2.15  60.95±1.32  71.00±2.91  79.10±4.38  46.71±2.64 
10% Our CL Joint ()  74.66±2.58  76.57±5.08  71.78±1.61  75.38±2.15  60.39±1.50  70.60±4.17  78.90±3.11  46.89±3.13 
10% Our CL Joint (+)  75.12±3.35  76.23±3.57  72.55±2.72  75.60±2.08  60.18±1.75  71.70±3.86  79.25±2.88  47.51±2.51 
10% Our CL Joint ( + )  74.75±3.35  76.82±3.85  73.07±2.31  76.18±2.46  61.75±1.30  71.50±5.32  78.35±4.21  47.73±2.69 
10% redOur CL Joint ( ++)  75.65±2.40  77.50±4.41  blue73.75±2.25  77.16±1.48  blue62.46±1.51  blue71.90±4.79  79.80±3.47  49.91±2.70 
We perform semisupervised graph classification task on TUDataset [28]. For our view generator, we use a 5layer GIN with a hidden size of 128 as the embedding model. We use ResGCN [2] with a hidden size of 128 as the classifier. For GraphCL, we use the default policy random4, which randomly selects two augmentations from node dropout, edge perturbation, subgraph, and attribute masking for every minibatch. For all augmentations, a node or edge could be dropped or perturbed with a probability of , which is also the default setting in GraphCL [45].
We employ a 10fold cross validation on each dataset. For each fold, we use 80% of the total data as the unlabeled data, 10% as labeled training data, and 10% as labeled testing data. For the augmentation only (Aug Only) experiments, we only perform 30 epochs of supervised training with augmentations using labeled data. For the contrastive learning experiments of GraphCL and our naivestrategy, we perform 30 epochs of contrastive pretraining followed by 30 epochs of supervised training. For our jointstrategy, there is 30 joint epochs of contrastive training and supervised training.
Table 4 compares the performances obtained by different training strategies: augmentation only (Aug only), naivestrategy (CL naive) and jointstrategy (CL joint). We also conducted an ablation study of our joint loss function. The proposed CL joint approach achieves relatively high accuracy on most datasets, for example, on DD, COLLAB, REDDITB and REDDITM5K datasets, using joint strategy obtains the best performance, the average gain of which is around 0.31 compared with the second best performances. In terms of other datasets, using joint strategy also achieves the second best performances. Looking at the comparison among Aug only, CL naive and CL joint, CL joint is superior to the other two approaches, in particular to CL naive.
4.2 Effectiveness of Learnable View Generators
In this section, we demonstrate the superiority of learnable graph augmentation policies over the fixed ones. Since the graph datasets are usually difficult to be manually classified and visualized, we trained a view generator on MNISTSuperpixel dataset [27] to verify that our graph view generator is able to effectively capture the semantic information in graphs than GraphCL [45], since MNISTSuperpixel graphs have clear semantics which does not require any domin knowledge. The visualization result is shown in Figure 4.
Here we jointly trained the view generators with the classifier until the test accuracy (evaluated on generated views) reached . Since our only topological augmentation is node dropping. So we compared the view of GraphCL’s node dropping augmentation, and use the default setting . Figure 4 shows that, our view generator are more likely to keep key nodes in the original graph, preserving its semantic feature, yet providing enough variance for contrastive learning. Details of the MNISTSuperpixel dataset and more visualization examples are shown in Section 1.2 of the supplementary.
4.3 Analysis for Joint Training Strategy
We compared the naivestrategy (Algorithm 1) with the jointstrategy (Algorithm 2). We trained on COLLAB [31] dataset, which have 5000 social network graphs of 3 classes, the average nodes and edges are 74.49 and 2457.78. Here we use 5layer GIN [42] as the backbone for both the view generator and the classifier. For naivestrategy, there is 30 epochs of contrastive pretrain using 80% unlabeled data and 30% of finetuning using 10% of data. For jointstrategy, there is 30 epochs of joint training. The learning curves are shown in Section 1.3 of the supplementary. Our results show that the joint strategy considerably alleviate the overfitting effect, and our labelpreserving view generator is very effective. We also visualize the process for learning the embedding for each strategy using tSNE [39] in the supplementary. We can find that using joint training strategy can learn better representation much faster since labeled data is used for supervision, also this supervision signal could benefit view generator learning.
5 Conclusion
In this paper, we presented a learnable data augmentation approach for graph contrastive learning, where we employed GIN to generate different views of the original graphs and to preserve the semantic label of the input graph, we developed a joint learning strategy, which alternately optimize the view generators, graph encoders and classifier. We also conducted extensive experiments on a number of datasets and tasks, such as semisupervised learning, unsupervised learning and transfer learning and the results demonstrate that our proposed method outperforms the counterparts on most datasets and tasks. In addition, we visualized the generated graph views, which could preserve the discriminative structure of the input graph, benefiting classification. Finally, the tSNE visualization illustrated that the proposed joint training strategy could be a better choice for semisupervised graph representation learning.
Appendix A More Analysis
a.1 An Insight into GraphCL Augmentations
Here we want to prove that the augmentation selection policy and the intensity of augmentations really matter to the final results. Among all the previous works, GraphCL [45] enables the most flexible set of graph data augmentations so far, as it includes node dropping, edge perturbation, subgraph, and attribute masking. Where

Node dropping randomly removes certain ratio of nodes.

Edge perturbation first randomly removes certain ratio of edges, then randomly adds the same number of edges.

Subgraph randomly choosing a connected subgraph by firstly choose a random center node, then gradually add its neighbor nodes until certain ratio of the total nodes are reached.

Node attribute masking randomly masks the attributes of certain ratio of nodes.
We note that the only augmentation selection policy of all existing works is uniform sampling and all the augmentation methods require a hyperparameter “aug ratio” that controls the portion of nodes/edges that are selected for augmentation. The “aug ratio” is set to a constant in every experiment (e.g., 20% by GraphCL’s default). We perform an ablation study of these augmentation methods as shown in Table LABEL:tabgraphclaugselectablation, Table 6 and conclude that:

The positive contributions of edge perturbation and subgraph augmentation for graph contrastive learning are very limited (or even negative).

The subgraph augmentation is actually contained in the augmentation space of node dropping. For instance, the potential view space of dropping 80% of nodes contains the potential view space of selecting a connected subgraph that contains 20% of the nodes.

The choice of “aug ratio” has a considerable effect on the final performance. It is inappropriate to apply the same “aug ratio” to different augmentations, datasets, and tasks.
a.2 The Effectiveness of Our Learnable Graph Augmentations
Here we demonstrate the superiority of learnable graph augmentation policies over the fixed ones. Since the graph datasets are usually difficult to be manually classified and visualized, we trained a view generator on MNISTSuperpixel Dataset [27] to verify that our graph view generator is able to effectively capture the semantic information in graphs than GraphCL [45]. The visualization result is shown in Figure 4.
The MNISTSuperpixel Dataset [27] is made of the superpixel graphs of the MNIST Dataset [21], contains 60000 training samples and 10000 testing samples, each graph have 75 nodes. The node attribute can be understand as the intensity of each superpixel.
Here we jointly trained the view generators with the classifier until the test accuracy (evaluated on generated views) reached . Since our only topological augmentation is node dropping. So we compared the view of Graphic’s node dropping augmentation, and use the default setting . Figure 4 shows that, our view generator are more likely to keep key nodes in the original graph, preserving its semantic feature, yet providing enough variance for contrastive learning.
a.3 Analysis for Joint Training Strategy
Here we compared the naivestrategy (Algorithm 1 in the paper) with the jointstrategy (Algorithm 2 in the paper). We trained on COLLAB [31] dataset, which have 5000 social network graphs of 3 classes, the average nodes and edges are 74.49 and 2457.78. Here we use 5layer GIN [42] as the backbone for both the view generator and the classifier. For naivestrategy, there is 30 epochs of contrastive pretrain using 80% unlabeled data and 30% of finetuning using 10% of data. For jointstrategy, there is 30 epochs of joint training.
We compared the learning curves in Figure 6. The contrastive losses are both multiplied by to fit in the figure. Here we can see the of naive strategy drops much faster than the joint strategy. However, the test accuracy of naive strategy is lower than the joint strategy, and shows an downward tendency, indicating overfitting. The joint strategy considerably alleviate the overfitting effect, this also shows the effectiveness of our labelpreserving view generator.
We also visualize the process for learning the embedding for each strategy using tSNE [39] in Figure 5. Figure 5 (a) demonstrates that during the contrastive learning process, the graphs that have the same semantic label could gradually cluster together, but it still difficult to recognize the decision boundary to classify the graphs, while using labeled data to finetune the model (see figure 5 (b)) could obtain much better graph representations for classification, indicating that to some extent, only using contrastive learning could benefit classification, but still far away from supervised learning. Figure 5 (c) presents the joint training process, we can easily find that introducing label supervision, the model could learn better representations using a few epochs and looking at the sim, and cl loss values, both of them decrease, indicating that the views of one input graph are more different but the representations of these views are close enough, hence the view generators learn to generate different views and preserve the semantic label of the input graph.
References
 [1] (2018) Sub2vec: feature learning for subgraphs. In PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 170–182. Cited by: §4.1.1.
 [2] (2019) Are powerful graph neural nets necessary? a dissection on graph classification. arXiv preprint arXiv:1905.04579. Cited by: §2.1, §4.1.3.

[3]
(2020)
A simple framework for contrastive learning of visual representations.
In
International conference on machine learning
, pp. 1597–1607. Cited by: §1, §2.3, §2.4, §3.3.1.  [4] (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §2.3.

[5]
(2019)
Autoaugment: learning augmentation strategies from data.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 113–123. Cited by: §1, §2.4.  [6] (2018) MolGAN: an implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973. Cited by: §1, §2.4.
 [7] (2019) Graph neural networks for social recommendation. In The World Wide Web Conference, pp. 417–426. Cited by: §1.
 [8] (2012) ChEMBL: a largescale bioactivity database for drug discovery. Nucleic acids research 40 (D1), pp. D1100–D1107. Cited by: §4.1.2.
 [9] (2020) Bootstrap your own latent: a new approach to selfsupervised learning. arXiv preprint arXiv:2006.07733. Cited by: §2.3.
 [10] (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §4.1.1.
 [11] (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §2.3.
 [12] (2017) Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216. Cited by: §1, §2.2.
 [13] (2020) Contrastive multiview representation learning on graphs. In International Conference on Machine Learning, pp. 4116–4126. Cited by: §1, §2.3, §3.1.

[14]
(2020)
Faster autoaugment: learning augmentation strategies using backpropagation
. In European Conference on Computer Vision, pp. 1–16. Cited by: §2.4.  [15] (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1, §2.3.
 [16] (2019) Strategies for pretraining graph neural networks. arXiv preprint arXiv:1905.12265. Cited by: §1, §2.2, §4.1.2, §4.1.2.
 [17] (2016) Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §1, §3.2.
 [18] (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1.
 [19] (2016) Variational graph autoencoders. arXiv preprint arXiv:1611.07308. Cited by: §1, §2.2, §2.4, §3.1.
 [20] (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §2.2.
 [21] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §A.2.
 [22] (2020) Differentiable automatic data augmentation. In European Conference on Computer Vision, pp. 580–595. Cited by: §2.4.
 [23] (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §2.4.
 [24] (2018) An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893. Cited by: §1.
 [25] (2019) Graphnvp: an invertible flow model for generating molecular graphs. arXiv preprint arXiv:1905.11600. Cited by: §2.4.
 [26] (2018) Largescale comparison of machine learning methods for drug target prediction on chembl. Chemical science 9 (24), pp. 5441–5451. Cited by: §4.1.2.

[27]
(2017)
Geometric deep learning on graphs and manifolds using mixture model cnns
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5115–5124. Cited by: §A.2, §A.2, §1, §4.2.  [28] (2020) TUDataset: a collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), External Links: 2007.08663, Link Cited by: §4.1.3.

[29]
(2017)
Graph2vec: learning distributed representations of graphs
. arXiv preprint arXiv:1707.05005. Cited by: §4.1.1.  [30] (2005) To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, Vol. 898, pp. 1–4. Cited by: §2.2.
 [31] (2015) The network data repository with interactive graph analytics and visualization. In AAAI, External Links: Link Cited by: §A.3, §4.3.
 [32] (2011) Weisfeilerlehman graph kernels.. Journal of Machine Learning Research 12 (9). Cited by: §4.1.1.
 [33] (2009) Efficient graphlet kernels for large graph comparison. In Artificial intelligence and statistics, pp. 488–495. Cited by: §4.1.1.
 [34] (2020) Pointgnn: graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1711–1719. Cited by: §1.
 [35] (2016) Improved deep metric learning with multiclass npair loss objective. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 1857–1865. Cited by: §3.3.1.
 [36] (2019) Infograph: unsupervised and semisupervised graphlevel representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §4.1.1.
 [37] (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1.
 [38] (2020) What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243. Cited by: §1, §2.4, §3.3.3, §3.3.
 [39] (2008) Visualizing data using tsne.. Journal of machine learning research 9 (11). Cited by: §A.3, §4.3.
 [40] (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1.
 [41] (2018) Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §1, §2.3.
 [42] (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §A.3, §1, §2.1, §3.2, §4.1.2, §4.3.
 [43] (2015) Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1365–1374. Cited by: §4.1.1.
 [44] (2019) Xlnet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1.
 [45] (2020) Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems 33. Cited by: §A.1, §A.2, §1, §2.3, §2.4, §3.1, §3.3.1, §3.3.2, §3.3.3, §4.1.1, §4.1.1, §4.1.2, §4.1.3, §4.2.
 [46] (2020) When does selfsupervision help graph convolutional networks?. In International Conference on Machine Learning, pp. 10871–10880. Cited by: §1.
 [47] (2021) Barlow twins: selfsupervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230. Cited by: §2.3.
 [48] (2020) Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131. Cited by: §3.1.
 [49] (2020) Graph contrastive learning with adaptive augmentation. arXiv preprint arXiv:2010.14945. Cited by: §1, §2.3, §3.1.
Comments
1147208219 ∙
请问这篇论文的代码准备公开吗
∙ reply
1147208219 ∙
Is the code of this paper ready to be made public
∙ reply