# Graph-based representation for identifying individual travel activities with spatiotemporal trajectories and POI data

### Problem formulation

Travel activity zones aggregated from travel trajectory footprints of an individual (*u*) comprise a graph:

$$begin{aligned} G_{u} = (V_{u},E_{u}), end{aligned}$$

(1)

where (V_{u}) represents a set of graph nodes:

$$begin{aligned} V_{u}={v_{u,i}|i in [0,n]}, end{aligned}$$

(2)

and each node ((v_{u,i})) represents an individual representative travel activity zone (ie, activity node) resulting from aggregating individual travel footprints. Activity nodes of the same individual are connected via a set of graph edges, denoted as:

$$begin{aligned} E_{u}={e_{u,j}|j in [0,m]}. end{aligned}$$

(3)

Each edge ((e_{u,j})) represents a directed trip between two end-on activity nodes. General statistics representing spatiotemporal properties of travel trajectories are thus encoded as node features ((X_{v})) and edge weights ((Y_{e})). Distributions of surrounding POIs are also encoded as node features. The encoding mechanism is explained in the next two subsections. Each activity node has two types of neighbors, namely in-neighbors ((N_{in})) and out-neighbors ((N_{out})). For activity node (v’) as a neighbor node of activity node *v*,

$$begin{aligned}&{if (v’,v) in E, v’ in N_{in}}, end{aligned}$$

(4)

$$begin{aligned}&{if (v,v’) in E, v’ in N_{out}}. end{aligned}$$

(5)

Based on the travel activity graph, Gstp2Vec is demonstrated in Fig. 1. Essentially, two sets of fully-connected feedforward neural networks (NN) are created by combining weights with feature embeddings for propagating the information from nodes’ neighbors through the graph structure^{15.37}. One set of NNs are wrapped as multihop (ie, *K*-hop) aggregators for accumulating neighborhood embeddings from sampled neighbor nodes and edges within *K* hops. Specifically, each aggregator function (eg, *AGG*1, *AGG*2) includes a fully connected NN layer with a nonlinear activation function (sigma). Node embeddings (initially node features) and edge weights are concatenated to generate low-dimensional embeddings through the aggregator.

Another set of NNs are built to generate updated node embeddings with the input as node embeddings concatenated with aggregated neighborhood embeddings. Updated node embeddings are treated as predictive representations, which are input into another activation function for inferring travel activity types. This process is called forward propagation^{15}, which is iterated over all activity nodes during one epoch for training the model. In this way, information of activity nodes far away from the current one is propagated to it through multihop neighbors and thus contributes to identifying its activity type. Weight matrices and aggregator parameters in the forward propagation are tuned by minimizing graph based cross-entropy loss with an Adam optimizer^{38}thus, making Gstp2Vec a supervised learning model.

### Activity nodes

Individual travel trajectories are represented as sequences of travel footprints, with each footprint representing individual presence at a location and a time point, and denoted as a pair of geographic coordinates with a timestamp. Travel stay points with a speed slower than 1300 m/h^{39} are detected based on spatial adjacency using density-based spatial clustering of applications with noise (DBSCAN)^{40}. Activity zones are generated as convex hulls of the detected spatial clusters (ie, stay regions) so that spatial scopes are specified for the represented travel activities^{41}. Then activity zones are used to produce features for activity type identification.

### Node features

The topological relationship between each node and its neighbor nodes, and the distribution of node features on its neighborhood are encoded and propagated to identify the activity type represented by each node. General statistics signifying distribution patterns of footprints on time and space, and distributions of surrounding POIs for each activity zone are calculated and concatenated as node features.

Specifically, the total or average numbers of footprints within each of 24 h on either weekdays or weekends are counted to represent time properties (*t*) of each individual activity zone, where *T* denotes the transpose of a matrix:

$$begin{aligned} t_{weekday}= & {} [n_{1},n_{2},ldots ,n_{24}]^T, end{aligned}$$

(6)

$$begin{aligned} {overline{t}}_{weekday}= & {} [{overline{n}}_{1},{overline{n}}_{2},ldots ,{overline{n}}_{24}]^T, end{aligned}$$

(7)

$$begin{aligned} t_{weekend}= & {} [n_{1}’,n_{2}’,ldots ,n_{24}’]^T, end{aligned}$$

(8)

$$begin{aligned} {overline{t}}_{weekend}= & {} [{overline{n}}_{1}’,{overline{n}}_{2}’,ldots ,{overline{n}}_{24}’]^T, end{aligned}$$

(9)

$$begin{aligned} t= & {} left[ t_{weekday},{{overline{t}}_{weekday}}, t_{weekend}, {{overline{t}}_{weekend}}right] . end{aligned}$$

(10)

Additionally, average durations spent at an individual activity zone during each date of a week are calculated and concatenated to generate an augmented representation^{42} ((t^{+})) of temporal patterns:

$$begin{aligned} overline{Delta t}_{dow}= & {} left[overline{Delta t}_{1}’,overline{Delta t}_{2}’,ldots ,overline{Delta t}_{7}’right]^T, end{aligned}$$

(11)

$$begin{aligned} t^{+}= & {} [t,overline{Delta t}_{dow}]. end{aligned}$$

(12)

The maximum and average values of elapsed time ((Delta t_{max}) and (overline{Delta t})) or distance ((Delta d_{max}) and (overline{Delta d})) to the next travel footprint for all footprints within an activity zone are also calculated as spatiotemporal features (*s*):

$$begin{aligned} s=left[ Delta t_{max},overline{Delta t},Delta d_{max},overline{Delta d}right] . end{aligned}$$

(13)

To encode POI distribution characteristics, in analogy to natural language processing^{43}each distinct POI feature class (eg, dormitory, café, bar, hospital, etc.) is considered as a word^{44}. All possible POI feature classes are considered as a dictionary, and feature classes of the POIs overlapped with an activity zone are considered as a corpus. A total of 335 distinct POI feature classes (ie, words) are collected from the OSM dictionary. Then the occurrences of each possible POI feature class are counted for an activity zone to produce a sparse POI feature vector (*p*).

Additionally, 335 POI feature classes are aggregated into 18 distinct place types (eg, home, eating, education) based on their functionality in urban settings (eg, café (rightarrow) eating)^{25}. Then a smaller word dictionary is built to produce a denser vector, which is concatenated with *p* to generate an augmented POI feature vector ((p^{+})). Next, (p^{+}) is concatenated with the aforementioned spatiotemporal feature vector to produce a node feature matrix ((X_v)) for each individual activity zone:

$$begin{aligned} {X_v}=left[ {t^{+}},s,{p^{+}}right] . end{aligned}$$

(14)

### Edge weights

A trip is defined as the transition from one travel activity (ie, origin) to another (ie, destination) for an individual, which usually also indicates the spatial transition of the individual from one location to the other. In our proposed Gstp2Vec framework, trip directions are consistent with edge directions. In addition to trip direction and properties of its end-on activity nodes, trip properties also include statistics measuring individual transitions over space and time, such as travel frequency (*f*), average travel duration (({overline{t}})), and average travel distance (({overline{d}})), which are encoded as edge weights ((Y_e)).

Specifically, *f* is calculated by counting trip occurrences from each origin activity zone to the corresponding destination zone for each individual by going through all travel footprints within the origin zone. Then ({overline{t}}) is measured by averaging the time spent on those trips, and ({overline{d}}) is measured by averaging their straight line distances on 2D space. These statistics measuring different aspects of trip properties are concatenated to represent *Y*:

$$begin{aligned} {Y_e}=left[ f,{overline{t}},{overline{d}}right] . end{aligned}$$

(15)

### Aggregators

As shown in Table 1 and Fig. 2, aggregators (ie, aggregation functions) in Gstp2Vec accept feature embeddings of sampled neighbor nodes, which are initialized as node features concatenated with their corresponding edge weights. Since neighbor nodes are not ordered by nature in our proposed framework, aggregation functions should be symmetric to be operated on arbitrarily ordered node embeddings. Besides, they need to be simple and trainable^{15}. Max pooling aggregator is both symmetric and trainable, and is thus applied in our proposed framework^{45}.

Specifically, a single-layer perceptron is applied as the fully-connected NN inside an aggregator. During each iteration of the forward propagation, a fixed number (eg, 2) of neighbor nodes are sampled for each activity node. Then the perceptron is applied on the feature embedding matrix of each sampled neighbor node to compute a series of features, and an element-wise maximum value is generated for each computed feature among all sampled neighbor nodes and passed to the current node. In this way, the model effectively captures different aspects of the neighborhood set^{15}.

### Supervised learning

Model weights are tuned iteratively in a manner of end-to-end supervised learning^{14,15,31}. First, graphs consisting of activity zones and trips are split into training, validation, and test sets based on individuals (Fig. 3). As such, activity zones of the same individual would not appear in different sets (eg, both training and test sets). For example, the graph ((G_{u_{2}}) in Fig. 3) constituted by activity zones of individuals (u_{2}) is divided into the test set, while two other graphs (ie, (G_{u_{1}}) and (G_{u_{3}})) are in the training set and the remaining one (ie, (G_{u_{4}}) is in the validation set.

The training process includes two steps, namely forward propagation and parameter learning. Forward propagation (Z in Eq. (16)^{46}) first generates node embeddings by concatenating node features with neighborhood embeddings (Fig. 2), which in turn are generated via aggregators as discussed above. Then the concatenated node features are assimilated by a single-layer NN with an activation function, which produces updated node feature embeddings and eventually generates the predictive representations (ie, (h^{k-1}_v)) (Table 2). Next, another softmax function is applied on the output representations to predict travel activity types via multicategory classification (argmax in Eq. (16)). For parameter learning, graph based cross-entropy loss is applied on the predicted results to tune previous weight matrices.

$$begin{aligned} {Z=fbigg (h_{v}^{k-1},h_{v’}^{k}bigg )=argmaxbigg (softmaxbigg (sigma bigg (W^kcdot CONCATbigg (h_{v}^{k-1},h_{N(v)}^{k}bigg )bigg )bigg )bigg )}. end{aligned}$$

(16)