Multiagent-Based Simulation of Temporal-Spatial Characteristics of Activity-Travel Patterns Using Interactive Reinforcement Learning

,


Introduction
Over the few last decades, activity-based approaches has become the main theme in transportation demand modeling, taking the place of trip-based approaches.Trip-based approach has several drawbacks: trip generation is fixed and independent of the transportation system; travel demand is generated from the need of activity participation; and the space and temporal relationship of all trips and activity patterns is ignored.Such drawbacks brought activity-based approach into transportation demand modeling.
The first activity-based approaches began in the 1970s [1][2][3].Those pioneering studies explored choices and constraints in travel demand.Since that time, activity-based modeling has flourished.Various methodologies have been introduced and they can be classified into three categories.
The first category is utility-maximizing model (or econometric model) which suggests that individuals seek to maximize their cumulative utilities when performing activities.Those models link individual or household's sociodemographics, transportation policies, and other environmental factors to their activity and travel patterns.Econometric models ranging from discrete choice models (such as multinomial logit and nested logit mode) to hazard duration models remain to be a powerful approach in activity-travel analysis [4][5][6][7].
The second category is computational process model (CPM) which focuses on using context-dependent choice heuristics to model individual's decision process.A computational process model is a set of condition-action rules that specify how a decision is made.One precursor in CPM is the time-space prism method.Hägerstrand [3] introduced the three-dimensional space-time models.In such models limited resources of time and space became constraints on each individual's behavior alternatives [8].The techniques used in more recent studies include decision trees, neural networks, and Bayesian networks [9][10][11].
The combination of the above two approaches leads to hybrid models.Hybrid models concentrate on the integration of econometric models and CPM.Decision-tree is combined with parametric modeling [12]; random utility maximization is incorporated into activity scheduling model [13].New algorithms such as reinforcement learning are also introduced into the field.
Reinforcement learning integrates the concepts of reward (utility) maximization and context-dependent choice heuristics.The applications of reinforcement learning include robotics, game theory, dispatching system, and financial trading [14][15][16][17].Tan used reinforcement learning to formalize an automated process for determining stock cycles by tuning the momentum and the average periods.The total experimental results from the five stocks are able to beat the market by about 50 percentage points [18].Lahkar and Seymour studied reinforcement learning in a population game.Agents in a population game revise mixed strategies using the cross rule of reinforcement learning [17].In addition, formulation of economic dispatch as a multistage decision making problem is carried out using reinforcement learning by Jasmin et al. [15].Applying reinforcement learning in transportation demand modeling has several advantages.First, the imitation of human learning through trial and error interactions with a dynamic environment helps to explain behavioral mechanisms [19].The RL mechanism is distinguished from other computational cognitive mechanisms by its emphasis on learning by an individual from direct interaction with individual's decision environment in the presence of an explicit goal and feedback and without relying on any exemplary supervision.Secondly, it does not need an expertsystem to inform it what selection is right and what is wrong.Thirdly, it could react to unforeseen events and take both long-term learning and short-term dynamics into account.Among the first attempts, Charypar and Nagel built the basic model of activity time plans using q-learning and got quite realistic results [20].This model was then modified to allocate both time and location choice of activity-travel pattern [21].Because q-learning generally takes a long time to converge and the curse of dimensionality occurs when the problem gets complex, q-learning was combined with the regression tree method to form a new algorithm called q-tree [22].
The above-mentioned researches show several aspects that need further development.
(i) In most of the reinforcement-learning-based studies, though the format of reward function has been scrutinized, the rewards are based on assumption values and are hard to be acquired from survey data, so that the result is hard to be put into practical use.
(ii) In many of the multiagent systems, "multi" means several components of the system such as road, intersection, and traveler rather than multiple travelers.Interactions of travelers are neglected.
(iii) The result analysis is often limited within individual activity-travel schedule.Macroscopic characteristics such as traffic flow distribution are often ignored.
In this study we propose an interactive reinforcement learning algorithm in which individuals not only receive information from the environment, but also give feedback to the environment.We did this by adding road congestion degree, which is determined by travelers' decisions, to the algorithm.The dynamic environment is a medium that passes the influence of one traveler's decision to others.The selforganization effect shown through this mechanism makes the system reach a dynamic equilibrium.This algorithm not only ensures rationality of each single traveler's behavior, but also obtains aggregated temporal-spatial traffic features such as traffic flow distribution and the distribution of activity locations.We also seek a compromise between the well-established theoretical reward function form and the quality of data we could truly get from practical surveys.The simplified reward function makes the algorithm immediately applicable.
The rest of this paper is organized as follows.Section 2 introduces the algorithm of modified multiagent-based qlearning.Section 3 is devoted to the analysis and calculation of the survey data.Section 4 shows the temporal-spatial simulation results of Shangyu city's traffic system.Section 5 concludes the findings of this paper and discusses future research directions.

Multiagent-Based Q-Learning Method
2.1.Reinforcement Learning.Multiagent system focuses on the analysis of several agents' dynamic and complex collective behavior.Because multiagent system has no global control and each agent may get incomplete information, the system must learn repetitively to improve the performance.Reinforcement learning is a major method of this kind.Kaelbling et al. [19] define reinforcement learning as the problem faced by an agent that must learn behavior through trial and error interactions in a dynamic environment.Moreover, the consequences of actions change over time and depend on the current and future state of the environment.Reinforcement learning has the potential to deal with this uncertainty through continuous observations of the environment and through consideration of indirect and delayed effects of actions.
Basic concepts concerning reinforcement learning include the following.
(i) Agent: in this paper, an agent means a traveler.
(ii) State: a vector (activity, start time, duration, location, and congestion degree) denotes an agent's state.The vector is denoted as (, , , , and V) for brief.
(iii) Location: the unit of location is traffic zone which is an area that has multifunctions including leisure, shopping, and working.
(iv) Activity: activities include home, work, maintenance, and leisure.
(v) Action: there are 4 actions, staying at current activity or move to one of the other 3 activities.The same as the way activities are represented; actions are denoted as h, w, s, and l for brief.
(vi) Duration and start time: time variables should be discrete in q-learning.Reinforcement learning tasks are generally treated in discrete time steps.A teach time step , the agent observes the current state   stand chooses a possible action at to perform, which leads to its succeeding state  +1 = (  ,   ).The environment responds by giving the agent a reward (  ,   ).These rewards can be positive, zero, or negative.It is probable that these preferable rewards come with a delay.In otherwords, some actions and their consequential state transitions may bring low rewards in short-term, while it will lead to state-action pairs later with a much higher reward.
For this reason, the task of the agent is to learn a policy  according to the state  and the action  to receive the maximal accumulative rewards.Given a random policy  from a random state   , the accumulative reward can be formulated as follows: where  + represents the scalar reward received  steps in the future and  is the discounting factor.The agent only receives the immediate reward if  is set to zero.

Q-Learning Algorithm.
The agent needs to learn the optimal policy  * () that maximizes the accumulative reward.Unfortunately, it is required that the knowledge of immediate reward function  and state transition function  are known in advance.In reality, however, it is usually impossible for the agent to predict in advance the exact outcome of applying a random action to a random state.In other words, the domain knowledge is probably not perfect.q-learning is then devised to select optimal actions even when the agent has no knowledge about the reward and state functions.
We define Q as the estimation of true -value.The qlearning algorithm maintains a large table with entries to each state-action pair.When it starts, the value of Q(, ) is initially filled with random numbers.The agent repeatedly observes its current state , chooses a possible action  to perform, and determines its immediate reward (, ) and resulting new state (, ).The Q(, ) value is then updated according to the following rule: That is to say, the Q-value of the current state-action pair is refined based on its immediate reward and the Q-value of its next state.The agent can reach a globally optimal solution by repeatedly selecting the action that maximizes the local values of  for the current state.This is only a brief introduction of q-learning and detailed introduction could be found in reference [20].The process can be described as follows: (1) initialize the -values, (2) select a random starting state  which has at least one possible action to select from, (3) select one of the possible actions.This action leads to the next state, (4) update the -value of the state-action pair according to the update rule above, (5) go back to Step 3 if the new state has at least one possible action, if not, go to Step 2.

Reward Function.
Previous researchers in this domain constructed their reward functions based on activity start time, duration, length of travel, and so on [20,22].This method is adopted by us and our reward function contains the following parts.

Reward Based on Attraction Degree of Zones.
In this paper a location is a zone that has multiple land use functions.In reality, people sometimes prefer to travel for a long time downtown to go shopping because the land use characteristics make downtown more attractive.To quantify this, the reward based on attraction degree of zones is added to the reward function.It is only for maintenance and leisure activities because home and work have fixed locations.We assume the more maintenance activities are conducted in a zone, the higher attraction degree this zone has.This also applies for leisure activities.Consider where  , is the number of leisure activities or maintenance activities conducted in zone ,  is the activity type,  avg is the average leisure or maintenance activities conducted among all zones, and  max is the maximum leisure or maintenance activities conducted among all zones.The reward is  attract(,) = 50 * attract , .

Reward Based on Activity Duration.
When an agent conducts an activity and the duration is within a reasonable range, it should get a fairly large accumulative reward.When the duration is less than the expected value, the marginal benefit is positive, while if the duration is more than the expected value, the marginal benefit is negative.Consider where  min() ,  avg() , and  max() represent the reasonable minimum, maximum, and average duration of activity .They are, respectively, the 5%, 50%, and 95% percentile duration of activity  in the survey data.

Reward Based on Activity Start Time
. Each activity's start time distribution is calculated using the survey data.
To make the distribution curve more smooth in order to diminish the effect of randomness, we use polynomial functions (use  to denote) to fit the curve.Then function  is normalized.Consider where  represents the type of activity, while  is the start time of the activity.The range of  is (1, 96).

Reward
Based on Travel Time.Some scholars define travel-time-based reward as  travel = − * ()  [23].This form is adopted by us, but it needs some modifications because the influence of congestion degree is taken into account. is no longer a fixed value decided by the length between zones, but it relates to the congestion degree of the OD pair.We use the widely accepted impedance function in China [21]: where  is the actual speed, while  0 is free flow speed. 0 is the free flow travel time.Actual travel time  could be defined as  =  0 *  0 /.

Flow-Chart of Calculation.
When q-learning is applied in this paper, the process described below could be shown in Figure 1.The whole process is separated into 3 steps.
Step 1 is to utilize travel diary survey data to extract typical activity patterns and form different kinds of agents according to their activity patterns.Also utilizing the survey data, the reward function for different kinds of agents is calculated.
Step 2 is to estimate the value function (in this algorithm: -values) through trial and error until the -value matrix converges.
Step 3 is to add agents on the network and then use the -values to decide the activity-travel schedule of each agent.In the end, temporal-spatial characteristics of the simulation result and each agent's activity-travel schedule are calculated and recorded.
Taking congestion degree into account could enable interactions among agents and let them cooperate and compete in the environment.In the simulation of a network, a number of agents are set on the network and their states are initialized.Then at each time step, agents decide their actions by choosing an action that brings maximum -value one by one according to the congestion degree and other aspects of the environment.Their actions would in turn influence the congestion degree therefore would influence other agents' actions.In this way, all agents' activity-travel schedules could be decided.Travel records include trip starting and ending times, origin and destination, mode used, and trip purpose.Trip purpose is divided into nine categories, including work, school, official business, shopping, socializing-recreation, serving passengers, personal business, returning home, and returning to work.Among these purposes, work, school, and official business are named commute activity or simply work.Shopping, serving passengers, and personal business are called maintenance activities.Socializing-recreation is called leisure activity.Maintenance and leisure activities generally are named none-working activities.Hence, the 9 categories of activities could be divided into 4 types: work, maintenance activity, leisure activity, and staying at home.Shangyu city has a population of 204,900.4,101 residents from 1,564 households are surveyed.After deleting the incorrect statistics, data from 3,368 people are used, representing 82.1% of the people surveyed.486 students account for 14.4% of the valid data.Because students' activity-travel schedules are rather fixed and the main focus of this paper is on working and none-working groups, the students' data are not considered.Thus, the data obtained from the remaining 2,883 people, accounting for 85.6% of all the valid data, are used for the analysis.

Typical Activity Patterns.
The first step of processing valid survey data is to extract typical activity patterns.A tour is defined as the travel from home to one or more activity locations and back to home again [4].An activity pattern here is defined as all tours an individual conducted in a single day.In the valid data, 10 of the patterns are shared by more than 20 samples.We call these 10 activity patterns typical activity patterns and the description of them can be seen in Table 1.They take up 2397 of the 2882 valid samples.Agents could be classified according to their activity patterns.We take these 10 typical patterns to form 10 types of agents.Patterns which include working activity are called commuting patterns, and others are called none-working patterns.The characteristics of these 10 patterns are described as in Table 1 (the 4 activities are written as h, w, s, and l for brief).

Reward Function Calculation.
The reward function has been constructed in Section 2.3.The paragraphs below show the values of parameters used in the reward function, calculated from the survey data.Furthermore, ten different types of agents have their own parameters, respectively, though the functional forms are the same.

Attraction Degree of Zone.
The attraction degrees of zones are listed in Figure 2, next to it is the traffic zone division of Shangyu city.Because these degrees are decided by land use characteristics of different zones, to different groups of people the attraction degrees are the same.
It is quite clear that zones 2, 8, and 13 are the center of leisure activity, while zones 3 and 5 are the center of maintenance activity.This result corresponds to the land use characters of Shangyu because these zones are in the center of Shangyu.

Reward Based on Duration.
To make the results more realistic, we calculate the rewards based on duration of 10 typical activities patterns according to the definition in Section 2.3.The unit of these parameters is 15 min.The relatively small value of standard deviation shows that people who belong to the same group share much similarity in behavior, at least in the duration of activity.
Where are the statistics?

Reward Based on Activity Start Time.
The process of calculating this reward has been stated in Section 2.3.Use polyfit function in MATLAB to fit every activity's start time distribution of each group into smooth curves.
In Figure 3 min is not 0. The start time-duration-reward graphs of the four activities are shown in Figure 3.

Assumptions and Preparations for the Simulation.
To simulate traffic conditions in Shangyu, the first step is to expand the number of agents from the size of the sample to the proportion of population these types of agent take up in Shangyu.By calculation, the 2397 samples in the survey should be expanded to a population of 145684 people.Apart from the already existed data of 2397 people, we need to establish 143287 people's attribute data.Each people's attributes include activity pattern and home and work locations (if this person works).To make the distribution of each attribute in the newly established data the same as the survey data, the procedure of establishing one person's attributes could be as follows.
(1) Randomly generate a natural number from 1 to 2397.The activity pattern of this people will equal to that of the number  people in the survey data.(2) Likewise, the attribute of home and work locations can be decided by randomly choosing one from the 2397 survey samples.
Initialize each agent's state and simulate 1000 time steps.We take the last 96 time step to analyze.Both each individual agent's activity-travel schedule and spatial-temporal characteristics are analyzed.

Simulation Results of Activity-Travel Schedules
. Each agent's activity-travel schedule in one day is recorded.We randomly choose one agent from each pattern and show his/ her activity-travel schedule in a day (from 0 am to 24 pm).The  result is shown in Table 2.The first part in each parenthesis is activity time and the second part is activity location.The table shows that no abnormal sequence, such as staying at one activity for too long or conducting activities in improper time, occurs in these 10 examples.One flaw is that to avoid the morning peak of commute agents, the none-working agents' trips are generally a little bit earlier than the peak shown by the survey.
Having activity-travel schedules of all agents, we could move our analysis further to macroscopic temporal-spatial characteristics of the traffic.

Temporal Characteristics of the Simulation Result.
The traffic flow distributions of this paper's algorithm and the traditional algorithm which has not taken interactions between agents are compared in Figure 4.Both methods show apparent morning and evening peak.But in the traditional method environment is static, which means one agent's action will not affect other agents' choices; it is natural that agents of the same attributes all do the same activity at the same time and zone.Therefore, the traditional method's distribution of flow is ladder-like, which means peak hour flow is very large.
By comparison, because the congestion degree is taken into account, in this paper's method, some agents avoid traveling in the rush hour because it will lead to lower rewards.As a result, the peak hour flow is much lower.Even agents of the same attributes would have different activitytravel schedules because the environment is dynamic.Thus, the behavior of the whole population is not isolated but has interactions.
Traffic flow distributions of the 2397 samples' survey result and their corresponding agents' simulation result are shown in Figure 5.The simulation result matches the survey result well.Their peak hour flow deviation is less than 5%.
To show the features of different patterns' traffic flow distribution, we could mark different traffic patterns' flows with different colors as is shown in Figures 6(a 3 we could find out that these commuting patterns, especially pattern hwh and hwhwh, account for a large percentage of morning and evening peaks' flow.The peak at noon is caused by pattern hwhwh agents who go home at noon.On the whole, pattern hwh and hwhwh are the determinants of commuting patterns' flow distribution, and other commuting patterns have too few people to influence the trend.there is no such dominant pattern.On the contrary, all none-working patterns contribute to the formation of figure's shape.Two peaks of the flow are all in the morning, at about 5 am and 9 am, respectively.The survey result shows that 42.9% of the none-working groups are retired people in Shangyu.In China the elderly usually like to go out to do some exercises early in the morning and food markets usually open very early; this explains why both the survey result and the simulation result show that none-working people's travel peak is in the morning.Agents of pattern hlh tend to go out early at 5 am, while the flow of pattern hsh almost distributes evenly from 5 to 10.In China, most people tend to stay at home in the evening, especially the none-working people so there is not much traffic in the evening as in Figure 6(b).
Table 3 shows two methods' comparison of peak hour ratio (PHR).Peak hour ratio is defined as the ratio of peak hour flow and the traffic flow of a whole day.In China, the measured PHR is often between 10% and 15%.Because there are too many OD pairs, the table listed the results of 6 OD pairs which have the largest traffic flow as representative.For all OD pairs, the original method's average PHR is 30.5% and the result of the new method is 16.2%.It is clear that the latter is closer to reality.

Spatial Characteristics of the Simulation Result.
In the traditional method, because congestion degree is not taken into account and attraction degree's effect is quite distinct, all agents conduct their maintenance and leisure activities at the zone that has maximum attraction degree: all the 20094 leisure activities are conducted in zone 8, while all the 70433 maintenance activities are conducted in zone 5. We need to mention that because Shangyu is a small city and the distances between zones are not very long; the influence of distances between OD pairs is subtle.After taking into account congestion degree, the choice of location is much more dispersed.Agents would choose to conduct their activities in other zones which have lower attraction degrees when center zones are crowded.Finally, 5845 leisure activities are conducted in zone 8, which accounts for 29.0% of all leisure activities.22392 maintenance activities are conducted in zone 5, accounting for 31.7% of all maintenance activities.
The choice of activity zones is shown in Figure 7.The survey data's activity location distribution is calculated and then it is extended the same proportion that the samples are extended to show how the 145684 people's choice of location would be like according to the survey data.It is compared with the simulation result and the figure shows that the simulation result is quite close to the extended survey result.The correlation coefficient between the survey data's leisure activity location distribution and the simulation result is 0.921.And the correlation coefficient between survey data's and the simulation result's maintenance location distribution is 0.902.

Conclusions and Future Directions
In this paper we use a modified multiagent-based reinforcement learning algorithm to simulate the traffic condition of Shangyu city.Both the spatial-temporal features of the entire population and the activity-travel schedule of single individuals are analyzed.The main findings are listed as follows.
(i) This paper's method takes the congestion degree between OD pairs into account, which enables agents' actions to influence the environment.Thus, agents' actions have interactions with each other.Because of this interaction, both the spatial-temporal features of the entire population and the activity-travel schedule of single agent are close to actual situations.
(ii) Because in this paper agents are no longer separated individuals but an integrity that interacts with each other, the spatial-temporal features of the whole population, such as traffic flow distribution, PHR factor, and location choice distribution, could be calculated, which is rarely seen in previous research in this field.
(iii) Survey data are utilized throughout the whole process, including the setting of traffic zones, extraction of typical activity patterns, formation of agents, and reward functions.The utilization of the survey data makes the simulation result closer to the actual situation in Shangyu; therefore, the simulation result has practical meanings and could be further utilized in transportation planning and management.For example, it could be used in TDM policy effect analysis.
(iv) Data used in this paper come from the survey of a typical small city in east China.Both the survey data and the simulation results have distinct Chinese characteristics.For example, maintenance and leisure activity are conducted mostly in the morning and people tend to stay at home in the evening; commuting groups have few leisure and maintenance activities during weekdays.These features provide materials for future research of Chinese traffic.
The above mentioned analysis of the simulation result shows that this paper's simulation method could better reflect actual traffic conditions.Both the macroscopic spatialtemporal features and the microscopic activity-travel schedule render this method valid.The veracity of the simulation result and the utilization of survey data enable this method to better service practical transportation planning and management.
Because of the limitations of the survey data and the algorithm, several aspects of the research can be improved in the future.
(i) Route choice in the current model is simplified.
The travelers "jump" directly from the origin to the destination, while the influence on the intermediate regions is neglected.
(ii) In this paper, the reward function contains four different parts; they are, respectively, based on attraction degree of zones, activity start time, duration, and travel time.When accumulated, the weights of them are considered to be equal.However, in reality, these factors have different effects on people when they make the decision on their trips.So one future direction is to calculate these weights according to the survey data, making the simulation results more accurate.
(iii) Road impedance varies greatly according to the type of traffic mode, since different modes have different occupation rates of roads and their speed are also different.As a result, it is better to take traffic mode of each agent into consideration when calculating congestion degree.
(iv) Reaction to uncertain events is a special characteristic of reinforcement learning.In this paper we are focusing on the most probable or the "average" state of the system.But it is also interesting to explore how the agents would react to radical changes of the environment and how do they interact with each other under this circumstance.

3. 1 .
Data Survey.We utilized the travel diary survey data from Shangyu city conducted in 2006.The survey includes individual/household sociodemographics and travel records.

Figure 1 :
Figure 1: Three Steps of multiagent-based q-learning simulation.

Figure 2 :
Figure 2: Attraction degrees of zones and traffic zone division of Shangyu city.

Figure 4 :
Figure 4: Comparison of traffic flow distribution between new method and survey data (flipped with Figure 5).

Figure 5 :
Figure 5: Comparison of traffic flow distribution between traditional method and new one.

Figure 6 (
Figure 6(a)  shows the flow distribution of the 5 commute patterns.It shows clear morning and evening peaks, at about 7 am and 6 pm, respectively.Compared with Figure3we could find out that these commuting patterns, especially pattern hwh and hwhwh, account for a large percentage of morning and evening peaks' flow.The peak at noon is caused by pattern hwhwh agents who go home at noon.On the whole, pattern hwh and hwhwh are the determinants of commuting patterns' flow distribution, and other commuting patterns have too few people to influence the trend.

Figure 6 (
Figure 6(b)  shows none-working agents' traffic flow distribution, which is totally different from commuting agents': there is no such dominant pattern.On the contrary, all none-working patterns contribute to the formation of figure's shape.Two peaks of the flow are all in the morning, at about 5 am and 9 am, respectively.The survey result shows that 42.9% of the none-working groups are retired people in Shangyu.In China the elderly usually like to go out to do some exercises early in the morning and food markets usually open very early; this explains why both the survey result and the simulation result show that none-working people's travel peak is in the morning.Agents of pattern hlh tend to go out early at 5 am, while the flow of pattern hsh almost distributes evenly from 5 to 10.In China, most people tend to stay at home in the evening, especially the none-working people so there is not much traffic in the evening as in Figure6(b).Table3shows two methods' comparison of peak hour ratio (PHR).Peak hour ratio is defined as the ratio of peak hour flow and the traffic flow of a whole day.In China, the measured PHR is often between 10% and 15%.Because there are too many OD pairs, the table listed the results of 6 OD pairs which have the largest traffic flow as representative.For all OD pairs, the original method's average PHR is 30.5% and the result of the new method is 16.2%.It is clear that the latter is closer to reality.

Table 1 :
Description of typical activity patterns.

Table 3 :
Comparison of two methods' PHR values.